Multimodal AI Systems: Combining Text, Image & Audio Data

Simor Consulting | 24 Oct, 2025 | 06 Mins read

Human communication is multimodal: we gesture while speaking, draw diagrams while explaining, and understand meaning through the interplay of sensory inputs. Yet most AI systems operate in silos—computer vision models that only see, language models that only read, speech models that only hear. This mismatch creates fundamental limitations.

A customer service system that can process a photo of a damaged product plus a voice message explaining the problem has fundamentally more information than one processing either modality alone. The challenge is building AI systems that can effectively combine these modalities.

The Multimodal Architecture Problem

Real-world applications require multiple modalities:

Healthcare: Diagnoses use visual scans, patient descriptions, and audio cues
Retail: Product discovery spans visual search, text queries, and voice commands
Security: Threat detection requires analyzing video, audio, and text

The limitations of unimodal systems:

Context Loss: A customer saying “this is broken” while showing an image provides more information than either modality alone.

Inefficient Workflows: Users translate between modalities—seeing something, struggling to describe it, hoping the text search understands.

Incomplete Understanding: Emotion in voice, sarcasm in text with images, urgency through multiple channels—unimodal systems miss these.

This diagram requires JavaScript.

Enable JavaScript in your browser to use this feature.

Building Multimodal Architectures

Multimodal Embeddings

The foundation is creating shared representation spaces where different modalities can interact:

import torch
import torch.nn as nn
from transformers import CLIPModel, AutoModel

class MultimodalEmbedder(nn.Module):
    def __init__(self, config):
        super().__init__()

        # Modality-specific encoders
        self.vision_encoder = CLIPModel.from_pretrained(
            config.vision_model
        ).vision_model

        self.text_encoder = AutoModel.from_pretrained(
            config.text_model
        )

        self.audio_encoder = AudioTransformer(
            config.audio_config
        )

        # Projection layers to shared space
        self.vision_projection = nn.Linear(
            config.vision_dim,
            config.shared_dim
        )
        self.text_projection = nn.Linear(
            config.text_dim,
            config.shared_dim
        )
        self.audio_projection = nn.Linear(
            config.audio_dim,
            config.shared_dim
        )

        # Normalization for unit sphere embeddings
        self.normalize = nn.LayerNorm(config.shared_dim)

    def encode_image(self, images):
        vision_outputs = self.vision_encoder(images)
        vision_features = vision_outputs.pooler_output
        vision_embedded = self.vision_projection(vision_features)
        vision_embedded = self.normalize(vision_embedded)
        return vision_embedded

    def encode_text(self, input_ids, attention_mask):
        text_outputs = self.text_encoder(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        text_features = text_outputs.last_hidden_state.mean(dim=1)
        text_embedded = self.text_projection(text_features)
        text_embedded = self.normalize(text_embedded)
        return text_embedded

    def encode_audio(self, audio_features):
        audio_outputs = self.audio_encoder(audio_features)
        audio_embedded = self.audio_projection(audio_outputs)
        audio_embedded = self.normalize(audio_embedded)
        return audio_embedded

Modalities attend to each other through cross-attention:

class CrossModalAttention(nn.Module):
    def __init__(self, dim, num_heads=8):
        super().__init__()
        self.num_heads = num_heads
        self.dim = dim
        self.head_dim = dim // num_heads

        self.q_proj = nn.Linear(dim, dim)
        self.k_proj = nn.Linear(dim, dim)
        self.v_proj = nn.Linear(dim, dim)
        self.out_proj = nn.Linear(dim, dim)

        self.scale = self.head_dim ** -0.5

    def forward(self, query_modality, key_modality, value_modality=None):
        if value_modality is None:
            value_modality = key_modality

        batch_size = query_modality.shape[0]

        # Project to multi-head format
        Q = self.q_proj(query_modality).view(
            batch_size, -1, self.num_heads, self.head_dim
        ).transpose(1, 2)

        K = self.k_proj(key_modality).view(
            batch_size, -1, self.num_heads, self.head_dim
        ).transpose(1, 2)

        V = self.v_proj(value_modality).view(
            batch_size, -1, self.num_heads, self.head_dim
        ).transpose(1, 2)

        # Compute attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) * self.scale
        attn_weights = torch.softmax(scores, dim=-1)

        # Apply attention to values
        attn_output = torch.matmul(attn_weights, V)

        # Reshape and project output
        attn_output = attn_output.transpose(1, 2).contiguous().view(
            batch_size, -1, self.dim
        )
        output = self.out_proj(attn_output)

        return output, attn_weights

Fusion Strategies

Different fusion strategies for different tasks:

class MultimodalFusion(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.fusion_type = config.fusion_type

        if self.fusion_type == "early":
            self.fusion = EarlyFusion(config)
        elif self.fusion_type == "late":
            self.fusion = LateFusion(config)
        elif self.fusion_type == "hybrid":
            self.fusion = HybridFusion(config)
        elif self.fusion_type == "attention":
            self.fusion = AttentionFusion(config)

    def forward(self, modalities):
        return self.fusion(modalities)

class AttentionFusion(nn.Module):
    def __init__(self, config):
        super().__init__()

        # Self-attention within modalities
        self.self_attention_layers = nn.ModuleDict({
            modality: nn.TransformerEncoder(
                nn.TransformerEncoderLayer(
                    d_model=config.hidden_dim,
                    nhead=config.num_heads
                ),
                num_layers=config.num_self_layers
            )
            for modality in config.modalities
        })

        # Cross-attention between modalities
        self.cross_attention_layers = nn.ModuleList([
            CrossModalAttention(
                config.hidden_dim,
                config.num_heads
            )
            for _ in range(config.num_cross_layers)
        ])

        # Gated fusion mechanism
        self.gates = nn.ModuleDict({
            modality: nn.Linear(config.hidden_dim, 1)
            for modality in config.modalities
        })

    def forward(self, modalities):
        # Self-attention within each modality
        enhanced_modalities = {}
        for name, features in modalities.items():
            if name in self.self_attention_layers:
                enhanced = self.self_attention_layers[name](features)
                enhanced_modalities[name] = enhanced

        # Cross-modal attention
        for cross_attn in self.cross_attention_layers:
            updated_modalities = {}
            for query_name, query_features in enhanced_modalities.items():
                attended_features = []

                for key_name, key_features in enhanced_modalities.items():
                    if query_name != key_name:
                        attended, _ = cross_attn(
                            query_features,
                            key_features
                        )
                        attended_features.append(attended)

                if attended_features:
                    combined = torch.stack(attended_features).mean(dim=0)
                    updated_modalities[query_name] = (
                        query_features + combined
                    )
                else:
                    updated_modalities[query_name] = query_features

            enhanced_modalities = updated_modalities

        # Gated fusion
        gate_weights = {}
        for name, features in enhanced_modalities.items():
            gate = torch.sigmoid(self.gates[name](features.mean(dim=1)))
            gate_weights[name] = gate

        # Normalize gates
        total_weight = sum(gate_weights.values())
        normalized_gates = {
            name: weight / total_weight
            for name, weight in gate_weights.items()
        }

        # Weighted combination
        fused = sum(
            gate * features
            for (name, features), gate in zip(
                enhanced_modalities.items(),
                normalized_gates.values()
            )
        )

        return fused, normalized_gates

Multimodal Applications

Visual-Linguistic Product Search

class MultimodalProductSearch:
    def __init__(self, model, product_index):
        self.model = model
        self.product_index = product_index

    def search(self, query_image=None, query_text=None, query_audio=None):
        """Search products using multimodal query"""

        embeddings = []
        weights = []

        # Encode available modalities
        if query_image is not None:
            image_emb = self.model.encode_image(query_image)
            embeddings.append(image_emb)
            weights.append(1.0)

        if query_text is not None:
            text_emb = self.model.encode_text(query_text)
            embeddings.append(text_emb)
            weights.append(1.0)

        if query_audio is not None:
            audio_emb = self.model.encode_audio(query_audio)
            embeddings.append(audio_emb)
            weights.append(0.5)  # Lower weight for audio

        # Combine embeddings
        if len(embeddings) > 1:
            weights = torch.tensor(weights).unsqueeze(1)
            combined_emb = (torch.stack(embeddings) * weights).sum(dim=0)
            combined_emb = combined_emb / weights.sum()
        else:
            combined_emb = embeddings[0]

        # Search in product index
        similarities = self.product_index.search(
            combined_emb,
            k=20
        )

        # Re-rank using cross-modal reasoning
        reranked_results = self.rerank_with_reasoning(
            similarities,
            query_image,
            query_text
        )

        return reranked_results

    def rerank_with_reasoning(self, initial_results, image, text):
        """Use cross-modal reasoning to improve ranking"""

        reranked = []

        for product in initial_results:
            score = product['similarity']

            # Boost score if multiple modalities agree
            if image is not None and text is not None:
                consistency = self.check_cross_modal_consistency(
                    image,
                    text,
                    product
                )
                score *= (1 + consistency)

            # Penalize if modalities disagree
            disagreement = self.detect_modal_disagreement(
                image,
                text,
                product
            )
            score *= (1 - disagreement * 0.5)

            reranked.append({
                'product': product,
                'score': score,
                'reasoning': self.generate_reasoning(image, text, product)
            })

        return sorted(reranked, key=lambda x: x['score'], reverse=True)

Multimodal Customer Support

class MultimodalSupportAgent:
    def __init__(self, config):
        self.understanding_model = MultimodalUnderstanding(config)
        self.response_generator = ResponseGenerator(config)
        self.emotion_detector = EmotionDetector(config)

    def handle_customer_query(self, text=None, image=None, audio=None):
        """Process multimodal customer query"""

        # Extract features from each modality
        understanding = self.understanding_model(
            text=text,
            image=image,
            audio=audio
        )

        # Detect emotional state from audio/text
        emotion_state = None
        if audio is not None:
            emotion_state = self.emotion_detector.from_audio(audio)
        elif text is not None:
            emotion_state = self.emotion_detector.from_text(text)

        # Analyze image if provided
        image_analysis = None
        if image is not None:
            image_analysis = self.analyze_product_image(image)

        # Generate contextual response
        response = self.generate_response(
            understanding=understanding,
            emotion=emotion_state,
            image_analysis=image_analysis
        )

        return response

    def analyze_product_image(self, image):
        """Specialized analysis for product images"""

        analysis = {
            'damage_detected': False,
            'damage_type': None,
            'severity': None,
            'product_identified': None
        }

        # Damage detection
        damage_score = self.damage_detector(image)
        if damage_score > 0.7:
            analysis['damage_detected'] = True
            analysis['damage_type'] = self.classify_damage(image)
            analysis['severity'] = self.assess_severity(image)

        # Product identification
        product_match = self.identify_product(image)
        if product_match['confidence'] > 0.8:
            analysis['product_identified'] = product_match['product_id']

        return analysis

Video Understanding Pipeline

class VideoUnderstanding(nn.Module):
    def __init__(self, config):
        super().__init__()

        # Frame encoder (vision)
        self.frame_encoder = FrameEncoder(config)

        # Audio encoder
        self.audio_encoder = AudioEncoder(config)

        # Temporal modeling
        self.temporal_encoder = nn.LSTM(
            input_size=config.frame_dim + config.audio_dim,
            hidden_size=config.hidden_dim,
            num_layers=config.num_layers,
            bidirectional=True
        )

        # Multimodal fusion
        self.fusion = TemporalMultimodalFusion(config)

    def forward(self, video_frames, audio_track, transcript=None):
        batch_size, num_frames = video_frames.shape[:2]

        # Encode frames
        frame_features = []
        for i in range(num_frames):
            frame_feat = self.frame_encoder(video_frames[:, i])
            frame_features.append(frame_feat)
        frame_features = torch.stack(frame_features, dim=1)

        # Encode audio in chunks aligned with frames
        audio_features = self.audio_encoder(
            audio_track,
            num_chunks=num_frames
        )

        # Combine frame and audio features
        combined_features = torch.cat(
            [frame_features, audio_features],
            dim=-1
        )

        # Temporal modeling
        temporal_features, _ = self.temporal_encoder(combined_features)

        # Optional transcript integration
        if transcript is not None:
            transcript_features = self.encode_transcript(
                transcript,
                num_frames
            )
            temporal_features = self.fusion(
                temporal_features,
                transcript_features
            )

        # Video-level representation
        video_representation = self.aggregate_temporal(temporal_features)

        return video_representation

Training Techniques

Contrastive Learning Across Modalities

class MultimodalContrastiveLearning:
    def __init__(self, temperature=0.07):
        self.temperature = temperature

    def compute_loss(self, image_embeddings, text_embeddings, audio_embeddings=None):
        """Compute contrastive loss across modalities"""

        batch_size = image_embeddings.shape[0]

        # Normalize embeddings
        image_embeddings = F.normalize(image_embeddings, dim=-1)
        text_embeddings = F.normalize(text_embeddings, dim=-1)

        # Image-Text contrastive loss
        image_text_sim = torch.matmul(
            image_embeddings,
            text_embeddings.T
        ) / self.temperature

        # Labels: diagonal elements are positive pairs
        labels = torch.arange(batch_size).to(image_embeddings.device)

        # Compute cross-entropy loss both directions
        loss_i2t = F.cross_entropy(image_text_sim, labels)
        loss_t2i = F.cross_entropy(image_text_sim.T, labels)

        total_loss = (loss_i2t + loss_t2i) / 2

        # Add audio if available
        if audio_embeddings is not None:
            audio_embeddings = F.normalize(audio_embeddings, dim=-1)

            # Audio-Text loss
            audio_text_sim = torch.matmul(
                audio_embeddings,
                text_embeddings.T
            ) / self.temperature
            loss_a2t = F.cross_entropy(audio_text_sim, labels)
            loss_t2a = F.cross_entropy(audio_text_sim.T, labels)

            # Audio-Image loss
            audio_image_sim = torch.matmul(
                audio_embeddings,
                image_embeddings.T
            ) / self.temperature
            loss_a2i = F.cross_entropy(audio_image_sim, labels)
            loss_i2a = F.cross_entropy(audio_image_sim.T, labels)

            audio_loss = (loss_a2t + loss_t2a + loss_a2i + loss_i2a) / 4
            total_loss = (total_loss + audio_loss) / 2

        return total_loss

Modality Dropout for Robustness

class ModalityDropout(nn.Module):
    def __init__(self, drop_prob=0.5):
        super().__init__()
        self.drop_prob = drop_prob

    def forward(self, modalities, training=True):
        if not training:
            return modalities

        # Randomly drop modalities during training
        kept_modalities = {}

        for name, features in modalities.items():
            if torch.rand(1).item() > self.drop_prob:
                kept_modalities[name] = features

        # Ensure at least one modality remains
        if not kept_modalities:
            random_modality = random.choice(list(modalities.keys()))
            kept_modalities[random_modality] = modalities[random_modality]

        return kept_modalities

Challenges

Modality Imbalance

Different modalities have different data availability:

class BalancedMultimodalSampler:
    def __init__(self, dataset, balance_strategy="upsample"):
        self.dataset = dataset
        self.balance_strategy = balance_strategy
        self.modality_counts = self._count_modalities()

    def _count_modalities(self):
        counts = defaultdict(int)

        for item in self.dataset:
            modality_key = tuple(sorted(item['modalities'].keys()))
            counts[modality_key] += 1

        return counts

    def __iter__(self):
        if self.balance_strategy == "upsample":
            max_count = max(self.modality_counts.values())

            indices = []
            for modality_combo, count in self.modality_counts.items():
                combo_indices = [
                    i for i, item in enumerate(self.dataset)
                    if tuple(sorted(item['modalities'].keys())) == modality_combo
                ]

                repeat_times = max_count // count
                remainder = max_count % count

                indices.extend(combo_indices * repeat_times)
                indices.extend(random.sample(combo_indices, remainder))

            random.shuffle(indices)
            return iter(indices)

Computational Complexity

This diagram requires JavaScript.

Enable JavaScript in your browser to use this feature.

Efficient inference:

class EfficientMultimodalInference:
    def __init__(self, model, efficiency_config):
        self.model = model
        self.config = efficiency_config

    def predict(self, inputs):
        # Quick classification of query complexity
        complexity = self.assess_complexity(inputs)

        if complexity == "simple":
            return self.simple_inference(inputs)
        elif complexity == "moderate":
            return self.moderate_inference(inputs)
        else:
            return self.complex_inference(inputs)

    def assess_complexity(self, inputs):
        """Quickly assess if multimodal processing is needed"""

        num_modalities = len(inputs)

        if num_modalities == 1:
            return "simple"

        if self.modalities_agree(inputs):
            return "moderate"

        return "complex"

Decision Rules

Adopt multimodal AI when:

User needs require combining modalities
Individual modality performance is insufficient
Cross-modal context improves outcomes
Training data with paired modalities is available

Stick with unimodal when:

Single modality captures the full problem
Multimodal data is unavailable
Computational budget is severely constrained
Latency requirements prevent multimodal processing

Key principles:

Start with two modalities, not all at once
Use pretrained encoders for each modality
Focus on alignment before fusion
Modality dropout improves robustness
Real-world data has missing and noisy modalities

Shipping a production AI system?

Find the control gaps before they turn into incidents. Take the AI Production Scorecard for a fast baseline across the seven layers, or book an architecture review and we will turn it into a hardening plan.

Take the AI Production Scorecard Book an Architecture Review

This comment section requires JavaScript.

Enable JavaScript in your browser to use this feature.