Back to Glossary
Multimodal Model
Models & Architectures
Processes multiple data types such as text, images, and audio.
Multimodal models combine representations from different data modalities.
- Architectures: Early/late fusion, cross-attention, shared embedding spaces.
- Applications: Image captioning, visual question answering, audio-video analysis.
- Challenges: Modality alignment, annotation quality, compute cost.