Back to Glossary

Multimodal Model

Models & Architectures

Processes multiple data types such as text, images, and audio.


Multimodal models combine representations from different data modalities.

  • Architectures: Early/late fusion, cross-attention, shared embedding spaces.
  • Applications: Image captioning, visual question answering, audio-video analysis.
  • Challenges: Modality alignment, annotation quality, compute cost.