Multimodal
An AI model is called multimodal when it can understand and/or generate several types of data at once: text, image, audio, video — and sometimes other signals like code or structured data.
An AI model is called multimodal when it can understand and/or generate several types of data at once: text, image, audio, video — and sometimes other signals like code or structured data.
Modern multimodal models (GPT-4o, Claude, Gemini) handle these modalities in a unified representation space. You can ask a question about a photo, transcribe and respond to an audio file, or generate text from a video.
This paradigm gradually replaces the historical pipelines where each modality had its own specialised model (OCR then NLP, ASR then NLP…). In 2026, almost every frontier LLM is natively multimodal.
