Multimodal

TermConcept

An AI model is called multimodal when it can understand and/or generate several types of data at once: text, image, audio, video — and sometimes other signals like code or structured data.

An AI model is called multimodal when it can understand and/or generate several types of data at once: text, image, audio, video — and sometimes other signals like code or structured data.

Modern multimodal models (GPT-4o, Claude, Gemini) handle these modalities in a unified representation space. You can ask a question about a photo, transcribe and respond to an audio file, or generate text from a video.

This paradigm gradually replaces the historical pipelines where each modality had its own specialised model (OCR then NLP, ASR then NLP…). In 2026, almost every frontier LLM is natively multimodal.

Related terms

Ready to find the missing piece of your team?

Let's talk about your hiring needs. A team member will get back to you quickly to qualify the brief and kick off the search.

Start a search I'm a candidate

Multimodal

Related terms

Claude

Data

Gemini

GPT

LLM

AI Act

Ready to find the missing piece of your team?