Token (AI)
In the context of LLMs, a token is the basic unit manipulated by the model: a chunk of text (often part of a word, sometimes a short whole word or a single character) produced by a tokenizer before inference.
In the context of LLMs, a token is the basic unit manipulated by the model: a chunk of text (often part of a word, sometimes a short whole word or a single character) produced by a tokenizer before inference.
A French text of 1,000 characters typically represents between 250 and 350 tokens. LLMs bill usage based on the number of input and output tokens, and their context window is also expressed in tokens.
The choice of tokenizer (BPE, SentencePiece, Tiktoken…) influences performance on non-English languages: a tokenizer poorly optimised for French can consume many more tokens per character than a well-suited one.
