Inference
In machine learning, inference refers to the stage at which a trained model is used to make predictions on new data, as opposed to the training phase.
In machine learning, inference refers to the stage at which a trained model is used to make predictions on new data, as opposed to the training phase.
For an LLM, an inference is a call that takes a prompt and returns a completion. It has a compute cost (often expressed in tokens), a latency (time to first token, tokens per second throughput) and a financial cost that can become significant at scale.
Inference optimisation (quantisation, batching, KV cache, speculative decoding, distillation) has become an engineering discipline of its own, with specialised engines (vLLM, TensorRT-LLM, llama.cpp) and dedicated providers (Together AI, Fireworks, Groq).
