Skip to main content
Bluecoders
← Tech glossary

Inference

TermConcept

In machine learning, inference refers to the stage at which a trained model is used to make predictions on new data, as opposed to the training phase.

In machine learning, inference refers to the stage at which a trained model is used to make predictions on new data, as opposed to the training phase.

For an LLM, an inference is a call that takes a prompt and returns a completion. It has a compute cost (often expressed in tokens), a latency (time to first token, tokens per second throughput) and a financial cost that can become significant at scale.

Inference optimisation (quantisation, batching, KV cache, speculative decoding, distillation) has become an engineering discipline of its own, with specialised engines (vLLM, TensorRT-LLM, llama.cpp) and dedicated providers (Together AI, Fireworks, Groq).

Ready to find the missing piece of your team?

Let's talk about your hiring needs. A team member will get back to you quickly to qualify the brief and kick off the search.