Inference
Inference is the act of running a trained model on new inputs to produce predictions or generated output.
What is Inference?
Inference is distinct from training. Training builds the model; inference uses it. Production inference systems care about latency, throughput, cost per request, and tail behaviour. Optimisations include batching, quantisation, distillation, caching, speculative decoding, and hardware-aware serving. For LLMs, inference cost is usually dominated by prompt and output token counts.
How does Inference apply to enterprise AI?
Enterprise inference economics drive build-vs-buy decisions. Hosted APIs trade cost for speed; self-hosted inference trades operational burden for unit economics and EU residency.
Related terms
Large Language Model
LLMOps
Observability
External references
Need help applying Inference to your enterprise? Submit a short brief and we reply within one business day.