Production

Inference

Inference is the act of running a trained model on new inputs to produce predictions or generated output.

What is Inference?

Inference is distinct from training. Training builds the model; inference uses it. Production inference systems care about latency, throughput, cost per request, and tail behaviour. Optimisations include batching, quantisation, distillation, caching, speculative decoding, and hardware-aware serving. For LLMs, inference cost is usually dominated by prompt and output token counts.

How does Inference apply to enterprise AI?

Enterprise inference economics drive build-vs-buy decisions. Hosted APIs trade cost for speed; self-hosted inference trades operational burden for unit economics and EU residency.

Related terms

Large Language Model - A Large Language Model (LLM) is a foundation model trained on text to predict the next token, capable of generating, summarising, and reasoning over natural language.
LLMOps - LLMOps is the subset of MLOps focused on the specific operational concerns of large language models: prompt versioning, evaluation, cost control, and output observability.
Observability - Observability for AI is the ability to understand what an AI system did, why it did it, and at what cost, by inspecting its inputs, outputs, intermediate steps, and metrics.
Build vs Buy AI - Build vs buy is the strategic decision between developing an AI capability internally or in partnership, versus licensing a finished product from a vendor.

External references

NVIDIA Triton Inference Server

Impetora

Need help applying Inference to your enterprise? Submit a short brief and we reply within one business day.

Submit a project Back to glossary