I
Impetora
Production

Inference

Inference is the act of running a trained model on new inputs to produce predictions or generated output.

What is Inference?

Inference is distinct from training. Training builds the model; inference uses it. Production inference systems care about latency, throughput, cost per request, and tail behaviour. Optimisations include batching, quantisation, distillation, caching, speculative decoding, and hardware-aware serving. For LLMs, inference cost is usually dominated by prompt and output token counts.

How does Inference apply to enterprise AI?

Enterprise inference economics drive build-vs-buy decisions. Hosted APIs trade cost for speed; self-hosted inference trades operational burden for unit economics and EU residency.

Related terms

  • Large Language Model - A Large Language Model (LLM) is a foundation model trained on text to predict the next token, capable of generating, summarising, and reasoning over natural language.
  • LLMOps - LLMOps is the subset of MLOps focused on the specific operational concerns of large language models: prompt versioning, evaluation, cost control, and output observability.
  • Observability - Observability for AI is the ability to understand what an AI system did, why it did it, and at what cost, by inspecting its inputs, outputs, intermediate steps, and metrics.
  • Build vs Buy AI - Build vs buy is the strategic decision between developing an AI capability internally or in partnership, versus licensing a finished product from a vendor.

External references

Impetora

Need help applying Inference to your enterprise? Submit a short brief and we reply within one business day.

Submit a projectBack to glossary