I
Impetora
Production

Inference

Inference is the act of running a trained model on new inputs to produce predictions or generated output.

What is Inference?

Inference is distinct from training. Training builds the model; inference uses it. Production inference systems care about latency, throughput, cost per request, and tail behaviour. Optimisations include batching, quantisation, distillation, caching, speculative decoding, and hardware-aware serving. For LLMs, inference cost is usually dominated by prompt and output token counts.

How does Inference apply to enterprise AI?

Enterprise inference economics drive build-vs-buy decisions. Hosted APIs trade cost for speed; self-hosted inference trades operational burden for unit economics and EU residency.

Related terms

External references

Impetora

Need help applying Inference to your enterprise? Submit a short brief and we reply within one business day.

Submit a projectBack to glossary
Discovery call

Book a discovery call

Tell us what you would like to build. We reply within one business day.

30-minute call. Free of charge. No obligation.