I
Impetora

How to deploy retrieval-augmented generation (RAG) in a regulated enterprise

By Impetora -

Retrieval-augmented generation (RAG) pairs a vector-searchable corpus of source documents with a generative model so that every answer is grounded in citable text rather than parametric memory. For regulated enterprises in banking, insurance, healthcare and the public sector, RAG is the default architecture because it preserves a citation chain back to authoritative documents, keeps proprietary content out of model training, and produces an auditable trail that satisfies supervisory expectations under the EU AI Act, GDPR, NIST AI RMF and sector-specific regimes such as SR 11-7 and Solvency II [1][2][5].

Reg 2024/1689
EU AI Act, Article 9 risk management for high-risk systems
EUR-Lex
GDPR Art. 22
human review required for solely automated decisions
EUR-Lex
NIST.AI.600-1
Generative AI Profile of the AI Risk Management Framework
NIST

The seven-step deployment playbook

A production-grade RAG system in a regulated environment is engineered as a sequence of governed stages, not a single pipeline. Each stage produces an artefact that an internal auditor or external supervisor can later inspect.

  1. Source-corpus governance. Classify every document by sensitivity, residency and retention class before it touches the index. Maintain a written data catalogue, record provenance and lawful basis under GDPR Article 6, and exclude special categories of data unless an explicit Article 9 condition applies. Documents flagged for deletion under Article 17 must be removable from the index within the contractual erasure window.
  2. Chunking and embedding strategy. Choose chunk size and overlap per document type, not globally. Regulatory text, contracts and clinical guidelines benefit from semantic chunking that respects clause and section boundaries; transactional logs benefit from fixed windows. Record the embedding model version with every vector so re-embedding can be triggered when a model is deprecated or a regulatory text is revised.
  3. Vector store selection. Pick a store that runs in EU regions, encrypts data at rest with customer-managed keys, publishes a sub-processor list, and supports per-tenant or per-document access control. Verify the operator's ISO/IEC 27001 and, where available, ISO/IEC 42001 attestations [3].
  4. Retrieval evaluation. Build a ground-truth set from real questions answered by subject-matter experts. Measure recall@k, mean reciprocal rank and context precision before any LLM endpoint is wired in. A retrieval layer that cannot return the right chunks will not be saved by a better generator.
  5. Generation with citation enforcement. Constrain the orchestration layer so that every output cites at least one retrieved chunk and ungrounded responses are rejected or flagged for review. The Stanford CRFM literature on grounding shows that enforced citation reduces hallucination materially compared with free generation [4].
  6. Human-in-the-loop review for regulated decisions. Any output that materially affects a data subject must route through a qualified reviewer before it takes effect. This is the operational reading of GDPR Article 22 and EU AI Act Article 14 and is not optional in credit, insurance underwriting, clinical triage or benefits adjudication [1][2].
  7. Continuous evaluation harness. Run faithfulness, answer relevance and citation completeness tests on every release. Track drift on the embedding model, the source corpus and the regulator's published guidance. Treat regulator updates as code-pushing events: when EBA, EIOPA, ENISA or the AI Office publish, the corpus and tests update the same week [6][7].

What changes for regulated industries versus general enterprise?

The architecture is the same; the obligations are not. A regulated deployment has to demonstrate, on demand, how each output was produced and how each design choice was justified.

  • EU AI Act Article 9 requires a documented risk management system that runs across the full lifecycle of a high-risk system, with explicit identification, evaluation and mitigation of foreseeable risks [1].
  • GDPR Article 5 data minimisation applies inside the retrieval step itself. The pipeline should not fetch personal data the generator does not need; query-time filters and per-role access scopes are the correct controls.
  • GDPR Article 32 security obligations cover encryption in transit and at rest, granular access control, immutable audit logs and tested incident response. Logs must be sufficient to reconstruct any single answer.
  • GDPR Article 35 requires a Data Protection Impact Assessment before production deployment whenever the processing is likely to result in a high risk to data subjects, which is the default assumption for generative systems handling personal data.
  • Article 22 plus EU AI Act Article 14 codify the human-oversight requirement: a competent reviewer with the authority and information to override the system must be in the loop for consequential decisions.
  • Sector overlays. Banking deployments must satisfy SR 11-7 model risk governance, including independent validation and ongoing monitoring. Insurance deployments fall under EIOPA guidance and Solvency II Pillar 2 governance. Healthcare deployments must align with the Medical Device Regulation when the system informs clinical decisions, and with HITRUST or equivalent controls for protected health information [6].

Common failure modes

Five failure modes recur across audits of production RAG systems and each is preventable with the right test in the evaluation harness.

  • Silent grounding loss. A source document is updated, the old vectors stay in the index, and the system continues to cite an obsolete passage. Mitigation: every chunk carries a content hash and a source-document version, and stale vectors are evicted on ingest.
  • Embedding drift on regulatory revisions. When a regulator amends a key text, the embedding of the new section sits at a different point in vector space. Mitigation: scheduled re-embedding on regulatory publication events, with a parity test against the previous index.
  • Re-ranker masking citation. A re-ranker promotes a high-similarity but low-authority chunk and the citation no longer points to the document the answer actually relied on. Mitigation: rank-aware citation logging that records every chunk that influenced the final response.
  • Prompt injection via uploaded documents. A user uploads a PDF whose content tries to override the system instruction. Mitigation: separate untrusted content from instructions at the orchestration layer, sanitise uploads, and follow the ENISA guidance on AI threat modelling [7].
  • Cross-tenant retrieval leakage. A misconfigured filter returns chunks from another tenant or another department. Mitigation: tenant identifiers as a mandatory predicate on every query, automated tests that probe with adversarial queries, and per-tenant index isolation for the most sensitive corpora.

How to evaluate a RAG deployment

Evaluation in regulated settings is a continuous engineering practice, not a launch checkpoint. The minimum metrics matrix has six dimensions.

  • Faithfulness. Share of generated claims that are supported by the retrieved context. Measured against a labelled ground-truth set.
  • Answer relevance. Whether the response actually addresses the user's question rather than a tangent.
  • Context precision and recall. Whether the retriever surfaced the right chunks and only the right chunks.
  • Citation completeness. Whether every material claim in the answer points to a specific retrieved passage.
  • Latency at p95. A regulated workflow that times out at the human-review step is unusable; latency budgets must include the orchestration layer, retrieval, generation and any guardrail step.
  • Cost per query. Tracked per intent class so that the business can decide where to apply the system and where a deterministic lookup is cheaper.

Open-source RAG evaluation frameworks have matured to the point where regulated buyers can run these metrics in continuous integration. The frameworks differ in scoring model and reporting; the choice matters less than the discipline of running them on every release [4].

Frequently asked questions

How much does a regulated RAG deployment typically cost?
Cost ranges widely by scope, corpus size and integration surface. A focused deployment on a single workload, with one source domain and one user group, can ship in the low six figures including governance artefacts. Multi-domain programmes covering several business units and several regulators run an order of magnitude higher because the conformity assessment, DPIA and integration work scale with surface area, not with model spend. Infrastructure is rarely the binding cost; review and assurance are.
What is a typical time to production?
A scoped, regulated workload usually takes three to six months from kick-off to production with a single user group, assuming the source corpus exists in a usable form and the legal basis is settled. Programmes that include source-document remediation, new data-sharing agreements or a fresh DPIA take longer, and the bottleneck is almost always governance, not engineering.
How do we keep the data EU-resident end to end?
Select a vector store and an LLM endpoint that contractually guarantee processing in EU regions, restrict storage and inference to those regions through configuration, prevent egress with network policy, and require sub-processor disclosure clauses in every supplier contract. Verify the controls during procurement and re-verify on every supplier release that touches the processing path.
What is the realistic hallucination floor with grounded RAG?
With enforced citation and a curated corpus, faithfulness scores in the high nineties are achievable on factual questions answered from the corpus. The floor is set by retrieval quality and by ambiguity in the source documents themselves, not by the generator. Questions whose answer is not in the corpus must be refused rather than synthesised, and that refusal behaviour is the single most important guardrail to test.
When should we choose fine-tuning instead of RAG?
Choose fine-tuning when the task is a stable transformation of input to output, the desired behaviour cannot be specified in a prompt, and the corpus is closed. Choose RAG when the source of truth changes, when citations are required, or when the regulator expects to see the document the answer relied on. In regulated settings the default is RAG; fine-tuning is reserved for narrow stylistic or formatting tasks.
How do we handle PII inside source documents?
Classify and tag personal data at ingest, restrict its retrieval through per-role access scopes, redact at the chunk level when the use case does not require it, and never include PII in evaluation logs that leave the production boundary. For special categories of data, require an explicit lawful basis and a documented necessity test before the data is indexed at all.
What does the audit trail actually look like?
Each answer is paired with the user identity, the query, the retrieved chunks with content hashes and source-document versions, the model version, the orchestration layer configuration, the citations rendered to the user, the human reviewer (if any) and the final disposition. Logs are immutable, time-synchronised and retained for the longer of contractual and regulatory minimums. A supervisor should be able to reconstruct any single output end to end.
Do we need ISO/IEC 42001 to deploy RAG in production?
It is not a legal prerequisite, but ISO/IEC 42001 has become the procurement default for buyers who want a structured AI management system. It complements the EU AI Act and pairs cleanly with existing ISO/IEC 27001 controls. Treat it as the operating model, not the deliverable.
Impetora

Ready to scope your project? Submit a short brief and we reply within one business day.

Sources cited

Sources cited (8) - show
  1. Regulation (EU) 2024/1689 (Artificial Intelligence Act). European Union, Official Journal, 2024-07-12. https://eur-lex.europa.eu/eli/reg/2024/1689/oj
  2. Regulation (EU) 2016/679 (General Data Protection Regulation). European Union, Official Journal, 2016-05-04. https://eur-lex.europa.eu/eli/reg/2016/679/oj
  3. ISO/IEC 42001:2023 - Artificial intelligence management system. International Organization for Standardization, 2023-12. https://www.iso.org/standard/81230.html
  4. Artificial Intelligence Risk Management Framework: Generative AI Profile (NIST AI 600-1). National Institute of Standards and Technology, 2024-07. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf
  5. Center for Research on Foundation Models - publications on retrieval and grounding. Stanford CRFM, 2024. https://crfm.stanford.edu/publications.html
  6. EBA report on machine learning for IRB models and supervisory expectations. European Banking Authority, 2023-11. https://www.eba.europa.eu/publications-and-media/publications
  7. Multilayer Framework for Good Cybersecurity Practices for AI. ENISA, 2023-06. https://www.enisa.europa.eu/publications/multilayer-framework-for-good-cybersecurity-practices-for-ai
  8. Financial Stability Implications from FinTech: AI and Machine Learning. Financial Stability Board, 2017-11. https://www.fsb.org/2017/11/artificial-intelligence-and-machine-learning-in-financial-services/
About Impetora
Impetora designs, builds, and deploys custom AI systems for enterprises in regulated industries. We operate from Vilnius and work in five languages.
Discovery call

Book a discovery call

Tell us what you would like to build. We reply within one business day.

30-minute call. Free of charge. No obligation.