Building Enterprise RAG Framework

A behind-the-scenes look at the reusable Enterprise RAG Framework I deploy across client engagements — and why the architecture decisions that seem over-engineered are the ones that matter most in production.

Early in my AI consulting work, I made the same mistake most architects make: I built bespoke AI systems for each client. New client, new ingestion pipeline. New client, new retrieval logic. New client, new governance layer. Three months of work, repeated from scratch, every engagement.

The breaking point came when a second financial services client asked for almost exactly what I had built for the first — a system to query regulatory documents in natural language, with citations, confidence scoring, and a full audit trail. The underlying problem was identical. The technology was identical. The only thing different was the document corpus and the cloud provider.

That’s when I stopped building projects and started building a framework.

The underlying problem was identical. The technology was identical. The only thing different was the document corpus and the cloud provider.

What the Framework Actually Is

The Enterprise RAG Architecture Framework is a production-grade, configurable retrieval-augmented generation platform I deploy and adapt for each client engagement. It handles the foundational concerns that every enterprise AI deployment shares — secure ingestion, governed retrieval, auditable LLM orchestration, and a REST API with access control baked in.

Think of it the way a systems integrator brings a reference architecture to an engagement, rather than designing from first principles every time. The framework is the reference architecture. Each deployment is a configured instance of it.

What makes it a framework rather than a project is the deliberate abstraction at every layer. The LLM provider, vector store, document corpus, agent workflows, and governance configuration are all externalised. Changing them requires configuration, not code.

The Four Layers — and Why Each One Exists

LAYER 01

Ingestion Pipeline

PDF loading, text cleaning, semantic chunking, metadata tagging, and embedding generation. Idempotent — safe to re-run when documents are updated.

CONFIGURABLE

LAYER 02

Hybrid Retrieval

Vector search plus BM25 keyword search, fused using Reciprocal Rank Fusion. Query analyser classifies intent and applies metadata filters before retrieval.

TUNABLE

LAYER 03

LLM Orchestration

LangChain RAG chain for factual queries. LangGraph multi-agent graph for complex workflows — comparison, gap analysis, multi-document synthesis.

EXTENSIBLE

LAYER 04

Governance Layer

RBAC with role-based jurisdiction restrictions, prompt injection guard, append-only audit log, confidence scoring, and citation enforcement. Ships standard.

STANDARD

Design Principle

Governance is a first-class framework component — not a client-specific add-on. Every enterprise deployment, regardless of sector, needs identity and access control, an audit trail, prompt safety controls, and confidence signalling. Building these once and configuring them per engagement is the only sensible approach at consulting scale.

The Financial Services Configuration

The deployment I’ll walk through here is the financial services configuration — built for regulatory compliance Q&A. The corpus is publicly available regulatory documents: Basel Framework publications from the Bank for International Settlements and OSFI guidelines from Canada’s federal banking regulator.

The use case is concrete: compliance analysts at a universal bank spend 60–80% of their research time manually searching regulatory PDFs to answer questions like “What are the minimum CET1 capital requirements under Basel III?” or “Does our current liquidity framework satisfy OSFI B-10 obligations?” The framework answers both — with citations, confidence scores, and a full audit trail of who asked what and when.

Here is what a live response looks like:

Query: “What are the minimum capital requirements under Basel?” Response: The minimum capital requirements under Basel III are: · CET1 ≥ 4.5% of risk-weighted assets [SOURCE: basel_capital_requirements.pdf, Page 20] · Tier 1 Capital ≥ 6.0% of risk-weighted assets · Total Capital ≥ 8.0% of risk-weighted assets [SOURCE: basel_capital_requirements.pdf, Page 20] Capital conservation buffer: +2.5% brings CET1 minimum to 7.0% including buffer requirements. [SOURCE: basel_capital_requirements.pdf, Page 72] Confidence: HIGH Latency: 3,603ms Chunks used: 6

Every factual claim is cited. Every response carries a confidence assessment. Every query is logged to an immutable audit trail. These aren’t features — they’re requirements in a regulated environment, and they’re delivered by the framework, not built per engagement.

The Decision That Changes Everything — Hybrid Retrieval

Most RAG implementations use pure vector search. It works well for conceptual queries. It fails badly for one specific class of query that is extremely common in enterprise contexts: exact citation lookup.

A question like “What is the spirit of Basel capital requirements?” is conceptual — vector search handles it well. A question like “Article 147(2)(b) counterparty credit risk weighting” requires exact term matching. Vector search will miss it if the semantically similar chunks don’t happen to embed that precise citation.

The framework uses Reciprocal Rank Fusion — a rank-based merging algorithm that combines vector search candidates and BM25 keyword search candidates without requiring score normalisation. Documents appearing in both result sets receive a compounding boost. The result is a 15–25% improvement in retrieval precision over pure vector search, and a 35 percentage point improvement in citation recall.

The Agentic Layer — When a Single Retrieval Pass Isn’t Enough

Simple factual queries — single retrieval, single generation. But enterprise use cases rarely stay simple. Gap analysis between an internal policy and a regulatory requirement needs multiple retrieval passes against different document sets, independent reasoning over each, and a synthesis step that produces a risk-rated output.

The framework’s LangGraph multi-agent layer handles exactly this. A supervisor node classifies the query and routes it through the appropriate agent sequence:

Query Type	Example	Agent Path
Factual	What is the CET1 minimum?	RAG Chain only
Procedural	How do I implement an ICAAP?	RAG Chain only
Comparative	Basel III vs Basel IV liquidity rules	Retrieval → Comparison → Summary
Gap Analysis	Does our policy satisfy OSFI B-10?	Retrieval → Gap Analysis → Summary

The supervisor routing is rule-based, not LLM-driven. Routing decisions are free, instant, and fully auditable — I can explain every routing decision without inspecting LLM outputs. LLM tokens are spent on agents that the query actually requires, not on routing logic.

The Architecture Decisions That Made This a Framework

Three decisions separate a reusable framework from a one-time project. Each is documented in an Architecture Decision Record — a discipline I apply across every engagement so decisions are portable, reversible, and explainable.

ADR-001

Vector Store Selection

Azure AI Search over Pinecone/Qdrant

—

Why it matters

Data residency, Entra ID RBAC integration, hybrid search native — all in one service

—

Portability

Provider abstraction means swap is one config change. Qdrant in dev, Azure AI Search in prod.

ADR-002

Chunking Strategy

Semantic over fixed-size

—

Why it matters

Fixed-size splits regulatory clauses mid-sentence. A half-clause retrieved in isolation is a compliance risk.

—

Result

94% clause integrity vs 67% with fixed-size. Measurable. Documented. Repeatable.

ADR-003

Retrieval Strategy

Hybrid RRF over pure vector

—

Why it matters

Enterprise text has two retrieval modes — semantic and exact citation. One retriever cannot handle both optimally.

—

Result

Precision@6 of 0.84 vs 0.71 vector-only. Citation recall +35 percentage points.

The Tech Stack

Orchestration

LangChain 0.3 LangGraph 0.2 LangSmith FastAPI Python 3.11

Cloud — Local / Production

Qdrant (local) Azure AI Search Azure OpenAI (GPT-4o) Azure AI Foundry OpenRouter (dev)

Governance

RBAC (5 roles) Audit Logger Prompt Guard Citation Enforcement Confidence Scoring

What Makes It Reusable Across Industries

The same framework deployed against a different corpus becomes a different product. The architecture doesn’t change — the configuration does.

Industry	Corpus	Agent Workflows	Governance Config
Financial Services	Basel, OSFI, FCA rulebooks	Gap analysis, comparison	Jurisdiction RBAC, immutable audit
Healthcare	Clinical guidelines, formularies	Treatment comparison, protocol Q&A	HIPAA audit controls, role restrictions
Legal	Case law, contracts, statutes	Precedent search, clause analysis	Matter-level access control
Enterprise Internal	Policies, SOPs, knowledge base	Factual Q&A, procedural guidance	Department-level RBAC

The Consulting Advantage

A client engagement that would take 12 weeks to deliver from scratch takes 4–6 weeks with the framework. The first two weeks are corpus preparation and client-specific agent workflow design. The governance layer, retrieval architecture, and API layer are already built, tested, and documented. That time saving is the commercial case for framework thinking over project thinking.

The Lesson I’d Give Every AI Architect

The temptation in AI consulting is to let the technology lead. A new client arrives with a new problem, a new vector store is trending on Hacker News, a new agent framework just dropped — and the instinct is to start fresh, incorporate everything new, build something impressive.

The more durable instinct is to ask: what part of this problem have I solved before? What decision did I make three engagements ago that I should be able to reuse today? What governance control did I build last year that every client since then has needed?

The RAG framework I’ve described here took time to build. But it now represents a compounding asset — each engagement makes it more capable, more documented, and more configurable. Each ADR I write makes the next vendor evaluation faster. Each governance component I build makes the next regulated deployment safer.

That’s the difference between an AI project and an AI practice.

Check the Details

The financial services configuration is open source — architecture diagrams, ADRs, and working code available on GitHub.

View on GitHub Get in Touch

Why I Stopped BuildingBespoke AI Systemsfor Every Client

What the Framework Actually Is

The Four Layers — and Why Each One Exists

The Financial Services Configuration

The Decision That Changes Everything — Hybrid Retrieval

The Agentic Layer — When a Single Retrieval Pass Isn’t Enough

The Architecture Decisions That Made This a Framework

The Tech Stack

Orchestration

Cloud — Local / Production

Governance

What Makes It Reusable Across Industries

The Lesson I’d Give Every AI Architect

Check the Details

Why I Stopped Building
Bespoke AI Systems
for Every Client