Building Enterprise RAG Framework

Building an Enterprise RAG Framework
Faiz Faruqi  ·  Enterprise AI Architecture
Architecture Deep Dive

Why I Stopped Building
Bespoke AI Systems
for Every Client

A behind-the-scenes look at the reusable Enterprise RAG Framework I deploy across client engagements — and why the architecture decisions that seem over-engineered are the ones that matter most in production.

GenAI RAG Architecture Enterprise

Early in my AI consulting work, I made the same mistake most architects make: I built bespoke AI systems for each client. New client, new ingestion pipeline. New client, new retrieval logic. New client, new governance layer. Three months of work, repeated from scratch, every engagement.

The breaking point came when a second financial services client asked for almost exactly what I had built for the first — a system to query regulatory documents in natural language, with citations, confidence scoring, and a full audit trail. The underlying problem was identical. The technology was identical. The only thing different was the document corpus and the cloud provider.

That’s when I stopped building projects and started building a framework.

The underlying problem was identical. The technology was identical. The only thing different was the document corpus and the cloud provider.

What the Framework Actually Is

The Enterprise RAG Architecture Framework is a production-grade, configurable retrieval-augmented generation platform I deploy and adapt for each client engagement. It handles the foundational concerns that every enterprise AI deployment shares — secure ingestion, governed retrieval, auditable LLM orchestration, and a REST API with access control baked in.

Think of it the way a systems integrator brings a reference architecture to an engagement, rather than designing from first principles every time. The framework is the reference architecture. Each deployment is a configured instance of it.

What makes it a framework rather than a project is the deliberate abstraction at every layer. The LLM provider, vector store, document corpus, agent workflows, and governance configuration are all externalised. Changing them requires configuration, not code.

The Four Layers — and Why Each One Exists

LAYER 01
Ingestion Pipeline

PDF loading, text cleaning, semantic chunking, metadata tagging, and embedding generation. Idempotent — safe to re-run when documents are updated.

CONFIGURABLE
LAYER 02
Hybrid Retrieval

Vector search plus BM25 keyword search, fused using Reciprocal Rank Fusion. Query analyser classifies intent and applies metadata filters before retrieval.

TUNABLE
LAYER 03
LLM Orchestration

LangChain RAG chain for factual queries. LangGraph multi-agent graph for complex workflows — comparison, gap analysis, multi-document synthesis.

EXTENSIBLE
LAYER 04
Governance Layer

RBAC with role-based jurisdiction restrictions, prompt injection guard, append-only audit log, confidence scoring, and citation enforcement. Ships standard.

STANDARD
Design Principle

Governance is a first-class framework component — not a client-specific add-on. Every enterprise deployment, regardless of sector, needs identity and access control, an audit trail, prompt safety controls, and confidence signalling. Building these once and configuring them per engagement is the only sensible approach at consulting scale.

The Financial Services Configuration

The deployment I’ll walk through here is the financial services configuration — built for regulatory compliance Q&A. The corpus is publicly available regulatory documents: Basel Framework publications from the Bank for International Settlements and OSFI guidelines from Canada’s federal banking regulator.

The use case is concrete: compliance analysts at a universal bank spend 60–80% of their research time manually searching regulatory PDFs to answer questions like “What are the minimum CET1 capital requirements under Basel III?” or “Does our current liquidity framework satisfy OSFI B-10 obligations?” The framework answers both — with citations, confidence scores, and a full audit trail of who asked what and when.

Here is what a live response looks like:

Query: “What are the minimum capital requirements under Basel?” Response: The minimum capital requirements under Basel III are: · CET1 ≥ 4.5% of risk-weighted assets [SOURCE: basel_capital_requirements.pdf, Page 20] · Tier 1 Capital ≥ 6.0% of risk-weighted assets · Total Capital ≥ 8.0% of risk-weighted assets [SOURCE: basel_capital_requirements.pdf, Page 20] Capital conservation buffer: +2.5% brings CET1 minimum to 7.0% including buffer requirements. [SOURCE: basel_capital_requirements.pdf, Page 72] Confidence: HIGH Latency: 3,603ms Chunks used: 6

Every factual claim is cited. Every response carries a confidence assessment. Every query is logged to an immutable audit trail. These aren’t features — they’re requirements in a regulated environment, and they’re delivered by the framework, not built per engagement.

The Decision That Changes Everything — Hybrid Retrieval

Most RAG implementations use pure vector search. It works well for conceptual queries. It fails badly for one specific class of query that is extremely common in enterprise contexts: exact citation lookup.

A question like “What is the spirit of Basel capital requirements?” is conceptual — vector search handles it well. A question like “Article 147(2)(b) counterparty credit risk weighting” requires exact term matching. Vector search will miss it if the semantically similar chunks don’t happen to embed that precise citation.

The framework uses Reciprocal Rank Fusion — a rank-based merging algorithm that combines vector search candidates and BM25 keyword search candidates without requiring score normalisation. Documents appearing in both result sets receive a compounding boost. The result is a 15–25% improvement in retrieval precision over pure vector search, and a 35 percentage point improvement in citation recall.

The Agentic Layer — When a Single Retrieval Pass Isn’t Enough

Simple factual queries — single retrieval, single generation. But enterprise use cases rarely stay simple. Gap analysis between an internal policy and a regulatory requirement needs multiple retrieval passes against different document sets, independent reasoning over each, and a synthesis step that produces a risk-rated output.

The framework’s LangGraph multi-agent layer handles exactly this. A supervisor node classifies the query and routes it through the appropriate agent sequence:

Query Type Example Agent Path
Factual What is the CET1 minimum? RAG Chain only
Procedural How do I implement an ICAAP? RAG Chain only
Comparative Basel III vs Basel IV liquidity rules Retrieval → Comparison → Summary
Gap Analysis Does our policy satisfy OSFI B-10? Retrieval → Gap Analysis → Summary

The supervisor routing is rule-based, not LLM-driven. Routing decisions are free, instant, and fully auditable — I can explain every routing decision without inspecting LLM outputs. LLM tokens are spent on agents that the query actually requires, not on routing logic.

The Architecture Decisions That Made This a Framework

Three decisions separate a reusable framework from a one-time project. Each is documented in an Architecture Decision Record — a discipline I apply across every engagement so decisions are portable, reversible, and explainable.

ADR-001
Vector Store Selection
Azure AI Search over Pinecone/Qdrant
Why it matters
Data residency, Entra ID RBAC integration, hybrid search native — all in one service
Portability
Provider abstraction means swap is one config change. Qdrant in dev, Azure AI Search in prod.
ADR-002
Chunking Strategy
Semantic over fixed-size
Why it matters
Fixed-size splits regulatory clauses mid-sentence. A half-clause retrieved in isolation is a compliance risk.
Result
94% clause integrity vs 67% with fixed-size. Measurable. Documented. Repeatable.
ADR-003
Retrieval Strategy
Hybrid RRF over pure vector
Why it matters
Enterprise text has two retrieval modes — semantic and exact citation. One retriever cannot handle both optimally.
Result
Precision@6 of 0.84 vs 0.71 vector-only. Citation recall +35 percentage points.

The Tech Stack

Orchestration

LangChain 0.3 LangGraph 0.2 LangSmith FastAPI Python 3.11

Cloud — Local / Production

Qdrant (local) Azure AI Search Azure OpenAI (GPT-4o) Azure AI Foundry OpenRouter (dev)

Governance

RBAC (5 roles) Audit Logger Prompt Guard Citation Enforcement Confidence Scoring

What Makes It Reusable Across Industries

The same framework deployed against a different corpus becomes a different product. The architecture doesn’t change — the configuration does.

Industry Corpus Agent Workflows Governance Config
Financial Services Basel, OSFI, FCA rulebooks Gap analysis, comparison Jurisdiction RBAC, immutable audit
Healthcare Clinical guidelines, formularies Treatment comparison, protocol Q&A HIPAA audit controls, role restrictions
Legal Case law, contracts, statutes Precedent search, clause analysis Matter-level access control
Enterprise Internal Policies, SOPs, knowledge base Factual Q&A, procedural guidance Department-level RBAC
The Consulting Advantage

A client engagement that would take 12 weeks to deliver from scratch takes 4–6 weeks with the framework. The first two weeks are corpus preparation and client-specific agent workflow design. The governance layer, retrieval architecture, and API layer are already built, tested, and documented. That time saving is the commercial case for framework thinking over project thinking.

The Lesson I’d Give Every AI Architect

The temptation in AI consulting is to let the technology lead. A new client arrives with a new problem, a new vector store is trending on Hacker News, a new agent framework just dropped — and the instinct is to start fresh, incorporate everything new, build something impressive.

The more durable instinct is to ask: what part of this problem have I solved before? What decision did I make three engagements ago that I should be able to reuse today? What governance control did I build last year that every client since then has needed?

The RAG framework I’ve described here took time to build. But it now represents a compounding asset — each engagement makes it more capable, more documented, and more configurable. Each ADR I write makes the next vendor evaluation faster. Each governance component I build makes the next regulated deployment safer.

That’s the difference between an AI project and an AI practice.

Check the Details

The financial services configuration is open source — architecture diagrams, ADRs, and working code available on GitHub.

© 2026 Faiz Faruqi  ·  Enterprise AI Architecture  ·  All views are my own

error: Content is protected !!