Portfolio Case Study — Enterprise GenAI Architecture

Building a Local-First
Hybrid AI Platform

From prototype to enterprise-ready GenAI architecture — balancing data privacy, inference cost, and operational resilience through a modular, routing-aware design.

FastAPI · Qdrant · Ollama · Redis · OpenRouter RAG · Hybrid Inference · LLMOps Enterprise Architecture
Executive Summary
The Problem
Enterprises want AI-powered insights without exposing sensitive data to cloud LLMs or absorbing unpredictable per-token costs at scale.
The Architecture
A local-first hybrid inference platform — routing queries through local models by default, falling back to cloud LLMs for complex reasoning, with semantic caching to eliminate redundant calls.
Primary Outcome
Privacy-sensitive documents stay on-premises. Repeated queries are served from cache. Cloud LLM is a precision instrument, not the default firehose.
Architectural Pattern
RAG + Hybrid Routing + Caching Layer. Each component independently replaceable — no vendor lock-in at any tier.
Motivating Use Case

A mid-sized energy company needed to analyse hundreds of vendor contracts and compliance documents — extracting obligations, financial terms, and risk flags — without sending sensitive legal data to a third-party cloud endpoint.

Sensitive Data Constraint
Contracts contain confidential financial terms, counterparty obligations, and regulatory disclosures. Transmission to cloud APIs is not permissible under internal policy.
Cost at Query Volume
Business users frequently ask identical or near-identical questions across document sets. Per-token cloud costs compound rapidly without a caching strategy.
Response Consistency
Downstream workflows depend on deterministic, auditable outputs. Non-deterministic LLM responses for repeated queries undermine process reliability.
Architecture Diagram

A three-tier platform: an orchestration layer (FastAPI), a retrieval and caching layer (Qdrant + Redis), and a dual inference layer (local Ollama + cloud OpenRouter) connected through a priority-based routing policy.

User Open WebUI ORCHESTRATION FastAPI — Control Plane Orchestration · Routing · Caching Integration RETRIEVAL + CACHE Qdrant Vector DB · Semantic Retrieval Redis Query Cache · Cost Control INFERENCE LAYER PRIMARY FALLBACK Ollama (Local) Gemma 2 · Privacy-first OpenRouter (Cloud) Complex Reasoning · Fallback Response → User Primary path Fallback path Return / Cache
Request Flow
01
Query ingestion
User submits a natural language query via Open WebUI. FastAPI receives the request and initiates the orchestration pipeline.
02
Semantic retrieval — Qdrant
The query is embedded and used to retrieve the top-k most semantically relevant document chunks from Qdrant. These chunks form the context window for the LLM prompt.
03
Cache lookup — Redis
Before invoking any LLM, the system checks Redis for an existing response to the same query-context pair.
Cache hit → return instantly, zero LLM cost
04
Inference routing decision
On a cache miss, the orchestrator routes to the appropriate inference tier based on query complexity and availability.
Primary → Ollama (Gemma 2, local) Fallback → OpenRouter (cloud, complex queries)
05
Cache write + response return
The response is stored in Redis before being returned to the user — ensuring future identical queries are served without model invocation.
Technology Stack

ORCHESTRATION

FastAPI Python 3.11+ Pydantic

RETRIEVAL

Qdrant Sentence Transformers LangChain Document Loaders

INFERENCE

Ollama Gemma 2 (9B) OpenRouter

INFRASTRUCTURE

Redis Ubuntu 24 Host Docker Compose Open WebUI
Design Principles
01 — PRIVACY
Local-first inference
Sensitive enterprise data is processed locally by default. Cloud models are never the primary path — only an explicitly triggered fallback for queries that exceed local model capability.
02 — RESILIENCE
Graceful degradation
If the local inference node is unavailable, the system automatically promotes the cloud path. No manual intervention required. The user experience is uninterrupted.
03 — ECONOMICS
Cost-aware caching
Redis intercepts repeated queries before they reach any LLM. For high-repetition enterprise query patterns, this can eliminate the majority of inference cost entirely.
04 — MODULARITY
Component independence
Each layer — LLM runtime, vector database, cache — is swappable without system-wide rearchitecting. Ollama can be replaced with vLLM; Qdrant with pgvector; Redis with a persistent store.
Measured Benefits
Capability Architectural Mechanism
Data Privacy Sensitive documents processed exclusively by Ollama on-premises. No document content exits the network perimeter.
Cost Control Redis cache eliminates redundant cloud API calls. Per-token spend is bounded and predictable, not volume-linear.
Response Latency Cache hits return in sub-10ms. Local inference (no network round-trip) is consistently faster than cloud for sub-threshold queries.
System Resilience Automatic fallback to cloud on local failure. No single point of failure across the inference tier.
Vendor Neutrality OpenRouter abstracts the cloud model provider. Switching from GPT-4 to Claude 3 is a config change, not a code change.
Lessons Learned
Local models are production-viable for constrained domains. With careful prompt design and appropriate model selection (Gemma 2 9B), local inference handles the majority of enterprise document Q&A without quality degradation.
Caching is the highest-ROI optimisation in enterprise GenAI. Most teams over-index on model quality and under-invest in caching. In repetitive query environments, a well-designed cache can reduce cost by 60–80%.
Hybrid architectures require explicit routing logic. The decision of when to invoke local vs cloud cannot be left to chance. Building confidence scoring and complexity thresholds early prevents later architectural debt.
Simplicity compounds. Every component added to the pipeline introduces failure modes and operational overhead. Resist feature sprawl — prove each layer earns its place through measurable impact.
Roadmap
Intelligent query routing
Classify query complexity before routing. Lightweight classifier pushes simple queries to local; complex multi-hop reasoning to cloud.
Session memory
Persist multi-turn conversation context in Redis or Qdrant. Enable coherent document exploration sessions without stateless limitations.
LLMOps observability
Integrate LangSmith or Arize for per-query latency tracking, cost attribution, and model quality scoring across inference tiers.
Multi-user access control
RBAC layer over document collections. Users query only documents scoped to their role — enforced at retrieval, not just UI.
Model performance scoring
Benchmark local vs cloud responses on a test set. Dynamically adjust routing thresholds based on empirical quality data.
FinOps dashboard
Real-time cost visibility across inference paths. Alert on cache miss rate spikes. Attribute cloud spend by team, document collection, or use case.
Portfolio Positioning Statement
“Designed and implemented a local-first hybrid GenAI platform integrating FastAPI, Qdrant, Ollama, Redis, and OpenRouter — enabling privacy-aware enterprise document intelligence with semantic caching and resilient multi-tier inference routing.”