Portfolio Case Study — Enterprise GenAI Architecture

Building a Local-First
Hybrid AI Platform

From prototype to enterprise-ready GenAI architecture — balancing data privacy, inference cost, and operational resilience through a modular, routing-aware design.

FastAPI · Qdrant · Ollama · Redis · OpenRouter RAG · Hybrid Inference · LLMOps Enterprise Architecture

Executive Summary

The Problem

Enterprises want AI-powered insights without exposing sensitive data to cloud LLMs or absorbing unpredictable per-token costs at scale.

The Architecture

A local-first hybrid inference platform — routing queries through local models by default, falling back to cloud LLMs for complex reasoning, with semantic caching to eliminate redundant calls.

Primary Outcome

Privacy-sensitive documents stay on-premises. Repeated queries are served from cache. Cloud LLM is a precision instrument, not the default firehose.

Architectural Pattern

RAG + Hybrid Routing + Caching Layer. Each component independently replaceable — no vendor lock-in at any tier.

Motivating Use Case

A mid-sized energy company needed to analyse hundreds of vendor contracts and compliance documents — extracting obligations, financial terms, and risk flags — without sending sensitive legal data to a third-party cloud endpoint.

Sensitive Data Constraint

Contracts contain confidential financial terms, counterparty obligations, and regulatory disclosures. Transmission to cloud APIs is not permissible under internal policy.

Cost at Query Volume

Business users frequently ask identical or near-identical questions across document sets. Per-token cloud costs compound rapidly without a caching strategy.

Response Consistency

Downstream workflows depend on deterministic, auditable outputs. Non-deterministic LLM responses for repeated queries undermine process reliability.

Architecture Diagram

A three-tier platform: an orchestration layer (FastAPI), a retrieval and caching layer (Qdrant + Redis), and a dual inference layer (local Ollama + cloud OpenRouter) connected through a priority-based routing policy.

Request Flow

Query ingestion

User submits a natural language query via Open WebUI. FastAPI receives the request and initiates the orchestration pipeline.

Semantic retrieval — Qdrant

The query is embedded and used to retrieve the top-k most semantically relevant document chunks from Qdrant. These chunks form the context window for the LLM prompt.

Cache lookup — Redis

Before invoking any LLM, the system checks Redis for an existing response to the same query-context pair.

Cache hit → return instantly, zero LLM cost

Inference routing decision

On a cache miss, the orchestrator routes to the appropriate inference tier based on query complexity and availability.

Primary → Ollama (Gemma 2, local) Fallback → OpenRouter (cloud, complex queries)

Cache write + response return

The response is stored in Redis before being returned to the user — ensuring future identical queries are served without model invocation.

Technology Stack

ORCHESTRATION

FastAPI Python 3.11+ Pydantic

RETRIEVAL

Qdrant Sentence Transformers LangChain Document Loaders

INFERENCE

Ollama Gemma 2 (9B) OpenRouter

INFRASTRUCTURE

Redis Ubuntu 24 Host Docker Compose Open WebUI

Design Principles

01 — PRIVACY

Local-first inference

Sensitive enterprise data is processed locally by default. Cloud models are never the primary path — only an explicitly triggered fallback for queries that exceed local model capability.

02 — RESILIENCE

Graceful degradation

If the local inference node is unavailable, the system automatically promotes the cloud path. No manual intervention required. The user experience is uninterrupted.

03 — ECONOMICS

Cost-aware caching

Redis intercepts repeated queries before they reach any LLM. For high-repetition enterprise query patterns, this can eliminate the majority of inference cost entirely.

04 — MODULARITY

Component independence

Each layer — LLM runtime, vector database, cache — is swappable without system-wide rearchitecting. Ollama can be replaced with vLLM; Qdrant with pgvector; Redis with a persistent store.

Measured Benefits

Capability	Architectural Mechanism
Data Privacy	Sensitive documents processed exclusively by Ollama on-premises. No document content exits the network perimeter.
Cost Control	Redis cache eliminates redundant cloud API calls. Per-token spend is bounded and predictable, not volume-linear.
Response Latency	Cache hits return in sub-10ms. Local inference (no network round-trip) is consistently faster than cloud for sub-threshold queries.
System Resilience	Automatic fallback to cloud on local failure. No single point of failure across the inference tier.
Vendor Neutrality	OpenRouter abstracts the cloud model provider. Switching from GPT-4 to Claude 3 is a config change, not a code change.

Lessons Learned

Local models are production-viable for constrained domains. With careful prompt design and appropriate model selection (Gemma 2 9B), local inference handles the majority of enterprise document Q&A without quality degradation.

Caching is the highest-ROI optimisation in enterprise GenAI. Most teams over-index on model quality and under-invest in caching. In repetitive query environments, a well-designed cache can reduce cost by 60–80%.

Hybrid architectures require explicit routing logic. The decision of when to invoke local vs cloud cannot be left to chance. Building confidence scoring and complexity thresholds early prevents later architectural debt.

Simplicity compounds. Every component added to the pipeline introduces failure modes and operational overhead. Resist feature sprawl — prove each layer earns its place through measurable impact.

Roadmap

Intelligent query routing

Classify query complexity before routing. Lightweight classifier pushes simple queries to local; complex multi-hop reasoning to cloud.

Session memory

Persist multi-turn conversation context in Redis or Qdrant. Enable coherent document exploration sessions without stateless limitations.

LLMOps observability

Integrate LangSmith or Arize for per-query latency tracking, cost attribution, and model quality scoring across inference tiers.

Multi-user access control

RBAC layer over document collections. Users query only documents scoped to their role — enforced at retrieval, not just UI.

Model performance scoring

Benchmark local vs cloud responses on a test set. Dynamically adjust routing thresholds based on empirical quality data.

FinOps dashboard

Real-time cost visibility across inference paths. Alert on cache miss rate spikes. Attribute cloud spend by team, document collection, or use case.

Portfolio Positioning Statement

“Designed and implemented a local-first hybrid GenAI platform integrating FastAPI, Qdrant, Ollama, Redis, and OpenRouter — enabling privacy-aware enterprise document intelligence with semantic caching and resilient multi-tier inference routing.”

Building a Local-FirstHybrid AI Platform

Building a Local-First
Hybrid AI Platform