A comprehensive enterprise architecture recommendation — one that spans business requirements, solution design, security risk, and infrastructure cost — typically requires three to four specialist consultants, a discovery workshop, and anywhere from two to eight weeks of structured analysis. What if the first-pass synthesis could be done in under five minutes?

That question is the starting point for the EA Advisor Agent: a multi-agent AI system in which four GPT-4-powered specialists deliberate over a set of requirements, each contributing a distinct analytical lens, before synthesising a unified architecture recommendation. It is not a replacement for human architects. It is a compression tool — one that shifts the human’s role from discovery to validation.

This post covers the architecture of the system, the decisions that made it interesting, and — critically — what you would need to add before deploying anything like this in an enterprise setting.

The Problem With Solo LLM Architecture Reviews

The obvious starting point is a single prompt: “You are a senior enterprise architect. Given these requirements, design a system.” It works, up to a point. GPT-4 is genuinely capable of producing reasonable architecture recommendations when given structured input.

The problem is cognitive role conflict. When a single model is asked to simultaneously optimise for solution elegance, security posture, and cost efficiency, those objectives pull against each other. A business analyst and a CISO approach the same system very differently — and that productive tension, in human teams, is where the best architecture emerges. Collapsing all four perspectives into one generation pass flattens that tension. You get a compromise rather than a synthesis.

The multi-agent approach preserves the tension. Each agent has a defined scope, a specialised system prompt, and no awareness of cost or business trade-offs outside its domain. The orchestration layer — not any single agent — is where the synthesis happens.

The magic of multi-agent architecture is not intelligence — it’s structured disagreement. Each agent optimises for a different objective function, and the tension between them produces better outputs than any single agent could.

How the Four Agents Collaborate

The system orchestrates a sequential pipeline using the CrewAI framework. Sequentiality is a deliberate design choice: each agent receives not just the original user requirements, but the full output of every preceding agent. Context accumulates. By the time the Cost Optimisation Specialist runs, it has access to the business requirements, the proposed architecture, and the risk assessment — allowing it to price trade-offs, not just components.

EA Advisor Agent · Orchestration Pipeline
User Requirements → Sequential Agent Pipeline → Unified Recommendation
Agent 01 · GPT-4
Senior Business Analyst
  • Extracts functional and non-functional requirements
  • Maps stakeholder groups: users, IT, business owners, regulators
  • Surfaces constraints and dependencies before design begins
  • Structures output as a requirements taxonomy for downstream agents
Agent 02 · GPT-4
Lead Solution Architect
  • Selects architecture pattern: microservices, monolith, serverless, or hybrid
  • Recommends technology stack with explicit rationale for each decision
  • Defines service boundaries, integration patterns, and data flow
  • Grounds every decision in the BA’s requirements output
Agent 03 · GPT-4
Enterprise Risk Assessment Specialist
  • Evaluates authentication, authorisation, and data protection posture
  • Identifies single points of failure and scalability bottlenecks
  • Reviews compliance exposure: GDPR, PCI-DSS, SOC2
  • Rates each risk High / Medium / Low with specific mitigations
Agent 04 · GPT-4
Cloud Cost Optimisation Specialist
  • Produces service-level cost breakdown and 3-year TCO
  • Identifies rightsizing, reserved instance, and spot instance opportunities
  • Compares cost-efficiency across architectural alternatives
  • Recommends monitoring and alerting strategy for cost governance

For the canonical e-commerce example — 10,000 concurrent users, PCI-DSS and GDPR compliance, Black Friday traffic spikes — the pipeline produces a six-service microservices design on AWS, an eight-category risk register with mitigations, and a monthly infrastructure estimate of approximately $1,400 with a 30% optimisation potential. End-to-end generation time: two to four minutes.

EA Advisor Agent output — architecture recommendation EA Advisor Agent output — risk register and cost estimate

The Architecture Decisions That Matter

Three decisions in the system design are worth examining, because they reflect choices that recur across agentic AI projects — and getting them wrong compounds quickly.

ADR-001
Orchestration Mode
Sequential over parallel
Why it matters
Parallel execution is faster but agents lose context from peers. The Risk Agent needs to see the proposed architecture to evaluate it — not just the raw requirements.
Trade-off
2–4 min vs 60–90 sec. For an architecture recommendation that replaces a multi-day workshop, this is an acceptable cost.
ADR-002
Role Isolation
Hard scope via system prompt
Why it matters
Without hard scope constraints, agents generalise. The Cost Specialist starts offering security advice. The output becomes as unfocused as a single-agent prompt.
Implementation
Each agent’s system prompt explicitly forbids cross-domain commentary. Output format templates enforce structured, parseable results.
ADR-003
Output Format
Structured template enforcement
Why it matters
Unstructured agent output cannot be parsed downstream or displayed consistently. Risk ratings and cost tables require predictable schemas.
Result
Hallucination rate drops when agents must fill a schema rather than generate free-form prose. Format constraints are a cheap quality gate.

The Tech Stack

Agent Orchestration

CrewAI 0.28 LangChain GPT-4 FastAPI Python 3.11

Frontend

Next.js 14 TypeScript Tailwind CSS NextAuth.js

Deployment

Vercel (frontend) Railway / Render (API) Docker JWT Auth

What This Version Cannot Do — And Why That Matters

The prototype works. The agents produce coherent, useful output. For a portfolio demonstration or an internal innovation proof-of-concept, this is sufficient. For a production deployment inside a bank, a healthcare system, or any organisation where the recommendations might influence real infrastructure spend or real compliance posture, there are four categories of capability that are currently absent.

The Honest Assessment

Multi-agent AI systems are convincing. The output reads like expert analysis because it is structured like expert analysis. That legibility is precisely why the gap between prototype and production must be taken seriously. A system that sounds authoritative without the guardrails to be authoritative is an enterprise liability, not an asset.

Productionising This — Evaluation, Drift, and Cost

Shipping an agentic system into production is not a deployment problem. It is a measurement and governance problem. Here is the framework I would apply to take this system from prototype to enterprise-grade.

01
Evaluation
Ground truth scoring against known-good architecture recommendations. LLM-as-judge for qualitative criteria. Human-in-the-loop review gates before recommendations are acted upon.
02
Drift Detection
Output quality monitoring across model versions. Automated regression tests when OpenAI updates GPT-4. Embedding drift tracking on input distributions to flag out-of-domain queries.
03
Cost Governance
Token budgets per agent with hard limits. Request batching and output caching for repeated pattern queries. Real-time cost dashboards with alert thresholds before spend exceeds policy.

Evaluation — You Cannot Improve What You Do Not Measure

The first production requirement is a ground truth evaluation set: fifty to one hundred architecture scenarios with expert-validated recommendations, scored across four dimensions — requirements coverage, design pattern appropriateness, risk completeness, and cost estimate accuracy. Every major model update, prompt change, or agent restructuring should run against this set before reaching production traffic.

For qualitative dimensions where numerical scoring is insufficient, LLM-as-judge evaluation is appropriate — a separate GPT-4 instance evaluating agent output against a rubric, with structured scoring and a rationale. This is not circular validation; it is a scale-efficient proxy for expert human review, used to triage which outputs require human escalation. Tools like LangSmith and Arize Phoenix provide the tracing and evaluation infrastructure to make this operational without bespoke tooling.

Critically, evaluation must be continuous, not one-off. Every production query should log inputs, intermediate agent outputs, and final recommendations. Weekly sampling by a domain expert closes the feedback loop. Without it, you will not know when the system starts drifting — and it will drift.

Drift — The Problem Nobody Plans For

A multi-agent system has multiple surfaces where drift manifests. Model drift occurs when the underlying LLM is updated by the provider — GPT-4’s behaviour is not static, and a new model version can shift recommendation quality in ways that are subtle and hard to detect without systematic testing. Input drift occurs when user queries evolve away from the distribution on which the system was evaluated — the system was trained on e-commerce and SaaS requirements; when someone submits an IoT edge architecture question, quality degrades silently.

The mitigation strategy is threefold. First, pin model versions explicitly — use gpt-4-0125-preview rather than gpt-4, and implement a controlled upgrade process with regression testing before advancing. Second, embed input queries using a lightweight model and monitor the embedding distribution using statistical process control — a shift in the distribution triggers a human review flag. Third, implement an output confidence signal: when agent outputs fall below a coherence threshold or contain known hallucination patterns (specific version numbers, named individuals, precise statistics without citations), route to a human reviewer rather than serving directly.

Cost — The Invisible Runaway Risk

At 8,000–15,000 tokens per analysis at GPT-4 pricing, each query costs approximately $0.40–$1.20 at current rates. That is reasonable at ten queries per day. At one thousand queries per day — which is plausible in an enterprise deployment — the unmanaged monthly cost is $12,000–$36,000 for LLM inference alone, before infrastructure.

Cost Control Mechanism Implementation Estimated Saving
Output Caching Redis cache on input hash; TTL 24–72 hours for identical queries 30–50%
Model Tiering GPT-3.5-turbo for BA and Cost agents; GPT-4 for SA and Risk only 40–60%
Token Budgets Hard max_tokens per agent with truncation and retry logic 15–25%
Async Streaming WebSocket streaming to reduce perceived latency without extra cost 0% saving — UX improvement only
Request Throttling Rate limits per user tier; queue for non-urgent analysis requests Prevents runaway spend

Model tiering deserves emphasis. The Business Analyst and Cost Optimisation agents perform largely structured extraction and calculation tasks — GPT-3.5-turbo handles these well at roughly one-tenth the cost. Reserving GPT-4 for the Solution Architect and Risk Assessment agents — where nuanced reasoning over complex requirements matters most — reduces inference cost by 40–60% without a meaningful quality degradation in the output that reaches the user.

The Governance Layer That Is Missing

Beyond evaluation, drift, and cost, an enterprise deployment needs two additional controls. Prompt injection guardrails — because a user who submits crafted requirements can attempt to redirect agent behaviour — should be implemented as a pre-processing classification step that screens inputs for adversarial patterns before they enter the pipeline. And an audit trail: every query, every intermediate agent output, and every final recommendation should be logged to an immutable store, with the user identity and timestamp attached. When an organisation acts on an AI-generated architecture recommendation and something goes wrong, you need to be able to reconstruct exactly what the system said and why.

The Honest Value Proposition

Multi-agent architecture analysis works. For early-stage feasibility work, internal innovation reviews, or rapid client discovery, a system like this compresses a meaningful amount of structured expert thinking into a time frame that human teams cannot match. The Business Analyst agent does not get tired. The Risk Assessor does not forget to check compliance. The Cost Specialist does not assume the client has negotiated enterprise pricing.

The value is not in replacing the architect. It is in ensuring that the architect starts every engagement with a comprehensive first-pass analysis that covers all four domains, so they can spend their time on the decisions that genuinely require human judgement — organisational politics, existing technology debt, vendor relationships, team capability — rather than on producing the initial structured summary.

That is a defensible, commercially real use case. The production readiness work described above is the difference between demonstrating it and deploying it. For organisations willing to invest in the evaluation infrastructure and governance controls, the return on that investment is substantial.

See the Code

Source code, agent prompt templates, and deployment configuration available on GitHub. Questions about production deployment or agentic AI architecture? Get in touch.