QRefAI
Contents
Custom AI Agents

Part 7 — Recommended stack

If I’m building a production vertical agent in Python today, what should I actually use?

5 min · Updated June 2026

This is a starting point you will adapt, not gospel. It is a coherent, genuinely cloud-neutral set that will run unchanged across AWS, GCP, Azure, or on-prem.

Q7.1 — What is the recommended stack?

The recommended production stack for custom AI agents: models, orchestration, memory, tools, observability, guardrails, and durable execution
ConcernDefault choiceWhen to deviate
OrchestrationLangGraph — stateful, durable, HITL-native; the most-deployed in regulated industriesPydantic AI if you value type-safety and a clean DX over orchestration breadth
Model accessLiteLLM in front, so providers are swappableDirect provider SDK only if you have committed to one cloud
ToolsMCP for shared/external systems, native functions for internal ones; gateway in front
Tool efficiencyProgrammatic tool calling / Code Mode once past ~20 tools
Short-term memoryLangGraph checkpointer (PostgresSaver)
Long-term memoryMem0 for personalisation, or Zep/Graphiti for temporal reasoningLetta if memory autonomy is the product
Vector storepgvector if you already run Postgres; else QdrantWeaviate for native hybrid; Milvus at billion-scale; LanceDB for multi-modal
RAGLlamaIndex (retrieval) + LangGraph (orchestration)Haystack when you need explicit, auditable pipelines
DurabilityTemporal for productionRestate for edge/serverless; DBOS for Postgres-only with no new infra
ObservabilityLangfuse or Arize Phoenix, instrumented via OpenLLMetryLaminar for deep agent-run debugging
EvaluationRagas (RAG) + DeepEval (CI gates) + a domain gold setPromptfoo for red-teaming
GuardrailsNeMo Guardrails + Llama Guard + Llama Prompt Guard + Guardrails AIAdd LLM Guard for heavy PII and secrets needs

Q7.2 — What is a sane rollout sequence?

Four-phase agent launch sequence: spine, memory and tools, hardening, and scale — with key milestones and decision gates at each phase

Weeks 0–4 — build the spine. Orchestrator, swappable models, a vector store, and observability from day one. Define context as a structured object assembled at the model-call boundary, never as free-form string concatenation. That single discipline prevents most context-rot pain later.

Weeks 4–10 — memory, tools, durability. Add the long-term memory layer, expose internal systems as MCP servers behind a gateway, adopt programmatic tool calling as the tool count grows, and put durable execution under anything long-running.

Weeks 10–20 — harden for the vertical. Layer the guardrails, build the audit logging to a real compliance standard, stand up the eval pipeline with SME-built gold sets, and bake HITL gates into the graph at every consequential action.

Post-launch — scale and govern. Tune prompt caching against real hit-rate logs, move to tiered model routing for cost, adopt signed agent-identity standards (A2A) when crossing organisational boundaries, and formalise governance under ISO 42001 and the NIST AI RMF.

Q7.3 — What four conditions should change my stack?

The four trigger conditions that should prompt a stack change: MCP CVE rate, multi-agent token cost, HITL rate, and context window saturation
  • MCP CVEs in your dependency tree outpacing your patch cadence → move to a gateway/portal model immediately.
  • Multi-agent token cost exceeding roughly 10× the single-agent baseline → collapse back to a supervisor with summary-returning sub-agents, or denormalise to a single agent.
  • Retrieval accuracy under roughly 85% → add reranking and contextual retrieval; consider GraphRAG if your data is relational.
  • HITL rate falling while customer escalations rise → your guardrails are catching the wrong errors; rebuild your eval set from real production traces.