Part 7 — Recommended stack
If I’m building a production vertical agent in Python today, what should I actually use?
5 min · Updated June 2026
This is a starting point you will adapt, not gospel. It is a coherent, genuinely cloud-neutral set that will run unchanged across AWS, GCP, Azure, or on-prem.
Q7.1 — What is the recommended stack?
| Concern | Default choice | When to deviate |
|---|---|---|
| Orchestration | LangGraph — stateful, durable, HITL-native; the most-deployed in regulated industries | Pydantic AI if you value type-safety and a clean DX over orchestration breadth |
| Model access | LiteLLM in front, so providers are swappable | Direct provider SDK only if you have committed to one cloud |
| Tools | MCP for shared/external systems, native functions for internal ones; gateway in front | — |
| Tool efficiency | Programmatic tool calling / Code Mode once past ~20 tools | — |
| Short-term memory | LangGraph checkpointer (PostgresSaver) | — |
| Long-term memory | Mem0 for personalisation, or Zep/Graphiti for temporal reasoning | Letta if memory autonomy is the product |
| Vector store | pgvector if you already run Postgres; else Qdrant | Weaviate for native hybrid; Milvus at billion-scale; LanceDB for multi-modal |
| RAG | LlamaIndex (retrieval) + LangGraph (orchestration) | Haystack when you need explicit, auditable pipelines |
| Durability | Temporal for production | Restate for edge/serverless; DBOS for Postgres-only with no new infra |
| Observability | Langfuse or Arize Phoenix, instrumented via OpenLLMetry | Laminar for deep agent-run debugging |
| Evaluation | Ragas (RAG) + DeepEval (CI gates) + a domain gold set | Promptfoo for red-teaming |
| Guardrails | NeMo Guardrails + Llama Guard + Llama Prompt Guard + Guardrails AI | Add LLM Guard for heavy PII and secrets needs |
Q7.2 — What is a sane rollout sequence?
Weeks 0–4 — build the spine. Orchestrator, swappable models, a vector store, and observability from day one. Define context as a structured object assembled at the model-call boundary, never as free-form string concatenation. That single discipline prevents most context-rot pain later.
Weeks 4–10 — memory, tools, durability. Add the long-term memory layer, expose internal systems as MCP servers behind a gateway, adopt programmatic tool calling as the tool count grows, and put durable execution under anything long-running.
Weeks 10–20 — harden for the vertical. Layer the guardrails, build the audit logging to a real compliance standard, stand up the eval pipeline with SME-built gold sets, and bake HITL gates into the graph at every consequential action.
Post-launch — scale and govern. Tune prompt caching against real hit-rate logs, move to tiered model routing for cost, adopt signed agent-identity standards (A2A) when crossing organisational boundaries, and formalise governance under ISO 42001 and the NIST AI RMF.
Q7.3 — What four conditions should change my stack?
- MCP CVEs in your dependency tree outpacing your patch cadence → move to a gateway/portal model immediately.
- Multi-agent token cost exceeding roughly 10× the single-agent baseline → collapse back to a supervisor with summary-returning sub-agents, or denormalise to a single agent.
- Retrieval accuracy under roughly 85% → add reranking and contextual retrieval; consider GraphRAG if your data is relational.
- HITL rate falling while customer escalations rise → your guardrails are catching the wrong errors; rebuild your eval set from real production traces.