Custom AI Agents

Part 7 — Recommended stack

If I’m building a production vertical agent in Python today, what should I actually use?

5 min · Updated June 2026

This is a starting point you will adapt, not gospel. It is a coherent, genuinely cloud-neutral set that will run unchanged across AWS, GCP, Azure, or on-prem.

Q7.1 — What is the recommended stack?

Concern	Default choice	When to deviate
Orchestration	LangGraph — stateful, durable, HITL-native; the most-deployed in regulated industries	Pydantic AI if you value type-safety and a clean DX over orchestration breadth
Model access	LiteLLM in front, so providers are swappable	Direct provider SDK only if you have committed to one cloud
Tools	MCP for shared/external systems, native functions for internal ones; gateway in front	—
Tool efficiency	Programmatic tool calling / Code Mode once past ~20 tools	—
Short-term memory	LangGraph checkpointer (PostgresSaver)	—
Long-term memory	Mem0 for personalisation, or Zep/Graphiti for temporal reasoning	Letta if memory autonomy is the product
Vector store	pgvector if you already run Postgres; else Qdrant	Weaviate for native hybrid; Milvus at billion-scale; LanceDB for multi-modal
RAG	LlamaIndex (retrieval) + LangGraph (orchestration)	Haystack when you need explicit, auditable pipelines
Durability	Temporal for production	Restate for edge/serverless; DBOS for Postgres-only with no new infra
Observability	Langfuse or Arize Phoenix, instrumented via OpenLLMetry	Laminar for deep agent-run debugging
Evaluation	Ragas (RAG) + DeepEval (CI gates) + a domain gold set	Promptfoo for red-teaming
Guardrails	NeMo Guardrails + Llama Guard + Llama Prompt Guard + Guardrails AI	Add LLM Guard for heavy PII and secrets needs

Q7.2 — What is a sane rollout sequence?

Weeks 0–4 — build the spine. Orchestrator, swappable models, a vector store, and observability from day one. Define context as a structured object assembled at the model-call boundary, never as free-form string concatenation. That single discipline prevents most context-rot pain later.

Weeks 4–10 — memory, tools, durability. Add the long-term memory layer, expose internal systems as MCP servers behind a gateway, adopt programmatic tool calling as the tool count grows, and put durable execution under anything long-running.

Weeks 10–20 — harden for the vertical. Layer the guardrails, build the audit logging to a real compliance standard, stand up the eval pipeline with SME-built gold sets, and bake HITL gates into the graph at every consequential action.

Post-launch — scale and govern. Tune prompt caching against real hit-rate logs, move to tiered model routing for cost, adopt signed agent-identity standards (A2A) when crossing organisational boundaries, and formalise governance under ISO 42001 and the NIST AI RMF.

Q7.3 — What four conditions should change my stack?

MCP CVEs in your dependency tree outpacing your patch cadence → move to a gateway/portal model immediately.
Multi-agent token cost exceeding roughly 10× the single-agent baseline → collapse back to a supervisor with summary-returning sub-agents, or denormalise to a single agent.
Retrieval accuracy under roughly 85% → add reranking and contextual retrieval; consider GraphRAG if your data is relational.
HITL rate falling while customer escalations rise → your guardrails are catching the wrong errors; rebuild your eval set from real production traces.