What I’m most curious about:
- Orchestrator choice and why: LangGraph, Temporal, Airflow, Prefect, custom queues.
- State and checkpointing: where do you persist steps, how do you replay, how do you handle schema changes.
- Concurrency control: parallel tool calls, backpressure, timeouts, idempotency for retries.
- Autoscaling and cost: policies that kept latency and spend sane, spot vs on-demand, GPU sharing.
- Memory and retrieval: vector DB vs KV store, eviction policies, preventing stale context.
- Observability: tracing, metrics, evals that actually predicted incidents.
- Safety and isolation: sandboxing tools, rate limits, abuse filters, PII handling.
- A war story: the incident that taught you a lesson and the fix.
Context (so it’s not a drive-by): small team, Python, k8s, MongoDB for state, Redis for queues, everything custom, experimenting with LangGraph and Temporal. Happy to share configs and trade notes in the comments.
Answer any subset. Even a quick sketch of your stack and one gotcha would help others reading this. Thanks!
It's not super complex, in fact that seems to be the only way to get a more or less reliable agent right now. Keep the graph small, the prompts concise, the nodes and tools atomic in function, etc.
* Orchestrator choice and why: LangGraph because it seems the most robust and well established from my research at the time (about 6 months ago). It has decent documentation, and includes community-built graphs and nodes. People complain a lot about LangChain, but the general vibe around LangGraph is that it's a maturely designed framework.
* State and checkpointing: I'm using a memory checkpointer after every state change. Why? Reports can just re-run at negligible cost. For chats, my users' requirements just don't need persistent thread storage. Persistence is better managed through RAG entries.
* Concurrency control: I don't use parallel tool calling for most of my agents because it adds too much instability to graph execution. This is actually fine for chatbots and my app's reporting system (which doesn't need many tools), but I can see this being an issue for more complex agents.
* Autoscaling and cost: Well I use foundation models, not local ones. I swap out models for various tasks and customer subscription levels (e.g., gpt-5-nano with low reasoning effort for trial users, and gpt-5-mini for paying customers).
* Memory and retrieval: Vector DB for RAG tooling, normal DB for everything else. Sometimes I use the same Postgres database for both vector and normal data, to simplify architecture. I load raw contextual data into prompts (JSON dump). In my app's case, I use a 30-day rolling window of store data so I never keep anything longer than 30 days. I instead keep distilled information as permanent context, which I let the AI control the lifecycle of (create, update, delete).
* Observability: The only thing I would use evals for are prompts, but haven't found a good tool for that yet. I use sentiment analysis for chats the AI deems "interesting" just to see if people are complaining about something.
* Safety and isolation: For reports, I filter out PII before giving data to the AI. For chats, memory checkpointing makes threads ephemeral anyway - and I just add a rate limit + message length limit. The sentiment analysis doesn't include their original messages, only a thematic summary by the AI.
* A war story: I spent weeks trying to fine-tune a prompt for the reporting agent, in which one node was tasked with A) analyzing multiple 30-day ecommerce reports, B) generating findings, C) comparing the findings to existing insights and mutating them, and finally: D) creating short and punchy copy for new insights (title, description). I re-wrote it like 100 times, and every time I ran it it would screw up in a new way or a way that occurred 5 revisions ago. Sometimes it would work perfectly, then the next time it ran it would screw up again, with the same data and temperature set to 0.
This, honestly, is the main problem with modern AI. My solution was to decompose the node into 4 separate ones that each handle a single task - and they still manage to screw it up quite often. It's much better, but not 100% reliable.