51123212c4
Harnesses under analysis: - opencode (Go-based coding agent) - pi (minimal terminal coding harness by Mario Zechner) - hermes (Nous Research agent) - forgecode (AI pair programmer with sub-agents) Each harness folder contains: - repo/: Source code from respective repositories - feedback/localllm/: Community feedback for local/smaller models - feedback/frontier/: Community feedback for frontier models Research focus: Tool handling, skills systems, prompt engineering, context management, and best practices for smaller/local models.
738 lines
38 KiB
Markdown
738 lines
38 KiB
Markdown
# Agent Orchestration and System Design
|
||
|
||
A practical, research-backed field guide for designing agent systems: workflows, multi-agent pipelines, memory, evaluation, and infrastructure.
|
||
|
||
Use it for:
|
||
- choosing between single-agent, workflow, and multi-agent designs
|
||
- orchestration patterns (sequential, parallel, evaluator-optimizer)
|
||
- agent memory and context management
|
||
- error recovery and production reliability
|
||
- evaluation and harness design
|
||
- tooling and automation loops
|
||
|
||
Prompt-level decisions (system prompt writing, CoT strategy, instruction following) live in `Research-prompt.md`.
|
||
|
||
## Fast Takeaways
|
||
|
||
1. Start with the simplest scaffold that can pass evals. Default to single-agent or workflow. Add agents only when evals show clear gains.
|
||
2. Separate generator and evaluator roles. Self-evaluation is too lenient; an external evaluator is much stronger.
|
||
3. Use different models or prompts only when they contribute distinct evidence or skills. Homogeneous agent swarms do not scale.
|
||
4. Model your agent loop as an explicit state machine. Named states with typed transitions beat open-ended ReAct loops beyond 5–6 steps.
|
||
5. Error states are first-class citizens. After two consecutive failures on the same action, return to planning — not more retries.
|
||
6. Memory is a system: working buffer + episodic store + semantic rules. A raw vector store is not a memory architecture.
|
||
7. Context window management: observation masking outperforms LLM summarization on cost and quality. Keep the reasoning chain; replace old tool outputs.
|
||
8. Grade outcomes, artifacts, and grounded evidence — not exact tool-call traces.
|
||
9. HITL approval is only useful if presented as a plain-language summary, not raw JSON.
|
||
10. Instrument with OpenTelemetry from day one. Correlate traces across agent boundaries with parent-child span IDs.
|
||
|
||
## What To Copy Into Systems
|
||
|
||
### Orchestration
|
||
- Keep the default path single-agent or workflow-based.
|
||
- Add planners, reviewers, or specialist agents only when evals show clear gains.
|
||
- Prefer bounded loops: one plan phase, one act phase, one verifier, one retry budget.
|
||
- Use different models or prompts only when they contribute distinct evidence or skills.
|
||
- Treat multi-agent diversity as a tool, not a religion.
|
||
|
||
### System Design
|
||
- One agent, one responsibility. Separate generator from evaluator from synthesizer.
|
||
- Small specialists for mechanical subproblems (grep, read, classify, run) are a real design pattern.
|
||
- Reserve expensive frontier models for hard reasoning and synthesis.
|
||
- Route by capability and role, not by "use more agents for quality."
|
||
- A compound system is not a complex system. Match complexity to business value.
|
||
|
||
### Tooling and Action Surface
|
||
- Favor tools that return verifiable feedback: tests, compiler errors, search results, fetched pages, graders.
|
||
- Apply poka-yoke to every tool: use absolute filepaths, validate inputs before calling external services, return structured error objects not raw exceptions.
|
||
- Keep traces and artifacts.
|
||
- If the task is stale/exact/source-sensitive, lookup beats memory.
|
||
|
||
### Automation and Safety
|
||
- Fix a metric before running an autonomous loop.
|
||
- Keep the mutable surface small.
|
||
- Auto-commit only after checks pass.
|
||
- Separate "experiment failed" from "checks failed" from "metric regressed."
|
||
- Prefer narrow optimization targets over grand autonomous platform behavior.
|
||
|
||
### Evaluation
|
||
- Build evals from real failures and real manual checks.
|
||
- Balance both sides of decision boundaries: "should do X" and "should not do X."
|
||
- Isolate trials — no shared repo state, hidden cache, or leaked history.
|
||
- Use deterministic graders where possible.
|
||
- Use LLM graders with clear rubrics and human calibration when needed.
|
||
- Read transcripts constantly. If metrics and transcripts disagree, suspect the harness or grader.
|
||
|
||
---
|
||
|
||
## Core Sources: System Design
|
||
|
||
### 1. The Shift from Models to Compound AI Systems (BAIR, 2024)
|
||
Source: https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/
|
||
|
||
Why it matters:
|
||
- Strong AI systems increasingly come from multiple interacting components, not just bigger base models. System design can improve quality faster than scaling alone.
|
||
|
||
Key takeaways:
|
||
- Current-data access, control, trust, and cost are often easier to solve at the system level.
|
||
- Optimizing a compound system is a distinct engineering problem from optimizing a model.
|
||
|
||
Implication:
|
||
- Build around tools, retrievers, graders, and routers when they solve a real product problem.
|
||
- Do not mistake "compound system" for "maximally complex system."
|
||
|
||
### 2. Building Effective AI Agents (Anthropic, 2026)
|
||
Source: https://resources.anthropic.com/building-effective-ai-agents
|
||
|
||
Why it matters:
|
||
- High-quality practical guidance from a team operating real agent systems at scale.
|
||
|
||
The most useful framing:
|
||
- Choose between single-agent, workflow, and multi-agent designs intentionally.
|
||
- Use a small set of reusable patterns:
|
||
- **Prompt chaining** — sequential, each output feeds the next
|
||
- **Routing** — classify input, dispatch to specialist
|
||
- **Parallelization** — sectioning for independent subtasks; voting for confidence
|
||
- **Orchestrator-workers** — dynamic delegation for unpredictable subtasks
|
||
- **Evaluator-optimizer** — generate-then-critique loop for refineable outputs
|
||
- Match system complexity to business value.
|
||
|
||
Implication:
|
||
- Default to simple workflows first.
|
||
- Reach for evaluator-optimizer when correctness matters and the task benefits from revision.
|
||
- Reach for multi-agent only after single-agent/workflow baselines are exhausted.
|
||
- The most successful teams use "simple, composable patterns rather than complex frameworks."
|
||
|
||
Avoid:
|
||
- Using agent frameworks without understanding what they do under the hood — they often create abstraction layers that obscure prompts and responses.
|
||
- Adding agents before measuring whether simpler approaches fail.
|
||
|
||
### 3. Harness Design for Long-Running Application Development (Anthropic, 2026)
|
||
Source: https://www.anthropic.com/engineering/harness-design-long-running-apps
|
||
|
||
Why it matters:
|
||
- Strong, current practical guidance on long-running coding harnesses from a team actively tuning them against real app-building tasks.
|
||
|
||
Key takeaways:
|
||
- Separate generator and evaluator roles. Self-evaluation is too lenient; a tuned external evaluator is much more useful.
|
||
- Make "done" explicit before coding. Use a per-sprint contract negotiated between builder and evaluator.
|
||
- Keep planner output high-level and product-facing. Over-specifying low-level details too early cascades bad assumptions.
|
||
- Use evaluators that touch the environment directly. Playwright-driven QA against real UI/API behavior is much stronger than static inspection.
|
||
- Use hard thresholds for acceptance, not vague approval. If a criterion misses, fail the chunk and return actionable feedback.
|
||
- Preserve handoff artifacts and structured files between agents. File-based communication reduces drift across long runs.
|
||
- Context resets vs. compaction is a model-specific engineering choice. Resets help when the model shows context anxiety; stronger models may make resets unnecessary.
|
||
- Every harness component encodes an assumption about model weakness. Re-run ablations and remove scaffold that is no longer load-bearing after model upgrades.
|
||
- Evaluators are worth the cost near the model's capability boundary. When the model can already do the task reliably solo, evaluator passes can become mostly overhead.
|
||
- Read logs and calibrate the evaluator from real disagreements. Prompt tuning should come from concrete misses, not abstract "better QA" wishes.
|
||
|
||
Implication:
|
||
- Keep coder and reviewer/verifier separate when acceptance quality matters.
|
||
- Add an explicit contract or acceptance plan before implementation when the spec is high-level.
|
||
- Prefer grounded evaluator tools over reviewer vibes.
|
||
- Keep handoff state compact and structured enough to survive resets when resets are needed.
|
||
- Revisit harness complexity whenever the base model changes; stale scaffold is real technical debt.
|
||
|
||
### 4. Understanding Agent Scaling via Diversity (2026)
|
||
Source: arXiv:2602.03794
|
||
|
||
Why it matters:
|
||
- More homogeneous agents do not scale indefinitely; diversity matters more than count.
|
||
|
||
Key takeaway:
|
||
- Two meaningfully different agents can outperform a swarm of same-ish agents.
|
||
|
||
Implication:
|
||
- Diversity should come from role, model, tool access, or evidence channel.
|
||
- Do not duplicate the same model/prompt ten times and call it orchestration.
|
||
|
||
### 5. SOLVE-Med / MATA / Small-Model Orchestration (2025–2026)
|
||
Sources:
|
||
- SOLVE-Med: arXiv:2511.03542
|
||
- MATA: arXiv:2602.09642
|
||
|
||
Why they matter:
|
||
- Small specialized models, when orchestrated well, can outperform or match much larger standalone systems.
|
||
|
||
Key takeaway:
|
||
- Cheap specialists for mechanical subproblems are a real design pattern, not a hack.
|
||
|
||
Implication:
|
||
- Route grep/read/run/simple classification to cheaper lanes.
|
||
- Reserve expensive models for hard reasoning or integration steps.
|
||
|
||
### 6. Difficulty-Aware Agentic Orchestration (DAAO, 2025)
|
||
Source: arXiv:2509.11079
|
||
|
||
Why it matters:
|
||
- Not all subtasks need the same model size. A variational autoencoder estimating query difficulty + a cost/performance-aware router gives near-frontier quality at significantly lower cost.
|
||
|
||
Key takeaway:
|
||
- Difficulty-based routing is a high-leverage optimization most systems skip.
|
||
|
||
Implication:
|
||
- Classify task difficulty before dispatching to a model. Easy classification → cheap model. Hard synthesis → frontier model.
|
||
- This is the principled version of the "route to cheap specialists" heuristic.
|
||
|
||
### 7. Multi-Agent Orchestration for Deterministic Decision Support (2025)
|
||
Source: arXiv:2511.15755
|
||
|
||
Why it matters:
|
||
- 348 controlled trials: multi-agent orchestration achieved 100% actionable recommendation rate vs. 1.7% for single-agent, with 80x specificity and 140x correctness improvement at similar latency.
|
||
- The reframing: multi-agent orchestration is a production-readiness requirement, not a performance optimization. Consistent, deterministic quality is what enables SLA commitments.
|
||
|
||
Key takeaway:
|
||
- Single agents produce high-variance outputs. Multi-agent systems with clear role separation produce stable ones.
|
||
|
||
Implication:
|
||
- When variance is unacceptable (financial decisions, infrastructure changes, compliance tasks), multi-agent is not optional — it's the architecture that enables quality guarantees.
|
||
|
||
### 8. Emergent Coordination in Multi-Agent Systems (2025)
|
||
Source: arXiv:2510.05174
|
||
|
||
Why it matters:
|
||
- Coordination is better when agents share objectives and understand complementary roles.
|
||
|
||
Key takeaway:
|
||
- Role awareness is useful; vague social-role prompts are not enough.
|
||
|
||
Implication:
|
||
- When using multiple agents, explicitly describe what each one contributes and how outputs combine.
|
||
- Name the shared objective in the orchestrator prompt; name each agent's responsibility in its own prompt.
|
||
|
||
---
|
||
|
||
## Core Sources: Memory
|
||
|
||
### 9. A-MEM: Agentic Memory for LLM Agents (NeurIPS 2025)
|
||
Source: https://arxiv.org/abs/2502.12110
|
||
|
||
Why it matters:
|
||
- A Zettelkasten-style memory network — structured notes with attributes, keywords, and tags — doubled complex reasoning performance vs. flat vector store baselines at lower token cost.
|
||
|
||
Key takeaways:
|
||
- Every memory node gets a structured note with contextual description, keywords, and tags at write time.
|
||
- An autonomous link-generation mechanism identifies connections via cosine similarity + LLM analysis.
|
||
- When a new memory is added, existing related memories are also updated — the memory network evolves.
|
||
|
||
Implication:
|
||
- At minimum, build two layers: a short-term in-context working buffer and a persistent episodic store with structured metadata per entry.
|
||
- Enrich every stored memory with metadata at write time (task context, success/failure outcome, timestamps, tags) — retrieval quality depends entirely on index richness.
|
||
|
||
Avoid:
|
||
- Flat vector stores with no structural metadata — retrieval becomes a bag-of-embeddings lottery.
|
||
- Unbounded episodic stores without consolidation or eviction policies.
|
||
|
||
### 10. Episodic Memory is the Missing Piece for Long-Term LLM Agents (2025)
|
||
Source: https://arxiv.org/abs/2502.06975
|
||
|
||
Why it matters:
|
||
- Of the four memory tiers (working, episodic, semantic, procedural), episodic memory is the most underinvested and the key enabler for genuine long-term agent improvement.
|
||
|
||
Key takeaway:
|
||
- Time-stamped traces of specific past task runs enable single-shot learning from concrete prior instances. Without episodic memory, agents keep relearning the same lessons.
|
||
|
||
Implication:
|
||
- Implement an episodic-to-semantic consolidation job: after task completion, abstract successful patterns from the episode trace into reusable rules in semantic memory.
|
||
- For multi-agent systems: distinguish per-agent private episodic memory from shared semantic memory. Sharing raw episodes risks leakage; sharing distilled rules is safer.
|
||
|
||
---
|
||
|
||
## Core Sources: Context Management
|
||
|
||
### 11. Cutting Through the Noise: Efficient Context Management (JetBrains Research, Dec 2025)
|
||
Source: https://blog.jetbrains.com/research/2025/12/efficient-context-management/
|
||
|
||
Why it matters:
|
||
- Both common strategies (observation masking and LLM summarization) cut costs >50% vs. unmanaged context. But observation masking matched or outperformed summarization in 4 of 5 configurations, at lower complexity.
|
||
|
||
Key takeaways:
|
||
- **Observation masking**: replaces older tool outputs/file contents with a placeholder, keeps the reasoning chain intact. Fast, cheap, no extra LLM calls.
|
||
- **LLM summarization**: compresses old turns. Slower, more expensive, and paradoxically caused agents to run ~15% longer trajectories (summaries gave false confidence to keep going).
|
||
- With Qwen3-Coder 480B, masking achieved 2.6% *higher* solve rates while being 52% cheaper.
|
||
- A 2026 industry report attributed ~65% of enterprise AI failures to "context drift" — accumulated noise causing agents to lose track of their goal.
|
||
|
||
Implication:
|
||
- Default to **observation masking** as the primary compaction strategy. Keep the reasoning chain; replace tool outputs after a rolling window.
|
||
- Add LLM summarization as a fallback only when a single tool response is too large to fit once.
|
||
- Set a hard token budget before each agent turn. Trigger compaction when projected input exceeds 70–80% of the context limit — before the LLM call, not after.
|
||
- Always preserve: original task specification, the most recent N turns verbatim, current goal state.
|
||
|
||
Avoid:
|
||
- LLM summarization as the primary strategy — slower, more expensive, longer trajectories.
|
||
- Letting context grow unchecked — quality degrades well before the hard limit due to lost-in-the-middle effects.
|
||
- Resetting context entirely for long-running tasks — you lose accumulated plan state.
|
||
|
||
---
|
||
|
||
## Core Sources: State Machines and Control Flow
|
||
|
||
### 12. StateFlow: Enhancing LLM Task-Solving through State-Driven Workflows (2024)
|
||
Source: https://arxiv.org/abs/2403.11322
|
||
|
||
Why it matters:
|
||
- Modeling a task as a finite state machine (FSM) with six components — States, Initial state, Final states, Output functions, Transitions, Context history — yielded 63.73% success on SQL tasks vs. 40.3% for ReAct, at 5.8x lower cost.
|
||
|
||
Key takeaways:
|
||
- Removing the explicit Error state caused a 5% success rate decline — error handling as a named state is critical.
|
||
- A specialist FSM variant (SF_Agent) with separate LLMs per state further reduced token usage.
|
||
|
||
Implication:
|
||
- Model your agent loop as an explicit FSM. Minimum viable states: `PLANNING`, `EXECUTING`, `OBSERVING`, `ERROR_RECOVERY`, `DONE`.
|
||
- Each state should have: a single well-defined LLM prompt, allowed tools, and explicit transition conditions.
|
||
- Add an `ERROR` state as a first-class citizen with its own prompt and recovery transitions.
|
||
- Use a transition counter per state (max N transitions before forcing fallback or human escalation) to prevent runaway loops.
|
||
- For complex multi-agent systems, define the FSM in a declarative config (YAML/JSON) rather than code — makes control flow auditable.
|
||
|
||
Avoid:
|
||
- Pure ReAct loops for tasks requiring more than 5–6 steps — they accumulate drift and have no recovery path when stuck.
|
||
- Embedding transition logic in the LLM prompt ("decide what to do next") — the LLM is unreliable as a state router. Keep routing deterministic in code.
|
||
|
||
---
|
||
|
||
## Core Sources: Parallelization
|
||
|
||
### 13. Parallelization and Scatter-Gather Patterns (AWS Prescriptive Guidance, 2025)
|
||
Source: https://docs.aws.amazon.com/prescriptive-guidance/latest/agentic-ai-patterns/parallelization-and-scatter-gather-patterns.html
|
||
|
||
Why it matters:
|
||
- Structured scatter-gather (coordinator dispatches N independent subtasks, aggregator synthesizes) is the most battle-tested pattern for parallelizing LLM work.
|
||
|
||
Key takeaways:
|
||
- The aggregator is the critical bottleneck. It must handle partial failures gracefully.
|
||
- Use correlation IDs to match results to requests.
|
||
- Allow downstream tasks to start early on streaming outputs from upstream tasks where dependencies permit.
|
||
- Keep fan-out degree below ~20 parallel agents — coordination overhead grows non-linearly.
|
||
|
||
Implication:
|
||
- Structure fan-out tasks with explicit contracts: each subtask specifies inputs it consumes and the exact output schema it must produce.
|
||
- Design the aggregator as a separate, dedicated role with a prompt focused purely on synthesis and conflict resolution.
|
||
- Use async fan-out with per-task timeouts and a minimum quorum: "proceed to aggregation once 80% of tasks complete or 30 seconds elapse."
|
||
- Route cheap classification/filtering steps to small models; reserve large models for synthesis.
|
||
|
||
Avoid:
|
||
- Parallelizing tasks with implicit ordering dependencies.
|
||
- Using the same large model for every subtask — expensive and unnecessary for simple tasks.
|
||
- Stateful or context-heavy aggregators — they should receive clean, structured outputs from workers, not full conversation transcripts.
|
||
|
||
### 14. Orla: A Library for Serving LLM-Based Multi-Agent Systems (2026)
|
||
Source: https://arxiv.org/abs/2603.13605
|
||
|
||
Why it matters:
|
||
- Stage-level model routing (small model for classification, large model for synthesis) cut wall-clock time by 38%, mean completion time by 60%, at 35% lower cost vs. single-model baselines on SWE-bench Lite.
|
||
|
||
Key takeaway:
|
||
- Model routing at the workflow stage level is a high-leverage optimization that most systems skip.
|
||
|
||
Implication:
|
||
- Assign model tiers to workflow stages at design time, not at runtime by the LLM.
|
||
- Workflow-level KV cache management (preserve cache across stages sharing context prefixes) delivers measurable latency gains.
|
||
|
||
---
|
||
|
||
## Core Sources: Error Recovery
|
||
|
||
### 15. Retries, Fallbacks, and Circuit Breakers in LLM Apps (Portkey / Maxim, 2025)
|
||
Sources:
|
||
- https://portkey.ai/blog/retries-fallbacks-and-circuit-breakers-in-llm-apps/
|
||
- https://www.getmaxim.ai/articles/retries-fallbacks-and-circuit-breakers-in-llm-apps-a-production-guide/
|
||
|
||
Why it matters:
|
||
- Three complementary patterns form the production resilience stack. Without all three, you get retry storms or cascading provider failures.
|
||
|
||
Key takeaways:
|
||
- **Retries**: for transient errors (network, rate limits). Exponential backoff + jitter. Max 3 attempts. Anti-pattern: retrying persistent failures.
|
||
- **Fallbacks**: for provider-level failures. Switch to alternate model/provider. Anti-pattern: reactive fallback waits for timeout; shared-infrastructure fallbacks fail identically.
|
||
- **Circuit breakers**: for systematic degradation. Monitor failure rate over a rolling window; remove the endpoint from routing when it exceeds a threshold. Proactive, not reactive.
|
||
- For agent tool errors: feed the formatted error back to the LLM as a structured observation — not a crash. Let the agent decide to retry with modified parameters, try an alternative tool, or revise its plan.
|
||
- Define per-task max-retry budgets: an agent that retries the same tool call 10 times in one task is stuck, not recovering.
|
||
|
||
Implication:
|
||
- Implement retries with exponential backoff + jitter at the base: `base_delay * (2^attempt) + random(0, base_delay)`. Cap at 3 attempts, 30 seconds max total.
|
||
- Implement fallbacks across at least two LLM providers for any production agent.
|
||
- Implement circuit breakers at the LLM client level: open after 5 failures in 60 seconds, cooldown 30 seconds.
|
||
- For agent tool errors, return structured error observations: `{tool, error_type, message, suggested_action}`.
|
||
- After 2 consecutive tool failures on the same action, force a planning reset (back to `PLANNING` in the FSM).
|
||
|
||
Avoid:
|
||
- Catching all exceptions and silently continuing — the agent will proceed with incomplete state.
|
||
- Retrying non-idempotent mutations without deduplication keys.
|
||
- Identical retry and fallback strategies for rate-limit errors vs. model quality errors — these require different handling.
|
||
|
||
---
|
||
|
||
## Core Sources: Agent Communication Protocols
|
||
|
||
### 16. Survey of Agent Interoperability Protocols: MCP, ACP, A2A, ANP (2025)
|
||
Source: https://arxiv.org/abs/2505.02279
|
||
|
||
Why it matters:
|
||
- Four protocols now occupy distinct architectural layers. Choosing the wrong one for the wrong layer creates maintenance debt.
|
||
|
||
Key findings:
|
||
|
||
| Protocol | Layer | Best for |
|
||
|---|---|---|
|
||
| MCP (Anthropic) | LLM ↔ Tools | Tool/resource injection into a single LLM |
|
||
| ACP (IBM/BeeAI) | Agent ↔ Agent | Model-agnostic, polyglot agent ecosystems |
|
||
| A2A (Google) | Agent ↔ Agent (enterprise) | Trusted enterprise inter-agent task delegation |
|
||
| ANP | Agent ↔ Internet | Open-internet, trustless agent discovery |
|
||
|
||
Recommended adoption: Start with MCP for tool access. Layer ACP for richer agent-to-agent messaging. Implement A2A within organizational boundaries. Extend to ANP only for internet-scale interoperability.
|
||
|
||
Implication:
|
||
- Use MCP today for everything that is "give an LLM access to a tool or data source." Most mature, widest tooling support.
|
||
- For agent-to-agent calls within your own system, a well-structured JSON message over HTTP with a correlation ID and defined output schema is sufficient and more debuggable than adopting a new protocol.
|
||
- Define a standard task envelope for all handoffs: `{task_id, parent_task_id, agent_role, input_schema, output_schema, deadline, status, error}`.
|
||
- Store task state externally (Redis, Postgres), not inside agent memory — so any agent can resume a task after failure.
|
||
|
||
Avoid:
|
||
- Passing full conversation transcripts between agents — pass structured outputs only.
|
||
- Deep delegation chains (A → B → C → D) without a policy layer enforcing permissions at each hop.
|
||
- Inventing bespoke message formats per integration — creates an N×M maintenance problem.
|
||
|
||
---
|
||
|
||
## Core Sources: LangGraph and Framework Patterns
|
||
|
||
### 17. LangGraph Orchestration Framework (2024–2025)
|
||
Source: https://www.langchain.com/langgraph
|
||
|
||
Why it matters:
|
||
- LangGraph's six primitives — Nodes, Edges, State, Checkpointing, Interrupts, Concurrency — remove real infrastructure boilerplate. But the framework also adds overhead when the task doesn't need these features.
|
||
|
||
Key takeaways:
|
||
- **Checkpointing** is the highest-ROI feature. Durable mid-run state means agents can recover from crashes without replaying from scratch.
|
||
- **Interrupts** are the cleanest available HITL pausing implementation — a first-class primitive.
|
||
- **Typed State** enforces a shared schema across all nodes, preventing the "agent passed the wrong keys" bug class.
|
||
- For simple linear workflows, LangGraph adds boilerplate with no functional gain — a plain Python function chain is faster to write and easier to debug.
|
||
- Research shows >75% of multi-agent systems become difficult to manage once they exceed 5 agents — LangGraph doesn't solve cognitive complexity of large graphs.
|
||
|
||
Implication:
|
||
- Use LangGraph when you need any of: checkpointing, HITL interrupts, conditional branching on LLM output, or parallel node execution with join semantics.
|
||
- Keep graphs small and flat. More than 8–10 nodes is a design smell — split into sub-graphs with clear interfaces.
|
||
- Use `StateGraph` with `TypedDict` state schema from day one. Untyped state dicts create subtle bugs.
|
||
- Use a persistent backend (Redis or Postgres) for checkpointing on any workflow longer than a few minutes.
|
||
- Do not use LangChain's high-level agent abstractions (`AgentExecutor`, `create_react_agent`) in production — they hide retry logic and error handling you need to control explicitly.
|
||
|
||
Avoid:
|
||
- Using LangGraph for simple prompt-chaining pipelines with no branching — overhead unjustified.
|
||
- Debugging via framework logs alone — instrument raw LLM inputs/outputs with LangSmith or equivalent.
|
||
|
||
---
|
||
|
||
## Core Sources: Human-in-the-Loop
|
||
|
||
### 18. Human-in-the-Loop Patterns (Permit.io / LangChain Docs, 2025)
|
||
Sources:
|
||
- https://www.permit.io/blog/human-in-the-loop-for-ai-agents-best-practices-frameworks-use-cases-and-demo
|
||
- https://docs.langchain.com/oss/python/langchain/human-in-the-loop
|
||
- The Human-in-the-Loop Illusion: https://www.resilientcyber.io/p/the-human-in-the-loop-illusion
|
||
|
||
Why it matters:
|
||
- HITL provides false confidence if approvers are presented with raw JSON or 50-step action summaries. Humans rubber-stamp without understanding. Good HITL requires human-readable summaries.
|
||
|
||
Four distinct HITL patterns:
|
||
1. **Interrupt & Resume** — agent pauses at a checkpoint, waits for decision, resumes. Best for irreversible action authorization.
|
||
2. **Human-as-a-Tool** — the agent treats human judgment as a callable service for genuine uncertainty. Best for ambiguous inputs.
|
||
3. **Approval Flows** — role-based, policy-driven authorization for action classes. Best for financial/compliance workflows.
|
||
4. **Fallback Escalation** — failed or permission-denied tasks route to humans via async channels. Best for lower-urgency decisions.
|
||
|
||
Key trigger criteria: access control changes, infrastructure modifications, destructive operations, financial transactions, operations outside the agent's intended scope. Heuristic: "Would I be okay if the agent did this without asking me?"
|
||
|
||
Implication:
|
||
- Define your HITL trigger policy in a config file, not in agent prompts. Specify: action classes requiring approval, required approver role, timeout behavior, fallback path if no response.
|
||
- Present approval requests as plain-language summaries: "Agent wants to delete 3 files in /prod/config: [list]. Reason: [reason]. Approve?" — never raw tool schemas.
|
||
- For async approval, save full agent state to durable storage before suspending.
|
||
- Set a maximum wait time for approval (e.g., 4 hours for low-stakes, 10 minutes for blocking workflows).
|
||
- Log every HITL interaction (request, approver, decision, timestamp) for audit trails.
|
||
|
||
Avoid:
|
||
- Requiring HITL approval for every step — creates approval fatigue leading to rubber-stamping.
|
||
- Presenting raw LLM output or tool call JSON to approvers.
|
||
- HITL with no timeout — human unavailability should be a handled failure mode.
|
||
|
||
---
|
||
|
||
## Core Sources: Observability
|
||
|
||
### 19. AI Agent Observability with OpenTelemetry (OTEL, 2025)
|
||
Source: https://opentelemetry.io/blog/2025/ai-agent-observability/
|
||
|
||
Why it matters:
|
||
- The industry is converging on OTEL as the standard for AI agent telemetry. Emit once, route to any backend without vendor lock-in. GenAI Semantic Conventions are being standardized.
|
||
|
||
Key takeaways:
|
||
- Multi-agent traces must reconstruct *why* an agent made a decision, not just *what* it did and *how long* it took. This requires correlation IDs linking all calls in a single task, parent-child span relationships across agent boundaries.
|
||
- Datadog launched AI Agent Monitoring (DASH 2025). Microsoft integrated multi-agent observability across Semantic Kernel, LangGraph, LangChain, OpenAI Agents SDK.
|
||
- Two audiences: LangSmith for prompt-level debugging in development; Datadog/Grafana for production operational monitoring.
|
||
|
||
Implication:
|
||
- Instrument with OTEL from day one using GenAI semantic conventions.
|
||
- Emit a trace per agent task with: task ID, parent task ID, agent role, FSM state transitions, tool calls as child spans, final output, success/failure, token counts per step.
|
||
- Propagate correlation IDs through all agent-to-agent calls. Without this, multi-agent debugging is blind.
|
||
- Alert on: agent loop depth >10 steps without completion, tool error rate >20% on any tool over 5 minutes, token-per-task cost 2x baseline.
|
||
- Capture the full state object at each checkpoint — this enables time-travel debugging.
|
||
|
||
Avoid:
|
||
- Logging only the final output of each agent.
|
||
- Building custom observability infrastructure when OTEL + a backend is available.
|
||
- Storing raw conversation histories as the only observability artifact.
|
||
- Instrumenting only happy-path flows — errors, retries, and HITL interrupts must also emit structured spans.
|
||
|
||
---
|
||
|
||
## Core Sources: Software Engineering Agents
|
||
|
||
### 20. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering (Yang et al., 2024)
|
||
Source: https://arxiv.org/abs/2405.15793
|
||
|
||
Why it matters:
|
||
- The interface between model and environment is part of the model's performance.
|
||
|
||
Key takeaway:
|
||
- Good repo agents need a disciplined shell/editor/test interface, not just a better prompt.
|
||
|
||
Implication:
|
||
- Design the action surface carefully.
|
||
- Short loops over read/search/edit/test beat abstract planning without execution.
|
||
|
||
### 21. Agentless: Demystifying LLM-based Software Engineering Agents (Xia et al., 2024)
|
||
Source: https://arxiv.org/abs/2407.01489
|
||
|
||
Why it matters:
|
||
- A simpler pipeline can outperform complex software agents at lower cost.
|
||
|
||
Key takeaway:
|
||
- Simpler decomposition often beats a giant autonomous loop.
|
||
|
||
Implication:
|
||
- Always benchmark against a simpler non-agentic or lightly agentic baseline.
|
||
- If a full agent loop is not clearly better, cut it.
|
||
|
||
### 22. PatchPilot: A Cost-Efficient Software Engineering Agent (2025)
|
||
Source: https://arxiv.org/abs/2502.02747
|
||
|
||
Why it matters:
|
||
- A rule-based 5-step workflow matches or beats fully agentic approaches at a fraction of the cost (under $1/instance on SWE-bench).
|
||
|
||
Key takeaway:
|
||
- Adding an explicit localization step before generation measurably improves patch quality. Rule-based planning provides stability; agent-based planning provides peak performance. A hybrid uses rules as the default and escalates to agent planning on failure.
|
||
|
||
5-step workflow:
|
||
1. Reproduction — verify the issue is reproducible
|
||
2. Localization — retrieve relevant context from the codebase
|
||
3. Generation — produce the patch
|
||
4. Validation — run tests/checks
|
||
5. Refinement — iterate until validation passes
|
||
|
||
Implication:
|
||
- Before generating a fix, explicitly localize: find the affected files/functions and pull them into context.
|
||
- Use rule-based orchestration as the stable default; reserve LLM-driven planning for cases where the rule path has already failed.
|
||
- Treat refinement as a bounded loop with a fixed retry budget, not open-ended re-generation.
|
||
|
||
---
|
||
|
||
## Core Sources: Evaluation
|
||
|
||
### 23. Demystifying Evals for AI Agents (Anthropic, 2026)
|
||
Source: https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
|
||
|
||
Why it matters:
|
||
- One of the best practical writeups on agent evals and reliability.
|
||
|
||
High-signal takeaways:
|
||
- Start early; 20–50 tasks is enough to begin.
|
||
- Write unambiguous tasks with reference solutions.
|
||
- Evaluate both "should do X" and "should not do X."
|
||
- Isolate trials from each other.
|
||
- Grade outputs/outcomes, not rigid exact traces.
|
||
- Calibrate model graders against humans.
|
||
- Read transcripts constantly.
|
||
- Treat eval-driven development as normal engineering.
|
||
|
||
Implication:
|
||
- Search/tool-use policies should be evaluated on both over-triggering and under-triggering.
|
||
- Coding agents should be graded on artifacts and tests, not whether they followed a specific thought path.
|
||
- If the model improved but the score did not, suspect the benchmark or grader too.
|
||
|
||
---
|
||
|
||
## Projects Worth Studying
|
||
|
||
### 1. karpathy/autoresearch
|
||
Source: https://github.com/karpathy/autoresearch
|
||
|
||
What to study:
|
||
- Extremely narrow loop, fixed optimization target, small mutable surface, experiment-first framing.
|
||
|
||
Copy: tight loop, fixed budget, metric-first automation.
|
||
|
||
Avoid: generalizing it into a broad orchestration layer unless evals justify it.
|
||
|
||
### 2. davebcn87/pi-autoresearch
|
||
Source: https://github.com/davebcn87/pi-autoresearch
|
||
|
||
What to study:
|
||
- Explicit session files, checks vs. crashes vs. metric logs, dashboard and widget feedback.
|
||
|
||
Copy:
|
||
- Make experiment state visible.
|
||
- Distinguish correctness failures from benchmark failures.
|
||
- Commit only after the right checks pass.
|
||
|
||
### 3. SWE-agent / mini-SWE-agent
|
||
Source: https://github.com/princeton-nlp/SWE-agent
|
||
|
||
What to study: repo-focused action surface, issue → inspect → edit → test loop, benchmark-first iteration.
|
||
|
||
Copy: narrow interface and strong harnessing.
|
||
|
||
### 4. OpenHands
|
||
Source: https://github.com/All-Hands-AI/OpenHands
|
||
|
||
What to study: broad workspace/runtime architecture, interactive software agent product design.
|
||
|
||
Copy carefully: runtime ergonomics and environment handling.
|
||
|
||
Risk: very easy to absorb too much framework complexity.
|
||
|
||
### 5. aider's architect/editor split
|
||
Source: https://aider.chat/2024/09/26/architect.html
|
||
|
||
What to study: separate high-level reasoning from concrete editing.
|
||
|
||
Copy: planner/editor separation can help when one lane should stay terse and execution-oriented.
|
||
|
||
Risk: only worth it if the split clearly improves results on your tasks.
|
||
|
||
### 6. ATLAS (Adaptive Test-time Learning and Autonomous Specialization)
|
||
Source: https://github.com/itigges22/ATLAS
|
||
|
||
What to study:
|
||
- Self-hosted coding agent achieving 74.6% on LiveCodeBench using a quantized 14B Qwen model on a single consumer GPU (RTX 5060 Ti, 16GB VRAM).
|
||
- Three-phase pipeline: **Generate → Verify → Repair**
|
||
- *Generate*: PlanSearch extracts constraints and produces diverse candidate solutions; Budget Forcing controls token spend during inference
|
||
- *Verify*: a "Geometric Lens" component scores candidates with a 5120-dimensional energy field (87.8% selection accuracy) plus sandboxed execution
|
||
- *Repair*: failing candidates trigger self-generated test cases and iterative refinement via PR-CoT (Problem-Repair Chain-of-Thought)
|
||
- ~$0.004/task in local electricity vs. $0.043–$0.066 for comparable API services; no external API calls.
|
||
|
||
Why it belongs here:
|
||
- Working proof that smart infrastructure — not model scale — can close the gap with frontier systems. Doubles baseline pass rate (from ~38% to 74.6%) entirely through the generation/verification/repair scaffold.
|
||
|
||
Copy:
|
||
- generate → external verify → self-repair loop as a default pattern for coding tasks.
|
||
- Budget forcing to limit token waste on low-confidence generations.
|
||
- Distinguish candidate selection accuracy from final pass rate — they are different metrics.
|
||
|
||
Avoid:
|
||
- Treating it as a general-purpose agent; optimized explicitly for LiveCodeBench.
|
||
- Sequential pipeline if throughput matters.
|
||
|
||
Risk: the Geometric Lens is described as undertrained; verification signal could be a bottleneck on new domains.
|
||
|
||
---
|
||
|
||
## Distilled Rules for Orchestration
|
||
|
||
### Choosing the Architecture
|
||
- Default to a single LLM call or a simple workflow.
|
||
- Add an evaluator-optimizer loop when correctness matters and the task benefits from revision.
|
||
- Add multiple agents only when evals show clear gains from role separation.
|
||
- Use heterogeneous agents (different roles, models, or tool access), not homogeneous swarms.
|
||
|
||
### State Machine Design
|
||
- Minimum states: `PLANNING`, `EXECUTING`, `OBSERVING`, `ERROR_RECOVERY`, `DONE`.
|
||
- Each state: one well-defined prompt, allowed tools, explicit transition conditions.
|
||
- Error state is mandatory — not optional.
|
||
- Transition counter per state to prevent loops.
|
||
- Routing logic lives in code, not in LLM prompts.
|
||
|
||
### Memory
|
||
- Don't treat the context window as your only memory.
|
||
- Two minimum layers: in-context working buffer + persistent episodic store with structured metadata.
|
||
- Consolidation job: after task completion, abstract lessons into semantic memory.
|
||
- Memory entries: structured notes with description, keywords, tags, outcome at write time.
|
||
|
||
### Context Management
|
||
- Default: observation masking (replace old tool outputs with placeholders; keep reasoning chain).
|
||
- Trigger compaction at 70–80% of context limit — before the LLM call.
|
||
- Always preserve: original task spec, last N turns verbatim, current goal state.
|
||
- LLM summarization only as a fallback for single oversized responses.
|
||
|
||
### Error Recovery
|
||
- Retry with exponential backoff + jitter. Cap at 3 attempts.
|
||
- Circuit breaker at the LLM client level.
|
||
- Tool errors return as structured observations, not crashes.
|
||
- After 2 consecutive failures on the same action: planning reset.
|
||
- Per-task retry budget, not just per-call.
|
||
|
||
### Multi-Agent Design
|
||
- Use one agent unless there is a measured reason to split.
|
||
- Split by capability, not by story or persona.
|
||
- Small helpers do mechanical work; larger models handle synthesis and edge-case reasoning.
|
||
- Coordination prompts name the shared objective and each role's responsibility.
|
||
- Per-subagent permission scopes defined in config, not in prompts.
|
||
|
||
### Evaluation
|
||
- Run the same harness the product actually uses.
|
||
- Keep trials isolated.
|
||
- Track pass@1 and consistency, not just "found a good answer once."
|
||
- Review transcripts every week if the system matters.
|
||
- Rebuild evals from real failures and real manual checks.
|
||
|
||
---
|
||
|
||
## Anti-Patterns
|
||
|
||
- Too many homogeneous agents
|
||
- Persona-rich orchestration prompts with weak task constraints
|
||
- Unbounded self-reflection loops
|
||
- Auto-commits without validation
|
||
- Massive context files with no ownership
|
||
- Grading the exact tool path instead of the delivered outcome
|
||
- Building a platform before validating a narrow workflow
|
||
- Routing logic embedded in LLM prompts instead of code
|
||
- Missing error state in agent FSM
|
||
- Context growing unchecked until hard limit
|
||
- HITL approvals with raw JSON or no plain-language summary
|
||
- Agent-to-agent calls without correlation IDs
|
||
- Memory as only a raw vector store with no structured metadata
|
||
- Stale harness components after model upgrades
|
||
|
||
---
|
||
|
||
## What To Re-Read Often
|
||
|
||
- Anthropic, Building Effective AI Agents
|
||
- Anthropic, Harness Design for Long-Running Application Development
|
||
- Anthropic, Demystifying Evals for AI Agents
|
||
- Anthropic, Programmatic Tool Calling Docs
|
||
- BAIR, The Shift from Models to Compound AI Systems
|
||
- StateFlow (FSM-based agent loops)
|
||
- JetBrains Research, Efficient Context Management
|
||
- A-MEM (memory network design)
|
||
- Agentless
|
||
- SWE-agent
|
||
- karpathy/autoresearch
|
||
- pi-autoresearch
|
||
- ATLAS (generate → verify → repair; small-model infra vs. scale)
|
||
- Multi-Agent Orchestration for Deterministic Decision Support
|
||
|
||
---
|
||
|
||
## Update Policy
|
||
|
||
When adding a new source, prefer:
|
||
- primary paper
|
||
- official engineering article
|
||
- official project README or documentation
|
||
|
||
For each new source, capture:
|
||
- what it claims
|
||
- what to copy into orchestration design
|
||
- what to avoid
|
||
- whether it actually changes system design decisions
|
||
|
||
If it does not change design decisions, it probably does not belong here.
|
||
|
||
*Last updated: 2026-04-01*
|