Files
mid_model_research/Research-orchestration.md
T
sleepy 51123212c4 Initial commit: coding harness feedback analysis
Harnesses under analysis:
- opencode (Go-based coding agent)
- pi (minimal terminal coding harness by Mario Zechner)
- hermes (Nous Research agent)
- forgecode (AI pair programmer with sub-agents)

Each harness folder contains:
- repo/: Source code from respective repositories
- feedback/localllm/: Community feedback for local/smaller models
- feedback/frontier/: Community feedback for frontier models

Research focus: Tool handling, skills systems, prompt engineering,
context management, and best practices for smaller/local models.
2026-04-09 15:13:45 +02:00

738 lines
38 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Agent Orchestration and System Design
A practical, research-backed field guide for designing agent systems: workflows, multi-agent pipelines, memory, evaluation, and infrastructure.
Use it for:
- choosing between single-agent, workflow, and multi-agent designs
- orchestration patterns (sequential, parallel, evaluator-optimizer)
- agent memory and context management
- error recovery and production reliability
- evaluation and harness design
- tooling and automation loops
Prompt-level decisions (system prompt writing, CoT strategy, instruction following) live in `Research-prompt.md`.
## Fast Takeaways
1. Start with the simplest scaffold that can pass evals. Default to single-agent or workflow. Add agents only when evals show clear gains.
2. Separate generator and evaluator roles. Self-evaluation is too lenient; an external evaluator is much stronger.
3. Use different models or prompts only when they contribute distinct evidence or skills. Homogeneous agent swarms do not scale.
4. Model your agent loop as an explicit state machine. Named states with typed transitions beat open-ended ReAct loops beyond 56 steps.
5. Error states are first-class citizens. After two consecutive failures on the same action, return to planning — not more retries.
6. Memory is a system: working buffer + episodic store + semantic rules. A raw vector store is not a memory architecture.
7. Context window management: observation masking outperforms LLM summarization on cost and quality. Keep the reasoning chain; replace old tool outputs.
8. Grade outcomes, artifacts, and grounded evidence — not exact tool-call traces.
9. HITL approval is only useful if presented as a plain-language summary, not raw JSON.
10. Instrument with OpenTelemetry from day one. Correlate traces across agent boundaries with parent-child span IDs.
## What To Copy Into Systems
### Orchestration
- Keep the default path single-agent or workflow-based.
- Add planners, reviewers, or specialist agents only when evals show clear gains.
- Prefer bounded loops: one plan phase, one act phase, one verifier, one retry budget.
- Use different models or prompts only when they contribute distinct evidence or skills.
- Treat multi-agent diversity as a tool, not a religion.
### System Design
- One agent, one responsibility. Separate generator from evaluator from synthesizer.
- Small specialists for mechanical subproblems (grep, read, classify, run) are a real design pattern.
- Reserve expensive frontier models for hard reasoning and synthesis.
- Route by capability and role, not by "use more agents for quality."
- A compound system is not a complex system. Match complexity to business value.
### Tooling and Action Surface
- Favor tools that return verifiable feedback: tests, compiler errors, search results, fetched pages, graders.
- Apply poka-yoke to every tool: use absolute filepaths, validate inputs before calling external services, return structured error objects not raw exceptions.
- Keep traces and artifacts.
- If the task is stale/exact/source-sensitive, lookup beats memory.
### Automation and Safety
- Fix a metric before running an autonomous loop.
- Keep the mutable surface small.
- Auto-commit only after checks pass.
- Separate "experiment failed" from "checks failed" from "metric regressed."
- Prefer narrow optimization targets over grand autonomous platform behavior.
### Evaluation
- Build evals from real failures and real manual checks.
- Balance both sides of decision boundaries: "should do X" and "should not do X."
- Isolate trials — no shared repo state, hidden cache, or leaked history.
- Use deterministic graders where possible.
- Use LLM graders with clear rubrics and human calibration when needed.
- Read transcripts constantly. If metrics and transcripts disagree, suspect the harness or grader.
---
## Core Sources: System Design
### 1. The Shift from Models to Compound AI Systems (BAIR, 2024)
Source: https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/
Why it matters:
- Strong AI systems increasingly come from multiple interacting components, not just bigger base models. System design can improve quality faster than scaling alone.
Key takeaways:
- Current-data access, control, trust, and cost are often easier to solve at the system level.
- Optimizing a compound system is a distinct engineering problem from optimizing a model.
Implication:
- Build around tools, retrievers, graders, and routers when they solve a real product problem.
- Do not mistake "compound system" for "maximally complex system."
### 2. Building Effective AI Agents (Anthropic, 2026)
Source: https://resources.anthropic.com/building-effective-ai-agents
Why it matters:
- High-quality practical guidance from a team operating real agent systems at scale.
The most useful framing:
- Choose between single-agent, workflow, and multi-agent designs intentionally.
- Use a small set of reusable patterns:
- **Prompt chaining** — sequential, each output feeds the next
- **Routing** — classify input, dispatch to specialist
- **Parallelization** — sectioning for independent subtasks; voting for confidence
- **Orchestrator-workers** — dynamic delegation for unpredictable subtasks
- **Evaluator-optimizer** — generate-then-critique loop for refineable outputs
- Match system complexity to business value.
Implication:
- Default to simple workflows first.
- Reach for evaluator-optimizer when correctness matters and the task benefits from revision.
- Reach for multi-agent only after single-agent/workflow baselines are exhausted.
- The most successful teams use "simple, composable patterns rather than complex frameworks."
Avoid:
- Using agent frameworks without understanding what they do under the hood — they often create abstraction layers that obscure prompts and responses.
- Adding agents before measuring whether simpler approaches fail.
### 3. Harness Design for Long-Running Application Development (Anthropic, 2026)
Source: https://www.anthropic.com/engineering/harness-design-long-running-apps
Why it matters:
- Strong, current practical guidance on long-running coding harnesses from a team actively tuning them against real app-building tasks.
Key takeaways:
- Separate generator and evaluator roles. Self-evaluation is too lenient; a tuned external evaluator is much more useful.
- Make "done" explicit before coding. Use a per-sprint contract negotiated between builder and evaluator.
- Keep planner output high-level and product-facing. Over-specifying low-level details too early cascades bad assumptions.
- Use evaluators that touch the environment directly. Playwright-driven QA against real UI/API behavior is much stronger than static inspection.
- Use hard thresholds for acceptance, not vague approval. If a criterion misses, fail the chunk and return actionable feedback.
- Preserve handoff artifacts and structured files between agents. File-based communication reduces drift across long runs.
- Context resets vs. compaction is a model-specific engineering choice. Resets help when the model shows context anxiety; stronger models may make resets unnecessary.
- Every harness component encodes an assumption about model weakness. Re-run ablations and remove scaffold that is no longer load-bearing after model upgrades.
- Evaluators are worth the cost near the model's capability boundary. When the model can already do the task reliably solo, evaluator passes can become mostly overhead.
- Read logs and calibrate the evaluator from real disagreements. Prompt tuning should come from concrete misses, not abstract "better QA" wishes.
Implication:
- Keep coder and reviewer/verifier separate when acceptance quality matters.
- Add an explicit contract or acceptance plan before implementation when the spec is high-level.
- Prefer grounded evaluator tools over reviewer vibes.
- Keep handoff state compact and structured enough to survive resets when resets are needed.
- Revisit harness complexity whenever the base model changes; stale scaffold is real technical debt.
### 4. Understanding Agent Scaling via Diversity (2026)
Source: arXiv:2602.03794
Why it matters:
- More homogeneous agents do not scale indefinitely; diversity matters more than count.
Key takeaway:
- Two meaningfully different agents can outperform a swarm of same-ish agents.
Implication:
- Diversity should come from role, model, tool access, or evidence channel.
- Do not duplicate the same model/prompt ten times and call it orchestration.
### 5. SOLVE-Med / MATA / Small-Model Orchestration (20252026)
Sources:
- SOLVE-Med: arXiv:2511.03542
- MATA: arXiv:2602.09642
Why they matter:
- Small specialized models, when orchestrated well, can outperform or match much larger standalone systems.
Key takeaway:
- Cheap specialists for mechanical subproblems are a real design pattern, not a hack.
Implication:
- Route grep/read/run/simple classification to cheaper lanes.
- Reserve expensive models for hard reasoning or integration steps.
### 6. Difficulty-Aware Agentic Orchestration (DAAO, 2025)
Source: arXiv:2509.11079
Why it matters:
- Not all subtasks need the same model size. A variational autoencoder estimating query difficulty + a cost/performance-aware router gives near-frontier quality at significantly lower cost.
Key takeaway:
- Difficulty-based routing is a high-leverage optimization most systems skip.
Implication:
- Classify task difficulty before dispatching to a model. Easy classification → cheap model. Hard synthesis → frontier model.
- This is the principled version of the "route to cheap specialists" heuristic.
### 7. Multi-Agent Orchestration for Deterministic Decision Support (2025)
Source: arXiv:2511.15755
Why it matters:
- 348 controlled trials: multi-agent orchestration achieved 100% actionable recommendation rate vs. 1.7% for single-agent, with 80x specificity and 140x correctness improvement at similar latency.
- The reframing: multi-agent orchestration is a production-readiness requirement, not a performance optimization. Consistent, deterministic quality is what enables SLA commitments.
Key takeaway:
- Single agents produce high-variance outputs. Multi-agent systems with clear role separation produce stable ones.
Implication:
- When variance is unacceptable (financial decisions, infrastructure changes, compliance tasks), multi-agent is not optional — it's the architecture that enables quality guarantees.
### 8. Emergent Coordination in Multi-Agent Systems (2025)
Source: arXiv:2510.05174
Why it matters:
- Coordination is better when agents share objectives and understand complementary roles.
Key takeaway:
- Role awareness is useful; vague social-role prompts are not enough.
Implication:
- When using multiple agents, explicitly describe what each one contributes and how outputs combine.
- Name the shared objective in the orchestrator prompt; name each agent's responsibility in its own prompt.
---
## Core Sources: Memory
### 9. A-MEM: Agentic Memory for LLM Agents (NeurIPS 2025)
Source: https://arxiv.org/abs/2502.12110
Why it matters:
- A Zettelkasten-style memory network — structured notes with attributes, keywords, and tags — doubled complex reasoning performance vs. flat vector store baselines at lower token cost.
Key takeaways:
- Every memory node gets a structured note with contextual description, keywords, and tags at write time.
- An autonomous link-generation mechanism identifies connections via cosine similarity + LLM analysis.
- When a new memory is added, existing related memories are also updated — the memory network evolves.
Implication:
- At minimum, build two layers: a short-term in-context working buffer and a persistent episodic store with structured metadata per entry.
- Enrich every stored memory with metadata at write time (task context, success/failure outcome, timestamps, tags) — retrieval quality depends entirely on index richness.
Avoid:
- Flat vector stores with no structural metadata — retrieval becomes a bag-of-embeddings lottery.
- Unbounded episodic stores without consolidation or eviction policies.
### 10. Episodic Memory is the Missing Piece for Long-Term LLM Agents (2025)
Source: https://arxiv.org/abs/2502.06975
Why it matters:
- Of the four memory tiers (working, episodic, semantic, procedural), episodic memory is the most underinvested and the key enabler for genuine long-term agent improvement.
Key takeaway:
- Time-stamped traces of specific past task runs enable single-shot learning from concrete prior instances. Without episodic memory, agents keep relearning the same lessons.
Implication:
- Implement an episodic-to-semantic consolidation job: after task completion, abstract successful patterns from the episode trace into reusable rules in semantic memory.
- For multi-agent systems: distinguish per-agent private episodic memory from shared semantic memory. Sharing raw episodes risks leakage; sharing distilled rules is safer.
---
## Core Sources: Context Management
### 11. Cutting Through the Noise: Efficient Context Management (JetBrains Research, Dec 2025)
Source: https://blog.jetbrains.com/research/2025/12/efficient-context-management/
Why it matters:
- Both common strategies (observation masking and LLM summarization) cut costs >50% vs. unmanaged context. But observation masking matched or outperformed summarization in 4 of 5 configurations, at lower complexity.
Key takeaways:
- **Observation masking**: replaces older tool outputs/file contents with a placeholder, keeps the reasoning chain intact. Fast, cheap, no extra LLM calls.
- **LLM summarization**: compresses old turns. Slower, more expensive, and paradoxically caused agents to run ~15% longer trajectories (summaries gave false confidence to keep going).
- With Qwen3-Coder 480B, masking achieved 2.6% *higher* solve rates while being 52% cheaper.
- A 2026 industry report attributed ~65% of enterprise AI failures to "context drift" — accumulated noise causing agents to lose track of their goal.
Implication:
- Default to **observation masking** as the primary compaction strategy. Keep the reasoning chain; replace tool outputs after a rolling window.
- Add LLM summarization as a fallback only when a single tool response is too large to fit once.
- Set a hard token budget before each agent turn. Trigger compaction when projected input exceeds 7080% of the context limit — before the LLM call, not after.
- Always preserve: original task specification, the most recent N turns verbatim, current goal state.
Avoid:
- LLM summarization as the primary strategy — slower, more expensive, longer trajectories.
- Letting context grow unchecked — quality degrades well before the hard limit due to lost-in-the-middle effects.
- Resetting context entirely for long-running tasks — you lose accumulated plan state.
---
## Core Sources: State Machines and Control Flow
### 12. StateFlow: Enhancing LLM Task-Solving through State-Driven Workflows (2024)
Source: https://arxiv.org/abs/2403.11322
Why it matters:
- Modeling a task as a finite state machine (FSM) with six components — States, Initial state, Final states, Output functions, Transitions, Context history — yielded 63.73% success on SQL tasks vs. 40.3% for ReAct, at 5.8x lower cost.
Key takeaways:
- Removing the explicit Error state caused a 5% success rate decline — error handling as a named state is critical.
- A specialist FSM variant (SF_Agent) with separate LLMs per state further reduced token usage.
Implication:
- Model your agent loop as an explicit FSM. Minimum viable states: `PLANNING`, `EXECUTING`, `OBSERVING`, `ERROR_RECOVERY`, `DONE`.
- Each state should have: a single well-defined LLM prompt, allowed tools, and explicit transition conditions.
- Add an `ERROR` state as a first-class citizen with its own prompt and recovery transitions.
- Use a transition counter per state (max N transitions before forcing fallback or human escalation) to prevent runaway loops.
- For complex multi-agent systems, define the FSM in a declarative config (YAML/JSON) rather than code — makes control flow auditable.
Avoid:
- Pure ReAct loops for tasks requiring more than 56 steps — they accumulate drift and have no recovery path when stuck.
- Embedding transition logic in the LLM prompt ("decide what to do next") — the LLM is unreliable as a state router. Keep routing deterministic in code.
---
## Core Sources: Parallelization
### 13. Parallelization and Scatter-Gather Patterns (AWS Prescriptive Guidance, 2025)
Source: https://docs.aws.amazon.com/prescriptive-guidance/latest/agentic-ai-patterns/parallelization-and-scatter-gather-patterns.html
Why it matters:
- Structured scatter-gather (coordinator dispatches N independent subtasks, aggregator synthesizes) is the most battle-tested pattern for parallelizing LLM work.
Key takeaways:
- The aggregator is the critical bottleneck. It must handle partial failures gracefully.
- Use correlation IDs to match results to requests.
- Allow downstream tasks to start early on streaming outputs from upstream tasks where dependencies permit.
- Keep fan-out degree below ~20 parallel agents — coordination overhead grows non-linearly.
Implication:
- Structure fan-out tasks with explicit contracts: each subtask specifies inputs it consumes and the exact output schema it must produce.
- Design the aggregator as a separate, dedicated role with a prompt focused purely on synthesis and conflict resolution.
- Use async fan-out with per-task timeouts and a minimum quorum: "proceed to aggregation once 80% of tasks complete or 30 seconds elapse."
- Route cheap classification/filtering steps to small models; reserve large models for synthesis.
Avoid:
- Parallelizing tasks with implicit ordering dependencies.
- Using the same large model for every subtask — expensive and unnecessary for simple tasks.
- Stateful or context-heavy aggregators — they should receive clean, structured outputs from workers, not full conversation transcripts.
### 14. Orla: A Library for Serving LLM-Based Multi-Agent Systems (2026)
Source: https://arxiv.org/abs/2603.13605
Why it matters:
- Stage-level model routing (small model for classification, large model for synthesis) cut wall-clock time by 38%, mean completion time by 60%, at 35% lower cost vs. single-model baselines on SWE-bench Lite.
Key takeaway:
- Model routing at the workflow stage level is a high-leverage optimization that most systems skip.
Implication:
- Assign model tiers to workflow stages at design time, not at runtime by the LLM.
- Workflow-level KV cache management (preserve cache across stages sharing context prefixes) delivers measurable latency gains.
---
## Core Sources: Error Recovery
### 15. Retries, Fallbacks, and Circuit Breakers in LLM Apps (Portkey / Maxim, 2025)
Sources:
- https://portkey.ai/blog/retries-fallbacks-and-circuit-breakers-in-llm-apps/
- https://www.getmaxim.ai/articles/retries-fallbacks-and-circuit-breakers-in-llm-apps-a-production-guide/
Why it matters:
- Three complementary patterns form the production resilience stack. Without all three, you get retry storms or cascading provider failures.
Key takeaways:
- **Retries**: for transient errors (network, rate limits). Exponential backoff + jitter. Max 3 attempts. Anti-pattern: retrying persistent failures.
- **Fallbacks**: for provider-level failures. Switch to alternate model/provider. Anti-pattern: reactive fallback waits for timeout; shared-infrastructure fallbacks fail identically.
- **Circuit breakers**: for systematic degradation. Monitor failure rate over a rolling window; remove the endpoint from routing when it exceeds a threshold. Proactive, not reactive.
- For agent tool errors: feed the formatted error back to the LLM as a structured observation — not a crash. Let the agent decide to retry with modified parameters, try an alternative tool, or revise its plan.
- Define per-task max-retry budgets: an agent that retries the same tool call 10 times in one task is stuck, not recovering.
Implication:
- Implement retries with exponential backoff + jitter at the base: `base_delay * (2^attempt) + random(0, base_delay)`. Cap at 3 attempts, 30 seconds max total.
- Implement fallbacks across at least two LLM providers for any production agent.
- Implement circuit breakers at the LLM client level: open after 5 failures in 60 seconds, cooldown 30 seconds.
- For agent tool errors, return structured error observations: `{tool, error_type, message, suggested_action}`.
- After 2 consecutive tool failures on the same action, force a planning reset (back to `PLANNING` in the FSM).
Avoid:
- Catching all exceptions and silently continuing — the agent will proceed with incomplete state.
- Retrying non-idempotent mutations without deduplication keys.
- Identical retry and fallback strategies for rate-limit errors vs. model quality errors — these require different handling.
---
## Core Sources: Agent Communication Protocols
### 16. Survey of Agent Interoperability Protocols: MCP, ACP, A2A, ANP (2025)
Source: https://arxiv.org/abs/2505.02279
Why it matters:
- Four protocols now occupy distinct architectural layers. Choosing the wrong one for the wrong layer creates maintenance debt.
Key findings:
| Protocol | Layer | Best for |
|---|---|---|
| MCP (Anthropic) | LLM ↔ Tools | Tool/resource injection into a single LLM |
| ACP (IBM/BeeAI) | Agent ↔ Agent | Model-agnostic, polyglot agent ecosystems |
| A2A (Google) | Agent ↔ Agent (enterprise) | Trusted enterprise inter-agent task delegation |
| ANP | Agent ↔ Internet | Open-internet, trustless agent discovery |
Recommended adoption: Start with MCP for tool access. Layer ACP for richer agent-to-agent messaging. Implement A2A within organizational boundaries. Extend to ANP only for internet-scale interoperability.
Implication:
- Use MCP today for everything that is "give an LLM access to a tool or data source." Most mature, widest tooling support.
- For agent-to-agent calls within your own system, a well-structured JSON message over HTTP with a correlation ID and defined output schema is sufficient and more debuggable than adopting a new protocol.
- Define a standard task envelope for all handoffs: `{task_id, parent_task_id, agent_role, input_schema, output_schema, deadline, status, error}`.
- Store task state externally (Redis, Postgres), not inside agent memory — so any agent can resume a task after failure.
Avoid:
- Passing full conversation transcripts between agents — pass structured outputs only.
- Deep delegation chains (A → B → C → D) without a policy layer enforcing permissions at each hop.
- Inventing bespoke message formats per integration — creates an N×M maintenance problem.
---
## Core Sources: LangGraph and Framework Patterns
### 17. LangGraph Orchestration Framework (20242025)
Source: https://www.langchain.com/langgraph
Why it matters:
- LangGraph's six primitives — Nodes, Edges, State, Checkpointing, Interrupts, Concurrency — remove real infrastructure boilerplate. But the framework also adds overhead when the task doesn't need these features.
Key takeaways:
- **Checkpointing** is the highest-ROI feature. Durable mid-run state means agents can recover from crashes without replaying from scratch.
- **Interrupts** are the cleanest available HITL pausing implementation — a first-class primitive.
- **Typed State** enforces a shared schema across all nodes, preventing the "agent passed the wrong keys" bug class.
- For simple linear workflows, LangGraph adds boilerplate with no functional gain — a plain Python function chain is faster to write and easier to debug.
- Research shows >75% of multi-agent systems become difficult to manage once they exceed 5 agents — LangGraph doesn't solve cognitive complexity of large graphs.
Implication:
- Use LangGraph when you need any of: checkpointing, HITL interrupts, conditional branching on LLM output, or parallel node execution with join semantics.
- Keep graphs small and flat. More than 810 nodes is a design smell — split into sub-graphs with clear interfaces.
- Use `StateGraph` with `TypedDict` state schema from day one. Untyped state dicts create subtle bugs.
- Use a persistent backend (Redis or Postgres) for checkpointing on any workflow longer than a few minutes.
- Do not use LangChain's high-level agent abstractions (`AgentExecutor`, `create_react_agent`) in production — they hide retry logic and error handling you need to control explicitly.
Avoid:
- Using LangGraph for simple prompt-chaining pipelines with no branching — overhead unjustified.
- Debugging via framework logs alone — instrument raw LLM inputs/outputs with LangSmith or equivalent.
---
## Core Sources: Human-in-the-Loop
### 18. Human-in-the-Loop Patterns (Permit.io / LangChain Docs, 2025)
Sources:
- https://www.permit.io/blog/human-in-the-loop-for-ai-agents-best-practices-frameworks-use-cases-and-demo
- https://docs.langchain.com/oss/python/langchain/human-in-the-loop
- The Human-in-the-Loop Illusion: https://www.resilientcyber.io/p/the-human-in-the-loop-illusion
Why it matters:
- HITL provides false confidence if approvers are presented with raw JSON or 50-step action summaries. Humans rubber-stamp without understanding. Good HITL requires human-readable summaries.
Four distinct HITL patterns:
1. **Interrupt & Resume** — agent pauses at a checkpoint, waits for decision, resumes. Best for irreversible action authorization.
2. **Human-as-a-Tool** — the agent treats human judgment as a callable service for genuine uncertainty. Best for ambiguous inputs.
3. **Approval Flows** — role-based, policy-driven authorization for action classes. Best for financial/compliance workflows.
4. **Fallback Escalation** — failed or permission-denied tasks route to humans via async channels. Best for lower-urgency decisions.
Key trigger criteria: access control changes, infrastructure modifications, destructive operations, financial transactions, operations outside the agent's intended scope. Heuristic: "Would I be okay if the agent did this without asking me?"
Implication:
- Define your HITL trigger policy in a config file, not in agent prompts. Specify: action classes requiring approval, required approver role, timeout behavior, fallback path if no response.
- Present approval requests as plain-language summaries: "Agent wants to delete 3 files in /prod/config: [list]. Reason: [reason]. Approve?" — never raw tool schemas.
- For async approval, save full agent state to durable storage before suspending.
- Set a maximum wait time for approval (e.g., 4 hours for low-stakes, 10 minutes for blocking workflows).
- Log every HITL interaction (request, approver, decision, timestamp) for audit trails.
Avoid:
- Requiring HITL approval for every step — creates approval fatigue leading to rubber-stamping.
- Presenting raw LLM output or tool call JSON to approvers.
- HITL with no timeout — human unavailability should be a handled failure mode.
---
## Core Sources: Observability
### 19. AI Agent Observability with OpenTelemetry (OTEL, 2025)
Source: https://opentelemetry.io/blog/2025/ai-agent-observability/
Why it matters:
- The industry is converging on OTEL as the standard for AI agent telemetry. Emit once, route to any backend without vendor lock-in. GenAI Semantic Conventions are being standardized.
Key takeaways:
- Multi-agent traces must reconstruct *why* an agent made a decision, not just *what* it did and *how long* it took. This requires correlation IDs linking all calls in a single task, parent-child span relationships across agent boundaries.
- Datadog launched AI Agent Monitoring (DASH 2025). Microsoft integrated multi-agent observability across Semantic Kernel, LangGraph, LangChain, OpenAI Agents SDK.
- Two audiences: LangSmith for prompt-level debugging in development; Datadog/Grafana for production operational monitoring.
Implication:
- Instrument with OTEL from day one using GenAI semantic conventions.
- Emit a trace per agent task with: task ID, parent task ID, agent role, FSM state transitions, tool calls as child spans, final output, success/failure, token counts per step.
- Propagate correlation IDs through all agent-to-agent calls. Without this, multi-agent debugging is blind.
- Alert on: agent loop depth >10 steps without completion, tool error rate >20% on any tool over 5 minutes, token-per-task cost 2x baseline.
- Capture the full state object at each checkpoint — this enables time-travel debugging.
Avoid:
- Logging only the final output of each agent.
- Building custom observability infrastructure when OTEL + a backend is available.
- Storing raw conversation histories as the only observability artifact.
- Instrumenting only happy-path flows — errors, retries, and HITL interrupts must also emit structured spans.
---
## Core Sources: Software Engineering Agents
### 20. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering (Yang et al., 2024)
Source: https://arxiv.org/abs/2405.15793
Why it matters:
- The interface between model and environment is part of the model's performance.
Key takeaway:
- Good repo agents need a disciplined shell/editor/test interface, not just a better prompt.
Implication:
- Design the action surface carefully.
- Short loops over read/search/edit/test beat abstract planning without execution.
### 21. Agentless: Demystifying LLM-based Software Engineering Agents (Xia et al., 2024)
Source: https://arxiv.org/abs/2407.01489
Why it matters:
- A simpler pipeline can outperform complex software agents at lower cost.
Key takeaway:
- Simpler decomposition often beats a giant autonomous loop.
Implication:
- Always benchmark against a simpler non-agentic or lightly agentic baseline.
- If a full agent loop is not clearly better, cut it.
### 22. PatchPilot: A Cost-Efficient Software Engineering Agent (2025)
Source: https://arxiv.org/abs/2502.02747
Why it matters:
- A rule-based 5-step workflow matches or beats fully agentic approaches at a fraction of the cost (under $1/instance on SWE-bench).
Key takeaway:
- Adding an explicit localization step before generation measurably improves patch quality. Rule-based planning provides stability; agent-based planning provides peak performance. A hybrid uses rules as the default and escalates to agent planning on failure.
5-step workflow:
1. Reproduction — verify the issue is reproducible
2. Localization — retrieve relevant context from the codebase
3. Generation — produce the patch
4. Validation — run tests/checks
5. Refinement — iterate until validation passes
Implication:
- Before generating a fix, explicitly localize: find the affected files/functions and pull them into context.
- Use rule-based orchestration as the stable default; reserve LLM-driven planning for cases where the rule path has already failed.
- Treat refinement as a bounded loop with a fixed retry budget, not open-ended re-generation.
---
## Core Sources: Evaluation
### 23. Demystifying Evals for AI Agents (Anthropic, 2026)
Source: https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
Why it matters:
- One of the best practical writeups on agent evals and reliability.
High-signal takeaways:
- Start early; 2050 tasks is enough to begin.
- Write unambiguous tasks with reference solutions.
- Evaluate both "should do X" and "should not do X."
- Isolate trials from each other.
- Grade outputs/outcomes, not rigid exact traces.
- Calibrate model graders against humans.
- Read transcripts constantly.
- Treat eval-driven development as normal engineering.
Implication:
- Search/tool-use policies should be evaluated on both over-triggering and under-triggering.
- Coding agents should be graded on artifacts and tests, not whether they followed a specific thought path.
- If the model improved but the score did not, suspect the benchmark or grader too.
---
## Projects Worth Studying
### 1. karpathy/autoresearch
Source: https://github.com/karpathy/autoresearch
What to study:
- Extremely narrow loop, fixed optimization target, small mutable surface, experiment-first framing.
Copy: tight loop, fixed budget, metric-first automation.
Avoid: generalizing it into a broad orchestration layer unless evals justify it.
### 2. davebcn87/pi-autoresearch
Source: https://github.com/davebcn87/pi-autoresearch
What to study:
- Explicit session files, checks vs. crashes vs. metric logs, dashboard and widget feedback.
Copy:
- Make experiment state visible.
- Distinguish correctness failures from benchmark failures.
- Commit only after the right checks pass.
### 3. SWE-agent / mini-SWE-agent
Source: https://github.com/princeton-nlp/SWE-agent
What to study: repo-focused action surface, issue → inspect → edit → test loop, benchmark-first iteration.
Copy: narrow interface and strong harnessing.
### 4. OpenHands
Source: https://github.com/All-Hands-AI/OpenHands
What to study: broad workspace/runtime architecture, interactive software agent product design.
Copy carefully: runtime ergonomics and environment handling.
Risk: very easy to absorb too much framework complexity.
### 5. aider's architect/editor split
Source: https://aider.chat/2024/09/26/architect.html
What to study: separate high-level reasoning from concrete editing.
Copy: planner/editor separation can help when one lane should stay terse and execution-oriented.
Risk: only worth it if the split clearly improves results on your tasks.
### 6. ATLAS (Adaptive Test-time Learning and Autonomous Specialization)
Source: https://github.com/itigges22/ATLAS
What to study:
- Self-hosted coding agent achieving 74.6% on LiveCodeBench using a quantized 14B Qwen model on a single consumer GPU (RTX 5060 Ti, 16GB VRAM).
- Three-phase pipeline: **Generate → Verify → Repair**
- *Generate*: PlanSearch extracts constraints and produces diverse candidate solutions; Budget Forcing controls token spend during inference
- *Verify*: a "Geometric Lens" component scores candidates with a 5120-dimensional energy field (87.8% selection accuracy) plus sandboxed execution
- *Repair*: failing candidates trigger self-generated test cases and iterative refinement via PR-CoT (Problem-Repair Chain-of-Thought)
- ~$0.004/task in local electricity vs. $0.043$0.066 for comparable API services; no external API calls.
Why it belongs here:
- Working proof that smart infrastructure — not model scale — can close the gap with frontier systems. Doubles baseline pass rate (from ~38% to 74.6%) entirely through the generation/verification/repair scaffold.
Copy:
- generate → external verify → self-repair loop as a default pattern for coding tasks.
- Budget forcing to limit token waste on low-confidence generations.
- Distinguish candidate selection accuracy from final pass rate — they are different metrics.
Avoid:
- Treating it as a general-purpose agent; optimized explicitly for LiveCodeBench.
- Sequential pipeline if throughput matters.
Risk: the Geometric Lens is described as undertrained; verification signal could be a bottleneck on new domains.
---
## Distilled Rules for Orchestration
### Choosing the Architecture
- Default to a single LLM call or a simple workflow.
- Add an evaluator-optimizer loop when correctness matters and the task benefits from revision.
- Add multiple agents only when evals show clear gains from role separation.
- Use heterogeneous agents (different roles, models, or tool access), not homogeneous swarms.
### State Machine Design
- Minimum states: `PLANNING`, `EXECUTING`, `OBSERVING`, `ERROR_RECOVERY`, `DONE`.
- Each state: one well-defined prompt, allowed tools, explicit transition conditions.
- Error state is mandatory — not optional.
- Transition counter per state to prevent loops.
- Routing logic lives in code, not in LLM prompts.
### Memory
- Don't treat the context window as your only memory.
- Two minimum layers: in-context working buffer + persistent episodic store with structured metadata.
- Consolidation job: after task completion, abstract lessons into semantic memory.
- Memory entries: structured notes with description, keywords, tags, outcome at write time.
### Context Management
- Default: observation masking (replace old tool outputs with placeholders; keep reasoning chain).
- Trigger compaction at 7080% of context limit — before the LLM call.
- Always preserve: original task spec, last N turns verbatim, current goal state.
- LLM summarization only as a fallback for single oversized responses.
### Error Recovery
- Retry with exponential backoff + jitter. Cap at 3 attempts.
- Circuit breaker at the LLM client level.
- Tool errors return as structured observations, not crashes.
- After 2 consecutive failures on the same action: planning reset.
- Per-task retry budget, not just per-call.
### Multi-Agent Design
- Use one agent unless there is a measured reason to split.
- Split by capability, not by story or persona.
- Small helpers do mechanical work; larger models handle synthesis and edge-case reasoning.
- Coordination prompts name the shared objective and each role's responsibility.
- Per-subagent permission scopes defined in config, not in prompts.
### Evaluation
- Run the same harness the product actually uses.
- Keep trials isolated.
- Track pass@1 and consistency, not just "found a good answer once."
- Review transcripts every week if the system matters.
- Rebuild evals from real failures and real manual checks.
---
## Anti-Patterns
- Too many homogeneous agents
- Persona-rich orchestration prompts with weak task constraints
- Unbounded self-reflection loops
- Auto-commits without validation
- Massive context files with no ownership
- Grading the exact tool path instead of the delivered outcome
- Building a platform before validating a narrow workflow
- Routing logic embedded in LLM prompts instead of code
- Missing error state in agent FSM
- Context growing unchecked until hard limit
- HITL approvals with raw JSON or no plain-language summary
- Agent-to-agent calls without correlation IDs
- Memory as only a raw vector store with no structured metadata
- Stale harness components after model upgrades
---
## What To Re-Read Often
- Anthropic, Building Effective AI Agents
- Anthropic, Harness Design for Long-Running Application Development
- Anthropic, Demystifying Evals for AI Agents
- Anthropic, Programmatic Tool Calling Docs
- BAIR, The Shift from Models to Compound AI Systems
- StateFlow (FSM-based agent loops)
- JetBrains Research, Efficient Context Management
- A-MEM (memory network design)
- Agentless
- SWE-agent
- karpathy/autoresearch
- pi-autoresearch
- ATLAS (generate → verify → repair; small-model infra vs. scale)
- Multi-Agent Orchestration for Deterministic Decision Support
---
## Update Policy
When adding a new source, prefer:
- primary paper
- official engineering article
- official project README or documentation
For each new source, capture:
- what it claims
- what to copy into orchestration design
- what to avoid
- whether it actually changes system design decisions
If it does not change design decisions, it probably does not belong here.
*Last updated: 2026-04-01*