Harnesses under analysis: - opencode (Go-based coding agent) - pi (minimal terminal coding harness by Mario Zechner) - hermes (Nous Research agent) - forgecode (AI pair programmer with sub-agents) Each harness folder contains: - repo/: Source code from respective repositories - feedback/localllm/: Community feedback for local/smaller models - feedback/frontier/: Community feedback for frontier models Research focus: Tool handling, skills systems, prompt engineering, context management, and best practices for smaller/local models.
38 KiB
Agent Orchestration and System Design
A practical, research-backed field guide for designing agent systems: workflows, multi-agent pipelines, memory, evaluation, and infrastructure.
Use it for:
- choosing between single-agent, workflow, and multi-agent designs
- orchestration patterns (sequential, parallel, evaluator-optimizer)
- agent memory and context management
- error recovery and production reliability
- evaluation and harness design
- tooling and automation loops
Prompt-level decisions (system prompt writing, CoT strategy, instruction following) live in Research-prompt.md.
Fast Takeaways
- Start with the simplest scaffold that can pass evals. Default to single-agent or workflow. Add agents only when evals show clear gains.
- Separate generator and evaluator roles. Self-evaluation is too lenient; an external evaluator is much stronger.
- Use different models or prompts only when they contribute distinct evidence or skills. Homogeneous agent swarms do not scale.
- Model your agent loop as an explicit state machine. Named states with typed transitions beat open-ended ReAct loops beyond 5–6 steps.
- Error states are first-class citizens. After two consecutive failures on the same action, return to planning — not more retries.
- Memory is a system: working buffer + episodic store + semantic rules. A raw vector store is not a memory architecture.
- Context window management: observation masking outperforms LLM summarization on cost and quality. Keep the reasoning chain; replace old tool outputs.
- Grade outcomes, artifacts, and grounded evidence — not exact tool-call traces.
- HITL approval is only useful if presented as a plain-language summary, not raw JSON.
- Instrument with OpenTelemetry from day one. Correlate traces across agent boundaries with parent-child span IDs.
What To Copy Into Systems
Orchestration
- Keep the default path single-agent or workflow-based.
- Add planners, reviewers, or specialist agents only when evals show clear gains.
- Prefer bounded loops: one plan phase, one act phase, one verifier, one retry budget.
- Use different models or prompts only when they contribute distinct evidence or skills.
- Treat multi-agent diversity as a tool, not a religion.
System Design
- One agent, one responsibility. Separate generator from evaluator from synthesizer.
- Small specialists for mechanical subproblems (grep, read, classify, run) are a real design pattern.
- Reserve expensive frontier models for hard reasoning and synthesis.
- Route by capability and role, not by "use more agents for quality."
- A compound system is not a complex system. Match complexity to business value.
Tooling and Action Surface
- Favor tools that return verifiable feedback: tests, compiler errors, search results, fetched pages, graders.
- Apply poka-yoke to every tool: use absolute filepaths, validate inputs before calling external services, return structured error objects not raw exceptions.
- Keep traces and artifacts.
- If the task is stale/exact/source-sensitive, lookup beats memory.
Automation and Safety
- Fix a metric before running an autonomous loop.
- Keep the mutable surface small.
- Auto-commit only after checks pass.
- Separate "experiment failed" from "checks failed" from "metric regressed."
- Prefer narrow optimization targets over grand autonomous platform behavior.
Evaluation
- Build evals from real failures and real manual checks.
- Balance both sides of decision boundaries: "should do X" and "should not do X."
- Isolate trials — no shared repo state, hidden cache, or leaked history.
- Use deterministic graders where possible.
- Use LLM graders with clear rubrics and human calibration when needed.
- Read transcripts constantly. If metrics and transcripts disagree, suspect the harness or grader.
Core Sources: System Design
1. The Shift from Models to Compound AI Systems (BAIR, 2024)
Source: https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/
Why it matters:
- Strong AI systems increasingly come from multiple interacting components, not just bigger base models. System design can improve quality faster than scaling alone.
Key takeaways:
- Current-data access, control, trust, and cost are often easier to solve at the system level.
- Optimizing a compound system is a distinct engineering problem from optimizing a model.
Implication:
- Build around tools, retrievers, graders, and routers when they solve a real product problem.
- Do not mistake "compound system" for "maximally complex system."
2. Building Effective AI Agents (Anthropic, 2026)
Source: https://resources.anthropic.com/building-effective-ai-agents
Why it matters:
- High-quality practical guidance from a team operating real agent systems at scale.
The most useful framing:
- Choose between single-agent, workflow, and multi-agent designs intentionally.
- Use a small set of reusable patterns:
- Prompt chaining — sequential, each output feeds the next
- Routing — classify input, dispatch to specialist
- Parallelization — sectioning for independent subtasks; voting for confidence
- Orchestrator-workers — dynamic delegation for unpredictable subtasks
- Evaluator-optimizer — generate-then-critique loop for refineable outputs
- Match system complexity to business value.
Implication:
- Default to simple workflows first.
- Reach for evaluator-optimizer when correctness matters and the task benefits from revision.
- Reach for multi-agent only after single-agent/workflow baselines are exhausted.
- The most successful teams use "simple, composable patterns rather than complex frameworks."
Avoid:
- Using agent frameworks without understanding what they do under the hood — they often create abstraction layers that obscure prompts and responses.
- Adding agents before measuring whether simpler approaches fail.
3. Harness Design for Long-Running Application Development (Anthropic, 2026)
Source: https://www.anthropic.com/engineering/harness-design-long-running-apps
Why it matters:
- Strong, current practical guidance on long-running coding harnesses from a team actively tuning them against real app-building tasks.
Key takeaways:
- Separate generator and evaluator roles. Self-evaluation is too lenient; a tuned external evaluator is much more useful.
- Make "done" explicit before coding. Use a per-sprint contract negotiated between builder and evaluator.
- Keep planner output high-level and product-facing. Over-specifying low-level details too early cascades bad assumptions.
- Use evaluators that touch the environment directly. Playwright-driven QA against real UI/API behavior is much stronger than static inspection.
- Use hard thresholds for acceptance, not vague approval. If a criterion misses, fail the chunk and return actionable feedback.
- Preserve handoff artifacts and structured files between agents. File-based communication reduces drift across long runs.
- Context resets vs. compaction is a model-specific engineering choice. Resets help when the model shows context anxiety; stronger models may make resets unnecessary.
- Every harness component encodes an assumption about model weakness. Re-run ablations and remove scaffold that is no longer load-bearing after model upgrades.
- Evaluators are worth the cost near the model's capability boundary. When the model can already do the task reliably solo, evaluator passes can become mostly overhead.
- Read logs and calibrate the evaluator from real disagreements. Prompt tuning should come from concrete misses, not abstract "better QA" wishes.
Implication:
- Keep coder and reviewer/verifier separate when acceptance quality matters.
- Add an explicit contract or acceptance plan before implementation when the spec is high-level.
- Prefer grounded evaluator tools over reviewer vibes.
- Keep handoff state compact and structured enough to survive resets when resets are needed.
- Revisit harness complexity whenever the base model changes; stale scaffold is real technical debt.
4. Understanding Agent Scaling via Diversity (2026)
Source: arXiv:2602.03794
Why it matters:
- More homogeneous agents do not scale indefinitely; diversity matters more than count.
Key takeaway:
- Two meaningfully different agents can outperform a swarm of same-ish agents.
Implication:
- Diversity should come from role, model, tool access, or evidence channel.
- Do not duplicate the same model/prompt ten times and call it orchestration.
5. SOLVE-Med / MATA / Small-Model Orchestration (2025–2026)
Sources:
- SOLVE-Med: arXiv:2511.03542
- MATA: arXiv:2602.09642
Why they matter:
- Small specialized models, when orchestrated well, can outperform or match much larger standalone systems.
Key takeaway:
- Cheap specialists for mechanical subproblems are a real design pattern, not a hack.
Implication:
- Route grep/read/run/simple classification to cheaper lanes.
- Reserve expensive models for hard reasoning or integration steps.
6. Difficulty-Aware Agentic Orchestration (DAAO, 2025)
Source: arXiv:2509.11079
Why it matters:
- Not all subtasks need the same model size. A variational autoencoder estimating query difficulty + a cost/performance-aware router gives near-frontier quality at significantly lower cost.
Key takeaway:
- Difficulty-based routing is a high-leverage optimization most systems skip.
Implication:
- Classify task difficulty before dispatching to a model. Easy classification → cheap model. Hard synthesis → frontier model.
- This is the principled version of the "route to cheap specialists" heuristic.
7. Multi-Agent Orchestration for Deterministic Decision Support (2025)
Source: arXiv:2511.15755
Why it matters:
- 348 controlled trials: multi-agent orchestration achieved 100% actionable recommendation rate vs. 1.7% for single-agent, with 80x specificity and 140x correctness improvement at similar latency.
- The reframing: multi-agent orchestration is a production-readiness requirement, not a performance optimization. Consistent, deterministic quality is what enables SLA commitments.
Key takeaway:
- Single agents produce high-variance outputs. Multi-agent systems with clear role separation produce stable ones.
Implication:
- When variance is unacceptable (financial decisions, infrastructure changes, compliance tasks), multi-agent is not optional — it's the architecture that enables quality guarantees.
8. Emergent Coordination in Multi-Agent Systems (2025)
Source: arXiv:2510.05174
Why it matters:
- Coordination is better when agents share objectives and understand complementary roles.
Key takeaway:
- Role awareness is useful; vague social-role prompts are not enough.
Implication:
- When using multiple agents, explicitly describe what each one contributes and how outputs combine.
- Name the shared objective in the orchestrator prompt; name each agent's responsibility in its own prompt.
Core Sources: Memory
9. A-MEM: Agentic Memory for LLM Agents (NeurIPS 2025)
Source: https://arxiv.org/abs/2502.12110
Why it matters:
- A Zettelkasten-style memory network — structured notes with attributes, keywords, and tags — doubled complex reasoning performance vs. flat vector store baselines at lower token cost.
Key takeaways:
- Every memory node gets a structured note with contextual description, keywords, and tags at write time.
- An autonomous link-generation mechanism identifies connections via cosine similarity + LLM analysis.
- When a new memory is added, existing related memories are also updated — the memory network evolves.
Implication:
- At minimum, build two layers: a short-term in-context working buffer and a persistent episodic store with structured metadata per entry.
- Enrich every stored memory with metadata at write time (task context, success/failure outcome, timestamps, tags) — retrieval quality depends entirely on index richness.
Avoid:
- Flat vector stores with no structural metadata — retrieval becomes a bag-of-embeddings lottery.
- Unbounded episodic stores without consolidation or eviction policies.
10. Episodic Memory is the Missing Piece for Long-Term LLM Agents (2025)
Source: https://arxiv.org/abs/2502.06975
Why it matters:
- Of the four memory tiers (working, episodic, semantic, procedural), episodic memory is the most underinvested and the key enabler for genuine long-term agent improvement.
Key takeaway:
- Time-stamped traces of specific past task runs enable single-shot learning from concrete prior instances. Without episodic memory, agents keep relearning the same lessons.
Implication:
- Implement an episodic-to-semantic consolidation job: after task completion, abstract successful patterns from the episode trace into reusable rules in semantic memory.
- For multi-agent systems: distinguish per-agent private episodic memory from shared semantic memory. Sharing raw episodes risks leakage; sharing distilled rules is safer.
Core Sources: Context Management
11. Cutting Through the Noise: Efficient Context Management (JetBrains Research, Dec 2025)
Source: https://blog.jetbrains.com/research/2025/12/efficient-context-management/
Why it matters:
- Both common strategies (observation masking and LLM summarization) cut costs >50% vs. unmanaged context. But observation masking matched or outperformed summarization in 4 of 5 configurations, at lower complexity.
Key takeaways:
- Observation masking: replaces older tool outputs/file contents with a placeholder, keeps the reasoning chain intact. Fast, cheap, no extra LLM calls.
- LLM summarization: compresses old turns. Slower, more expensive, and paradoxically caused agents to run ~15% longer trajectories (summaries gave false confidence to keep going).
- With Qwen3-Coder 480B, masking achieved 2.6% higher solve rates while being 52% cheaper.
- A 2026 industry report attributed ~65% of enterprise AI failures to "context drift" — accumulated noise causing agents to lose track of their goal.
Implication:
- Default to observation masking as the primary compaction strategy. Keep the reasoning chain; replace tool outputs after a rolling window.
- Add LLM summarization as a fallback only when a single tool response is too large to fit once.
- Set a hard token budget before each agent turn. Trigger compaction when projected input exceeds 70–80% of the context limit — before the LLM call, not after.
- Always preserve: original task specification, the most recent N turns verbatim, current goal state.
Avoid:
- LLM summarization as the primary strategy — slower, more expensive, longer trajectories.
- Letting context grow unchecked — quality degrades well before the hard limit due to lost-in-the-middle effects.
- Resetting context entirely for long-running tasks — you lose accumulated plan state.
Core Sources: State Machines and Control Flow
12. StateFlow: Enhancing LLM Task-Solving through State-Driven Workflows (2024)
Source: https://arxiv.org/abs/2403.11322
Why it matters:
- Modeling a task as a finite state machine (FSM) with six components — States, Initial state, Final states, Output functions, Transitions, Context history — yielded 63.73% success on SQL tasks vs. 40.3% for ReAct, at 5.8x lower cost.
Key takeaways:
- Removing the explicit Error state caused a 5% success rate decline — error handling as a named state is critical.
- A specialist FSM variant (SF_Agent) with separate LLMs per state further reduced token usage.
Implication:
- Model your agent loop as an explicit FSM. Minimum viable states:
PLANNING,EXECUTING,OBSERVING,ERROR_RECOVERY,DONE. - Each state should have: a single well-defined LLM prompt, allowed tools, and explicit transition conditions.
- Add an
ERRORstate as a first-class citizen with its own prompt and recovery transitions. - Use a transition counter per state (max N transitions before forcing fallback or human escalation) to prevent runaway loops.
- For complex multi-agent systems, define the FSM in a declarative config (YAML/JSON) rather than code — makes control flow auditable.
Avoid:
- Pure ReAct loops for tasks requiring more than 5–6 steps — they accumulate drift and have no recovery path when stuck.
- Embedding transition logic in the LLM prompt ("decide what to do next") — the LLM is unreliable as a state router. Keep routing deterministic in code.
Core Sources: Parallelization
13. Parallelization and Scatter-Gather Patterns (AWS Prescriptive Guidance, 2025)
Why it matters:
- Structured scatter-gather (coordinator dispatches N independent subtasks, aggregator synthesizes) is the most battle-tested pattern for parallelizing LLM work.
Key takeaways:
- The aggregator is the critical bottleneck. It must handle partial failures gracefully.
- Use correlation IDs to match results to requests.
- Allow downstream tasks to start early on streaming outputs from upstream tasks where dependencies permit.
- Keep fan-out degree below ~20 parallel agents — coordination overhead grows non-linearly.
Implication:
- Structure fan-out tasks with explicit contracts: each subtask specifies inputs it consumes and the exact output schema it must produce.
- Design the aggregator as a separate, dedicated role with a prompt focused purely on synthesis and conflict resolution.
- Use async fan-out with per-task timeouts and a minimum quorum: "proceed to aggregation once 80% of tasks complete or 30 seconds elapse."
- Route cheap classification/filtering steps to small models; reserve large models for synthesis.
Avoid:
- Parallelizing tasks with implicit ordering dependencies.
- Using the same large model for every subtask — expensive and unnecessary for simple tasks.
- Stateful or context-heavy aggregators — they should receive clean, structured outputs from workers, not full conversation transcripts.
14. Orla: A Library for Serving LLM-Based Multi-Agent Systems (2026)
Source: https://arxiv.org/abs/2603.13605
Why it matters:
- Stage-level model routing (small model for classification, large model for synthesis) cut wall-clock time by 38%, mean completion time by 60%, at 35% lower cost vs. single-model baselines on SWE-bench Lite.
Key takeaway:
- Model routing at the workflow stage level is a high-leverage optimization that most systems skip.
Implication:
- Assign model tiers to workflow stages at design time, not at runtime by the LLM.
- Workflow-level KV cache management (preserve cache across stages sharing context prefixes) delivers measurable latency gains.
Core Sources: Error Recovery
15. Retries, Fallbacks, and Circuit Breakers in LLM Apps (Portkey / Maxim, 2025)
Sources:
- https://portkey.ai/blog/retries-fallbacks-and-circuit-breakers-in-llm-apps/
- https://www.getmaxim.ai/articles/retries-fallbacks-and-circuit-breakers-in-llm-apps-a-production-guide/
Why it matters:
- Three complementary patterns form the production resilience stack. Without all three, you get retry storms or cascading provider failures.
Key takeaways:
- Retries: for transient errors (network, rate limits). Exponential backoff + jitter. Max 3 attempts. Anti-pattern: retrying persistent failures.
- Fallbacks: for provider-level failures. Switch to alternate model/provider. Anti-pattern: reactive fallback waits for timeout; shared-infrastructure fallbacks fail identically.
- Circuit breakers: for systematic degradation. Monitor failure rate over a rolling window; remove the endpoint from routing when it exceeds a threshold. Proactive, not reactive.
- For agent tool errors: feed the formatted error back to the LLM as a structured observation — not a crash. Let the agent decide to retry with modified parameters, try an alternative tool, or revise its plan.
- Define per-task max-retry budgets: an agent that retries the same tool call 10 times in one task is stuck, not recovering.
Implication:
- Implement retries with exponential backoff + jitter at the base:
base_delay * (2^attempt) + random(0, base_delay). Cap at 3 attempts, 30 seconds max total. - Implement fallbacks across at least two LLM providers for any production agent.
- Implement circuit breakers at the LLM client level: open after 5 failures in 60 seconds, cooldown 30 seconds.
- For agent tool errors, return structured error observations:
{tool, error_type, message, suggested_action}. - After 2 consecutive tool failures on the same action, force a planning reset (back to
PLANNINGin the FSM).
Avoid:
- Catching all exceptions and silently continuing — the agent will proceed with incomplete state.
- Retrying non-idempotent mutations without deduplication keys.
- Identical retry and fallback strategies for rate-limit errors vs. model quality errors — these require different handling.
Core Sources: Agent Communication Protocols
16. Survey of Agent Interoperability Protocols: MCP, ACP, A2A, ANP (2025)
Source: https://arxiv.org/abs/2505.02279
Why it matters:
- Four protocols now occupy distinct architectural layers. Choosing the wrong one for the wrong layer creates maintenance debt.
Key findings:
| Protocol | Layer | Best for |
|---|---|---|
| MCP (Anthropic) | LLM ↔ Tools | Tool/resource injection into a single LLM |
| ACP (IBM/BeeAI) | Agent ↔ Agent | Model-agnostic, polyglot agent ecosystems |
| A2A (Google) | Agent ↔ Agent (enterprise) | Trusted enterprise inter-agent task delegation |
| ANP | Agent ↔ Internet | Open-internet, trustless agent discovery |
Recommended adoption: Start with MCP for tool access. Layer ACP for richer agent-to-agent messaging. Implement A2A within organizational boundaries. Extend to ANP only for internet-scale interoperability.
Implication:
- Use MCP today for everything that is "give an LLM access to a tool or data source." Most mature, widest tooling support.
- For agent-to-agent calls within your own system, a well-structured JSON message over HTTP with a correlation ID and defined output schema is sufficient and more debuggable than adopting a new protocol.
- Define a standard task envelope for all handoffs:
{task_id, parent_task_id, agent_role, input_schema, output_schema, deadline, status, error}. - Store task state externally (Redis, Postgres), not inside agent memory — so any agent can resume a task after failure.
Avoid:
- Passing full conversation transcripts between agents — pass structured outputs only.
- Deep delegation chains (A → B → C → D) without a policy layer enforcing permissions at each hop.
- Inventing bespoke message formats per integration — creates an N×M maintenance problem.
Core Sources: LangGraph and Framework Patterns
17. LangGraph Orchestration Framework (2024–2025)
Source: https://www.langchain.com/langgraph
Why it matters:
- LangGraph's six primitives — Nodes, Edges, State, Checkpointing, Interrupts, Concurrency — remove real infrastructure boilerplate. But the framework also adds overhead when the task doesn't need these features.
Key takeaways:
- Checkpointing is the highest-ROI feature. Durable mid-run state means agents can recover from crashes without replaying from scratch.
- Interrupts are the cleanest available HITL pausing implementation — a first-class primitive.
- Typed State enforces a shared schema across all nodes, preventing the "agent passed the wrong keys" bug class.
- For simple linear workflows, LangGraph adds boilerplate with no functional gain — a plain Python function chain is faster to write and easier to debug.
- Research shows >75% of multi-agent systems become difficult to manage once they exceed 5 agents — LangGraph doesn't solve cognitive complexity of large graphs.
Implication:
- Use LangGraph when you need any of: checkpointing, HITL interrupts, conditional branching on LLM output, or parallel node execution with join semantics.
- Keep graphs small and flat. More than 8–10 nodes is a design smell — split into sub-graphs with clear interfaces.
- Use
StateGraphwithTypedDictstate schema from day one. Untyped state dicts create subtle bugs. - Use a persistent backend (Redis or Postgres) for checkpointing on any workflow longer than a few minutes.
- Do not use LangChain's high-level agent abstractions (
AgentExecutor,create_react_agent) in production — they hide retry logic and error handling you need to control explicitly.
Avoid:
- Using LangGraph for simple prompt-chaining pipelines with no branching — overhead unjustified.
- Debugging via framework logs alone — instrument raw LLM inputs/outputs with LangSmith or equivalent.
Core Sources: Human-in-the-Loop
18. Human-in-the-Loop Patterns (Permit.io / LangChain Docs, 2025)
Sources:
- https://www.permit.io/blog/human-in-the-loop-for-ai-agents-best-practices-frameworks-use-cases-and-demo
- https://docs.langchain.com/oss/python/langchain/human-in-the-loop
- The Human-in-the-Loop Illusion: https://www.resilientcyber.io/p/the-human-in-the-loop-illusion
Why it matters:
- HITL provides false confidence if approvers are presented with raw JSON or 50-step action summaries. Humans rubber-stamp without understanding. Good HITL requires human-readable summaries.
Four distinct HITL patterns:
- Interrupt & Resume — agent pauses at a checkpoint, waits for decision, resumes. Best for irreversible action authorization.
- Human-as-a-Tool — the agent treats human judgment as a callable service for genuine uncertainty. Best for ambiguous inputs.
- Approval Flows — role-based, policy-driven authorization for action classes. Best for financial/compliance workflows.
- Fallback Escalation — failed or permission-denied tasks route to humans via async channels. Best for lower-urgency decisions.
Key trigger criteria: access control changes, infrastructure modifications, destructive operations, financial transactions, operations outside the agent's intended scope. Heuristic: "Would I be okay if the agent did this without asking me?"
Implication:
- Define your HITL trigger policy in a config file, not in agent prompts. Specify: action classes requiring approval, required approver role, timeout behavior, fallback path if no response.
- Present approval requests as plain-language summaries: "Agent wants to delete 3 files in /prod/config: [list]. Reason: [reason]. Approve?" — never raw tool schemas.
- For async approval, save full agent state to durable storage before suspending.
- Set a maximum wait time for approval (e.g., 4 hours for low-stakes, 10 minutes for blocking workflows).
- Log every HITL interaction (request, approver, decision, timestamp) for audit trails.
Avoid:
- Requiring HITL approval for every step — creates approval fatigue leading to rubber-stamping.
- Presenting raw LLM output or tool call JSON to approvers.
- HITL with no timeout — human unavailability should be a handled failure mode.
Core Sources: Observability
19. AI Agent Observability with OpenTelemetry (OTEL, 2025)
Source: https://opentelemetry.io/blog/2025/ai-agent-observability/
Why it matters:
- The industry is converging on OTEL as the standard for AI agent telemetry. Emit once, route to any backend without vendor lock-in. GenAI Semantic Conventions are being standardized.
Key takeaways:
- Multi-agent traces must reconstruct why an agent made a decision, not just what it did and how long it took. This requires correlation IDs linking all calls in a single task, parent-child span relationships across agent boundaries.
- Datadog launched AI Agent Monitoring (DASH 2025). Microsoft integrated multi-agent observability across Semantic Kernel, LangGraph, LangChain, OpenAI Agents SDK.
- Two audiences: LangSmith for prompt-level debugging in development; Datadog/Grafana for production operational monitoring.
Implication:
- Instrument with OTEL from day one using GenAI semantic conventions.
- Emit a trace per agent task with: task ID, parent task ID, agent role, FSM state transitions, tool calls as child spans, final output, success/failure, token counts per step.
- Propagate correlation IDs through all agent-to-agent calls. Without this, multi-agent debugging is blind.
- Alert on: agent loop depth >10 steps without completion, tool error rate >20% on any tool over 5 minutes, token-per-task cost 2x baseline.
- Capture the full state object at each checkpoint — this enables time-travel debugging.
Avoid:
- Logging only the final output of each agent.
- Building custom observability infrastructure when OTEL + a backend is available.
- Storing raw conversation histories as the only observability artifact.
- Instrumenting only happy-path flows — errors, retries, and HITL interrupts must also emit structured spans.
Core Sources: Software Engineering Agents
20. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering (Yang et al., 2024)
Source: https://arxiv.org/abs/2405.15793
Why it matters:
- The interface between model and environment is part of the model's performance.
Key takeaway:
- Good repo agents need a disciplined shell/editor/test interface, not just a better prompt.
Implication:
- Design the action surface carefully.
- Short loops over read/search/edit/test beat abstract planning without execution.
21. Agentless: Demystifying LLM-based Software Engineering Agents (Xia et al., 2024)
Source: https://arxiv.org/abs/2407.01489
Why it matters:
- A simpler pipeline can outperform complex software agents at lower cost.
Key takeaway:
- Simpler decomposition often beats a giant autonomous loop.
Implication:
- Always benchmark against a simpler non-agentic or lightly agentic baseline.
- If a full agent loop is not clearly better, cut it.
22. PatchPilot: A Cost-Efficient Software Engineering Agent (2025)
Source: https://arxiv.org/abs/2502.02747
Why it matters:
- A rule-based 5-step workflow matches or beats fully agentic approaches at a fraction of the cost (under $1/instance on SWE-bench).
Key takeaway:
- Adding an explicit localization step before generation measurably improves patch quality. Rule-based planning provides stability; agent-based planning provides peak performance. A hybrid uses rules as the default and escalates to agent planning on failure.
5-step workflow:
- Reproduction — verify the issue is reproducible
- Localization — retrieve relevant context from the codebase
- Generation — produce the patch
- Validation — run tests/checks
- Refinement — iterate until validation passes
Implication:
- Before generating a fix, explicitly localize: find the affected files/functions and pull them into context.
- Use rule-based orchestration as the stable default; reserve LLM-driven planning for cases where the rule path has already failed.
- Treat refinement as a bounded loop with a fixed retry budget, not open-ended re-generation.
Core Sources: Evaluation
23. Demystifying Evals for AI Agents (Anthropic, 2026)
Source: https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
Why it matters:
- One of the best practical writeups on agent evals and reliability.
High-signal takeaways:
- Start early; 20–50 tasks is enough to begin.
- Write unambiguous tasks with reference solutions.
- Evaluate both "should do X" and "should not do X."
- Isolate trials from each other.
- Grade outputs/outcomes, not rigid exact traces.
- Calibrate model graders against humans.
- Read transcripts constantly.
- Treat eval-driven development as normal engineering.
Implication:
- Search/tool-use policies should be evaluated on both over-triggering and under-triggering.
- Coding agents should be graded on artifacts and tests, not whether they followed a specific thought path.
- If the model improved but the score did not, suspect the benchmark or grader too.
Projects Worth Studying
1. karpathy/autoresearch
Source: https://github.com/karpathy/autoresearch
What to study:
- Extremely narrow loop, fixed optimization target, small mutable surface, experiment-first framing.
Copy: tight loop, fixed budget, metric-first automation.
Avoid: generalizing it into a broad orchestration layer unless evals justify it.
2. davebcn87/pi-autoresearch
Source: https://github.com/davebcn87/pi-autoresearch
What to study:
- Explicit session files, checks vs. crashes vs. metric logs, dashboard and widget feedback.
Copy:
- Make experiment state visible.
- Distinguish correctness failures from benchmark failures.
- Commit only after the right checks pass.
3. SWE-agent / mini-SWE-agent
Source: https://github.com/princeton-nlp/SWE-agent
What to study: repo-focused action surface, issue → inspect → edit → test loop, benchmark-first iteration.
Copy: narrow interface and strong harnessing.
4. OpenHands
Source: https://github.com/All-Hands-AI/OpenHands
What to study: broad workspace/runtime architecture, interactive software agent product design.
Copy carefully: runtime ergonomics and environment handling.
Risk: very easy to absorb too much framework complexity.
5. aider's architect/editor split
Source: https://aider.chat/2024/09/26/architect.html
What to study: separate high-level reasoning from concrete editing.
Copy: planner/editor separation can help when one lane should stay terse and execution-oriented.
Risk: only worth it if the split clearly improves results on your tasks.
6. ATLAS (Adaptive Test-time Learning and Autonomous Specialization)
Source: https://github.com/itigges22/ATLAS
What to study:
- Self-hosted coding agent achieving 74.6% on LiveCodeBench using a quantized 14B Qwen model on a single consumer GPU (RTX 5060 Ti, 16GB VRAM).
- Three-phase pipeline: Generate → Verify → Repair
- Generate: PlanSearch extracts constraints and produces diverse candidate solutions; Budget Forcing controls token spend during inference
- Verify: a "Geometric Lens" component scores candidates with a 5120-dimensional energy field (87.8% selection accuracy) plus sandboxed execution
- Repair: failing candidates trigger self-generated test cases and iterative refinement via PR-CoT (Problem-Repair Chain-of-Thought)
- ~$0.004/task in local electricity vs. $0.043–$0.066 for comparable API services; no external API calls.
Why it belongs here:
- Working proof that smart infrastructure — not model scale — can close the gap with frontier systems. Doubles baseline pass rate (from ~38% to 74.6%) entirely through the generation/verification/repair scaffold.
Copy:
- generate → external verify → self-repair loop as a default pattern for coding tasks.
- Budget forcing to limit token waste on low-confidence generations.
- Distinguish candidate selection accuracy from final pass rate — they are different metrics.
Avoid:
- Treating it as a general-purpose agent; optimized explicitly for LiveCodeBench.
- Sequential pipeline if throughput matters.
Risk: the Geometric Lens is described as undertrained; verification signal could be a bottleneck on new domains.
Distilled Rules for Orchestration
Choosing the Architecture
- Default to a single LLM call or a simple workflow.
- Add an evaluator-optimizer loop when correctness matters and the task benefits from revision.
- Add multiple agents only when evals show clear gains from role separation.
- Use heterogeneous agents (different roles, models, or tool access), not homogeneous swarms.
State Machine Design
- Minimum states:
PLANNING,EXECUTING,OBSERVING,ERROR_RECOVERY,DONE. - Each state: one well-defined prompt, allowed tools, explicit transition conditions.
- Error state is mandatory — not optional.
- Transition counter per state to prevent loops.
- Routing logic lives in code, not in LLM prompts.
Memory
- Don't treat the context window as your only memory.
- Two minimum layers: in-context working buffer + persistent episodic store with structured metadata.
- Consolidation job: after task completion, abstract lessons into semantic memory.
- Memory entries: structured notes with description, keywords, tags, outcome at write time.
Context Management
- Default: observation masking (replace old tool outputs with placeholders; keep reasoning chain).
- Trigger compaction at 70–80% of context limit — before the LLM call.
- Always preserve: original task spec, last N turns verbatim, current goal state.
- LLM summarization only as a fallback for single oversized responses.
Error Recovery
- Retry with exponential backoff + jitter. Cap at 3 attempts.
- Circuit breaker at the LLM client level.
- Tool errors return as structured observations, not crashes.
- After 2 consecutive failures on the same action: planning reset.
- Per-task retry budget, not just per-call.
Multi-Agent Design
- Use one agent unless there is a measured reason to split.
- Split by capability, not by story or persona.
- Small helpers do mechanical work; larger models handle synthesis and edge-case reasoning.
- Coordination prompts name the shared objective and each role's responsibility.
- Per-subagent permission scopes defined in config, not in prompts.
Evaluation
- Run the same harness the product actually uses.
- Keep trials isolated.
- Track pass@1 and consistency, not just "found a good answer once."
- Review transcripts every week if the system matters.
- Rebuild evals from real failures and real manual checks.
Anti-Patterns
- Too many homogeneous agents
- Persona-rich orchestration prompts with weak task constraints
- Unbounded self-reflection loops
- Auto-commits without validation
- Massive context files with no ownership
- Grading the exact tool path instead of the delivered outcome
- Building a platform before validating a narrow workflow
- Routing logic embedded in LLM prompts instead of code
- Missing error state in agent FSM
- Context growing unchecked until hard limit
- HITL approvals with raw JSON or no plain-language summary
- Agent-to-agent calls without correlation IDs
- Memory as only a raw vector store with no structured metadata
- Stale harness components after model upgrades
What To Re-Read Often
- Anthropic, Building Effective AI Agents
- Anthropic, Harness Design for Long-Running Application Development
- Anthropic, Demystifying Evals for AI Agents
- Anthropic, Programmatic Tool Calling Docs
- BAIR, The Shift from Models to Compound AI Systems
- StateFlow (FSM-based agent loops)
- JetBrains Research, Efficient Context Management
- A-MEM (memory network design)
- Agentless
- SWE-agent
- karpathy/autoresearch
- pi-autoresearch
- ATLAS (generate → verify → repair; small-model infra vs. scale)
- Multi-Agent Orchestration for Deterministic Decision Support
Update Policy
When adding a new source, prefer:
- primary paper
- official engineering article
- official project README or documentation
For each new source, capture:
- what it claims
- what to copy into orchestration design
- what to avoid
- whether it actually changes system design decisions
If it does not change design decisions, it probably does not belong here.
Last updated: 2026-04-01