Files
sleepy 51123212c4 Initial commit: coding harness feedback analysis
Harnesses under analysis:
- opencode (Go-based coding agent)
- pi (minimal terminal coding harness by Mario Zechner)
- hermes (Nous Research agent)
- forgecode (AI pair programmer with sub-agents)

Each harness folder contains:
- repo/: Source code from respective repositories
- feedback/localllm/: Community feedback for local/smaller models
- feedback/frontier/: Community feedback for frontier models

Research focus: Tool handling, skills systems, prompt engineering,
context management, and best practices for smaller/local models.
2026-04-09 15:13:45 +02:00

24 KiB
Raw Permalink Blame History

Agent Systems Research Notes

This file is a practical, research-backed field guide for building agentic systems.

Use it for:

  • prompt design
  • orchestration decisions
  • tool-use policy
  • automation loop design
  • eval and reliability practices

The goal is not to collect every paper. The goal is to keep the highest-signal findings that actually change how an agent or scaffold should be built.

Fast Takeaways

  1. Start with the simplest scaffold that can pass evals. Do not default to multi-agent.
  2. Use tools when the task depends on current facts, exact details, or environment feedback.
  3. Grade outcomes, artifacts, and grounded evidence, not exact tool-call traces.
  4. Separate cheap mechanical work from expensive reasoning.
  5. Use reflection/revision only when it improves measured performance more than it hurts latency/cost.
  6. Keep prompts short, constraint-like, and verification-oriented. Avoid persona-heavy prompt sludge.
  7. Read transcripts. If metrics and transcripts disagree, the harness or grader may be wrong.
  8. Heterogeneous systems beat piles of homogeneous agents when the roles are genuinely different.
  9. External feedback beats self-confidence. Tests, search results, compiler output, and graders matter.
  10. Narrow loops outperform vague autonomy. Small mutable surface, fixed metric, bounded retries.

What To Copy Into Systems

Prompting

  • State objective, constraints, and success criteria explicitly.
  • Preserve exact terms from the user or evidence; do not rename concrete entities.
  • Prefer a short rule like "if not verified, say so" over long keyword lists and examples.
  • Give the model an honest escape hatch: unknown, need evidence, or search more.
  • Use prompt tricks as baselines first, not as substitutes for retrieval, tests, or evals.

Orchestration

  • Keep the default path single-agent or workflow-based.
  • Add planners, reviewers, or specialist agents only when evals show clear gains.
  • Prefer bounded loops: one plan, one act phase, one verifier, one retry budget.
  • Use different models or prompts only when they contribute distinct evidence or skills.
  • Treat multi-agent diversity as a tool, not a religion.

Tooling

  • Favor tools that return verifiable feedback:
    • tests
    • compiler errors
    • search results
    • fetched pages
    • graders
  • Keep traces and artifacts.
  • Persist compact research notes when follow-up questions are common.
  • If the task is stale/exact/source-sensitive, lookup beats memory.

Automation

  • Fix a metric before running an autonomous loop.
  • Keep the mutable surface small.
  • Auto-commit only after checks pass.
  • Separate "experiment failed" from "checks failed" from "metric regressed".
  • Prefer narrow optimization targets over grand autonomous platform behavior.

Evaluation

  • Build evals from real failures and real manual checks.
  • Balance both sides of decision boundaries:
    • should search
    • should not search
  • Isolate trials. No shared repo state, hidden cache, or leaked history.
  • Use deterministic graders where possible.
  • Use LLM graders with clear rubrics and human calibration when needed.
  • Track both quality and consistency.

Core Sources

Prompting And Reasoning

1. Large Language Models are Zero-Shot Reasoners (Kojima et al., 2022)

Source:

Why it matters:

  • A very small reasoning cue can unlock much better performance than a plain direct answer.

Key takeaway:

  • Before inventing a complicated prompt chain, test a minimal reasoning baseline.

Implication for agents:

  • Use simple reasoning scaffolds as the baseline to beat.
  • If a complex workflow does not outperform a minimal prompt + tool loop, the workflow is probably not worth it.

2. Self-Consistency Improves Chain of Thought Reasoning in Language Models (Wang et al., 2022)

Source:

Why it matters:

  • Sampling multiple reasoning paths and aggregating them can improve correctness on hard reasoning tasks.

Key takeaway:

  • Best-of-N / vote / pass@k style decoding is often more useful than one brittle "perfect prompt".

Implication for agents:

  • Use this selectively for high-value reasoning or planning steps.
  • Do not apply it blindly to every turn; it is a latency and cost tradeoff.

3. ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2022)

Source:

Why it matters:

  • ReAct formalized the now-standard pattern of interleaving reasoning with external actions.

Key takeaway:

  • Reasoning is better when it can touch the world.

Implication for agents:

  • For factual, interactive, or environment-dependent tasks, combine thinking with acting instead of pushing all work into one monologue.
  • This is a strong default for search, repo work, shell use, and structured tool loops.

4. Tree of Thoughts / Search-Style Deliberation

Sources:

Why it matters:

  • Search over candidate reasoning paths can improve hard problems when a single left-to-right pass is too brittle.

Key takeaway:

  • Deliberate search can help, but it is expensive. Use it for genuinely hard branches, not for routine chat.

Implication for agents:

  • Keep search/planning loops bounded.
  • Reach for tree search only when the task is hard enough and the eval gain justifies the extra cost.

Tool Use And Self-Correction

5. Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al., 2023)

Source:

Why it matters:

  • Tool use is not just a static external heuristic; it can be learned and integrated into the model's behavior.

Key takeaway:

  • A good agent should decide when to call tools from the information need, not from crude keyword triggers.

Implication for agents:

  • Prefer model-directed tool decisions over brittle word lists.
  • Keep a simple fallback policy, but do not let the fallback dominate the product behavior.

6. CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing (Gou et al., 2024)

Source:

Why it matters:

  • Self-correction becomes much stronger when the critique is grounded in external tools instead of pure self-reflection.

Key takeaway:

  • Verification works better with evidence than with vibes.

Implication for agents:

  • When possible, critique drafts against search results, tests, or environment state.
  • A grounded revision pass is usually higher value than another creative generation pass.

7. Self-Refine: Iterative Refinement with Self-Feedback (Madaan et al., 2023)

Source:

Why it matters:

  • Even without external tools, generate -> critique -> revise can improve outputs.

Key takeaway:

  • Revision is a useful primitive, but should be bounded and measured.

Implication for agents:

  • Keep self-refine loops short.
  • Prefer one clear revision pass over open-ended introspection.

8. Reflexion: Language Agents with Verbal Reinforcement Learning (Shinn et al., 2023)

Source:

Why it matters:

  • Reflection across attempts can improve repeated-task performance.

Key takeaway:

  • Memory is most useful when it captures compact lessons from failures, not giant transcripts.

Implication for agents:

  • Store short, actionable reflections.
  • Use memory across repeated tasks or sessions, not as an excuse to keep every token forever.

8a. Programmatic Tool Calling (Anthropic Docs, 2026)

Source:

Why it matters:

  • Strong practical guidance on when a model should batch tool work inside a code/execution environment instead of paying a model round-trip per tool call.

Key takeaways from the docs:

  • Programmatic tool calling is useful when the model can write code that fans out or sequences tool calls, filters large intermediate results, and returns only the compact summary back into context.
  • This is especially attractive for multi-step workflows with 3+ dependent tool calls, large datasets, or parallel checks across many items.
  • Caller boundaries matter. Tools should usually be either direct-call tools or execution-only tools, not both by default.
  • Tool outputs must be treated as untrusted strings. Validation and injection hygiene matter if the execution environment will parse or act on those results.
  • This is not the default for single fast calls or highly interactive steps where the code-execution overhead outweighs the gain.

Implication for agents:

  • Add batched tool fanout only as an opt-in executor pattern for research-heavy or data-heavy tasks.
  • Use it to cut latency and context pressure, not to replace the main verifier contract.
  • If implemented, log caller/executor state clearly enough to debug failures and reuse behavior.

System Design And Orchestration

9. The Shift from Models to Compound AI Systems (BAIR, 2024)

Source:

Why it matters:

  • Strong AI systems increasingly come from multiple interacting components, not just bigger base models.

Key takeaways from the article:

  • system design can improve quality faster than scaling alone
  • current-data access, control, trust, and cost are often easier to solve at the system level
  • optimizing a compound system is a distinct engineering problem

Implication for agents:

  • Build around tools, retrievers, graders, and routers when they solve a real product problem.
  • Do not mistake "compound system" for "maximally complex system".

10. Building Effective AI Agents (Anthropic, 2026)

Source:

Why it matters:

  • High-quality practical guidance from a team operating real agent systems at scale.

The most useful framing:

  • choose between single-agent, workflow, and multi-agent designs intentionally
  • use a small set of reusable patterns:
    • sequential
    • parallel
    • evaluator-optimizer
  • match system complexity to business value

Implication for agents:

  • Default to simple workflows first.
  • Reach for evaluator-optimizer when correctness matters and the task benefits from revision.
  • Reach for multi-agent only after single-agent/workflow baselines are exhausted.

10a. Harness Design for Long-Running Application Development (Anthropic, 2026)

Source:

Why it matters:

  • Strong, current practical guidance on long-running coding harnesses from a team actively tuning them against real app-building tasks.

Key takeaways from the article:

  • Separate generator and evaluator roles. Self-evaluation is too lenient; a tuned external evaluator is much more useful.
  • Make "done" explicit before coding. Anthropic used planner output plus a per-sprint contract negotiated between builder and evaluator.
  • Keep planner output high-level and product-facing. Over-specifying low-level implementation details too early can cascade bad assumptions.
  • Use evaluators that touch the environment directly. Playwright-driven QA against real UI behavior, API behavior, and data state is much stronger than static inspection.
  • Use hard thresholds for acceptance, not vague approval. If a criterion misses, fail the chunk and return actionable feedback.
  • Preserve handoff artifacts and structured files between agents. File-based communication and contracts reduce drift across long runs.
  • Context resets vs compaction is a model-specific engineering choice. Resets help when the model shows context anxiety; stronger models may make resets unnecessary.
  • Every harness component encodes an assumption about model weakness. Re-run ablations and remove scaffold that is no longer load-bearing after model upgrades.
  • Evaluators are worth the cost near the model's capability boundary. When the model can already do the task reliably solo, evaluator passes can become mostly overhead.
  • Read logs and calibrate the evaluator from real disagreements. Prompt tuning should come from concrete misses, not abstract "better QA" wishes.

Implication for agents:

  • Keep coder and reviewer/verifier separate when acceptance quality matters.
  • Add an explicit contract or acceptance plan before implementation when the spec is high-level.
  • Prefer grounded evaluator tools over reviewer vibes.
  • Keep handoff state compact and structured enough to survive resets when resets are needed.
  • Revisit harness complexity whenever the base model changes; stale scaffold is real technical debt.

11. Understanding Agent Scaling via Diversity (2026)

Source:

  • arXiv:2602.03794

Why it matters:

  • More homogeneous agents do not scale indefinitely; diversity matters more than count.

Key takeaway:

  • Two meaningfully different agents can outperform a swarm of same-ish agents.

Implication for agents:

  • Diversity should come from role, model, tool access, or evidence channel.
  • Do not duplicate the same model/prompt ten times and call it orchestration.

12. SOLVE-Med / MATA / Small-Model Orchestration (2025-2026)

Sources:

  • SOLVE-Med: arXiv:2511.03542
  • MATA: arXiv:2602.09642

Why they matter:

  • Small specialized models, when orchestrated well, can outperform or match much larger standalone systems.

Key takeaway:

  • Cheap specialists for mechanical subproblems are a real design pattern, not a hack.

Implication for agents:

  • Route grep/read/run/simple classification to cheaper lanes.
  • Reserve expensive models for hard reasoning or integration steps.

Concrete example (see Projects section):

  • ATLAS achieves 74.6% on LiveCodeBench using a quantized 14B model on a single consumer GPU by layering structured generation, energy-based verification, and self-verified repair — no frontier model, no cloud API. The infrastructure more than doubles the baseline pass rate.

13. Agent READMEs: An Empirical Study of Context Files for Agentic Coding (2025)

Source:

  • arXiv:2511.12884

Why it matters:

  • Agent context files become living operational artifacts, but often drift into unreadable piles.

Key takeaways from the study:

  • teams heavily specify build/run, architecture, and implementation context
  • security and performance are badly underspecified

Implication for agents:

  • Keep context files short, operational, and constraint-rich.
  • Add explicit non-functional requirements.
  • Treat agent context as maintained configuration, not lore.

14. From Biased Chatbots to Biased Agents (2026)

Source:

  • arXiv:2602.12285

Why it matters:

  • Persona baggage can actively hurt agent behavior.

Key takeaway:

  • Capability framing helps; character acting often hurts.

Implication for agents:

  • Keep personalities light.
  • Put behavior into constraints and tools, not theatrics.

15. Emergent Coordination in Multi-Agent Systems (2025)

Source:

  • arXiv:2510.05174

Why it matters:

  • Coordination is better when agents share objectives and understand complementary roles.

Key takeaway:

  • Role awareness is useful; vague social-role prompts are not enough.

Implication for agents:

  • When using multiple agents, explicitly describe what each one contributes and how outputs combine.

Software Engineering Agents

16. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering (Yang et al., 2024)

Source:

Why it matters:

  • The interface between model and environment is part of the model's performance.

Key takeaway:

  • Good repo agents need a disciplined shell/editor/test interface, not just a better prompt.

Implication for agents:

  • Design the action surface carefully.
  • Short loops over read/search/edit/test beat abstract planning without execution.

17. Agentless: Demystifying LLM-based Software Engineering Agents (Xia et al., 2024)

Source:

Why it matters:

  • A simpler pipeline can outperform complex software agents at lower cost.

Key takeaway:

  • Simpler decomposition often beats a giant autonomous loop.

Implication for agents:

  • Always benchmark against a simpler non-agentic or lightly agentic baseline.
  • If a full agent loop is not clearly better, cut it.

18. PatchPilot: A Cost-Efficient Software Engineering Agent (2025)

Source:

Why it matters:

  • Demonstrates that a rule-based 5-step workflow can match or beat fully agentic approaches at a fraction of the cost (under $1/instance on SWE-bench).

Key takeaway:

  • Adding an explicit localization step before generation — retrieve relevant context from the codebase — measurably improves patch quality. Rule-based planning provides stability; agent-based planning provides peak performance. A hybrid uses rules as the default and escalates to agent planning on failure.

5-step workflow:

  1. Reproduction — verify the issue is reproducible
  2. Localization — retrieve relevant context from the codebase
  3. Generation — produce the patch
  4. Validation — run tests/checks
  5. Refinement — iterate until validation passes

Implication for agents:

  • Before generating a fix, explicitly localize: find the affected files/functions and pull them into context.
  • Use rule-based orchestration as the stable default; reserve LLM-driven planning for cases where the rule path has already failed.
  • Treat refinement as a bounded loop with a fixed retry budget, not open-ended re-generation.

Evaluation And Reliability

18. Demystifying evals for AI agents (Anthropic, 2026)

Source:

Why it matters:

  • This is one of the best practical writeups on agent evals and reliability.

High-signal takeaways:

  • start early; 20-50 tasks is enough to begin
  • write unambiguous tasks with reference solutions
  • evaluate both "should do X" and "should not do X"
  • isolate trials from each other
  • grade outputs/outcomes, not rigid exact traces
  • calibrate model graders against humans
  • read transcripts constantly
  • treat eval-driven development as normal engineering

Implication for agents:

  • Search/tool-use policies should be evaluated on both over-triggering and under-triggering.
  • Coding agents should be graded on artifacts and tests, not whether they followed a specific thought path.

Projects Worth Studying

These are not all research papers, but they are useful design references.

1. karpathy/autoresearch

Source:

What to study:

  • extremely narrow loop
  • fixed optimization target
  • small mutable surface
  • experiment-first framing instead of "general agent platform"

Copy:

  • tight loop, fixed budget, metric-first automation

Avoid:

  • generalizing it into a broad orchestration layer unless evals justify it

2. davebcn87/pi-autoresearch

Source:

What to study:

  • practical extension of the autoresearch idea
  • explicit session files
  • checks vs crashes vs metric logs
  • dashboard and widget feedback

Copy:

  • make experiment state visible
  • distinguish correctness failures from benchmark failures
  • commit only after the right checks pass

3. SWE-agent / mini-SWE-agent

Source:

What to study:

  • repo-focused action surface
  • issue -> inspect -> edit -> test loop
  • benchmark-first iteration

Copy:

  • narrow interface and strong harnessing

4. OpenHands

Source:

What to study:

  • broad workspace/runtime architecture
  • interactive software agent product design

Copy carefully:

  • runtime ergonomics and environment handling

Risk:

  • very easy to absorb too much framework complexity

5. aider's architect/editor split

Source:

What to study:

  • separate high-level reasoning from concrete editing

Copy:

  • planner/editor separation can help when one lane should stay terse and execution-oriented

Risk:

  • only worth it if the split clearly improves results on your tasks

6. ATLAS (Adaptive Test-time Learning and Autonomous Specialization)

Source:

What to study:

  • a self-hosted coding agent that achieves 74.6% on LiveCodeBench using a quantized 14B Qwen model on a single consumer GPU (RTX 5060 Ti, 16GB VRAM)
  • three-phase pipeline: Generate → Verify → Repair
    • Generate: PlanSearch extracts constraints and produces diverse candidate solutions; Budget Forcing controls token spend during inference
    • Verify: a "Geometric Lens" component scores candidates with a 5120-dimensional energy field (87.8% selection accuracy) plus sandboxed execution
    • Repair: failing candidates trigger self-generated test cases and iterative refinement via PR-CoT (Problem-Repair Chain-of-Thought)
  • estimated ~$0.004/task in local electricity vs. $0.043$0.066 for comparable API services; no external API calls required

Why it belongs here:

  • It is a working proof that smart infrastructure — not model scale — can close the gap with frontier systems
  • Directly validates the small-model orchestration pattern from #12: doubling the baseline pass rate (from ~38% to 74.6%) comes entirely from the generation/verification/repair scaffold, not from a bigger model
  • The Geometric Lens energy-field selector is an unusual but measurable alternative to pure LLM-based self-critique

Copy:

  • generate → external verify → self-repair loop as a default pattern for coding tasks
  • budget forcing to limit token waste on low-confidence generations
  • distinguishing candidate selection accuracy (Geometric Lens) from final pass rate — they are different metrics worth tracking separately

Avoid:

  • treating it as a general-purpose agent; it is explicitly optimized for LiveCodeBench and cross-domain generalization is listed as a known limitation
  • the sequential/single-threaded pipeline if throughput matters — version 3.1 targets parallel processing

Risk:

  • the Geometric Lens is described as undertrained; the verification signal could be a bottleneck on new domains

Distilled Rules For Kokoclaw/OpenClaw-Like Systems

Search And Retrieval

  • Do not rely on hardcoded keywords to decide whether to search.
  • Let the model judge whether fresh evidence is needed, then measure the behavior.
  • Keep the first search shallow and literal.
  • Allow bounded refinement if the first results are weak or mismatched.
  • Ground final factual answers in retrieved evidence.

Coding Agents

  • Keep repo agents on short inspect/edit/test loops.
  • Preserve exact names and file-local conventions.
  • Use docs lookup when behavior depends on framework or version details.
  • Grade the produced diff and test result, not exact intermediate steps.
  • Always compare against simpler baselines like "read more, act less".

Multi-Agent Design

  • Use one agent unless there is a measured reason to split.
  • Split by capability, not by story or persona.
  • Small helpers should do mechanical work.
  • Larger models should handle synthesis and edge-case reasoning.
  • Coordination prompts should name the shared objective and each role's responsibility.

Prompt Writing

  • Short beats bloated.
  • Abstract rules beat example catalogs unless the task genuinely needs demonstrations.
  • Put uncertainty policy in the prompt:
    • verify
    • revise
    • say unknown when unsupported
  • Do not try to encode every failure mode in one mega-prompt.

Evaluation

  • Run the same harness the product actually uses.
  • Keep trials isolated.
  • Track pass@1 and consistency, not just "found a good answer once".
  • Review transcripts every week if the system matters.
  • If the model improved but the score did not, suspect the benchmark or grader too.

Anti-Patterns

  • Too many homogeneous agents
  • Persona-rich prompts with weak task constraints
  • Keyword-triggered search/tool policies
  • Unbounded self-reflection loops
  • Auto-commits without validation
  • Massive context files with no ownership
  • Grading only the exact path instead of the delivered outcome
  • Building a platform before validating a narrow workflow

What To Re-Read Often

  • Anthropic, Building Effective AI Agents
  • Anthropic, Harness Design for Long-Running Application Development
  • Anthropic, Programmatic tool calling docs
  • Anthropic, Demystifying evals for AI agents
  • BAIR, The Shift from Models to Compound AI Systems
  • ReAct
  • CRITIC
  • Agentless
  • SWE-agent
  • karpathy/autoresearch
  • pi-autoresearch
  • ATLAS (generate → verify → repair; small-model infra vs. scale)

Update Policy For This File

When adding a new source, prefer one of:

  • primary paper
  • official engineering article
  • official project README or documentation

For each new source, capture:

  • what it claims
  • what to copy
  • what to avoid
  • whether it actually changes system design decisions

If it does not change design decisions, it probably does not belong here.

Last updated: 2026-03-29