Files
2026-04-30 19:44:09 +02:00

6.6 KiB

Multi-Agent LLM Research Findings

Papers Reviewed

1. Agent READMEs: An Empirical Study of Context Files for Agentic Coding (Nov 2025)

arXiv:2511.12884

Key Findings:

  • Analyzed 2,303 agent context files from 1,925 repositories
  • Context files evolve like configuration code through frequent, small additions
  • Developers prioritize functional context:
    • Build/run commands: 62.3%
    • Implementation details: 69.9%
    • Architecture: 67.7%
  • Critical gap: Non-functional requirements rarely specified:
    • Security: only 14.5%
    • Performance: only 14.5%
  • Files are often "complex, difficult-to-read artifacts"

Implications for Our Setup:

  • AGENTS.md files are effective but need guardrails
  • Add explicit security/performance constraints
  • Keep persona definitions minimal and task-focused
  • Context files should be living documents, not static

2. Understanding Agent Scaling via Diversity (Feb 2026)

arXiv:2602.03794

Key Findings:

  • Homogeneous agents saturate early - diminishing returns with more of same model
  • Heterogeneity wins: Different models, prompts, or tools yield substantial gains
  • 2 diverse agents can match/exceed performance of 16 homogeneous agents
  • Performance bounded by task uncertainty, not agent count
  • Homogeneous outputs are strongly correlated; heterogeneous agents provide complementary evidence

Implications for Our Setup: VALIDATED: 4B orchestrator + 14B coder + 1.2B runners is the right pattern

  • The diversity in model sizes creates effective specialization
  • Small models for mechanical tasks (grep, read, run)
  • Large models for complex reasoning (coding, architecture)

3. SOLVE-Med: Specialized Orchestration with Small Models (Nov 2025)

arXiv:2511.03542

Key Findings:

  • Uses 10 specialized models (1B each) orchestrated together
  • Outperforms standalone models up to 14B parameters
  • Architecture: Router Agent → Specialists → Orchestrator Agent
  • Enables local deployment while beating much larger models

Implications:

  • Our 1.2B fast runners pattern is validated by research
  • Router/Orchestrator separation is effective
  • Small specialists + coordination > single large model

4. MATA: Multi-Agent with Small Model Tools (Feb 2026)

arXiv:2602.09642

Key Findings:

  • Uses small language models for tools to minimize expensive LLM calls
  • Algorithm designed to minimize expensive LLM agent calls
  • "Careful orchestration of multiple reasoning pathways yields scalable and reliable" results
  • Strong performance across 10 different LLMs

Implications:

  • Fast task runners (1.2B) should handle all simple operations
  • Reserve 14B model for complex tasks only
  • Expensive model calls should be minimized through smart routing

5. From Biased Chatbots to Biased Agents (Jan 2026)

arXiv:2602.12285

Key Findings:

  • Persona assignments can degrade performance by up to 26.2%
  • Task-irrelevant persona cues introduce implicit biases
  • "Persona assignments can introduce implicit biases and increase behavioral volatility"

Implications:

  • Norse mythology naming is fine, but keep persona definitions task-focused and minimal
  • Don't over-engineer personalities
  • Focus on capability descriptions, not character roles

6. Emergent Coordination in Multi-Agent Systems (Oct 2025)

arXiv:2510.05174

Key Findings:

  • Multi-agent systems can be steered from "mere aggregates" to "higher-order collectives"
  • Best coordination: Personas + "think about what other agents might do" instruction
  • "Effective performance requires both alignment on shared objectives and complementary contributions"

Implications:

  • Agents need awareness of other agents' roles
  • Thor should explicitly mention Odin's capabilities in specs
  • Shared understanding of project structure is critical

Summary: What Research Validates

Our Current Approach (Heterogeneous Agents)

  • 4B dispatcher + 14B coder + 1.2B runners is correct
  • Diversity beats homogeneous scaling
  • Small models work when orchestrated properly

⚠️ Areas for Improvement

  1. Add non-functional requirements to specs (security, performance)
  2. Minimize expensive model calls - use 1.2B agents for simple tasks
  3. Keep personas minimal - avoid character over-engineering
  4. Agent awareness - specs should reference other agents' capabilities
  5. Context file maintenance - AGENTS.md should evolve, not be static

What Failed in Koko Blog Implementation

  • Thor didn't enforce stack constraints - allowed React when Astro required
  • No verification step - no check that code matched PLAN.md
  • Missing guardrails - no security/performance requirements in specs
  • Poor context - specs didn't reference Astro patterns explicitly

Recommendations for Better Agent Performance

1. Spec Template Improvements

Every spec sent to Odin should include:

STACK ENFORCEMENT (NON-NEGOTIABLE):
- Framework: [specific framework]
- Forbidden: [list of banned libraries/patterns]
- Required patterns: [specific conventions]
- Verification: [how to check compliance]

NON-FUNCTIONAL REQUIREMENTS:
- Security: [specific constraints]
- Performance: [targets]
- Bundle size: [limits]

2. Pre-Implementation Verification

Before coding starts:

  • Thor reads existing files of the target type
  • Explicitly lists patterns to follow
  • References Astro docs for APIs
  • States "use ONLY [framework] patterns"

3. Post-Implementation Checklist

After Odin completes:

  • Verify framework compliance (grep for banned imports)
  • Check bundle size impact
  • Validate against PLAN.md line-by-line
  • Run build and check output

4. Agent Coordination Improvements

  • Thor specs should explicitly mention: "Odin will implement, Loki will verify"
  • Include "Review Requirements" section in specs
  • Loki should check both functional and non-functional requirements

Research-Backed Configuration Adjustments

Effective Agent Hierarchy (Validated)

Level 1: Orchestrators (4B)
  - Task coordination
  - Context extraction
  - Spec writing with guardrails
  
Level 2: Specialist (14B)
  - Complex coding only
  - Architecture decisions
  - Deep problem solving
  
Level 3: Fast Runners (1.2B)
  - Grep, read, run
  - No coding
  - Parallel execution

Optimal Work Distribution

  • 2 diverse agents > 16 homogeneous agents
  • Our 3-tier setup (4B/14B/1.2B) is research-optimal
  • Don't add more agents of same type
  • Differentiate by capability, not just name

Last updated: 2026-03-01 Sources: arXiv papers reviewed March 2026