6.6 KiB
6.6 KiB
Multi-Agent LLM Research Findings
Papers Reviewed
1. Agent READMEs: An Empirical Study of Context Files for Agentic Coding (Nov 2025)
arXiv:2511.12884
Key Findings:
- Analyzed 2,303 agent context files from 1,925 repositories
- Context files evolve like configuration code through frequent, small additions
- Developers prioritize functional context:
- Build/run commands: 62.3%
- Implementation details: 69.9%
- Architecture: 67.7%
- Critical gap: Non-functional requirements rarely specified:
- Security: only 14.5%
- Performance: only 14.5%
- Files are often "complex, difficult-to-read artifacts"
Implications for Our Setup:
- AGENTS.md files are effective but need guardrails
- Add explicit security/performance constraints
- Keep persona definitions minimal and task-focused
- Context files should be living documents, not static
2. Understanding Agent Scaling via Diversity (Feb 2026)
arXiv:2602.03794
Key Findings:
- Homogeneous agents saturate early - diminishing returns with more of same model
- Heterogeneity wins: Different models, prompts, or tools yield substantial gains
- 2 diverse agents can match/exceed performance of 16 homogeneous agents
- Performance bounded by task uncertainty, not agent count
- Homogeneous outputs are strongly correlated; heterogeneous agents provide complementary evidence
Implications for Our Setup: ✅ VALIDATED: 4B orchestrator + 14B coder + 1.2B runners is the right pattern
- The diversity in model sizes creates effective specialization
- Small models for mechanical tasks (grep, read, run)
- Large models for complex reasoning (coding, architecture)
3. SOLVE-Med: Specialized Orchestration with Small Models (Nov 2025)
arXiv:2511.03542
Key Findings:
- Uses 10 specialized models (1B each) orchestrated together
- Outperforms standalone models up to 14B parameters
- Architecture: Router Agent → Specialists → Orchestrator Agent
- Enables local deployment while beating much larger models
Implications:
- Our 1.2B fast runners pattern is validated by research
- Router/Orchestrator separation is effective
- Small specialists + coordination > single large model
4. MATA: Multi-Agent with Small Model Tools (Feb 2026)
arXiv:2602.09642
Key Findings:
- Uses small language models for tools to minimize expensive LLM calls
- Algorithm designed to minimize expensive LLM agent calls
- "Careful orchestration of multiple reasoning pathways yields scalable and reliable" results
- Strong performance across 10 different LLMs
Implications:
- Fast task runners (1.2B) should handle all simple operations
- Reserve 14B model for complex tasks only
- Expensive model calls should be minimized through smart routing
5. From Biased Chatbots to Biased Agents (Jan 2026)
arXiv:2602.12285
Key Findings:
- Persona assignments can degrade performance by up to 26.2%
- Task-irrelevant persona cues introduce implicit biases
- "Persona assignments can introduce implicit biases and increase behavioral volatility"
Implications:
- Norse mythology naming is fine, but keep persona definitions task-focused and minimal
- Don't over-engineer personalities
- Focus on capability descriptions, not character roles
6. Emergent Coordination in Multi-Agent Systems (Oct 2025)
arXiv:2510.05174
Key Findings:
- Multi-agent systems can be steered from "mere aggregates" to "higher-order collectives"
- Best coordination: Personas + "think about what other agents might do" instruction
- "Effective performance requires both alignment on shared objectives and complementary contributions"
Implications:
- Agents need awareness of other agents' roles
- Thor should explicitly mention Odin's capabilities in specs
- Shared understanding of project structure is critical
Summary: What Research Validates
✅ Our Current Approach (Heterogeneous Agents)
- 4B dispatcher + 14B coder + 1.2B runners is correct
- Diversity beats homogeneous scaling
- Small models work when orchestrated properly
⚠️ Areas for Improvement
- Add non-functional requirements to specs (security, performance)
- Minimize expensive model calls - use 1.2B agents for simple tasks
- Keep personas minimal - avoid character over-engineering
- Agent awareness - specs should reference other agents' capabilities
- Context file maintenance - AGENTS.md should evolve, not be static
❌ What Failed in Koko Blog Implementation
- Thor didn't enforce stack constraints - allowed React when Astro required
- No verification step - no check that code matched PLAN.md
- Missing guardrails - no security/performance requirements in specs
- Poor context - specs didn't reference Astro patterns explicitly
Recommendations for Better Agent Performance
1. Spec Template Improvements
Every spec sent to Odin should include:
STACK ENFORCEMENT (NON-NEGOTIABLE):
- Framework: [specific framework]
- Forbidden: [list of banned libraries/patterns]
- Required patterns: [specific conventions]
- Verification: [how to check compliance]
NON-FUNCTIONAL REQUIREMENTS:
- Security: [specific constraints]
- Performance: [targets]
- Bundle size: [limits]
2. Pre-Implementation Verification
Before coding starts:
- Thor reads existing files of the target type
- Explicitly lists patterns to follow
- References Astro docs for APIs
- States "use ONLY [framework] patterns"
3. Post-Implementation Checklist
After Odin completes:
- Verify framework compliance (grep for banned imports)
- Check bundle size impact
- Validate against PLAN.md line-by-line
- Run build and check output
4. Agent Coordination Improvements
- Thor specs should explicitly mention: "Odin will implement, Loki will verify"
- Include "Review Requirements" section in specs
- Loki should check both functional and non-functional requirements
Research-Backed Configuration Adjustments
Effective Agent Hierarchy (Validated)
Level 1: Orchestrators (4B)
- Task coordination
- Context extraction
- Spec writing with guardrails
Level 2: Specialist (14B)
- Complex coding only
- Architecture decisions
- Deep problem solving
Level 3: Fast Runners (1.2B)
- Grep, read, run
- No coding
- Parallel execution
Optimal Work Distribution
- 2 diverse agents > 16 homogeneous agents
- Our 3-tier setup (4B/14B/1.2B) is research-optimal
- Don't add more agents of same type
- Differentiate by capability, not just name
Last updated: 2026-03-01 Sources: arXiv papers reviewed March 2026