Files
2026-04-30 19:44:09 +02:00

198 lines
6.6 KiB
Markdown

# Multi-Agent LLM Research Findings
## Papers Reviewed
### 1. Agent READMEs: An Empirical Study of Context Files for Agentic Coding (Nov 2025)
**arXiv:2511.12884**
**Key Findings:**
- Analyzed **2,303 agent context files** from 1,925 repositories
- Context files evolve like **configuration code** through frequent, small additions
- Developers prioritize **functional context**:
- Build/run commands: **62.3%**
- Implementation details: **69.9%**
- Architecture: **67.7%**
- **Critical gap**: Non-functional requirements rarely specified:
- Security: only **14.5%**
- Performance: only **14.5%**
- Files are often **"complex, difficult-to-read artifacts"**
**Implications for Our Setup:**
- AGENTS.md files are effective but need guardrails
- Add explicit security/performance constraints
- Keep persona definitions minimal and task-focused
- Context files should be living documents, not static
---
### 2. Understanding Agent Scaling via Diversity (Feb 2026)
**arXiv:2602.03794**
**Key Findings:**
- **Homogeneous agents saturate early** - diminishing returns with more of same model
- **Heterogeneity wins**: Different models, prompts, or tools yield substantial gains
- **2 diverse agents can match/exceed performance of 16 homogeneous agents**
- Performance bounded by task uncertainty, not agent count
- Homogeneous outputs are strongly correlated; heterogeneous agents provide **complementary evidence**
**Implications for Our Setup:**
**VALIDATED**: 4B orchestrator + 14B coder + 1.2B runners is the right pattern
- The diversity in model sizes creates effective specialization
- Small models for mechanical tasks (grep, read, run)
- Large models for complex reasoning (coding, architecture)
---
### 3. SOLVE-Med: Specialized Orchestration with Small Models (Nov 2025)
**arXiv:2511.03542**
**Key Findings:**
- Uses **10 specialized models (1B each)** orchestrated together
- Outperforms standalone models up to **14B parameters**
- Architecture: Router Agent → Specialists → Orchestrator Agent
- Enables local deployment while beating much larger models
**Implications:**
- Our 1.2B fast runners pattern is validated by research
- Router/Orchestrator separation is effective
- Small specialists + coordination > single large model
---
### 4. MATA: Multi-Agent with Small Model Tools (Feb 2026)
**arXiv:2602.09642**
**Key Findings:**
- Uses **small language models** for tools to minimize expensive LLM calls
- Algorithm designed to **minimize expensive LLM agent calls**
- "Careful orchestration of multiple reasoning pathways yields scalable and reliable" results
- Strong performance across **10 different LLMs**
**Implications:**
- Fast task runners (1.2B) should handle all simple operations
- Reserve 14B model for complex tasks only
- Expensive model calls should be minimized through smart routing
---
### 5. From Biased Chatbots to Biased Agents (Jan 2026)
**arXiv:2602.12285**
**Key Findings:**
- Persona assignments can degrade performance by up to **26.2%**
- Task-irrelevant persona cues introduce implicit biases
- "Persona assignments can introduce implicit biases and increase behavioral volatility"
**Implications:**
- Norse mythology naming is fine, but keep persona definitions **task-focused and minimal**
- Don't over-engineer personalities
- Focus on capability descriptions, not character roles
---
### 6. Emergent Coordination in Multi-Agent Systems (Oct 2025)
**arXiv:2510.05174**
**Key Findings:**
- Multi-agent systems can be steered from "mere aggregates" to "higher-order collectives"
- Best coordination: Personas + "think about what other agents might do" instruction
- "Effective performance requires both alignment on shared objectives and complementary contributions"
**Implications:**
- Agents need awareness of other agents' roles
- Thor should explicitly mention Odin's capabilities in specs
- Shared understanding of project structure is critical
---
## Summary: What Research Validates
### ✅ Our Current Approach (Heterogeneous Agents)
- 4B dispatcher + 14B coder + 1.2B runners is **correct**
- Diversity beats homogeneous scaling
- Small models work when orchestrated properly
### ⚠️ Areas for Improvement
1. **Add non-functional requirements** to specs (security, performance)
2. **Minimize expensive model calls** - use 1.2B agents for simple tasks
3. **Keep personas minimal** - avoid character over-engineering
4. **Agent awareness** - specs should reference other agents' capabilities
5. **Context file maintenance** - AGENTS.md should evolve, not be static
### ❌ What Failed in Koko Blog Implementation
- **Thor didn't enforce stack constraints** - allowed React when Astro required
- **No verification step** - no check that code matched PLAN.md
- **Missing guardrails** - no security/performance requirements in specs
- **Poor context** - specs didn't reference Astro patterns explicitly
---
## Recommendations for Better Agent Performance
### 1. Spec Template Improvements
Every spec sent to Odin should include:
```
STACK ENFORCEMENT (NON-NEGOTIABLE):
- Framework: [specific framework]
- Forbidden: [list of banned libraries/patterns]
- Required patterns: [specific conventions]
- Verification: [how to check compliance]
NON-FUNCTIONAL REQUIREMENTS:
- Security: [specific constraints]
- Performance: [targets]
- Bundle size: [limits]
```
### 2. Pre-Implementation Verification
Before coding starts:
- Thor reads existing files of the target type
- Explicitly lists patterns to follow
- References Astro docs for APIs
- States "use ONLY [framework] patterns"
### 3. Post-Implementation Checklist
After Odin completes:
- Verify framework compliance (grep for banned imports)
- Check bundle size impact
- Validate against PLAN.md line-by-line
- Run build and check output
### 4. Agent Coordination Improvements
- Thor specs should explicitly mention: "Odin will implement, Loki will verify"
- Include "Review Requirements" section in specs
- Loki should check both functional and non-functional requirements
---
## Research-Backed Configuration Adjustments
### Effective Agent Hierarchy (Validated)
```
Level 1: Orchestrators (4B)
- Task coordination
- Context extraction
- Spec writing with guardrails
Level 2: Specialist (14B)
- Complex coding only
- Architecture decisions
- Deep problem solving
Level 3: Fast Runners (1.2B)
- Grep, read, run
- No coding
- Parallel execution
```
### Optimal Work Distribution
- **2 diverse agents > 16 homogeneous agents**
- Our 3-tier setup (4B/14B/1.2B) is research-optimal
- Don't add more agents of same type
- Differentiate by capability, not just name
---
*Last updated: 2026-03-01*
*Sources: arXiv papers reviewed March 2026*