198 lines
6.6 KiB
Markdown
198 lines
6.6 KiB
Markdown
# Multi-Agent LLM Research Findings
|
|
|
|
## Papers Reviewed
|
|
|
|
### 1. Agent READMEs: An Empirical Study of Context Files for Agentic Coding (Nov 2025)
|
|
**arXiv:2511.12884**
|
|
|
|
**Key Findings:**
|
|
- Analyzed **2,303 agent context files** from 1,925 repositories
|
|
- Context files evolve like **configuration code** through frequent, small additions
|
|
- Developers prioritize **functional context**:
|
|
- Build/run commands: **62.3%**
|
|
- Implementation details: **69.9%**
|
|
- Architecture: **67.7%**
|
|
- **Critical gap**: Non-functional requirements rarely specified:
|
|
- Security: only **14.5%**
|
|
- Performance: only **14.5%**
|
|
- Files are often **"complex, difficult-to-read artifacts"**
|
|
|
|
**Implications for Our Setup:**
|
|
- AGENTS.md files are effective but need guardrails
|
|
- Add explicit security/performance constraints
|
|
- Keep persona definitions minimal and task-focused
|
|
- Context files should be living documents, not static
|
|
|
|
---
|
|
|
|
### 2. Understanding Agent Scaling via Diversity (Feb 2026)
|
|
**arXiv:2602.03794**
|
|
|
|
**Key Findings:**
|
|
- **Homogeneous agents saturate early** - diminishing returns with more of same model
|
|
- **Heterogeneity wins**: Different models, prompts, or tools yield substantial gains
|
|
- **2 diverse agents can match/exceed performance of 16 homogeneous agents**
|
|
- Performance bounded by task uncertainty, not agent count
|
|
- Homogeneous outputs are strongly correlated; heterogeneous agents provide **complementary evidence**
|
|
|
|
**Implications for Our Setup:**
|
|
✅ **VALIDATED**: 4B orchestrator + 14B coder + 1.2B runners is the right pattern
|
|
- The diversity in model sizes creates effective specialization
|
|
- Small models for mechanical tasks (grep, read, run)
|
|
- Large models for complex reasoning (coding, architecture)
|
|
|
|
---
|
|
|
|
### 3. SOLVE-Med: Specialized Orchestration with Small Models (Nov 2025)
|
|
**arXiv:2511.03542**
|
|
|
|
**Key Findings:**
|
|
- Uses **10 specialized models (1B each)** orchestrated together
|
|
- Outperforms standalone models up to **14B parameters**
|
|
- Architecture: Router Agent → Specialists → Orchestrator Agent
|
|
- Enables local deployment while beating much larger models
|
|
|
|
**Implications:**
|
|
- Our 1.2B fast runners pattern is validated by research
|
|
- Router/Orchestrator separation is effective
|
|
- Small specialists + coordination > single large model
|
|
|
|
---
|
|
|
|
### 4. MATA: Multi-Agent with Small Model Tools (Feb 2026)
|
|
**arXiv:2602.09642**
|
|
|
|
**Key Findings:**
|
|
- Uses **small language models** for tools to minimize expensive LLM calls
|
|
- Algorithm designed to **minimize expensive LLM agent calls**
|
|
- "Careful orchestration of multiple reasoning pathways yields scalable and reliable" results
|
|
- Strong performance across **10 different LLMs**
|
|
|
|
**Implications:**
|
|
- Fast task runners (1.2B) should handle all simple operations
|
|
- Reserve 14B model for complex tasks only
|
|
- Expensive model calls should be minimized through smart routing
|
|
|
|
---
|
|
|
|
### 5. From Biased Chatbots to Biased Agents (Jan 2026)
|
|
**arXiv:2602.12285**
|
|
|
|
**Key Findings:**
|
|
- Persona assignments can degrade performance by up to **26.2%**
|
|
- Task-irrelevant persona cues introduce implicit biases
|
|
- "Persona assignments can introduce implicit biases and increase behavioral volatility"
|
|
|
|
**Implications:**
|
|
- Norse mythology naming is fine, but keep persona definitions **task-focused and minimal**
|
|
- Don't over-engineer personalities
|
|
- Focus on capability descriptions, not character roles
|
|
|
|
---
|
|
|
|
### 6. Emergent Coordination in Multi-Agent Systems (Oct 2025)
|
|
**arXiv:2510.05174**
|
|
|
|
**Key Findings:**
|
|
- Multi-agent systems can be steered from "mere aggregates" to "higher-order collectives"
|
|
- Best coordination: Personas + "think about what other agents might do" instruction
|
|
- "Effective performance requires both alignment on shared objectives and complementary contributions"
|
|
|
|
**Implications:**
|
|
- Agents need awareness of other agents' roles
|
|
- Thor should explicitly mention Odin's capabilities in specs
|
|
- Shared understanding of project structure is critical
|
|
|
|
---
|
|
|
|
## Summary: What Research Validates
|
|
|
|
### ✅ Our Current Approach (Heterogeneous Agents)
|
|
- 4B dispatcher + 14B coder + 1.2B runners is **correct**
|
|
- Diversity beats homogeneous scaling
|
|
- Small models work when orchestrated properly
|
|
|
|
### ⚠️ Areas for Improvement
|
|
1. **Add non-functional requirements** to specs (security, performance)
|
|
2. **Minimize expensive model calls** - use 1.2B agents for simple tasks
|
|
3. **Keep personas minimal** - avoid character over-engineering
|
|
4. **Agent awareness** - specs should reference other agents' capabilities
|
|
5. **Context file maintenance** - AGENTS.md should evolve, not be static
|
|
|
|
### ❌ What Failed in Koko Blog Implementation
|
|
- **Thor didn't enforce stack constraints** - allowed React when Astro required
|
|
- **No verification step** - no check that code matched PLAN.md
|
|
- **Missing guardrails** - no security/performance requirements in specs
|
|
- **Poor context** - specs didn't reference Astro patterns explicitly
|
|
|
|
---
|
|
|
|
## Recommendations for Better Agent Performance
|
|
|
|
### 1. Spec Template Improvements
|
|
Every spec sent to Odin should include:
|
|
```
|
|
STACK ENFORCEMENT (NON-NEGOTIABLE):
|
|
- Framework: [specific framework]
|
|
- Forbidden: [list of banned libraries/patterns]
|
|
- Required patterns: [specific conventions]
|
|
- Verification: [how to check compliance]
|
|
|
|
NON-FUNCTIONAL REQUIREMENTS:
|
|
- Security: [specific constraints]
|
|
- Performance: [targets]
|
|
- Bundle size: [limits]
|
|
```
|
|
|
|
### 2. Pre-Implementation Verification
|
|
Before coding starts:
|
|
- Thor reads existing files of the target type
|
|
- Explicitly lists patterns to follow
|
|
- References Astro docs for APIs
|
|
- States "use ONLY [framework] patterns"
|
|
|
|
### 3. Post-Implementation Checklist
|
|
After Odin completes:
|
|
- Verify framework compliance (grep for banned imports)
|
|
- Check bundle size impact
|
|
- Validate against PLAN.md line-by-line
|
|
- Run build and check output
|
|
|
|
### 4. Agent Coordination Improvements
|
|
- Thor specs should explicitly mention: "Odin will implement, Loki will verify"
|
|
- Include "Review Requirements" section in specs
|
|
- Loki should check both functional and non-functional requirements
|
|
|
|
---
|
|
|
|
## Research-Backed Configuration Adjustments
|
|
|
|
### Effective Agent Hierarchy (Validated)
|
|
```
|
|
Level 1: Orchestrators (4B)
|
|
- Task coordination
|
|
- Context extraction
|
|
- Spec writing with guardrails
|
|
|
|
Level 2: Specialist (14B)
|
|
- Complex coding only
|
|
- Architecture decisions
|
|
- Deep problem solving
|
|
|
|
Level 3: Fast Runners (1.2B)
|
|
- Grep, read, run
|
|
- No coding
|
|
- Parallel execution
|
|
```
|
|
|
|
### Optimal Work Distribution
|
|
- **2 diverse agents > 16 homogeneous agents**
|
|
- Our 3-tier setup (4B/14B/1.2B) is research-optimal
|
|
- Don't add more agents of same type
|
|
- Differentiate by capability, not just name
|
|
|
|
---
|
|
|
|
*Last updated: 2026-03-01*
|
|
*Sources: arXiv papers reviewed March 2026*
|