claw_blog/Research.md

# Multi-Agent LLM Research Findings

## Papers Reviewed

### 1. Agent READMEs: An Empirical Study of Context Files for Agentic Coding (Nov 2025)
**arXiv:2511.12884**

**Key Findings:**
- Analyzed **2,303 agent context files** from 1,925 repositories
- Context files evolve like **configuration code** through frequent, small additions
- Developers prioritize **functional context**:
  - Build/run commands: **62.3%**
  - Implementation details: **69.9%**
  - Architecture: **67.7%**
- **Critical gap**: Non-functional requirements rarely specified:
  - Security: only **14.5%**
  - Performance: only **14.5%**
- Files are often **"complex, difficult-to-read artifacts"**

**Implications for Our Setup:**
- AGENTS.md files are effective but need guardrails
- Add explicit security/performance constraints
- Keep persona definitions minimal and task-focused
- Context files should be living documents, not static

---

### 2. Understanding Agent Scaling via Diversity (Feb 2026)
**arXiv:2602.03794**

**Key Findings:**
- **Homogeneous agents saturate early** - diminishing returns with more of same model
- **Heterogeneity wins**: Different models, prompts, or tools yield substantial gains
- **2 diverse agents can match/exceed performance of 16 homogeneous agents**
- Performance bounded by task uncertainty, not agent count
- Homogeneous outputs are strongly correlated; heterogeneous agents provide **complementary evidence**

**Implications for Our Setup:**
✅ **VALIDATED**: 4B orchestrator + 14B coder + 1.2B runners is the right pattern
- The diversity in model sizes creates effective specialization
- Small models for mechanical tasks (grep, read, run)
- Large models for complex reasoning (coding, architecture)

---

### 3. SOLVE-Med: Specialized Orchestration with Small Models (Nov 2025)
**arXiv:2511.03542**

**Key Findings:**
- Uses **10 specialized models (1B each)** orchestrated together
- Outperforms standalone models up to **14B parameters**
- Architecture: Router Agent → Specialists → Orchestrator Agent
- Enables local deployment while beating much larger models

**Implications:**
- Our 1.2B fast runners pattern is validated by research
- Router/Orchestrator separation is effective
- Small specialists + coordination > single large model

---

### 4. MATA: Multi-Agent with Small Model Tools (Feb 2026)
**arXiv:2602.09642**

**Key Findings:**
- Uses **small language models** for tools to minimize expensive LLM calls
- Algorithm designed to **minimize expensive LLM agent calls**
- "Careful orchestration of multiple reasoning pathways yields scalable and reliable" results
- Strong performance across **10 different LLMs**

**Implications:**
- Fast task runners (1.2B) should handle all simple operations
- Reserve 14B model for complex tasks only
- Expensive model calls should be minimized through smart routing

---

### 5. From Biased Chatbots to Biased Agents (Jan 2026)
**arXiv:2602.12285**

**Key Findings:**
- Persona assignments can degrade performance by up to **26.2%**
- Task-irrelevant persona cues introduce implicit biases
- "Persona assignments can introduce implicit biases and increase behavioral volatility"

**Implications:**
- Norse mythology naming is fine, but keep persona definitions **task-focused and minimal**
- Don't over-engineer personalities
- Focus on capability descriptions, not character roles

---

### 6. Emergent Coordination in Multi-Agent Systems (Oct 2025)
**arXiv:2510.05174**

**Key Findings:**
- Multi-agent systems can be steered from "mere aggregates" to "higher-order collectives"
- Best coordination: Personas + "think about what other agents might do" instruction
- "Effective performance requires both alignment on shared objectives and complementary contributions"

**Implications:**
- Agents need awareness of other agents' roles
- Thor should explicitly mention Odin's capabilities in specs
- Shared understanding of project structure is critical

---

## Summary: What Research Validates

### ✅ Our Current Approach (Heterogeneous Agents)
- 4B dispatcher + 14B coder + 1.2B runners is **correct**
- Diversity beats homogeneous scaling
- Small models work when orchestrated properly

### ⚠️ Areas for Improvement
1. **Add non-functional requirements** to specs (security, performance)
2. **Minimize expensive model calls** - use 1.2B agents for simple tasks
3. **Keep personas minimal** - avoid character over-engineering
4. **Agent awareness** - specs should reference other agents' capabilities
5. **Context file maintenance** - AGENTS.md should evolve, not be static

### ❌ What Failed in Koko Blog Implementation
- **Thor didn't enforce stack constraints** - allowed React when Astro required
- **No verification step** - no check that code matched PLAN.md
- **Missing guardrails** - no security/performance requirements in specs
- **Poor context** - specs didn't reference Astro patterns explicitly

---

## Recommendations for Better Agent Performance

### 1. Spec Template Improvements
Every spec sent to Odin should include:
```
STACK ENFORCEMENT (NON-NEGOTIABLE):
- Framework: [specific framework]
- Forbidden: [list of banned libraries/patterns]
- Required patterns: [specific conventions]
- Verification: [how to check compliance]

NON-FUNCTIONAL REQUIREMENTS:
- Security: [specific constraints]
- Performance: [targets]
- Bundle size: [limits]
```

### 2. Pre-Implementation Verification
Before coding starts:
- Thor reads existing files of the target type
- Explicitly lists patterns to follow
- References Astro docs for APIs
- States "use ONLY [framework] patterns"

### 3. Post-Implementation Checklist
After Odin completes:
- Verify framework compliance (grep for banned imports)
- Check bundle size impact
- Validate against PLAN.md line-by-line
- Run build and check output

### 4. Agent Coordination Improvements
- Thor specs should explicitly mention: "Odin will implement, Loki will verify"
- Include "Review Requirements" section in specs
- Loki should check both functional and non-functional requirements

---

## Research-Backed Configuration Adjustments

### Effective Agent Hierarchy (Validated)
```
Level 1: Orchestrators (4B)
  - Task coordination
  - Context extraction
  - Spec writing with guardrails

Level 2: Specialist (14B)
  - Complex coding only
  - Architecture decisions
  - Deep problem solving

Level 3: Fast Runners (1.2B)
  - Grep, read, run
  - No coding
  - Parallel execution
```

### Optimal Work Distribution
- **2 diverse agents > 16 homogeneous agents**
- Our 3-tier setup (4B/14B/1.2B) is research-optimal
- Don't add more agents of same type
- Differentiate by capability, not just name

---

*Last updated: 2026-03-01*
*Sources: arXiv papers reviewed March 2026*