# Multi-Agent LLM Research Findings ## Papers Reviewed ### 1. Agent READMEs: An Empirical Study of Context Files for Agentic Coding (Nov 2025) **arXiv:2511.12884** **Key Findings:** - Analyzed **2,303 agent context files** from 1,925 repositories - Context files evolve like **configuration code** through frequent, small additions - Developers prioritize **functional context**: - Build/run commands: **62.3%** - Implementation details: **69.9%** - Architecture: **67.7%** - **Critical gap**: Non-functional requirements rarely specified: - Security: only **14.5%** - Performance: only **14.5%** - Files are often **"complex, difficult-to-read artifacts"** **Implications for Our Setup:** - AGENTS.md files are effective but need guardrails - Add explicit security/performance constraints - Keep persona definitions minimal and task-focused - Context files should be living documents, not static --- ### 2. Understanding Agent Scaling via Diversity (Feb 2026) **arXiv:2602.03794** **Key Findings:** - **Homogeneous agents saturate early** - diminishing returns with more of same model - **Heterogeneity wins**: Different models, prompts, or tools yield substantial gains - **2 diverse agents can match/exceed performance of 16 homogeneous agents** - Performance bounded by task uncertainty, not agent count - Homogeneous outputs are strongly correlated; heterogeneous agents provide **complementary evidence** **Implications for Our Setup:** ✅ **VALIDATED**: 4B orchestrator + 14B coder + 1.2B runners is the right pattern - The diversity in model sizes creates effective specialization - Small models for mechanical tasks (grep, read, run) - Large models for complex reasoning (coding, architecture) --- ### 3. SOLVE-Med: Specialized Orchestration with Small Models (Nov 2025) **arXiv:2511.03542** **Key Findings:** - Uses **10 specialized models (1B each)** orchestrated together - Outperforms standalone models up to **14B parameters** - Architecture: Router Agent → Specialists → Orchestrator Agent - Enables local deployment while beating much larger models **Implications:** - Our 1.2B fast runners pattern is validated by research - Router/Orchestrator separation is effective - Small specialists + coordination > single large model --- ### 4. MATA: Multi-Agent with Small Model Tools (Feb 2026) **arXiv:2602.09642** **Key Findings:** - Uses **small language models** for tools to minimize expensive LLM calls - Algorithm designed to **minimize expensive LLM agent calls** - "Careful orchestration of multiple reasoning pathways yields scalable and reliable" results - Strong performance across **10 different LLMs** **Implications:** - Fast task runners (1.2B) should handle all simple operations - Reserve 14B model for complex tasks only - Expensive model calls should be minimized through smart routing --- ### 5. From Biased Chatbots to Biased Agents (Jan 2026) **arXiv:2602.12285** **Key Findings:** - Persona assignments can degrade performance by up to **26.2%** - Task-irrelevant persona cues introduce implicit biases - "Persona assignments can introduce implicit biases and increase behavioral volatility" **Implications:** - Norse mythology naming is fine, but keep persona definitions **task-focused and minimal** - Don't over-engineer personalities - Focus on capability descriptions, not character roles --- ### 6. Emergent Coordination in Multi-Agent Systems (Oct 2025) **arXiv:2510.05174** **Key Findings:** - Multi-agent systems can be steered from "mere aggregates" to "higher-order collectives" - Best coordination: Personas + "think about what other agents might do" instruction - "Effective performance requires both alignment on shared objectives and complementary contributions" **Implications:** - Agents need awareness of other agents' roles - Thor should explicitly mention Odin's capabilities in specs - Shared understanding of project structure is critical --- ## Summary: What Research Validates ### ✅ Our Current Approach (Heterogeneous Agents) - 4B dispatcher + 14B coder + 1.2B runners is **correct** - Diversity beats homogeneous scaling - Small models work when orchestrated properly ### ⚠️ Areas for Improvement 1. **Add non-functional requirements** to specs (security, performance) 2. **Minimize expensive model calls** - use 1.2B agents for simple tasks 3. **Keep personas minimal** - avoid character over-engineering 4. **Agent awareness** - specs should reference other agents' capabilities 5. **Context file maintenance** - AGENTS.md should evolve, not be static ### ❌ What Failed in Koko Blog Implementation - **Thor didn't enforce stack constraints** - allowed React when Astro required - **No verification step** - no check that code matched PLAN.md - **Missing guardrails** - no security/performance requirements in specs - **Poor context** - specs didn't reference Astro patterns explicitly --- ## Recommendations for Better Agent Performance ### 1. Spec Template Improvements Every spec sent to Odin should include: ``` STACK ENFORCEMENT (NON-NEGOTIABLE): - Framework: [specific framework] - Forbidden: [list of banned libraries/patterns] - Required patterns: [specific conventions] - Verification: [how to check compliance] NON-FUNCTIONAL REQUIREMENTS: - Security: [specific constraints] - Performance: [targets] - Bundle size: [limits] ``` ### 2. Pre-Implementation Verification Before coding starts: - Thor reads existing files of the target type - Explicitly lists patterns to follow - References Astro docs for APIs - States "use ONLY [framework] patterns" ### 3. Post-Implementation Checklist After Odin completes: - Verify framework compliance (grep for banned imports) - Check bundle size impact - Validate against PLAN.md line-by-line - Run build and check output ### 4. Agent Coordination Improvements - Thor specs should explicitly mention: "Odin will implement, Loki will verify" - Include "Review Requirements" section in specs - Loki should check both functional and non-functional requirements --- ## Research-Backed Configuration Adjustments ### Effective Agent Hierarchy (Validated) ``` Level 1: Orchestrators (4B) - Task coordination - Context extraction - Spec writing with guardrails Level 2: Specialist (14B) - Complex coding only - Architecture decisions - Deep problem solving Level 3: Fast Runners (1.2B) - Grep, read, run - No coding - Parallel execution ``` ### Optimal Work Distribution - **2 diverse agents > 16 homogeneous agents** - Our 3-tier setup (4B/14B/1.2B) is research-optimal - Don't add more agents of same type - Differentiate by capability, not just name --- *Last updated: 2026-03-01* *Sources: arXiv papers reviewed March 2026*