sleepy/claw_blog

Fork 0

Files

T

sleepy ba17d520bc initial

2026-04-30 19:44:09 +02:00

6.6 KiB

Raw Permalink Blame History

Multi-Agent LLM Research Findings

Papers Reviewed

1. Agent READMEs: An Empirical Study of Context Files for Agentic Coding (Nov 2025)

arXiv:2511.12884

Key Findings:

Analyzed 2,303 agent context files from 1,925 repositories
Context files evolve like configuration code through frequent, small additions
Developers prioritize functional context:
- Build/run commands: 62.3%
- Implementation details: 69.9%
- Architecture: 67.7%
Critical gap: Non-functional requirements rarely specified:
- Security: only 14.5%
- Performance: only 14.5%
Files are often "complex, difficult-to-read artifacts"

Implications for Our Setup:

AGENTS.md files are effective but need guardrails
Add explicit security/performance constraints
Keep persona definitions minimal and task-focused
Context files should be living documents, not static

2. Understanding Agent Scaling via Diversity (Feb 2026)

arXiv:2602.03794

Key Findings:

Homogeneous agents saturate early - diminishing returns with more of same model
Heterogeneity wins: Different models, prompts, or tools yield substantial gains
2 diverse agents can match/exceed performance of 16 homogeneous agents
Performance bounded by task uncertainty, not agent count
Homogeneous outputs are strongly correlated; heterogeneous agents provide complementary evidence

Implications for Our Setup: ✅ VALIDATED: 4B orchestrator + 14B coder + 1.2B runners is the right pattern

The diversity in model sizes creates effective specialization
Small models for mechanical tasks (grep, read, run)
Large models for complex reasoning (coding, architecture)

3. SOLVE-Med: Specialized Orchestration with Small Models (Nov 2025)

arXiv:2511.03542

Key Findings:

Uses 10 specialized models (1B each) orchestrated together
Outperforms standalone models up to 14B parameters
Architecture: Router Agent → Specialists → Orchestrator Agent
Enables local deployment while beating much larger models

Implications:

Our 1.2B fast runners pattern is validated by research
Router/Orchestrator separation is effective
Small specialists + coordination > single large model

4. MATA: Multi-Agent with Small Model Tools (Feb 2026)

arXiv:2602.09642

Key Findings:

Uses small language models for tools to minimize expensive LLM calls
Algorithm designed to minimize expensive LLM agent calls
"Careful orchestration of multiple reasoning pathways yields scalable and reliable" results
Strong performance across 10 different LLMs

Implications:

Fast task runners (1.2B) should handle all simple operations
Reserve 14B model for complex tasks only
Expensive model calls should be minimized through smart routing

5. From Biased Chatbots to Biased Agents (Jan 2026)

arXiv:2602.12285

Key Findings:

Persona assignments can degrade performance by up to 26.2%
Task-irrelevant persona cues introduce implicit biases
"Persona assignments can introduce implicit biases and increase behavioral volatility"

Implications:

Norse mythology naming is fine, but keep persona definitions task-focused and minimal
Don't over-engineer personalities
Focus on capability descriptions, not character roles

6. Emergent Coordination in Multi-Agent Systems (Oct 2025)

arXiv:2510.05174

Key Findings:

Multi-agent systems can be steered from "mere aggregates" to "higher-order collectives"
Best coordination: Personas + "think about what other agents might do" instruction
"Effective performance requires both alignment on shared objectives and complementary contributions"

Implications:

Agents need awareness of other agents' roles
Thor should explicitly mention Odin's capabilities in specs
Shared understanding of project structure is critical

Summary: What Research Validates

✅ Our Current Approach (Heterogeneous Agents)

4B dispatcher + 14B coder + 1.2B runners is correct
Diversity beats homogeneous scaling
Small models work when orchestrated properly

⚠️ Areas for Improvement

Add non-functional requirements to specs (security, performance)
Minimize expensive model calls - use 1.2B agents for simple tasks
Keep personas minimal - avoid character over-engineering
Agent awareness - specs should reference other agents' capabilities
Context file maintenance - AGENTS.md should evolve, not be static

❌ What Failed in Koko Blog Implementation

Thor didn't enforce stack constraints - allowed React when Astro required
No verification step - no check that code matched PLAN.md
Missing guardrails - no security/performance requirements in specs
Poor context - specs didn't reference Astro patterns explicitly

Recommendations for Better Agent Performance

1. Spec Template Improvements

Every spec sent to Odin should include:

STACK ENFORCEMENT (NON-NEGOTIABLE):
- Framework: [specific framework]
- Forbidden: [list of banned libraries/patterns]
- Required patterns: [specific conventions]
- Verification: [how to check compliance]

NON-FUNCTIONAL REQUIREMENTS:
- Security: [specific constraints]
- Performance: [targets]
- Bundle size: [limits]

2. Pre-Implementation Verification

Before coding starts:

Thor reads existing files of the target type
Explicitly lists patterns to follow
References Astro docs for APIs
States "use ONLY [framework] patterns"

3. Post-Implementation Checklist

After Odin completes:

Verify framework compliance (grep for banned imports)
Check bundle size impact
Validate against PLAN.md line-by-line
Run build and check output

4. Agent Coordination Improvements

Thor specs should explicitly mention: "Odin will implement, Loki will verify"
Include "Review Requirements" section in specs
Loki should check both functional and non-functional requirements

Research-Backed Configuration Adjustments

Effective Agent Hierarchy (Validated)

Level 1: Orchestrators (4B)
  - Task coordination
  - Context extraction
  - Spec writing with guardrails
  
Level 2: Specialist (14B)
  - Complex coding only
  - Architecture decisions
  - Deep problem solving
  
Level 3: Fast Runners (1.2B)
  - Grep, read, run
  - No coding
  - Parallel execution

Optimal Work Distribution

2 diverse agents > 16 homogeneous agents
Our 3-tier setup (4B/14B/1.2B) is research-optimal
Don't add more agents of same type
Differentiate by capability, not just name

Last updated: 2026-03-01 Sources: arXiv papers reviewed March 2026

6.6 KiB Raw Permalink Blame History

Multi-Agent LLM Research Findings

Papers Reviewed

1. Agent READMEs: An Empirical Study of Context Files for Agentic Coding (Nov 2025)

2. Understanding Agent Scaling via Diversity (Feb 2026)

3. SOLVE-Med: Specialized Orchestration with Small Models (Nov 2025)

4. MATA: Multi-Agent with Small Model Tools (Feb 2026)

5. From Biased Chatbots to Biased Agents (Jan 2026)

6. Emergent Coordination in Multi-Agent Systems (Oct 2025)

Summary: What Research Validates

✅ Our Current Approach (Heterogeneous Agents)

⚠️ Areas for Improvement

❌ What Failed in Koko Blog Implementation

Recommendations for Better Agent Performance

1. Spec Template Improvements

2. Pre-Implementation Verification

3. Post-Implementation Checklist

4. Agent Coordination Improvements

Research-Backed Configuration Adjustments

Effective Agent Hierarchy (Validated)

Optimal Work Distribution

6.6 KiB

Raw Permalink Blame History