Initial commit: coding harness feedback analysis

Harnesses under analysis: - opencode (Go-based coding agent) - pi (minimal terminal coding harness by Mario Zechner) - hermes (Nous Research agent) - forgecode (AI pair programmer with sub-agents) Each harness folder contains: - repo/: Source code from respective repositories - feedback/localllm/: Community feedback for local/smaller models - feedback/frontier/: Community feedback for frontier models Research focus: Tool handling, skills systems, prompt engineering, context management, and best practices for smaller/local models.
2026-04-09 15:13:45 +02:00
commit 51123212c4
46 changed files with 7213 additions and 0 deletions
@@ -0,0 +1,99 @@
+# MiniMax, GLM, DeepSeek with ForgeCode - Feedback Report
+
+**Models:** MiniMax M2/M2.1, GLM-4.6, DeepSeek-V3  
+**Source References:** llm-stats.com, ForgeCode Blog  
+**Date Compiled:** April 9, 2026
+
+---
+
+## MiniMax M2.1
+
+### Performance
+- **Terminal-Bench Score:** 47.9% (Rank #2 on independent leaderboard)
+- **Parameters:** 230B
+- **Context:** 1.0M tokens
+- **Cost:** $0.30 / $1.20 per million tokens
+
+### Value Proposition
+- **Best cost-performance ratio** among top performers
+- Near-SOTA performance at entry-level pricing
+- Massive 1.0M context window
+
+### ForgeCode Usage
+- Well-supported via OpenRouter
+- Good tool calling reliability
+- Recommended for budget-conscious users
+
+---
+
+## GLM-4.6 (Zhipu AI)
+
+### Performance
+- **Terminal-Bench Score:** 40.5% (Rank #7)
+- **Parameters:** 357B
+- **Context:** 131K tokens
+- **Cost:** $0.55 / $2.19 per million tokens
+
+### Characteristics
+- Open weights
+- Competitive with proprietary models at similar price point
+- Good context length (131K)
+
+---
+
+## DeepSeek Models
+
+### DeepSeek-V3.2-Exp
+- **Terminal-Bench Score:** 37.7% (Rank #10)
+- **Status:** Experimental
+- **Note:** Results from llm-stats.com
+
+### DeepSeek-V3.1
+- **Terminal-Bench Score:** 31.3% (Rank #16)
+- **Parameters:** 671B
+- **Observation:** Large parameter count doesn't translate to top-tier performance
+
+### DeepSeek-R1-0528
+- **Terminal-Bench Score:** 5.7% (Rank #23 - lowest)
+- **Parameters:** 671B
+- **Note:** Reasoning model may not be optimized for terminal tasks
+
+---
+
+## Key Insights
+
+### Scale ≠ Performance
+- Kimi K2 (1.0T parameters) underperforms smaller models
+- DeepSeek-V3.1 (671B) scores lower than MiniMax M2.1 (230B)
+- **Quality of architecture > raw parameter count**
+
+### Cost-Performance Leaders
+1. **MiniMax M2.1:** 47.9% at $0.30/$1.20
+2. **LongCat-Flash-Lite:** Budget option at $0.10/$0.40 (score not in top 10)
+
+### Context Window Comparison
+| Model | Context | Rank |
+|-------|---------|------|
+| MiniMax M2.1 | 1.0M | #2 |
+| Claude Opus 4.1 | 200K | #5 |
+| GLM-4.6 | 131K | #7 |
+
+---
+
+## Recommendations
+
+### For Budget + Performance
+**MiniMax M2.1** - Best value proposition
+
+### For Open Weights
+**GLM-4.6** or **MiniMax M2** - Both open, strong performance
+
+### For Research
+Avoid DeepSeek-R1 for terminal tasks; use V3 variants instead
+
+---
+
+## Source References
+
+1. **llm-stats.com:** https://llm-stats.com/benchmarks/terminal-bench
+2. **ForgeCode Blog:** Model comparison series