51123212c4
Harnesses under analysis: - opencode (Go-based coding agent) - pi (minimal terminal coding harness by Mario Zechner) - hermes (Nous Research agent) - forgecode (AI pair programmer with sub-agents) Each harness folder contains: - repo/: Source code from respective repositories - feedback/localllm/: Community feedback for local/smaller models - feedback/frontier/: Community feedback for frontier models Research focus: Tool handling, skills systems, prompt engineering, context management, and best practices for smaller/local models.
2.4 KiB
2.4 KiB
MiniMax, GLM, DeepSeek with ForgeCode - Feedback Report
Models: MiniMax M2/M2.1, GLM-4.6, DeepSeek-V3
Source References: llm-stats.com, ForgeCode Blog
Date Compiled: April 9, 2026
MiniMax M2.1
Performance
- Terminal-Bench Score: 47.9% (Rank #2 on independent leaderboard)
- Parameters: 230B
- Context: 1.0M tokens
- Cost: $0.30 / $1.20 per million tokens
Value Proposition
- Best cost-performance ratio among top performers
- Near-SOTA performance at entry-level pricing
- Massive 1.0M context window
ForgeCode Usage
- Well-supported via OpenRouter
- Good tool calling reliability
- Recommended for budget-conscious users
GLM-4.6 (Zhipu AI)
Performance
- Terminal-Bench Score: 40.5% (Rank #7)
- Parameters: 357B
- Context: 131K tokens
- Cost: $0.55 / $2.19 per million tokens
Characteristics
- Open weights
- Competitive with proprietary models at similar price point
- Good context length (131K)
DeepSeek Models
DeepSeek-V3.2-Exp
- Terminal-Bench Score: 37.7% (Rank #10)
- Status: Experimental
- Note: Results from llm-stats.com
DeepSeek-V3.1
- Terminal-Bench Score: 31.3% (Rank #16)
- Parameters: 671B
- Observation: Large parameter count doesn't translate to top-tier performance
DeepSeek-R1-0528
- Terminal-Bench Score: 5.7% (Rank #23 - lowest)
- Parameters: 671B
- Note: Reasoning model may not be optimized for terminal tasks
Key Insights
Scale ≠ Performance
- Kimi K2 (1.0T parameters) underperforms smaller models
- DeepSeek-V3.1 (671B) scores lower than MiniMax M2.1 (230B)
- Quality of architecture > raw parameter count
Cost-Performance Leaders
- MiniMax M2.1: 47.9% at $0.30/$1.20
- LongCat-Flash-Lite: Budget option at $0.10/$0.40 (score not in top 10)
Context Window Comparison
| Model | Context | Rank |
|---|---|---|
| MiniMax M2.1 | 1.0M | #2 |
| Claude Opus 4.1 | 200K | #5 |
| GLM-4.6 | 131K | #7 |
Recommendations
For Budget + Performance
MiniMax M2.1 - Best value proposition
For Open Weights
GLM-4.6 or MiniMax M2 - Both open, strong performance
For Research
Avoid DeepSeek-R1 for terminal tasks; use V3 variants instead
Source References
- llm-stats.com: https://llm-stats.com/benchmarks/terminal-bench
- ForgeCode Blog: Model comparison series