51123212c4
Harnesses under analysis: - opencode (Go-based coding agent) - pi (minimal terminal coding harness by Mario Zechner) - hermes (Nous Research agent) - forgecode (AI pair programmer with sub-agents) Each harness folder contains: - repo/: Source code from respective repositories - feedback/localllm/: Community feedback for local/smaller models - feedback/frontier/: Community feedback for frontier models Research focus: Tool handling, skills systems, prompt engineering, context management, and best practices for smaller/local models.
100 lines
2.4 KiB
Markdown
100 lines
2.4 KiB
Markdown
# MiniMax, GLM, DeepSeek with ForgeCode - Feedback Report
|
|
|
|
**Models:** MiniMax M2/M2.1, GLM-4.6, DeepSeek-V3
|
|
**Source References:** llm-stats.com, ForgeCode Blog
|
|
**Date Compiled:** April 9, 2026
|
|
|
|
---
|
|
|
|
## MiniMax M2.1
|
|
|
|
### Performance
|
|
- **Terminal-Bench Score:** 47.9% (Rank #2 on independent leaderboard)
|
|
- **Parameters:** 230B
|
|
- **Context:** 1.0M tokens
|
|
- **Cost:** $0.30 / $1.20 per million tokens
|
|
|
|
### Value Proposition
|
|
- **Best cost-performance ratio** among top performers
|
|
- Near-SOTA performance at entry-level pricing
|
|
- Massive 1.0M context window
|
|
|
|
### ForgeCode Usage
|
|
- Well-supported via OpenRouter
|
|
- Good tool calling reliability
|
|
- Recommended for budget-conscious users
|
|
|
|
---
|
|
|
|
## GLM-4.6 (Zhipu AI)
|
|
|
|
### Performance
|
|
- **Terminal-Bench Score:** 40.5% (Rank #7)
|
|
- **Parameters:** 357B
|
|
- **Context:** 131K tokens
|
|
- **Cost:** $0.55 / $2.19 per million tokens
|
|
|
|
### Characteristics
|
|
- Open weights
|
|
- Competitive with proprietary models at similar price point
|
|
- Good context length (131K)
|
|
|
|
---
|
|
|
|
## DeepSeek Models
|
|
|
|
### DeepSeek-V3.2-Exp
|
|
- **Terminal-Bench Score:** 37.7% (Rank #10)
|
|
- **Status:** Experimental
|
|
- **Note:** Results from llm-stats.com
|
|
|
|
### DeepSeek-V3.1
|
|
- **Terminal-Bench Score:** 31.3% (Rank #16)
|
|
- **Parameters:** 671B
|
|
- **Observation:** Large parameter count doesn't translate to top-tier performance
|
|
|
|
### DeepSeek-R1-0528
|
|
- **Terminal-Bench Score:** 5.7% (Rank #23 - lowest)
|
|
- **Parameters:** 671B
|
|
- **Note:** Reasoning model may not be optimized for terminal tasks
|
|
|
|
---
|
|
|
|
## Key Insights
|
|
|
|
### Scale ≠ Performance
|
|
- Kimi K2 (1.0T parameters) underperforms smaller models
|
|
- DeepSeek-V3.1 (671B) scores lower than MiniMax M2.1 (230B)
|
|
- **Quality of architecture > raw parameter count**
|
|
|
|
### Cost-Performance Leaders
|
|
1. **MiniMax M2.1:** 47.9% at $0.30/$1.20
|
|
2. **LongCat-Flash-Lite:** Budget option at $0.10/$0.40 (score not in top 10)
|
|
|
|
### Context Window Comparison
|
|
| Model | Context | Rank |
|
|
|-------|---------|------|
|
|
| MiniMax M2.1 | 1.0M | #2 |
|
|
| Claude Opus 4.1 | 200K | #5 |
|
|
| GLM-4.6 | 131K | #7 |
|
|
|
|
---
|
|
|
|
## Recommendations
|
|
|
|
### For Budget + Performance
|
|
**MiniMax M2.1** - Best value proposition
|
|
|
|
### For Open Weights
|
|
**GLM-4.6** or **MiniMax M2** - Both open, strong performance
|
|
|
|
### For Research
|
|
Avoid DeepSeek-R1 for terminal tasks; use V3 variants instead
|
|
|
|
---
|
|
|
|
## Source References
|
|
|
|
1. **llm-stats.com:** https://llm-stats.com/benchmarks/terminal-bench
|
|
2. **ForgeCode Blog:** Model comparison series
|