Initial commit: coding harness feedback analysis
Harnesses under analysis: - opencode (Go-based coding agent) - pi (minimal terminal coding harness by Mario Zechner) - hermes (Nous Research agent) - forgecode (AI pair programmer with sub-agents) Each harness folder contains: - repo/: Source code from respective repositories - feedback/localllm/: Community feedback for local/smaller models - feedback/frontier/: Community feedback for frontier models Research focus: Tool handling, skills systems, prompt engineering, context management, and best practices for smaller/local models.
This commit is contained in:
@@ -0,0 +1,99 @@
|
||||
# MiniMax, GLM, DeepSeek with ForgeCode - Feedback Report
|
||||
|
||||
**Models:** MiniMax M2/M2.1, GLM-4.6, DeepSeek-V3
|
||||
**Source References:** llm-stats.com, ForgeCode Blog
|
||||
**Date Compiled:** April 9, 2026
|
||||
|
||||
---
|
||||
|
||||
## MiniMax M2.1
|
||||
|
||||
### Performance
|
||||
- **Terminal-Bench Score:** 47.9% (Rank #2 on independent leaderboard)
|
||||
- **Parameters:** 230B
|
||||
- **Context:** 1.0M tokens
|
||||
- **Cost:** $0.30 / $1.20 per million tokens
|
||||
|
||||
### Value Proposition
|
||||
- **Best cost-performance ratio** among top performers
|
||||
- Near-SOTA performance at entry-level pricing
|
||||
- Massive 1.0M context window
|
||||
|
||||
### ForgeCode Usage
|
||||
- Well-supported via OpenRouter
|
||||
- Good tool calling reliability
|
||||
- Recommended for budget-conscious users
|
||||
|
||||
---
|
||||
|
||||
## GLM-4.6 (Zhipu AI)
|
||||
|
||||
### Performance
|
||||
- **Terminal-Bench Score:** 40.5% (Rank #7)
|
||||
- **Parameters:** 357B
|
||||
- **Context:** 131K tokens
|
||||
- **Cost:** $0.55 / $2.19 per million tokens
|
||||
|
||||
### Characteristics
|
||||
- Open weights
|
||||
- Competitive with proprietary models at similar price point
|
||||
- Good context length (131K)
|
||||
|
||||
---
|
||||
|
||||
## DeepSeek Models
|
||||
|
||||
### DeepSeek-V3.2-Exp
|
||||
- **Terminal-Bench Score:** 37.7% (Rank #10)
|
||||
- **Status:** Experimental
|
||||
- **Note:** Results from llm-stats.com
|
||||
|
||||
### DeepSeek-V3.1
|
||||
- **Terminal-Bench Score:** 31.3% (Rank #16)
|
||||
- **Parameters:** 671B
|
||||
- **Observation:** Large parameter count doesn't translate to top-tier performance
|
||||
|
||||
### DeepSeek-R1-0528
|
||||
- **Terminal-Bench Score:** 5.7% (Rank #23 - lowest)
|
||||
- **Parameters:** 671B
|
||||
- **Note:** Reasoning model may not be optimized for terminal tasks
|
||||
|
||||
---
|
||||
|
||||
## Key Insights
|
||||
|
||||
### Scale ≠ Performance
|
||||
- Kimi K2 (1.0T parameters) underperforms smaller models
|
||||
- DeepSeek-V3.1 (671B) scores lower than MiniMax M2.1 (230B)
|
||||
- **Quality of architecture > raw parameter count**
|
||||
|
||||
### Cost-Performance Leaders
|
||||
1. **MiniMax M2.1:** 47.9% at $0.30/$1.20
|
||||
2. **LongCat-Flash-Lite:** Budget option at $0.10/$0.40 (score not in top 10)
|
||||
|
||||
### Context Window Comparison
|
||||
| Model | Context | Rank |
|
||||
|-------|---------|------|
|
||||
| MiniMax M2.1 | 1.0M | #2 |
|
||||
| Claude Opus 4.1 | 200K | #5 |
|
||||
| GLM-4.6 | 131K | #7 |
|
||||
|
||||
---
|
||||
|
||||
## Recommendations
|
||||
|
||||
### For Budget + Performance
|
||||
**MiniMax M2.1** - Best value proposition
|
||||
|
||||
### For Open Weights
|
||||
**GLM-4.6** or **MiniMax M2** - Both open, strong performance
|
||||
|
||||
### For Research
|
||||
Avoid DeepSeek-R1 for terminal tasks; use V3 variants instead
|
||||
|
||||
---
|
||||
|
||||
## Source References
|
||||
|
||||
1. **llm-stats.com:** https://llm-stats.com/benchmarks/terminal-bench
|
||||
2. **ForgeCode Blog:** Model comparison series
|
||||
Reference in New Issue
Block a user