Files
mid_model_research/forgecode/feedback/localllm/minimax-glm-deepseek.md
T
sleepy 51123212c4 Initial commit: coding harness feedback analysis
Harnesses under analysis:
- opencode (Go-based coding agent)
- pi (minimal terminal coding harness by Mario Zechner)
- hermes (Nous Research agent)
- forgecode (AI pair programmer with sub-agents)

Each harness folder contains:
- repo/: Source code from respective repositories
- feedback/localllm/: Community feedback for local/smaller models
- feedback/frontier/: Community feedback for frontier models

Research focus: Tool handling, skills systems, prompt engineering,
context management, and best practices for smaller/local models.
2026-04-09 15:13:45 +02:00

100 lines
2.4 KiB
Markdown

# MiniMax, GLM, DeepSeek with ForgeCode - Feedback Report
**Models:** MiniMax M2/M2.1, GLM-4.6, DeepSeek-V3
**Source References:** llm-stats.com, ForgeCode Blog
**Date Compiled:** April 9, 2026
---
## MiniMax M2.1
### Performance
- **Terminal-Bench Score:** 47.9% (Rank #2 on independent leaderboard)
- **Parameters:** 230B
- **Context:** 1.0M tokens
- **Cost:** $0.30 / $1.20 per million tokens
### Value Proposition
- **Best cost-performance ratio** among top performers
- Near-SOTA performance at entry-level pricing
- Massive 1.0M context window
### ForgeCode Usage
- Well-supported via OpenRouter
- Good tool calling reliability
- Recommended for budget-conscious users
---
## GLM-4.6 (Zhipu AI)
### Performance
- **Terminal-Bench Score:** 40.5% (Rank #7)
- **Parameters:** 357B
- **Context:** 131K tokens
- **Cost:** $0.55 / $2.19 per million tokens
### Characteristics
- Open weights
- Competitive with proprietary models at similar price point
- Good context length (131K)
---
## DeepSeek Models
### DeepSeek-V3.2-Exp
- **Terminal-Bench Score:** 37.7% (Rank #10)
- **Status:** Experimental
- **Note:** Results from llm-stats.com
### DeepSeek-V3.1
- **Terminal-Bench Score:** 31.3% (Rank #16)
- **Parameters:** 671B
- **Observation:** Large parameter count doesn't translate to top-tier performance
### DeepSeek-R1-0528
- **Terminal-Bench Score:** 5.7% (Rank #23 - lowest)
- **Parameters:** 671B
- **Note:** Reasoning model may not be optimized for terminal tasks
---
## Key Insights
### Scale ≠ Performance
- Kimi K2 (1.0T parameters) underperforms smaller models
- DeepSeek-V3.1 (671B) scores lower than MiniMax M2.1 (230B)
- **Quality of architecture > raw parameter count**
### Cost-Performance Leaders
1. **MiniMax M2.1:** 47.9% at $0.30/$1.20
2. **LongCat-Flash-Lite:** Budget option at $0.10/$0.40 (score not in top 10)
### Context Window Comparison
| Model | Context | Rank |
|-------|---------|------|
| MiniMax M2.1 | 1.0M | #2 |
| Claude Opus 4.1 | 200K | #5 |
| GLM-4.6 | 131K | #7 |
---
## Recommendations
### For Budget + Performance
**MiniMax M2.1** - Best value proposition
### For Open Weights
**GLM-4.6** or **MiniMax M2** - Both open, strong performance
### For Research
Avoid DeepSeek-R1 for terminal tasks; use V3 variants instead
---
## Source References
1. **llm-stats.com:** https://llm-stats.com/benchmarks/terminal-bench
2. **ForgeCode Blog:** Model comparison series