Files
mid_model_research/forgecode/feedback/localllm/minimax-glm-deepseek.md
T
sleepy 51123212c4 Initial commit: coding harness feedback analysis
Harnesses under analysis:
- opencode (Go-based coding agent)
- pi (minimal terminal coding harness by Mario Zechner)
- hermes (Nous Research agent)
- forgecode (AI pair programmer with sub-agents)

Each harness folder contains:
- repo/: Source code from respective repositories
- feedback/localllm/: Community feedback for local/smaller models
- feedback/frontier/: Community feedback for frontier models

Research focus: Tool handling, skills systems, prompt engineering,
context management, and best practices for smaller/local models.
2026-04-09 15:13:45 +02:00

2.4 KiB

MiniMax, GLM, DeepSeek with ForgeCode - Feedback Report

Models: MiniMax M2/M2.1, GLM-4.6, DeepSeek-V3
Source References: llm-stats.com, ForgeCode Blog
Date Compiled: April 9, 2026


MiniMax M2.1

Performance

  • Terminal-Bench Score: 47.9% (Rank #2 on independent leaderboard)
  • Parameters: 230B
  • Context: 1.0M tokens
  • Cost: $0.30 / $1.20 per million tokens

Value Proposition

  • Best cost-performance ratio among top performers
  • Near-SOTA performance at entry-level pricing
  • Massive 1.0M context window

ForgeCode Usage

  • Well-supported via OpenRouter
  • Good tool calling reliability
  • Recommended for budget-conscious users

GLM-4.6 (Zhipu AI)

Performance

  • Terminal-Bench Score: 40.5% (Rank #7)
  • Parameters: 357B
  • Context: 131K tokens
  • Cost: $0.55 / $2.19 per million tokens

Characteristics

  • Open weights
  • Competitive with proprietary models at similar price point
  • Good context length (131K)

DeepSeek Models

DeepSeek-V3.2-Exp

  • Terminal-Bench Score: 37.7% (Rank #10)
  • Status: Experimental
  • Note: Results from llm-stats.com

DeepSeek-V3.1

  • Terminal-Bench Score: 31.3% (Rank #16)
  • Parameters: 671B
  • Observation: Large parameter count doesn't translate to top-tier performance

DeepSeek-R1-0528

  • Terminal-Bench Score: 5.7% (Rank #23 - lowest)
  • Parameters: 671B
  • Note: Reasoning model may not be optimized for terminal tasks

Key Insights

Scale ≠ Performance

  • Kimi K2 (1.0T parameters) underperforms smaller models
  • DeepSeek-V3.1 (671B) scores lower than MiniMax M2.1 (230B)
  • Quality of architecture > raw parameter count

Cost-Performance Leaders

  1. MiniMax M2.1: 47.9% at $0.30/$1.20
  2. LongCat-Flash-Lite: Budget option at $0.10/$0.40 (score not in top 10)

Context Window Comparison

Model Context Rank
MiniMax M2.1 1.0M #2
Claude Opus 4.1 200K #5
GLM-4.6 131K #7

Recommendations

For Budget + Performance

MiniMax M2.1 - Best value proposition

For Open Weights

GLM-4.6 or MiniMax M2 - Both open, strong performance

For Research

Avoid DeepSeek-R1 for terminal tasks; use V3 variants instead


Source References

  1. llm-stats.com: https://llm-stats.com/benchmarks/terminal-bench
  2. ForgeCode Blog: Model comparison series