4 Commits

Author SHA1 Message Date
sleepy f764aaac8b Add 'Last Updated: April 9, 2026' to all markdown files missing dates 2026-04-09 17:27:57 +02:00
sleepy 827c4eb121 Continue unifying feedback file structure
Restructured to unified template:
- pi/feedback/localllm/local-llm-feedback.md
- hermes/feedback/localllm/qwen-models-feedback.md

Applied standardized sections:
- Header with Model/Provider/Harness/Date
- Quick Reference table
- Per-model sections with Benchmark/What Worked/Issues
- Source References with descriptions
2026-04-09 16:14:57 +02:00
sleepy f561bed731 Fix model references and benchmark data across all feedback files
Qwen Model Corrections:
- Added Model Reference Guide to clarify Qwen3 vs Qwen 3.5 families
- Qwen3: 0.6B, 1.7B, 4B, 8B, 14B, 32B + MoE (30B-A3B, 235B-A22B)
- Qwen 3.5: 0.8B, 2B, 4B, 9B + MoE (27B, 122B-A10B, 397B-A17B)
- Fixed 'Qwen3.5-35B-A3B' -> 'Qwen3-30B-A3B' (non-existent model corrected)
- Note: Qwen 3.5 14B does NOT exist; references likely mean Qwen3-14B

Terminal-Bench 2.0 Fixes:
- Clarified that Terminal-Bench measures HARNESS+MODEL combinations
- Updated rankings with current leaderboard data (April 2026):
  - #1: Pilot + Claude Opus 4.6: 82.9%
  - #2: ForgeCode + GPT-5.4: 81.8%
  - #3: ForgeCode + Claude Opus 4.6: 81.8%
- Removed incorrect 'GPT-5.4 Rank #1' claims (scores vary by harness)
- Added harness attribution to all Terminal-Bench references

SWE-Bench Pro Updates (Verified):
- #1: Claude Mythos Preview: 77.8%
- #2: GLM-5.1: 58.4% (top open-source)
- #3: GPT-5.4: 57.7%
- Added source references to llm-stats.com

Files Modified:
- forgecode/feedback/localllm/qwen-3.5.md
- forgecode/feedback/frontier/benchmark-controversy.md
- hermes/feedback/localllm/qwen-models-feedback.md
- opencode/opencode/feedback/SUMMARY.md
- opencode/opencode/feedback/frontier/frontier-model-feedback.md
- opencode/opencode/feedback/localllm/local-llm-feedback.md
- pi/feedback/frontier/frontier-model-feedback.md
2026-04-09 16:05:14 +02:00
sleepy 2623737ad2 Add pi (pi-mono) feedback analysis
- Comprehensive feedback document covering tool handling, UX, performance
- Frontier model feedback (Claude, GPT, Gemini)
- Local LLM feedback (context window issues, prompting strategies)
- Source references from GitHub issues and community
2026-04-09 15:40:56 +02:00