sleepy
f764aaac8b
Add 'Last Updated: April 9, 2026' to all markdown files missing dates
2026-04-09 17:27:57 +02:00
sleepy
691cdfcb5d
Update README.md: Mark research run as complete, add conclusion.md reference, clean up folder structure
2026-04-09 17:25:30 +02:00
sleepy
f31942c35f
Add conclusion.md: Comprehensive analysis of harness suitability for local models with research-backed recommendations
2026-04-09 17:22:42 +02:00
sleepy
46a59f0aa8
Move pi REPO_FEEDBACK.md to correct location (pi/ instead of pi/pi/)
2026-04-09 17:15:36 +02:00
sleepy
a794d9bddf
Add REPO_FEEDBACK.md files for opencode, hermes, forgecode, and pi-mono harnesses
2026-04-09 17:14:27 +02:00
sleepy
e1781947f4
Fix Qwen3.5-35B-A3B model references
...
Reverted incorrect changes - Qwen3.5-35B-A3B IS a real model:
- 35B total / 3B active parameters (MoE)
- 262k native context (up to 1M extended)
- Apache 2.0 license
- Available on HuggingFace: Qwen/Qwen3.5-35B-A3B
Updated files:
- opencode/opencode/feedback/localllm/local-llm-feedback.md
- opencode/opencode/feedback/SUMMARY.md
- FEEDBACK_TEMPLATE.md
Added correct specs:
- MMLU-Pro: 85.3%
- SWE-bench Verified: 69.2%
- Context: 262k native, 1M extended
2026-04-09 16:25:19 +02:00
sleepy
1a1522266c
Final batch of structure unification
...
Restructured to unified template:
- hermes/feedback/localllm/gemma-models-feedback.md
- hermes/feedback/frontier/openai-gpt-feedback.md
All key feedback files now follow FEEDBACK_TEMPLATE.md structure
2026-04-09 16:16:15 +02:00
sleepy
827c4eb121
Continue unifying feedback file structure
...
Restructured to unified template:
- pi/feedback/localllm/local-llm-feedback.md
- hermes/feedback/localllm/qwen-models-feedback.md
Applied standardized sections:
- Header with Model/Provider/Harness/Date
- Quick Reference table
- Per-model sections with Benchmark/What Worked/Issues
- Source References with descriptions
2026-04-09 16:14:57 +02:00
sleepy
b012a406c7
Unify feedback file structure across harness folders
...
Applied unified structure template to key feedback files:
Structure now includes:
1. Standard header (Model/Size/Provider/Harness/Date)
2. Quick Reference table
3. Benchmark Results (with harness+model note)
4. What Worked Well
5. Issues Encountered (with severity levels)
6. Configuration (if applicable)
7. Source References (with descriptions)
Files restructured:
- forgecode/feedback/frontier/gpt-5.4.md
- forgecode/feedback/frontier/claude-opus-4.6.md
- hermes/feedback/frontier/claude-sonnet-feedback.md
Also created FEEDBACK_TEMPLATE.md as a style guide for all future feedback files.
2026-04-09 16:12:52 +02:00
sleepy
f561bed731
Fix model references and benchmark data across all feedback files
...
Qwen Model Corrections:
- Added Model Reference Guide to clarify Qwen3 vs Qwen 3.5 families
- Qwen3: 0.6B, 1.7B, 4B, 8B, 14B, 32B + MoE (30B-A3B, 235B-A22B)
- Qwen 3.5: 0.8B, 2B, 4B, 9B + MoE (27B, 122B-A10B, 397B-A17B)
- Fixed 'Qwen3.5-35B-A3B' -> 'Qwen3-30B-A3B' (non-existent model corrected)
- Note: Qwen 3.5 14B does NOT exist; references likely mean Qwen3-14B
Terminal-Bench 2.0 Fixes:
- Clarified that Terminal-Bench measures HARNESS+MODEL combinations
- Updated rankings with current leaderboard data (April 2026):
- #1 : Pilot + Claude Opus 4.6: 82.9%
- #2 : ForgeCode + GPT-5.4: 81.8%
- #3 : ForgeCode + Claude Opus 4.6: 81.8%
- Removed incorrect 'GPT-5.4 Rank #1 ' claims (scores vary by harness)
- Added harness attribution to all Terminal-Bench references
SWE-Bench Pro Updates (Verified):
- #1 : Claude Mythos Preview: 77.8%
- #2 : GLM-5.1: 58.4% (top open-source)
- #3 : GPT-5.4: 57.7%
- Added source references to llm-stats.com
Files Modified:
- forgecode/feedback/localllm/qwen-3.5.md
- forgecode/feedback/frontier/benchmark-controversy.md
- hermes/feedback/localllm/qwen-models-feedback.md
- opencode/opencode/feedback/SUMMARY.md
- opencode/opencode/feedback/frontier/frontier-model-feedback.md
- opencode/opencode/feedback/localllm/local-llm-feedback.md
- pi/feedback/frontier/frontier-model-feedback.md
2026-04-09 16:05:14 +02:00
sleepy
2623737ad2
Add pi (pi-mono) feedback analysis
...
- Comprehensive feedback document covering tool handling, UX, performance
- Frontier model feedback (Claude, GPT, Gemini)
- Local LLM feedback (context window issues, prompting strategies)
- Source references from GitHub issues and community
2026-04-09 15:40:56 +02:00
sleepy
1175ddd42a
Add README with folder navigation
2026-04-09 15:15:28 +02:00
sleepy
51123212c4
Initial commit: coding harness feedback analysis
...
Harnesses under analysis:
- opencode (Go-based coding agent)
- pi (minimal terminal coding harness by Mario Zechner)
- hermes (Nous Research agent)
- forgecode (AI pair programmer with sub-agents)
Each harness folder contains:
- repo/: Source code from respective repositories
- feedback/localllm/: Community feedback for local/smaller models
- feedback/frontier/: Community feedback for frontier models
Research focus: Tool handling, skills systems, prompt engineering,
context management, and best practices for smaller/local models.
2026-04-09 15:13:45 +02:00