From f4601547d2be0439cf7032ceea93844e0e9f663f Mon Sep 17 00:00:00 2001
From: Kaloyan Nikolov <konik98@gmail.com>
Date: Fri, 24 Apr 2026 00:32:55 +0200
Subject: [PATCH] Initial commit: PLAN.md

---
 PLAN.md | 266 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 266 insertions(+)
 create mode 100644 PLAN.md

diff --git a/PLAN.md b/PLAN.md
new file mode 100644
index 0000000..a18be86
--- /dev/null
+++ b/PLAN.md
@@ -0,0 +1,266 @@
+# Ternary Quantization Research Plan
+
+## Objective
+Research and prototype ternary (1.58-bit) quantization for LLMs, exploring quantization-aware training (QAT) and post-training quantization (PTQ) + fine-tuning pipelines. The goal is to understand how to take a pre-trained model, quantize it to ternary/2-bit weights, and recover accuracy through fine-tuning.
+
+**Key methodology addition: Autonomous experiment iteration via Karpathy's autoresearch pattern** to accelerate hyperparameter and technique discovery.
+
+---
+
+## What "Bonsai" Actually Is
+
+**Bonsai** is a family of commercially-viable sub-2-bit LLMs developed by **PrismML** (not Microsoft/BitNet). They have two families:
+
+| Family | Weights | Sizes | Format |
+|--------|---------|-------|--------|
+| **Bonsai** | Binary {-1, +1} | 1.7B, 4B, 8B | Q1_0 (GGUF), MLX 1-bit |
+| **Ternary-Bonsai** | Ternary {-1, 0, +1} | 1.7B, 4B, 8B | Q2_0 (GGUF), MLX 2-bit |
+
+Key properties:
+- Uses **group size 128** for quantization scales
+- Llama architecture with Mistral tokenizer
+- Trained **natively** at low bit-width (not PTQ from FP16)
+- Inference via llama.cpp fork (PrismML-Eng/llama.cpp) and MLX
+- Models available on HuggingFace: `prism-ml/Bonsai-8B-gguf`, `prism-ml/Ternary-Bonsai-8B-gguf`
+
+**Bonsai is NOT open-source training code** — only inference weights and demos are released. To replicate Bonsai-style results, you need to implement your own QAT pipeline.
+
+---
+
+## Recommended Stack
+
+### For Training / Fine-tuning
+| Component | Recommendation | Rationale |
+|-----------|---------------|-----------|
+| **Framework** | PyTorch + HuggingFace Transformers | Widest ecosystem, ParetoQ/EfficientQAT both use it |
+| **Training** | HuggingFace TRL / custom training loop | ParetoQ uses vanilla HF Trainer; HF blog uses Nanotron |
+| **Quantization Layer** | Custom `BitLinear` (see HF blog code) | Drop-in replacement for `nn.Linear` |
+| **Dataset** | FineWeb-edu, RedPajama, or UltraFineWeb | Proven for ternary QAT (HF blog, Tequila, ParetoQ) |
+| **Inference** | bitnet.cpp (Microsoft) or llama.cpp (Bonsai fork) | Optimized CPU/GPU kernels for ternary |
+
+### For Quick Experiments
+| Component | Recommendation |
+|-----------|---------------|
+| **Small model** | Llama-3.2-1B or SmolLM-135M/360M | Fast iteration, ParetoQ has released quantized versions |
+| **GPU** | Single A100-80GB or H100 | EfficientQAT does 2-bit Llama-2-70B on one A100-80GB in 41h |
+| **Tokens** | 10B–100B for fine-tuning | HF blog: 10B tokens competitive; 100B closer to FP baseline |
+
+---
+
+## Core Technical Approaches
+
+### Approach 1: Warmup Quantization Fine-tuning (HF Blog / Most Practical)
+**Best for:** Starting from a pretrained FP model, quantizing to ternary, recovering via QAT fine-tuning.
+
+```python
+# Core idea: gradually introduce quantization
+lambda_ = min(training_step / 1000, 1)  # linear warmup over 1000 steps
+
+x_quant = x + lambda_ * (activation_quant(x) - x).detach()
+w_quant = w + lambda_ * (weight_quant(w) - w).detach()
+```
+
+Key hyperparameters from HF blog (Llama3-8B):
+- **LR:** 1e-4 (critical — they experimented extensively)
+- **Batch size:** 2M tokens
+- **Dataset:** FineWeb-edu
+- **Warmup steps:** 1000 (linear scheduler)
+- **Weight quant:** `scale = 1.0 / w.abs().mean(); round(clamp(-1, 1))`
+- **Activation quant:** 8-bit absmax per token
+
+Results: WikiText PPL 12.2 after 10B tokens; surpasses Llama-1-7B on MMLU.
+
+### Approach 2: ParetoQ-style QAT (Meta Research)
+**Best for:** Rigorous comparison across bit-widths; released training code.
+
+```bash
+# From their repo
+torchrun train.py \
+  --input_model_filename "meta-llama/Llama-3.2-1B" \
+  --qat True --w_bits 2 \
+  --learning_rate 2e-5 --bf16 True
+```
+
+Key insights:
+- **2-bit and ternary sit on the Pareto frontier** for size-vs-accuracy
+- 3-bit+ models stay close to FP distribution; 2-bit and below change drastically
+- Scale initialization differs by bit-width (critical detail in their code)
+- Released MobileLLM-ParetoQ models: 125M–1.5B in 1/1.58/2/3/4-bit
+
+### Approach 3: Two-Phase PTQ + Fine-tuning (EfficientQAT)
+**Best for:** If you want to start from PTQ then recover.
+
+Phase 1: Block-wise training of all parameters (Block-AP)
+Phase 2: End-to-end training of only quantization parameters (E2E-QP)
+
+Supports **INT2** (not ternary). Best for 2-bit uniform quantization, not {-1,0,+1}.
+
+### Approach 4: TernaryLLM-style Knowledge Distillation
+**Best for:** Maximum accuracy recovery with feature-level distillation.
+
+- **DLT (Dual Learnable Ternarization):** Learnable scale α + shift γ per layer
+- **OFF loss:** Cosine similarity between FP and ternary features (scale-invariant, outlier-friendly)
+- `L_total = L_label + ε·L_logits + δ·L_feat`
+- Results: LLaMA-3-8B W1.58A16 outperforms W2A16 (DB-LLM) by 5.8 PPL on C4
+
+### Approach 5: Tequila (Deadzone Trapping Fix)
+**Best for:** Fixing the fundamental problem where ternary QAT weights get stuck at 0.
+
+- Problem: STE gives noisy gradients to deadzone weights → they can't escape
+- Solution: Repurpose deadzone weights as dynamic biases with learnable reactivation λ
+- Forward: `Y = X·Q̂(W)·α + Σᵢ∈D λ·wᵢ`
+- Results on LLaMA-3.2-1B (10B tokens): <1% gap to FP on ARC benchmarks
+
+---
+
+## Autonomous Experimentation: Karpathy's autoresearch Pattern
+
+Karpathy's [autoresearch](https://github.com/karpathy/autoresearch) is an autonomous AI-driven experiment loop. An agent iteratively modifies a training script, runs a short training job, evaluates the result, and either keeps or discards the change. The loop runs indefinitely until interrupted.
+
+### Why This Matters for Ternary Quantization
+
+Ternary quantization research has a **large, poorly understood hyperparameter space**: quantization schedules (λ warmup), LR schedules, group sizes, deadzone recovery thresholds, distillation loss weights (ε, δ), and architecture trade-offs. Manually grid-searching this is impractical. The autoresearch pattern automates it.
+
+### How It Works (adapted for our use case)
+
+```
+LOOP FOREVER:
+  1. Agent reads current state of train.py and results.tsv
+  2. Agent proposes a change (e.g., "try Tequila deadzone reactivation with λ=0.5")
+  3. Agent modifies train.py and commits
+  4. Run training for fixed time budget (~5 min on small model)
+  5. Extract val_bpb / val_ppl from output
+  6. Log result to results.tsv (commit, metric, memory, status, description)
+  7. If improved → keep the commit
+  8. If equal or worse → git reset to previous commit
+  9. Repeat
+```
+
+### Key Design Choices from autoresearch
+
+| Decision | Rationale |
+|----------|-----------|
+| **Single mutable file** (`train.py`) | Keeps scope manageable; diffs are reviewable |
+| **Fixed time budget** (5 min) | Experiments are comparable regardless of model/architecture changes |
+| **Single metric** (`val_bpb`) | Removes ambiguity in what "better" means |
+| **Git-based version control** | Automatic rollback on failed experiments; full audit trail |
+| **NEVER STOP** directive | Agent runs until manually stopped (e.g., overnight = ~100 experiments) |
+
+### Adapting autoresearch for Ternary Quantization
+
+Our adaptation differs from vanilla autoresearch in key ways:
+
+1. **Metric**: `val_ppl` or `val_bpb` on WikiText/C4 instead of autoresearch's synthetic data metric
+2. **Base model**: Start from a pretrained HF model (Llama-3.2-1B) rather than training from scratch
+3. **Scope of mutations**: Agent can modify quantization layers, loss functions, warmup schedules, deadzone recovery, distillation weights — not just architecture/hyperparameters
+4. **Two-file boundary**: `train.py` (mutable — quantization logic + training loop) vs `prepare.py` (read-only — data loading, tokenizer, evaluation)
+5. **Longer runs**: Full QAT fine-tuning needs 10B+ tokens. The autoresearch loop handles **short ablation experiments** (5-15 min) to find the best hyperparameter combos, then the winning config gets a **full long run** outside the loop
+
+### Proposed autoresearch Integration
+
+```
+┌─────────────────────────────────────────────┐
+│  SHORT-LOOP (autoresearch agent, 5-min runs) │
+│  - Quantization schedule shape               │
+│  - Lambda warmup length                      │
+│  - LR warmup vs constant                     │
+│  - Deadzone recovery thresholds              │
+│  - Distillation loss weights                 │
+│  - Group size ablations                      │
+│  → Outputs: best hyperparameter config       │
+└──────────────────┬──────────────────────────┘
+                   │ winning config
+                   ▼
+┌─────────────────────────────────────────────┐
+│  LONG-RUN (manual, full QAT fine-tuning)     │
+│  - 10B-100B token training                   │
+│  - Full dataset (FineWeb-edu)                │
+│  - Eval on WikiText + MMLU + ARC             │
+│  → Outputs: production ternary model         │
+└─────────────────────────────────────────────┘
+```
+
+The short-loop runs autonomously (overnight) to explore the hyperparameter space. Once a winning configuration emerges, you run a full-scale fine-tuning with those settings.
+
+---
+
+## Proposed POC Pipeline
+
+### Phase 0: Infrastructure & autoresearch Setup (2-3 days)
+1. Set up the autoresearch-style project structure:
+   - `prepare.py` — data loading, tokenizer, evaluation (read-only)
+   - `train.py` — model loading, `BitLinear` layer, quantization logic, training loop (mutable by agent)
+   - `program.md` — agent instructions specific to ternary quantization experimentation
+   - `results.tsv` — experiment log
+2. Clone autoresearch repo as reference; adapt `prepare.py` patterns for our data pipeline
+3. Set up evaluation harness: WikiText PPL, optionally ARC/MMLU zero-shot
+4. **Goal**: Working 5-minute training loop that loads Llama-3.2-1B, applies ternary quantization, and reports val_ppl
+
+### Phase 1: Reproduce HF Blog Fine-tuning (1–2 weeks)
+1. Take **Llama-3.2-1B** or **SmolLM-135M**
+2. Implement `BitLinear` layer with STE + warmup quantization
+3. Fine-tune on FineWeb-edu (10B tokens) with lambda warmup
+4. Evaluate on WikiText + zero-shot tasks
+5. **Goal:** Validate the pipeline works; establish baseline PPL
+
+### Phase 2: Autonomous Hyperparameter Search via autoresearch (1-2 weeks)
+1. Launch autoresearch agent with `program.md` tuned for ternary quantization
+2. Agent iteratively explores:
+   - Quantization warmup schedules (linear, cosine, exponential)
+   - Lambda warmup step counts (500, 1000, 2000, 5000)
+   - Learning rates (1e-5 to 1e-3 grid)
+   - Group sizes (64, 128, 256)
+   - Deadzone recovery strategies (Tequila λ values, ON/OFF)
+   - Distillation loss weights (ε, δ)
+3. Each experiment: ~5 min run, automatic keep/discard
+4. Review `results.tsv` after overnight runs; identify patterns
+5. **Goal:** Find optimal hyperparameter configuration through autonomous search (~100-500 experiments)
+
+### Phase 3: Recovery Technique Deep Dive (2–3 weeks)
+1. Apply winning autoresearch config as baseline
+2. Systematically add **Tequila** deadzone reactivation to `BitLinear`
+3. Try **TernaryLLM**-style OFF distillation loss
+4. Compare: warmup-only vs warmup+Tequila vs warmup+OFF vs combined
+5. Use autoresearch short-loop to find optimal weights for each technique
+6. **Goal:** Find best accuracy recovery method; quantify each technique's contribution
+
+### Phase 4: Full-Scale Fine-tuning (2–4 weeks)
+1. Apply winning recipe to target model size (e.g., 7B–8B)
+2. Scale to 100B tokens
+3. Monitor training loss curve vs FP16 baseline
+4. Evaluate on full benchmark suite (WikiText, C4, MMLU, ARC)
+5. **Goal:** Ternary model within 10-15% of FP16 baseline on key benchmarks
+
+### Phase 5: Export & Inference (1 week)
+1. Export to GGUF Q2_0 format (Bonsai-compatible) or bitnet.cpp I2_S
+2. Benchmark inference speed vs FP16 baseline (tokens/sec, memory footprint)
+3. Quantize activations for inference (INT8 activations)
+4. **Goal:** Production-ready ternary model with measured speed/memory gains
+
+---
+
+## Key Risks & Mitigations
+
+| Risk | Likelihood | Impact | Mitigation |
+|------|-----------|--------|------------|
+| **Catastrophic forgetting** during QAT | High | High | Use warmup quantization (lambda scheduling); start from instruct-tuned model; use diverse dataset |
+| **Deadzone trapping** (weights stuck at 0) | Medium | High | Implement Tequila reactivation; use per-group quantization; autoresearch explores λ values |
+| **Training instability** at low LR | Medium | Medium | LR 1e-4 worked for HF; ParetoQ uses 2e-5. Autoresearch grid-searches on small model first |
+| **autoresearch agent wastes runs** on bad ideas | Low | Low | The keep/discard loop naturally prunes; 5-min budget limits waste; `program.md` constrains search space |
+| **autoresearch metric not correlating** with full fine-tune results | Medium | High | Validate: run 3-5 winning configs as longer runs (30+ min) and check correlation before committing to full run |
+| **autoresearch agent breaks train.py** | Medium | Low | Git reset on failure; `prepare.py` is immutable; crash logged and skipped |
+
+---
+
+## References
+
+| Resource | Link |
+|----------|------|
+| autoresearch (Karpathy) | https://github.com/karpathy/autoresearch |
+| Bonsai / Ternary-Bonsai (PrismML) | https://huggingface.co/prism-ml |
+| ParetoQ (Meta) | https://github.com/facebookresearch/ParetoQ |
+| HF Blog: Ternary LLM Fine-tuning | https://huggingface.co/blog/ternary-llm |
+| Tequila (Deadzone Trapping) | https://arxiv.org/abs/2506.18907 |
+| TernaryLLM (Distillation) | https://arxiv.org/abs/2406.11943 |
+| EfficientQAT (PTQ + Fine-tune) | https://github.com/microsoft/BrickFlow |
+| ParetoQ MobileLLM Models | https://huggingface.co/collections/meta/pq-675198e3097f6a25e810eea2 |