2.9 KiB
2.9 KiB
program.md — Instructions for the Autoresearch Agent
You are an autonomous research agent exploring ternary (1.58-bit) quantization for LLMs.
Your Goal
Iteratively improve the ternary quantization training in train.py to achieve the lowest validation perplexity (val_ppl) on WikiText-2.
File Boundaries
-
MUTABLE:
train.py— You may modify this file. It contains:BitLinearlayer (quantization logic)- Quantization schedules (lambda warmup)
- Training loop
- Hyperparameters (LR, batch size, group size, etc.)
-
READ-ONLY:
prepare.py— DO NOT modify this file. It contains:- Dataset loading and tokenization
- Evaluation harness (WikiText PPL)
- Model loading utilities
-
OUTPUT:
results.tsv— Results are automatically logged here after each run.
Experiment Protocol
- Read
results.tsvto understand what has been tried - Read
train.pyto understand current implementation - Propose ONE focused change to
train.py - The change will be committed and a training run will execute (~5 minutes)
- Results are logged to
results.tsv - If improved (lower val_ppl) → change is kept
- If equal or worse → git reset to previous commit
What to Explore
Priority 1: Quantization Schedule
- Lambda warmup shape: linear, cosine, exponential
- Warmup step counts: 200, 500, 1000, 2000, 5000
- Two-phase warmup (fast initial + slow final)
Priority 2: Learning Rate
- LR values: 1e-5, 2e-5, 5e-5, 1e-4, 2e-4, 5e-4
- LR schedule: constant, linear decay, cosine decay
Priority 3: Quantization Details
- Group size: 32, 64, 128, 256, per-tensor
- Scale initialization: mean-based vs absmax-based
- Ternary threshold adjustments
Priority 4: Deadzone Recovery
- Tequila-style reactivation (learnable lambda for deadzone weights)
- Bias injection for zero-valued weights
- Gradient scaling for deadzone weights
Priority 5: Distillation
- OFF loss (cosine similarity between FP and ternary features)
- Logits distillation weight
- Feature distillation weight
Constraints
- Keep experiments focused — ONE change per iteration
- Always maintain working code — syntax errors waste time
- Use SmolLM-135M or Llama-3.2-1B for fast iteration
- Target metric: val_ppl (lower is better)
- Time budget: 5 minutes per experiment
Important Notes
- The STE (Straight-Through Estimator) is critical for gradients to flow through quantization
- Warmup quantization prevents catastrophic accuracy loss at the start of training
- Deadzone trapping (weights stuck at 0) is a known problem — explore solutions
- Per-group quantization scales are essential for handling outlier weights
- The quantization formula:
scale = 1.0 / w.abs().mean(); round(clamp(-1, 1))
NEVER STOP
Run experiments continuously until manually interrupted. Each experiment should be a small, focused change. Review results.tsv between runs to inform your next decision.