- Python 97.5%
- Shell 2.5%
| Filename | Latest commit message | Latest commit date |
|---|---|---|
| docs | ||
| .gitignore | ||
| AGENTS.md | ||
| analyze_training.py | ||
| bidirectional_delta.py | ||
| debug_load.py | ||
| model.py | ||
| ONBOARDING.md | ||
| README.md | ||
| sweep.sh | ||
| test_load.py | ||
| train.py | ||
| validate_consensus.py | ||
| validate_rank.py | ||
| validate_test1.py | ||
| validate_test2.py | ||
| validate_test3.py | ||
Orthrus-Qwen3.5: Full-Coverage Parallel Token Generation
Adapting the Orthrus framework to Qwen3.5-0.8B's hybrid architecture (18 GatedDeltaNet + 6 GatedAttention layers).
New here? Read ONBOARDING.md first.
Status: Architecture Ceiling Reached
| Phase | Status |
|---|---|
| Validation tests (1, 3, 3b, consensus, rank analysis) | PASS |
| Production training pipeline (KL distillation, GaLore, diagnostics) | Complete |
| Training (Run 13) | Peaked at 55.6% eval exact — overfitting after step 2500 |
| Per-layer diagnostics | Done — both DeltaNet and attention layers underperform |
| Inference with diffusion + consensus | Blocked — need >70% exact for meaningful TPF |
Peak Training Performance (Run 13)
Step 2500: KL= 0.84 exact=55.8% eval=55.6% ← peak eval
Step 3000: KL= 0.82 exact=56.5% eval=52.9% ← overfitting
Per-Layer Diagnostics (Step 2500)
| Layer Type | Cos Sim | KL/layer | Exact/layer |
|---|---|---|---|
| DeltaNet (18L) | 0.219 | 0.020 | 9.3% |
| Attention (6L) | 0.329 | 0.047 | 14.7% |
Both layer types are far from converged. Worst DeltaNet layer (L10) has cos_sim=-0.083 (anti-correlated with AR). The 55% ceiling is likely a fundamental limitation of the first-order bidirectional delta rule approximation.
Architecture
The Problem
Qwen3.5-0.8B uses a hybrid architecture: 18 DeltaNet (linear attention, recurrent state) + 6 standard attention layers. Standard Orthrus only works on attention layers (shared KV cache). DeltaNet layers have no KV cache — they use a recurrent state matrix S_t of shape (B, 16, 128, 128).
Partial coverage (attention-only) has a hard ceiling of ~1.6x speedup because DeltaNet layers are 76% of compute. We need full 24-layer coverage.
Bidirectional Delta Rule (Core Innovation)
For DeltaNet layers, we replace the causal chunk kernel with a bidirectional computation during the diffusion pass:
1. Decay products: G_i = Π_{j=i+1}^{K} g_j (parallel prefix sum)
2. Surprise: δ_t = β_t * (v_t - S_prefix @ k_t)
3. Aggregate: ΔS = Σ_t G_t * k_t ⊗ δ_t (decay-weighted, all tokens simultaneously)
4. Bidir state: S_bi = (Π g_i) * S_prefix + ΔS
5. Output: o_t = S_bi @ q_t (each token sees ALL other tokens' corrections)
Token k's output depends on token k+1's key-value binding through ΔS. This is structurally different from causal recurrence — same way Orthrus attention is bidirectional within the block.
Known limitation: The first-order approximation (ignoring non-commutativity of state updates) causes error that compounds across 18 DeltaNet layers. Per-layer cos_sim=0.22, with anti-correlated layers. This limits exact match to ~55%.
Per-Layer Design
| Full Attention (6 layers) | DeltaNet (18 layers) | |
|---|---|---|
| AR pass | Standard, builds KV cache | Standard, builds recurrent + conv state |
| Shared state | KV cache (K_AR, V_AR) | Recurrent state S_t + conv state |
| Diff projections | Full-rank, init from AR weights | LoRA rank 256 (PiSSA init), scaling=32/sqrt(rank) (rsLoRA) |
| Diffusion pass | Attend to [K_AR || K_diff] bidir | Bidirectional delta rule with S_prefix |
| Trainable params | ~7.3M/layer = 44M | ~1.8M/layer = 32M |
| Total trainable | ~106M / 859M (12.4%) |
Consensus Mechanism
AR (causal sequential) validates bidirectional diffusion proposals. For attention layers: exact via KV cache. For DeltaNet: chunk-processed with frozen AR weights (approximate but conservative). Acceptance/rejection ensures output quality.
Validation Results
Test 1: Frozen Bidirectional vs Causal DeltaNet Output
Random weights: cos_sim=0.037 (expected — chaotic dynamics with random weights) Real Qwen3.5-0.8B weights: cos_sim=0.904, relative_L2=1.43
Per-token pattern confirms bidirectional information flow:
- Last token: cos_sim=0.995 (near identical — both paths have full context by last token)
- Earlier tokens: cos_sim≈0.0-0.1 (bidir has "future context" that causal doesn't)
Interpretation: With real trained weights, bidir and causal outputs are 90% directionally aligned. The magnitude difference (L2 > norm) is a simple scaling. LoRA can easily bridge this 10% gap.
Test 3: Single-Sequence Overfit (RTX 3090, CUDA, bf16)
Result: PASS. KL loss drops from 544 → 0.43 in 100 steps. 100% exact token match achieved by step 80.
Step 10: KL=544.00, exact_match=10%
Step 50: KL=184.00, exact_match=40%
Step 80: KL=1.28, exact_match=100%
Step 100: KL=0.43, exact_match=100%
Conclusion: Green light for full training pipeline.
Test 3b: Single-Sequence Overfit with LoRA rank 128 (9.1% trainable)
Trainable: 75.7M / 828M (9.1%)
Step 90: KL=19.8, exact_match=100%
Step 200: KL=0.30, exact_match=100% (stable, no oscillation)
Conclusion: LoRA rank 128 is sufficient. Proceeding with 9.1% trainable params.
Training
Setup
- Data: 588K tokenized samples across 5 domains (code, math, instruction, scientific, multilingual), truncated to 2048 tokens
- Optimizer: GaLoreAdamW8bit (full-rank _diff) + AdamW8bit (LoRA), LR=2e-4 with warmup-hold-linear schedule
- Hardware: RTX 3090 (24GB), CUDA 13.2, PyTorch bf16
- Trainable: 106.4M / 858.8M (12.4%) — LoRA rank 256 on DeltaNet, full-rank diff on attention layers
- FLA: flash-linear-attention + causal-conv1d installed for fast DeltaNet kernels
- Checkpoint: Saves every 1000 steps with model weights + optimizer/scheduler state. Resume with
--resume_from. - Diagnostics: Per-layer eval (cos_sim, KL, exact by layer type) runs every eval step.
Training Method: KL Distillation
Each microbatch:
- AR forward (no_grad): run full causal model on prefix+block to get target distribution
- Prefix forward (with cache): process prefix, save KV cache (attention) and recurrent states (DeltaNet)
- Diffusion forward (with grad): run bidirectional diffusion pass on block tokens using shared prefix state
- Loss: KL(diffusion_logits || AR_logits) — train diff projections so bidir matches causal
Block size K=8 (fixed). Random prefix split per microbatch. LoRA rank 256 with PiSSA init.
Run History Summary
| Run | Key Change | Best Eval Exact | Status |
|---|---|---|---|
| 5-9 | Various configs, bugs present | ~38% | Plateaued |
| 10 | All 6 init fixes | ~48.7% | New ceiling |
| 11 | warmup-hold-cosine LR, K=8 | 51.75% | Climbing |
| 12 | PiSSA + rsLoRA + GaLore | 52.8% | Checkpoint bug |
| 13 | Checkpoint fix, per-layer diagnostics | 55.6% | Architecture ceiling |
Full history: see docs/TRAINING_LOG.md
Open Questions
- How to break 55%: Second-order delta rule correction? StateQuery + BlockAttention? Higher LoRA rank?
- Anti-correlated DeltaNet layers: L6/L10 have cos_sim=-0.083. Training instability or fundamental issue?
- Supervision density: Paper uses 256 blocks/step, we use 1. Would more blocks help despite K=8 being optimal?
- Consensus for DeltaNet: Chunk-processed consensus is approximate. May need sequential verification.
References
- Orthrus paper: arXiv:2605.12825 → summary
- GatedDeltaNet: arXiv:2412.06464 → summary
- Qwen3: arXiv:2505.09388 → summary
- Reference implementation: https://github.com/chiennv2000/orthrus
- Diagnostic report: docs/REPORT.md
- Subagent analysis: docs/SUBAGENT_ANALYSIS.md
Model
- Base:
Qwen/Qwen3.5-0.8B(1.6GB bf16, text-only weights) - Local path:
~/.cache/huggingface/hub/models--Qwen--Qwen3.5-0.8B/snapshots/... - Training target: single 24GB GPU (RTX 3090 on sleepy-wsl)