No description
  • Python 97.5%
  • Shell 2.5%
Find a file
Repository files (latest commit first)
Filename Latest commit message Latest commit date
2026-05-17 22:29:39 +02:00
docs Update docs for Run 14: dual-scan architecture, conv1d fix, DirDrop, multi-block training 2026-05-17 22:29:39 +02:00
.gitignore Initial project: bidirectional delta rule kernel + validation tests 2026-05-16 12:47:57 +02:00
AGENTS.md Add ONBOARDING.md, move docs to docs/, update references 2026-05-17 11:50:55 +02:00
analyze_training.py Fix 4 critical training bugs + add training analysis 2026-05-16 23:04:52 +02:00
bidirectional_delta.py Initial project: bidirectional delta rule kernel + validation tests 2026-05-16 12:47:57 +02:00
debug_load.py All 3 validation tests pass 2026-05-16 13:08:28 +02:00
model.py fix PiSSA: correct SVD→LoRA shape mapping 2026-05-17 09:47:38 +02:00
ONBOARDING.md Update docs for Run 14: dual-scan architecture, conv1d fix, DirDrop, multi-block training 2026-05-17 22:29:39 +02:00
README.md Run 13: 55.6% eval exact (architecture ceiling), per-layer diagnostics, checkpoint fix 2026-05-17 16:37:38 +02:00
sweep.sh Replace OneCycleLR with warmup-hold-cosine decay schedule 2026-05-17 08:21:38 +02:00
test_load.py All 3 validation tests pass 2026-05-16 13:08:28 +02:00
train.py Run 13: 55.6% eval exact (architecture ceiling), per-layer diagnostics, checkpoint fix 2026-05-17 16:37:38 +02:00
validate_consensus.py Consensus test: frozen bidir produces 0% logit-level match (expected) 2026-05-16 13:50:17 +02:00
validate_rank.py Training pipeline: FLA fast path, 8-bit Adam, contiguous tensor dataset 2026-05-16 17:10:49 +02:00
validate_test1.py Initial project: bidirectional delta rule kernel + validation tests 2026-05-16 12:47:57 +02:00
validate_test2.py All 3 validation tests pass 2026-05-16 13:08:28 +02:00
validate_test3.py LoRA rank 128 validated: 9.1% trainable, 100% exact match by step 90 2026-05-16 13:24:27 +02:00

Orthrus-Qwen3.5: Full-Coverage Parallel Token Generation

Adapting the Orthrus framework to Qwen3.5-0.8B's hybrid architecture (18 GatedDeltaNet + 6 GatedAttention layers).

New here? Read ONBOARDING.md first.

Status: Architecture Ceiling Reached

Phase Status
Validation tests (1, 3, 3b, consensus, rank analysis) PASS
Production training pipeline (KL distillation, GaLore, diagnostics) Complete
Training (Run 13) Peaked at 55.6% eval exact — overfitting after step 2500
Per-layer diagnostics Done — both DeltaNet and attention layers underperform
Inference with diffusion + consensus Blocked — need >70% exact for meaningful TPF

Peak Training Performance (Run 13)

Step   2500: KL= 0.84  exact=55.8%  eval=55.6%  ← peak eval
Step   3000: KL= 0.82  exact=56.5%  eval=52.9%  ← overfitting

Per-Layer Diagnostics (Step 2500)

Layer Type Cos Sim KL/layer Exact/layer
DeltaNet (18L) 0.219 0.020 9.3%
Attention (6L) 0.329 0.047 14.7%

Both layer types are far from converged. Worst DeltaNet layer (L10) has cos_sim=-0.083 (anti-correlated with AR). The 55% ceiling is likely a fundamental limitation of the first-order bidirectional delta rule approximation.

Architecture

The Problem

Qwen3.5-0.8B uses a hybrid architecture: 18 DeltaNet (linear attention, recurrent state) + 6 standard attention layers. Standard Orthrus only works on attention layers (shared KV cache). DeltaNet layers have no KV cache — they use a recurrent state matrix S_t of shape (B, 16, 128, 128).

Partial coverage (attention-only) has a hard ceiling of ~1.6x speedup because DeltaNet layers are 76% of compute. We need full 24-layer coverage.

Bidirectional Delta Rule (Core Innovation)

For DeltaNet layers, we replace the causal chunk kernel with a bidirectional computation during the diffusion pass:

1. Decay products:  G_i = Π_{j=i+1}^{K} g_j    (parallel prefix sum)
2. Surprise:        δ_t = β_t * (v_t - S_prefix @ k_t)
3. Aggregate:       ΔS = Σ_t G_t * k_t ⊗ δ_t    (decay-weighted, all tokens simultaneously)
4. Bidir state:     S_bi = (Π g_i) * S_prefix + ΔS
5. Output:          o_t = S_bi @ q_t             (each token sees ALL other tokens' corrections)

Token k's output depends on token k+1's key-value binding through ΔS. This is structurally different from causal recurrence — same way Orthrus attention is bidirectional within the block.

Known limitation: The first-order approximation (ignoring non-commutativity of state updates) causes error that compounds across 18 DeltaNet layers. Per-layer cos_sim=0.22, with anti-correlated layers. This limits exact match to ~55%.

Per-Layer Design

Full Attention (6 layers) DeltaNet (18 layers)
AR pass Standard, builds KV cache Standard, builds recurrent + conv state
Shared state KV cache (K_AR, V_AR) Recurrent state S_t + conv state
Diff projections Full-rank, init from AR weights LoRA rank 256 (PiSSA init), scaling=32/sqrt(rank) (rsLoRA)
Diffusion pass Attend to [K_AR || K_diff] bidir Bidirectional delta rule with S_prefix
Trainable params ~7.3M/layer = 44M ~1.8M/layer = 32M
Total trainable ~106M / 859M (12.4%)

Consensus Mechanism

AR (causal sequential) validates bidirectional diffusion proposals. For attention layers: exact via KV cache. For DeltaNet: chunk-processed with frozen AR weights (approximate but conservative). Acceptance/rejection ensures output quality.

Validation Results

Test 1: Frozen Bidirectional vs Causal DeltaNet Output

Random weights: cos_sim=0.037 (expected — chaotic dynamics with random weights) Real Qwen3.5-0.8B weights: cos_sim=0.904, relative_L2=1.43

Per-token pattern confirms bidirectional information flow:

  • Last token: cos_sim=0.995 (near identical — both paths have full context by last token)
  • Earlier tokens: cos_sim≈0.0-0.1 (bidir has "future context" that causal doesn't)

Interpretation: With real trained weights, bidir and causal outputs are 90% directionally aligned. The magnitude difference (L2 > norm) is a simple scaling. LoRA can easily bridge this 10% gap.

Test 3: Single-Sequence Overfit (RTX 3090, CUDA, bf16)

Result: PASS. KL loss drops from 544 → 0.43 in 100 steps. 100% exact token match achieved by step 80.

Step  10: KL=544.00, exact_match=10%
Step  50: KL=184.00, exact_match=40%
Step  80: KL=1.28,   exact_match=100%
Step 100: KL=0.43,   exact_match=100%

Conclusion: Green light for full training pipeline.

Test 3b: Single-Sequence Overfit with LoRA rank 128 (9.1% trainable)

Trainable: 75.7M / 828M (9.1%)
Step  90: KL=19.8,  exact_match=100%
Step 200: KL=0.30,  exact_match=100%  (stable, no oscillation)

Conclusion: LoRA rank 128 is sufficient. Proceeding with 9.1% trainable params.

Training

Setup

  • Data: 588K tokenized samples across 5 domains (code, math, instruction, scientific, multilingual), truncated to 2048 tokens
  • Optimizer: GaLoreAdamW8bit (full-rank _diff) + AdamW8bit (LoRA), LR=2e-4 with warmup-hold-linear schedule
  • Hardware: RTX 3090 (24GB), CUDA 13.2, PyTorch bf16
  • Trainable: 106.4M / 858.8M (12.4%) — LoRA rank 256 on DeltaNet, full-rank diff on attention layers
  • FLA: flash-linear-attention + causal-conv1d installed for fast DeltaNet kernels
  • Checkpoint: Saves every 1000 steps with model weights + optimizer/scheduler state. Resume with --resume_from.
  • Diagnostics: Per-layer eval (cos_sim, KL, exact by layer type) runs every eval step.

Training Method: KL Distillation

Each microbatch:

  1. AR forward (no_grad): run full causal model on prefix+block to get target distribution
  2. Prefix forward (with cache): process prefix, save KV cache (attention) and recurrent states (DeltaNet)
  3. Diffusion forward (with grad): run bidirectional diffusion pass on block tokens using shared prefix state
  4. Loss: KL(diffusion_logits || AR_logits) — train diff projections so bidir matches causal

Block size K=8 (fixed). Random prefix split per microbatch. LoRA rank 256 with PiSSA init.

Run History Summary

Run Key Change Best Eval Exact Status
5-9 Various configs, bugs present ~38% Plateaued
10 All 6 init fixes ~48.7% New ceiling
11 warmup-hold-cosine LR, K=8 51.75% Climbing
12 PiSSA + rsLoRA + GaLore 52.8% Checkpoint bug
13 Checkpoint fix, per-layer diagnostics 55.6% Architecture ceiling

Full history: see docs/TRAINING_LOG.md

Open Questions

  1. How to break 55%: Second-order delta rule correction? StateQuery + BlockAttention? Higher LoRA rank?
  2. Anti-correlated DeltaNet layers: L6/L10 have cos_sim=-0.083. Training instability or fundamental issue?
  3. Supervision density: Paper uses 256 blocks/step, we use 1. Would more blocks help despite K=8 being optimal?
  4. Consensus for DeltaNet: Chunk-processed consensus is approximate. May need sequential verification.

References

Model

  • Base: Qwen/Qwen3.5-0.8B (1.6GB bf16, text-only weights)
  • Local path: ~/.cache/huggingface/hub/models--Qwen--Qwen3.5-0.8B/snapshots/...
  • Training target: single 24GB GPU (RTX 3090 on sleepy-wsl)