No description

Python 97.5%
Shell 2.5%

Find a file

Repository files (latest commit first)
Filename	Latest commit message	Latest commit date
Kaloyan Nikolov d838eee9a6 Update docs for Run 14: dual-scan architecture, conv1d fix, DirDrop, multi-block training		2026-05-17 22:29:39 +02:00
docs	Update docs for Run 14: dual-scan architecture, conv1d fix, DirDrop, multi-block training	2026-05-17 22:29:39 +02:00
.gitignore	Initial project: bidirectional delta rule kernel + validation tests	2026-05-16 12:47:57 +02:00
AGENTS.md	Add ONBOARDING.md, move docs to docs/, update references	2026-05-17 11:50:55 +02:00
analyze_training.py	Fix 4 critical training bugs + add training analysis	2026-05-16 23:04:52 +02:00
bidirectional_delta.py	Initial project: bidirectional delta rule kernel + validation tests	2026-05-16 12:47:57 +02:00
debug_load.py	All 3 validation tests pass	2026-05-16 13:08:28 +02:00
model.py	fix PiSSA: correct SVD→LoRA shape mapping	2026-05-17 09:47:38 +02:00
ONBOARDING.md	Update docs for Run 14: dual-scan architecture, conv1d fix, DirDrop, multi-block training	2026-05-17 22:29:39 +02:00
README.md	Run 13: 55.6% eval exact (architecture ceiling), per-layer diagnostics, checkpoint fix	2026-05-17 16:37:38 +02:00
sweep.sh	Replace OneCycleLR with warmup-hold-cosine decay schedule	2026-05-17 08:21:38 +02:00
test_load.py	All 3 validation tests pass	2026-05-16 13:08:28 +02:00
train.py	Run 13: 55.6% eval exact (architecture ceiling), per-layer diagnostics, checkpoint fix	2026-05-17 16:37:38 +02:00
validate_consensus.py	Consensus test: frozen bidir produces 0% logit-level match (expected)	2026-05-16 13:50:17 +02:00
validate_rank.py	Training pipeline: FLA fast path, 8-bit Adam, contiguous tensor dataset	2026-05-16 17:10:49 +02:00
validate_test1.py	Initial project: bidirectional delta rule kernel + validation tests	2026-05-16 12:47:57 +02:00
validate_test2.py	All 3 validation tests pass	2026-05-16 13:08:28 +02:00
validate_test3.py	LoRA rank 128 validated: 9.1% trainable, 100% exact match by step 90	2026-05-16 13:24:27 +02:00

README.md

Orthrus-Qwen3.5: Full-Coverage Parallel Token Generation

Adapting the Orthrus framework to Qwen3.5-0.8B's hybrid architecture (18 GatedDeltaNet + 6 GatedAttention layers).

New here? Read ONBOARDING.md first.

Status: Architecture Ceiling Reached

Phase	Status
Validation tests (1, 3, 3b, consensus, rank analysis)	PASS
Production training pipeline (KL distillation, GaLore, diagnostics)	Complete
Training (Run 13)	Peaked at 55.6% eval exact — overfitting after step 2500
Per-layer diagnostics	Done — both DeltaNet and attention layers underperform
Inference with diffusion + consensus	Blocked — need >70% exact for meaningful TPF

Peak Training Performance (Run 13)

Step   2500: KL= 0.84  exact=55.8%  eval=55.6%  ← peak eval
Step   3000: KL= 0.82  exact=56.5%  eval=52.9%  ← overfitting

Per-Layer Diagnostics (Step 2500)

Layer Type	Cos Sim	KL/layer	Exact/layer
DeltaNet (18L)	0.219	0.020	9.3%
Attention (6L)	0.329	0.047	14.7%

Both layer types are far from converged. Worst DeltaNet layer (L10) has cos_sim=-0.083 (anti-correlated with AR). The 55% ceiling is likely a fundamental limitation of the first-order bidirectional delta rule approximation.

Architecture

The Problem

Qwen3.5-0.8B uses a hybrid architecture: 18 DeltaNet (linear attention, recurrent state) + 6 standard attention layers. Standard Orthrus only works on attention layers (shared KV cache). DeltaNet layers have no KV cache — they use a recurrent state matrix S_t of shape (B, 16, 128, 128).

Partial coverage (attention-only) has a hard ceiling of ~1.6x speedup because DeltaNet layers are 76% of compute. We need full 24-layer coverage.

Bidirectional Delta Rule (Core Innovation)

For DeltaNet layers, we replace the causal chunk kernel with a bidirectional computation during the diffusion pass:

1. Decay products:  G_i = Π_{j=i+1}^{K} g_j    (parallel prefix sum)
2. Surprise:        δ_t = β_t * (v_t - S_prefix @ k_t)
3. Aggregate:       ΔS = Σ_t G_t * k_t ⊗ δ_t    (decay-weighted, all tokens simultaneously)
4. Bidir state:     S_bi = (Π g_i) * S_prefix + ΔS
5. Output:          o_t = S_bi @ q_t             (each token sees ALL other tokens' corrections)

Token k's output depends on token k+1's key-value binding through ΔS. This is structurally different from causal recurrence — same way Orthrus attention is bidirectional within the block.

Known limitation: The first-order approximation (ignoring non-commutativity of state updates) causes error that compounds across 18 DeltaNet layers. Per-layer cos_sim=0.22, with anti-correlated layers. This limits exact match to ~55%.

Per-Layer Design

	Full Attention (6 layers)	DeltaNet (18 layers)
AR pass	Standard, builds KV cache	Standard, builds recurrent + conv state
Shared state	KV cache (K_AR, V_AR)	Recurrent state S_t + conv state
Diff projections	Full-rank, init from AR weights	LoRA rank 256 (PiSSA init), scaling=32/sqrt(rank) (rsLoRA)
Diffusion pass	Attend to [K_AR \|\| K_diff] bidir	Bidirectional delta rule with S_prefix
Trainable params	~7.3M/layer = 44M	~1.8M/layer = 32M
Total trainable	~106M / 859M (12.4%)

Consensus Mechanism

AR (causal sequential) validates bidirectional diffusion proposals. For attention layers: exact via KV cache. For DeltaNet: chunk-processed with frozen AR weights (approximate but conservative). Acceptance/rejection ensures output quality.

Validation Results

Test 1: Frozen Bidirectional vs Causal DeltaNet Output

Random weights: cos_sim=0.037 (expected — chaotic dynamics with random weights) Real Qwen3.5-0.8B weights: cos_sim=0.904, relative_L2=1.43

Per-token pattern confirms bidirectional information flow:

Last token: cos_sim=0.995 (near identical — both paths have full context by last token)
Earlier tokens: cos_sim≈0.0-0.1 (bidir has "future context" that causal doesn't)

Interpretation: With real trained weights, bidir and causal outputs are 90% directionally aligned. The magnitude difference (L2 > norm) is a simple scaling. LoRA can easily bridge this 10% gap.

Test 3: Single-Sequence Overfit (RTX 3090, CUDA, bf16)

Result: PASS. KL loss drops from 544 → 0.43 in 100 steps. 100% exact token match achieved by step 80.

Step  10: KL=544.00, exact_match=10%
Step  50: KL=184.00, exact_match=40%
Step  80: KL=1.28,   exact_match=100%
Step 100: KL=0.43,   exact_match=100%

Conclusion: Green light for full training pipeline.

Test 3b: Single-Sequence Overfit with LoRA rank 128 (9.1% trainable)

Trainable: 75.7M / 828M (9.1%)
Step  90: KL=19.8,  exact_match=100%
Step 200: KL=0.30,  exact_match=100%  (stable, no oscillation)

Conclusion: LoRA rank 128 is sufficient. Proceeding with 9.1% trainable params.

Training

Setup

Data: 588K tokenized samples across 5 domains (code, math, instruction, scientific, multilingual), truncated to 2048 tokens
Optimizer: GaLoreAdamW8bit (full-rank _diff) + AdamW8bit (LoRA), LR=2e-4 with warmup-hold-linear schedule
Hardware: RTX 3090 (24GB), CUDA 13.2, PyTorch bf16
Trainable: 106.4M / 858.8M (12.4%) — LoRA rank 256 on DeltaNet, full-rank diff on attention layers
FLA: flash-linear-attention + causal-conv1d installed for fast DeltaNet kernels
Checkpoint: Saves every 1000 steps with model weights + optimizer/scheduler state. Resume with --resume_from.
Diagnostics: Per-layer eval (cos_sim, KL, exact by layer type) runs every eval step.

Training Method: KL Distillation

Each microbatch:

AR forward (no_grad): run full causal model on prefix+block to get target distribution
Prefix forward (with cache): process prefix, save KV cache (attention) and recurrent states (DeltaNet)
Diffusion forward (with grad): run bidirectional diffusion pass on block tokens using shared prefix state
Loss: KL(diffusion_logits || AR_logits) — train diff projections so bidir matches causal

Block size K=8 (fixed). Random prefix split per microbatch. LoRA rank 256 with PiSSA init.

Run History Summary

Run	Key Change	Best Eval Exact	Status
5-9	Various configs, bugs present	~38%	Plateaued
10	All 6 init fixes	~48.7%	New ceiling
11	warmup-hold-cosine LR, K=8	51.75%	Climbing
12	PiSSA + rsLoRA + GaLore	52.8%	Checkpoint bug
13	Checkpoint fix, per-layer diagnostics	55.6%	Architecture ceiling

Full history: see docs/TRAINING_LOG.md

Open Questions

How to break 55%: Second-order delta rule correction? StateQuery + BlockAttention? Higher LoRA rank?
Anti-correlated DeltaNet layers: L6/L10 have cos_sim=-0.083. Training instability or fundamental issue?
Supervision density: Paper uses 256 blocks/step, we use 1. Would more blocks help despite K=8 being optimal?
Consensus for DeltaNet: Chunk-processed consensus is approximate. May need sequential verification.

References

Orthrus paper: arXiv:2605.12825 → summary
GatedDeltaNet: arXiv:2412.06464 → summary
Qwen3: arXiv:2505.09388 → summary
Reference implementation: https://github.com/chiennv2000/orthrus
Diagnostic report: docs/REPORT.md
Subagent analysis: docs/SUBAGENT_ANALYSIS.md

Model

Base: Qwen/Qwen3.5-0.8B (1.6GB bf16, text-only weights)
Local path: ~/.cache/huggingface/hub/models--Qwen--Qwen3.5-0.8B/snapshots/...
Training target: single 24GB GPU (RTX 3090 on sleepy-wsl)