45c3aad453
- Add Claude Opus 4.7, Kimi K2.6, GLM-5.1 to existing GLM-5, Qwen3-6, MiniMax-M2.7 - Add 5 new challenges: flash attention fwd/bwd, beam search, DFlash, ternary training - Rewrite README with TL;DR rankings, grade matrix, and DeepSeek V4 Pro attribution - Add analysis/ folder with cross-model comparisons and per-challenge deep dives - Add deploy_challenges.sh script - Expand .gitignore to exclude Python envs, ML weights, and build artifacts
4.0 KiB
4.0 KiB
Implement a correct batched beam search decoder for autoregressive generation in pure NumPy.
Simulate a minimal language model:
- vocab_size = 1000
- d_model = 64
- Use random embeddings + 1 transformer block with random weights (correctness depends on beam search logic, not model quality)
Requirements:
-
MULTI-BATCH SUPPORT:
- Accept prompt_token_ids: list[list[int]] — one prompt per batch item
- beam_width K per batch item
- Each batch item's beams are INDEPENDENT (no cross-contamination between different prompts)
-
PER-STEP BEAM EXPANSION:
- For each UNFINISHED beam, compute logits for the next token
- Take top-(2*K) candidates per beam (not just top-K, to preserve diversity)
- Compute total logprob = accumulated_logprob + new_logprob
- Pool all candidates across all beams, sort globally by total logprob (most negative = worst), take top K
- These K become the active beams for the next step
-
LENGTH PENALTY (for ranking only, not for accumulated score):
- adjusted_score = accumulated_logprob / (generated_length ^ alpha)
- alpha is a hyperparameter (default 0.6)
- The accumulated logprob is NEVER modified by length penalty — it stays as the raw sum of logprobs
- Length penalty is used ONLY when comparing beams for ranking/selection
- generated_length = number of NEW tokens generated (NOT including prompt tokens — the prompt does not count toward length penalty)
-
EOS HANDLING (the critical part — get this right):
- When a beam produces token_id == eos_token:
- Mark that beam as FINISHED
- Freeze its accumulated_logprob and generated_length
- The beam STAYS in the pool — it competes with unfinished beams
- At each step, the top-K selection pool includes BOTH: (a) all FINISHED beams, and (b) all candidates from expanding UNFINISHED beams
- If all K beams in a batch item are finished, that item stops expanding
- If all batch items have K finished beams, stop early
- IMPORTANT: Do NOT remove finished beams from the pool. They must compete against unfinished beams using their length-penalized scores. If you remove them, a short, high-confidence sequence that hit EOS early will be wrongly discarded in favor of a longer, lower-confidence sequence.
- When a beam produces token_id == eos_token:
-
RETURN:
- For each batch item: a list of K sequences (generated token IDs only, NOT including prompt tokens), sorted by length-penalized score descending (best/highest score first)
- Each sequence's score = accumulated_logprob / (len(seq) ^ alpha)
-
EDGE CASES:
- If max_new_tokens is reached before K beams finish, return the best K (finished + unfinished) by length-penalized score
- A batch item may end with fewer than K finished beams (if max_new_tokens hit). Return whatever is available.
- Log-space accumulation: keep everything in log space; avoid unnecessary exp/log conversions. Don't let very negative numbers cause underflow.
Deliver:
- A class or function
batched_beam_search(prompts, beam_width, max_new_tokens, alpha, eos_token_id)that returns the K best sequences per batch item - Test 1: Single batch item, K=1, short prompt, alpha=0 → verify this behaves identically to greedy decoding (always pick argmax)
- Test 2: batch=2, beam_width=3, different prompt lengths [3, 5], alpha=0.6 → verify per-batch independence: beams from prompt 0 never interact with beams from prompt 1
- Test 3: THE EOS RETENTION TEST. Monkey-patch or modify the model's forward pass so that at step 1, one beam produces EOS with total logprob=-3.0 while another beam continues with logprob=-4.0. At step 2, the continuing beam has logprob=-5.0. Verify that the EOS beam with score=-3.0 is correctly returned as the winner (even though it stopped early). If you had removed EOS beams from the pool, the unfinished beam with score=-5.0 would wrongly win. This test distinguishes correct from buggy implementations.
- Comments explaining why finished beams must NOT be removed from the pool
Use only NumPy. No PyTorch, JAX, TensorFlow, or autograd.