Files
deep_pro_judge/minimax-m2.7/beam_search/PROMPT.md
T
sleepy 45c3aad453 feat: expand to 6 models, 8 challenges; rewrite README with DeepSeek V4 Pro analysis
- Add Claude Opus 4.7, Kimi K2.6, GLM-5.1 to existing GLM-5, Qwen3-6, MiniMax-M2.7
- Add 5 new challenges: flash attention fwd/bwd, beam search, DFlash, ternary training
- Rewrite README with TL;DR rankings, grade matrix, and DeepSeek V4 Pro attribution
- Add analysis/ folder with cross-model comparisons and per-challenge deep dives
- Add deploy_challenges.sh script
- Expand .gitignore to exclude Python envs, ML weights, and build artifacts
2026-04-27 18:49:22 +02:00

4.0 KiB

Implement a correct batched beam search decoder for autoregressive generation in pure NumPy.

Simulate a minimal language model:

  • vocab_size = 1000
  • d_model = 64
  • Use random embeddings + 1 transformer block with random weights (correctness depends on beam search logic, not model quality)

Requirements:

  1. MULTI-BATCH SUPPORT:

    • Accept prompt_token_ids: list[list[int]] — one prompt per batch item
    • beam_width K per batch item
    • Each batch item's beams are INDEPENDENT (no cross-contamination between different prompts)
  2. PER-STEP BEAM EXPANSION:

    • For each UNFINISHED beam, compute logits for the next token
    • Take top-(2*K) candidates per beam (not just top-K, to preserve diversity)
    • Compute total logprob = accumulated_logprob + new_logprob
    • Pool all candidates across all beams, sort globally by total logprob (most negative = worst), take top K
    • These K become the active beams for the next step
  3. LENGTH PENALTY (for ranking only, not for accumulated score):

    • adjusted_score = accumulated_logprob / (generated_length ^ alpha)
    • alpha is a hyperparameter (default 0.6)
    • The accumulated logprob is NEVER modified by length penalty — it stays as the raw sum of logprobs
    • Length penalty is used ONLY when comparing beams for ranking/selection
    • generated_length = number of NEW tokens generated (NOT including prompt tokens — the prompt does not count toward length penalty)
  4. EOS HANDLING (the critical part — get this right):

    • When a beam produces token_id == eos_token:
      • Mark that beam as FINISHED
      • Freeze its accumulated_logprob and generated_length
      • The beam STAYS in the pool — it competes with unfinished beams
    • At each step, the top-K selection pool includes BOTH: (a) all FINISHED beams, and (b) all candidates from expanding UNFINISHED beams
    • If all K beams in a batch item are finished, that item stops expanding
    • If all batch items have K finished beams, stop early
    • IMPORTANT: Do NOT remove finished beams from the pool. They must compete against unfinished beams using their length-penalized scores. If you remove them, a short, high-confidence sequence that hit EOS early will be wrongly discarded in favor of a longer, lower-confidence sequence.
  5. RETURN:

    • For each batch item: a list of K sequences (generated token IDs only, NOT including prompt tokens), sorted by length-penalized score descending (best/highest score first)
    • Each sequence's score = accumulated_logprob / (len(seq) ^ alpha)
  6. EDGE CASES:

    • If max_new_tokens is reached before K beams finish, return the best K (finished + unfinished) by length-penalized score
    • A batch item may end with fewer than K finished beams (if max_new_tokens hit). Return whatever is available.
    • Log-space accumulation: keep everything in log space; avoid unnecessary exp/log conversions. Don't let very negative numbers cause underflow.

Deliver:

  • A class or function batched_beam_search(prompts, beam_width, max_new_tokens, alpha, eos_token_id) that returns the K best sequences per batch item
  • Test 1: Single batch item, K=1, short prompt, alpha=0 → verify this behaves identically to greedy decoding (always pick argmax)
  • Test 2: batch=2, beam_width=3, different prompt lengths [3, 5], alpha=0.6 → verify per-batch independence: beams from prompt 0 never interact with beams from prompt 1
  • Test 3: THE EOS RETENTION TEST. Monkey-patch or modify the model's forward pass so that at step 1, one beam produces EOS with total logprob=-3.0 while another beam continues with logprob=-4.0. At step 2, the continuing beam has logprob=-5.0. Verify that the EOS beam with score=-3.0 is correctly returned as the winner (even though it stopped early). If you had removed EOS beams from the pool, the unfinished beam with score=-5.0 would wrongly win. This test distinguishes correct from buggy implementations.
  • Comments explaining why finished beams must NOT be removed from the pool

Use only NumPy. No PyTorch, JAX, TensorFlow, or autograd.