generate() doesn't use KV cache — re-runs full forward pass every token #12

Open
opened 2026-05-09 07:30:22 +02:00 by sleepy · 0 comments
Owner

Problem

TergentModel.generate() (model.py:165-178) re-runs the entire forward pass for every generated token:

for _ in range(max_new_tokens):
    seq = input_ids[:, -self.config.max_seq_len:]
    out = self.forward(seq)  # re-runs ALL layers from scratch

With 20 layers and ternary weights, this is extremely slow. The TransformerBlock.forward() already returns KV cache (kv), but generate() discards it.

Impact

  • Generation is O(n²) in compute instead of O(n)
  • For 128 tokens, this is ~128× full forward passes instead of reusing cached KV
  • Makes inference impractical for long sequences

Action needed

  • Thread past_kv through generate() so each layer reuses its cached KV
  • Update forward() to accept and return per-layer KV caches
  • Consider adding a separate generate() method that handles caching explicitly

Files

  • tergent/model.py:165-178
## Problem `TergentModel.generate()` (model.py:165-178) re-runs the **entire** forward pass for every generated token: ```python for _ in range(max_new_tokens): seq = input_ids[:, -self.config.max_seq_len:] out = self.forward(seq) # re-runs ALL layers from scratch ``` With 20 layers and ternary weights, this is extremely slow. The `TransformerBlock.forward()` already returns KV cache (`kv`), but `generate()` discards it. ## Impact - Generation is O(n²) in compute instead of O(n) - For 128 tokens, this is ~128× full forward passes instead of reusing cached KV - Makes inference impractical for long sequences ## Action needed - Thread `past_kv` through `generate()` so each layer reuses its cached KV - Update `forward()` to accept and return per-layer KV caches - Consider adding a separate `generate()` method that handles caching explicitly ## Files - `tergent/model.py:165-178`
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/ternary#12
No description provided.