NGramHasher.lookup leaves early positions as index 0 — wrong embedding injected #10

Open
opened 2026-05-08 23:49:37 +02:00 by sleepy · 0 comments
Owner

Problem

NGramHasher.lookup() (engram.py:92-93) skips positions where there aren't enough tokens for an n-gram:

if t < order - 1:
    continue  # indices[b, t, oi, h] stays at 0 (default from torch.zeros)

All positions that can't form an n-gram get hash index 0. This means the same embedding slot is looked up for all "not enough context" positions, which injects a real embedding value (whatever happens to be at index 0) rather than a neutral/zero signal.

Impact

  • Early positions in every sequence get the same (arbitrary) embedding from index 0
  • This adds noise to the model's input, potentially corrupting learning at sequence boundaries
  • The effect is worse for higher n-gram orders (e.g., order=3 means 2 positions per sequence are affected)

Action needed

  • Initialize indices with a special "no-context" value (e.g., -1), and have EngramEmbedding handle it by returning zeros, OR
  • Use a dedicated "padding" embedding slot at index 0 initialized to zeros

Files

  • tergent/engram.py:92-93
## Problem `NGramHasher.lookup()` (engram.py:92-93) skips positions where there aren't enough tokens for an n-gram: ```python if t < order - 1: continue # indices[b, t, oi, h] stays at 0 (default from torch.zeros) ``` All positions that can't form an n-gram get hash index 0. This means the **same embedding slot** is looked up for all "not enough context" positions, which injects a real embedding value (whatever happens to be at index 0) rather than a neutral/zero signal. ## Impact - Early positions in every sequence get the same (arbitrary) embedding from index 0 - This adds noise to the model's input, potentially corrupting learning at sequence boundaries - The effect is worse for higher n-gram orders (e.g., order=3 means 2 positions per sequence are affected) ## Action needed - Initialize `indices` with a special "no-context" value (e.g., -1), and have `EngramEmbedding` handle it by returning zeros, OR - Use a dedicated "padding" embedding slot at index 0 initialized to zeros ## Files - `tergent/engram.py:92-93`
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/ternary#10
No description provided.