compress_token_ids uses Python hash() — non-deterministic across runs #9

Open
opened 2026-05-08 23:49:37 +02:00 by sleepy · 0 comments
Owner

Problem

compress_token_ids() (engram.py:31) uses Python's built-in hash() to map token strings to compressed IDs:

compress_cache[raw_id_int] = hash(canonical) % (2**24)

Python's hash() is randomized per process (PYTHONHASHSEED). The cache will produce different results on each training run, making Engram lookups non-reproducible.

Impact

  • Checkpoints can't be reliably resumed across restarts
  • Training results are non-reproducible
  • Multi-process training (if added) will have inconsistent lookups

Action needed

Replace with a deterministic hash function. The file already has _murmur_hash() — use that instead, or switch to hashlib.md5.

Files

  • tergent/engram.py:31
## Problem `compress_token_ids()` (engram.py:31) uses Python's built-in `hash()` to map token strings to compressed IDs: ```python compress_cache[raw_id_int] = hash(canonical) % (2**24) ``` Python's `hash()` is **randomized per process** (PYTHONHASHSEED). The cache will produce different results on each training run, making Engram lookups non-reproducible. ## Impact - Checkpoints can't be reliably resumed across restarts - Training results are non-reproducible - Multi-process training (if added) will have inconsistent lookups ## Action needed Replace with a deterministic hash function. The file already has `_murmur_hash()` — use that instead, or switch to `hashlib.md5`. ## Files - `tergent/engram.py:31`
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/ternary#9
No description provided.