compress_token_ids uses Python hash() — non-deterministic across runs #9

New issue

Open

opened 2026-05-08 23:49:37 +02:00 by sleepy · 0 comments

sleepy commented

2026-05-08 23:49:37 +02:00

Owner

Problem

compress_token_ids() (engram.py:31) uses Python's built-in hash() to map token strings to compressed IDs:

compress_cache[raw_id_int] = hash(canonical) % (2**24)

Python's hash() is randomized per process (PYTHONHASHSEED). The cache will produce different results on each training run, making Engram lookups non-reproducible.

Impact

Checkpoints can't be reliably resumed across restarts
Training results are non-reproducible
Multi-process training (if added) will have inconsistent lookups

Action needed

Replace with a deterministic hash function. The file already has _murmur_hash() — use that instead, or switch to hashlib.md5.

Files

tergent/engram.py:31

## Problem `compress_token_ids()` (engram.py:31) uses Python's built-in `hash()` to map token strings to compressed IDs: ```python compress_cache[raw_id_int] = hash(canonical) % (2**24) ``` Python's `hash()` is **randomized per process** (PYTHONHASHSEED). The cache will produce different results on each training run, making Engram lookups non-reproducible. ## Impact - Checkpoints can't be reliably resumed across restarts - Training results are non-reproducible - Multi-process training (if added) will have inconsistent lookups ## Action needed Replace with a deterministic hash function. The file already has `_murmur_hash()` — use that instead, or switch to `hashlib.md5`. ## Files - `tergent/engram.py:31`

No labels

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

sleepy/ternary#9

No description provided.

Rows
Columns