[RAG] rag_vector.py uses non-deterministic hash() for doc IDs — breaks dedup across restarts #753

Closed
opened 2026-06-03 00:19:25 +02:00 by sleepy · 0 comments
Owner

File: src/rag_vector.py — 496 lines, just under the limit but close

doc_id = f"doc_{hash(text) % 10**16}" (line 102) uses Python's hash() which is non-deterministic across processes (PYTHONHASHSEED randomization). This means:

  1. The same text produces different doc_ids across app restarts
  2. Duplicate detection (existing = self._collection.get(ids=[doc_id])) fails after restart — the same document gets re-indexed with a new ID
  3. The batch path (line 138) has the same issue

Also in src/rag_vector.py: The remove_directory() method (line 378) uses $contains as a ChromaDB metadata filter, but $contains is not a standard ChromaDB operator — it's a substring match that may not work as intended on all backends.

Action: Replace hash(text) with a deterministic hash (e.g., hashlib.sha256(text.encode()).hexdigest()[:16]) for stable document IDs.

**File**: `src/rag_vector.py` — 496 lines, just under the limit but close `doc_id = f"doc_{hash(text) % 10**16}"` (line 102) uses Python's `hash()` which is **non-deterministic across processes** (`PYTHONHASHSEED` randomization). This means: 1. The same text produces different doc_ids across app restarts 2. Duplicate detection (`existing = self._collection.get(ids=[doc_id])`) fails after restart — the same document gets re-indexed with a new ID 3. The batch path (line 138) has the same issue **Also in `src/rag_vector.py`**: The `remove_directory()` method (line 378) uses `$contains` as a ChromaDB metadata filter, but `$contains` is not a standard ChromaDB operator — it's a substring match that may not work as intended on all backends. **Action**: Replace `hash(text)` with a deterministic hash (e.g., `hashlib.sha256(text.encode()).hexdigest()[:16]`) for stable document IDs.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/odysseus#753
No description provided.