[RAG] rag_vector.py uses non-deterministic hash() for doc IDs — breaks dedup across restarts #753
Labels
No labels
area:chat
area:core
area:llm
area:routes
area:tools
bug
documentation
duplicate
enhancement
good first issue
help wanted
invalid
question
refactor
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
sleepy/odysseus#753
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
File:
src/rag_vector.py— 496 lines, just under the limit but closedoc_id = f"doc_{hash(text) % 10**16}"(line 102) uses Python'shash()which is non-deterministic across processes (PYTHONHASHSEEDrandomization). This means:existing = self._collection.get(ids=[doc_id])) fails after restart — the same document gets re-indexed with a new IDAlso in
src/rag_vector.py: Theremove_directory()method (line 378) uses$containsas a ChromaDB metadata filter, but$containsis not a standard ChromaDB operator — it's a substring match that may not work as intended on all backends.Action: Replace
hash(text)with a deterministic hash (e.g.,hashlib.sha256(text.encode()).hexdigest()[:16]) for stable document IDs.