fix: replace non-deterministic hash() with hashlib.sha256 for doc IDs #801

Open
sleepy wants to merge 0 commits from fix/deterministic-doc-ids-753 into main
Owner

Replace hash() with hashlib.sha256() for deterministic doc IDs.

Pythons hash() is randomized via PYTHONHASHSEED, producing different values across process restarts. This breaks document deduplication.

Fixes both add_document() and add_documents_batch() in src/rag_vector.py.

Closes #753

Replace hash() with hashlib.sha256() for deterministic doc IDs. Pythons hash() is randomized via PYTHONHASHSEED, producing different values across process restarts. This breaks document deduplication. Fixes both add_document() and add_documents_batch() in src/rag_vector.py. Closes #753
Python's hash() is randomized via PYTHONHASHSEED, producing different
values across process restarts. This breaks document deduplication
because the same text gets a different doc_id each time.

Replace with hashlib.sha256(text.encode()).hexdigest()[:16] which is
deterministic and stable across processes.

Closes #753
This branch is already included in the target branch. There is nothing to merge.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin fix/deterministic-doc-ids-753:fix/deterministic-doc-ids-753
git switch fix/deterministic-doc-ids-753

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch main
git merge --no-ff fix/deterministic-doc-ids-753
git switch fix/deterministic-doc-ids-753
git rebase main
git switch main
git merge --ff-only fix/deterministic-doc-ids-753
git switch fix/deterministic-doc-ids-753
git rebase main
git switch main
git merge --no-ff fix/deterministic-doc-ids-753
git switch main
git merge --squash fix/deterministic-doc-ids-753
git switch main
git merge --ff-only fix/deterministic-doc-ids-753
git switch main
git merge fix/deterministic-doc-ids-753
git push origin main
Sign in to join this conversation.
No description provided.