fix(#748): Deduplicate tokenization and similarity logic in services/memory/ #805

Open
sleepy wants to merge 0 commits from fix/748-deduplicate-tokenization into main
Owner

Extract shared text_utils.py from three modules with duplicate tokenize/Jaccard implementations. Closes #748

Extract shared text_utils.py from three modules with duplicate tokenize/Jaccard implementations. Closes #748
Three modules had their own tokenize/Jaccard implementations:
- memory.py: tokenize() + get_text_similarity() (basic punctuation stripping)
- skills.py: _tokenize() + _jaccard() (enhanced, strips :()[] too)
- memory_extractor.py: _is_text_duplicate() (raw split, no punctuation strip)

Create services/memory/text_utils.py with the comprehensive implementation
(based on skills.py version which strips the most characters and filters
short tokens). Update all consumers to import from the shared module.

Backward compatible: memory.py re-exports tokenize, get_text_similarity,
jaccard, and is_text_duplicate. skills.py uses import aliases so existing
call-sites (_tokenize, _jaccard) work unchanged.
- Remove unused imports (tokenize_list, jaccard, is_text_duplicate) from
  services/memory/memory.py — only tokenize and get_text_similarity are
  used by the module.
- Add tests/test_text_utils.py with 27 unit tests covering tokenize,
  tokenize_list, jaccard, get_text_similarity, and is_text_duplicate,
  including edge cases for empty strings, None, single-char tokens,
  and threshold behavior.
This branch is already included in the target branch. There is nothing to merge.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin fix/748-deduplicate-tokenization:fix/748-deduplicate-tokenization
git switch fix/748-deduplicate-tokenization

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch main
git merge --no-ff fix/748-deduplicate-tokenization
git switch fix/748-deduplicate-tokenization
git rebase main
git switch main
git merge --ff-only fix/748-deduplicate-tokenization
git switch fix/748-deduplicate-tokenization
git rebase main
git switch main
git merge --no-ff fix/748-deduplicate-tokenization
git switch main
git merge --squash fix/748-deduplicate-tokenization
git switch main
git merge --ff-only fix/748-deduplicate-tokenization
git switch main
git merge fix/748-deduplicate-tokenization
git push origin main
Sign in to join this conversation.
No description provided.