[memory/RAG] Duplicated tokenization and similarity logic across 3 modules #748
Labels
No labels
area:chat
area:core
area:llm
area:routes
area:tools
bug
documentation
duplicate
enhancement
good first issue
help wanted
invalid
question
refactor
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
sleepy/odysseus#748
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Duplicated helpers:
services/memory/memory.py(lines 13-33):tokenize()(splits on whitespace, strips punctuation) andget_text_similarity()(Jaccard)src/memory.py(lines 13-33): Identical copies of the aboveservices/memory/skills.py(lines 37-44):_tokenize()and_jaccard()— same logic, slightly different implementation (also strips:()[], filters len > 1)services/memory/memory_extractor.py(lines 209-222):_is_text_duplicate()— another Jaccard similarity implementationsrc/personal_docs.py(lines 69-72):tokenize()— yet another tokenizer with stop-word filteringAGENTS.md says: "Writing new helpers without searching for existing ones first."
Action: Extract a single
services/memory/text_similarity.py(or a sharedsrc/text_helpers.py) with canonicaltokenize(),jaccard_similarity(), andis_text_duplicate(). Remove all duplicates.Fixed via PR #869 — most duplicates already resolved in prior batches. Replaced last duplicate in src/personal_docs.py with import from services.memory.text_utils.