[memory/RAG] Duplicated tokenization and similarity logic across 3 modules #748

Closed
opened 2026-06-03 00:18:12 +02:00 by sleepy · 1 comment
Owner

Duplicated helpers:

  1. services/memory/memory.py (lines 13-33): tokenize() (splits on whitespace, strips punctuation) and get_text_similarity() (Jaccard)
  2. src/memory.py (lines 13-33): Identical copies of the above
  3. services/memory/skills.py (lines 37-44): _tokenize() and _jaccard() — same logic, slightly different implementation (also strips :()[], filters len > 1)
  4. services/memory/memory_extractor.py (lines 209-222): _is_text_duplicate() — another Jaccard similarity implementation
  5. src/personal_docs.py (lines 69-72): tokenize() — yet another tokenizer with stop-word filtering

AGENTS.md says: "Writing new helpers without searching for existing ones first."

Action: Extract a single services/memory/text_similarity.py (or a shared src/text_helpers.py) with canonical tokenize(), jaccard_similarity(), and is_text_duplicate(). Remove all duplicates.

**Duplicated helpers**: 1. `services/memory/memory.py` (lines 13-33): `tokenize()` (splits on whitespace, strips punctuation) and `get_text_similarity()` (Jaccard) 2. `src/memory.py` (lines 13-33): Identical copies of the above 3. `services/memory/skills.py` (lines 37-44): `_tokenize()` and `_jaccard()` — same logic, slightly different implementation (also strips `:()[]`, filters len > 1) 4. `services/memory/memory_extractor.py` (lines 209-222): `_is_text_duplicate()` — another Jaccard similarity implementation 5. `src/personal_docs.py` (lines 69-72): `tokenize()` — yet another tokenizer with stop-word filtering AGENTS.md says: *"Writing new helpers without searching for existing ones first."* **Action**: Extract a single `services/memory/text_similarity.py` (or a shared `src/text_helpers.py`) with canonical `tokenize()`, `jaccard_similarity()`, and `is_text_duplicate()`. Remove all duplicates.
Author
Owner

Fixed via PR #869 — most duplicates already resolved in prior batches. Replaced last duplicate in src/personal_docs.py with import from services.memory.text_utils.

Fixed via PR #869 — most duplicates already resolved in prior batches. Replaced last duplicate in src/personal_docs.py with import from services.memory.text_utils.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/odysseus#748
No description provided.