[memory/RAG] Duplicated tokenization and similarity logic across 3 modules #748

New issue

Closed

opened 2026-06-03 00:18:12 +02:00 by sleepy · 1 comment

sleepy commented

2026-06-03 00:18:12 +02:00

Owner

Duplicated helpers:

services/memory/memory.py (lines 13-33): tokenize() (splits on whitespace, strips punctuation) and get_text_similarity() (Jaccard)
src/memory.py (lines 13-33): Identical copies of the above
services/memory/skills.py (lines 37-44): _tokenize() and _jaccard() — same logic, slightly different implementation (also strips :()[], filters len > 1)
services/memory/memory_extractor.py (lines 209-222): _is_text_duplicate() — another Jaccard similarity implementation
src/personal_docs.py (lines 69-72): tokenize() — yet another tokenizer with stop-word filtering

AGENTS.md says: "Writing new helpers without searching for existing ones first."

Action: Extract a single services/memory/text_similarity.py (or a shared src/text_helpers.py) with canonical tokenize(), jaccard_similarity(), and is_text_duplicate(). Remove all duplicates.

**Duplicated helpers**: 1. `services/memory/memory.py` (lines 13-33): `tokenize()` (splits on whitespace, strips punctuation) and `get_text_similarity()` (Jaccard) 2. `src/memory.py` (lines 13-33): Identical copies of the above 3. `services/memory/skills.py` (lines 37-44): `_tokenize()` and `_jaccard()` — same logic, slightly different implementation (also strips `:()[]`, filters len > 1) 4. `services/memory/memory_extractor.py` (lines 209-222): `_is_text_duplicate()` — another Jaccard similarity implementation 5. `src/personal_docs.py` (lines 69-72): `tokenize()` — yet another tokenizer with stop-word filtering AGENTS.md says: *"Writing new helpers without searching for existing ones first."* **Action**: Extract a single `services/memory/text_similarity.py` (or a shared `src/text_helpers.py`) with canonical `tokenize()`, `jaccard_similarity()`, and `is_text_duplicate()`. Remove all duplicates.

sleepy referenced this issue from a commit

2026-06-03 16:13:00 +02:00

fix(#748): extract shared text_utils.py, deduplicate tokenization/similarity

sleepy referenced this issue from a pull request that will close it,

2026-06-03 16:13:27 +02:00

fix(#748): Deduplicate tokenization and similarity logic in services/memory/ #805

sleepy referenced this issue from a commit

2026-06-03 23:54:36 +02:00

deduplicate tokenization: replace personal_docs local tokenize with canonical text_utils

sleepy closed this issue

2026-06-03 23:56:36 +02:00

sleepy commented

2026-06-03 23:56:47 +02:00

Author

Owner

Fixed via PR #869 — most duplicates already resolved in prior batches. Replaced last duplicate in src/personal_docs.py with import from services.memory.text_utils.