[embeddings] Failed HTTP endpoint latched for entire process lifetime with no auto-retry #766

Closed
opened 2026-06-03 00:22:44 +02:00 by sleepy · 1 comment
Owner

"File: src/embeddings.py line 203 python _http_embed_down = False # process-level latch Once the HTTP embedding endpoint fails, _http_embed_down = True is set for the entire process lifetime. The only way to reset it is calling reset_http_embed_state() — which is only triggered by manual admin panel saves. This means: 1. If the embedding endpoint is briefly down during startup, the process runs on FastEmbed forever 2. If the endpoint recovers, no automatic retry occurs 3. In a long-running server, this can cause degraded quality (FastEmbed may use a different/smaller model than the configured endpoint) The rag_singleton.py has a better pattern — it retries every 30 seconds. embeddings.py should adopt a similar approach. Action: Replace the boolean latch with a time-based retry (e.g., re-probe every N seconds after failure), similar to rag_singleton.py's _RETRY_INTERVAL."

"**File**: `src/embeddings.py` line 203 ```python _http_embed_down = False # process-level latch ``` Once the HTTP embedding endpoint fails, `_http_embed_down = True` is set for the **entire process lifetime**. The only way to reset it is calling `reset_http_embed_state()` — which is only triggered by manual admin panel saves. This means: 1. If the embedding endpoint is briefly down during startup, the process runs on FastEmbed forever 2. If the endpoint recovers, no automatic retry occurs 3. In a long-running server, this can cause degraded quality (FastEmbed may use a different/smaller model than the configured endpoint) The `rag_singleton.py` has a better pattern — it retries every 30 seconds. `embeddings.py` should adopt a similar approach. **Action**: Replace the boolean latch with a time-based retry (e.g., re-probe every N seconds after failure), similar to `rag_singleton.py`'s `_RETRY_INTERVAL`."
Author
Owner

Fixed in PR #802 — replaced boolean latch with time-based retry (30s re-probe interval).

Fixed in PR #802 — replaced boolean latch with time-based retry (30s re-probe interval).
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/odysseus#766
No description provided.