sleepy/odysseus

Fork 0

[feature] Video Analysis side panel — transcribe + find + summarize (like Deep Research panel) #926

New issue

Open

opened 2026-06-04 12:13:56 +02:00 by sleepy · 0 comments

sleepy commented

2026-06-04 12:13:56 +02:00

Owner

Feature Request

Add a Video Analysis side panel to Odysseus, modeled after the existing Deep Research panel (static/js/research/panel.js + research-overlay). Not a chat tool — a dedicated UI section with its own overlay, progress tracking, and results viewer.

The entire backend pipeline already exists and is production-tested at ~/workspace/vod-pipeline/:

vod_pipeline.py (548 lines) — download → preprocess → Parakeet transcribe → LLM analysis
discord_bot.py (762 lines) — working Discord bot with /url, /retry, /find, /help
~/workspace/clyde-vods/prompts/PROMPT_5hr.md — analysis prompt (chapters, highlights, games, audio pollution filtering)

~95% of the backend code can be reused. The main work is: frontend panel, backend routes, settings integration.

UI Design (follow Deep Research pattern)

Side Panel Button

Like the Deep Research button in the sidebar, add a Video Analysis button that opens a full-screen overlay (vod-overlay).

Overlay Layout

The overlay should have 3 tabs/modes (like DR has running/completed sections):

Analyze — Full pipeline: paste URL → transcribe + analyze → view results
- Input: YouTube/Twitch URL field + "Analyze" button
- Progress bar (reuse the bot's progress parsing logic): Download → Transcribe → Analyze
- Results: transcript viewer + analysis cards (chapters, highlights, games)
- Downloads: transcript .txt, analysis .json
Find — Timestamp search in a video
- Input: URL + search query
- Auto-transcribes if no existing transcript
- Returns timestamps + context snippets
- Reuses /find implementation from discord_bot.py:564+
Summarize — Video summary with key points
- Input: URL
- Auto-transcribes if no existing transcript
- Returns structured summary with timestamps for each key point
- New (does not exist in Discord bot)

History

Like DR keeps data/deep_research/<id>.json, store analyses in data/video_analysis/<id>.json with:

url, video_id, video_info (title, channel, date)
transcript_path, analysis_path
status (pending/transcribing/analyzing/complete/error)
timestamps for each phase

Architecture (follow Deep Research pattern)

Frontend

static/js/vod/
  panel.js      — overlay, tabs, progress, results rendering
  jobs.js       — job tracking, polling, history

Backend Routes

routes/vod_routes.py   — /api/vod/* endpoints

Endpoints:

POST /api/vod/analyze — Start analysis (download + transcribe + analyze)
POST /api/vod/find — Find in video (auto-transcribe if needed, then search)
POST /api/vod/summarize — Summarize video (auto-transcribe if needed, then summarize)
GET /api/vod/jobs — List all jobs (history)
GET /api/vod/jobs/{id} — Get job status + results
GET /api/vod/jobs/{id}/transcript — Stream transcript text
DELETE /api/vod/jobs/{id} — Delete job + artifacts

Backend Pipeline

src/vod_pipeline.py         — audio download, preprocess, Parakeet transcription (from vod_pipeline.py)
src/vod_analysis.py         — LLM analysis, find, summarize (from discord_bot.py find logic + new summarize)

Settings Keys

"video_analysis_enabled": True,
"video_analysis_model": "",           # model for analysis/find/summarize
"video_analysis_endpoint_id": "",     # endpoint for the model

Model resolution uses Odysseus' existing endpoint_resolver — NOT hardcoded to DeepSeek. Falls back to default_model / default_endpoint_id if video-specific model not configured.

Data Storage

data/video_analysis/
  <job_id>/
    audio_16k_mono.wav     — kept for re-analysis (delete after configurable TTL)
    <video_id>_transcript.txt
    <video_id>_analysis.json
    job.json               — metadata, status, timestamps

Model Resolution

All LLM calls (analysis, find, summarize) go through Odysseus' endpoint resolution:

from src.settings import get_setting
from src.endpoint_resolver import resolve_endpoint

model = get_setting("video_analysis_model") or get_setting("default_model")
endpoint_id = get_setting("video_analysis_endpoint_id") or get_setting("default_endpoint_id")
url, model, headers = resolve_endpoint(endpoint_id)

Uses src.llm_core.llm_call_async for the actual calls instead of raw urllib.request.

Key Differences from Discord Bot

No Discord dependency — strip all discord.py imports, use FastAPI routes + SSE/WebSocket for progress
Model from settings — not hardcoded DeepSeek
Frontend progress — SSE stream or polling (like DR's progress tracking) instead of Discord message edits
Summarize tool — new, doesn't exist in the bot
Results viewer — rendered in the overlay, not just file uploads

Requirements

Python packages (Parakeet transcription):

nemo_toolkit[asr]   # ~2GB, optional — tool disables gracefully if missing
torchaudio
soundfile

System dependencies:

yt-dlp    # on PATH
ffmpeg    # on PATH

Make NeMo an optional import — the panel shows a setup message if not installed.

Pitfalls (from production experience)

Timestamp format: Must use [HH:MM:SS], not [MM:SS:00]. Minutes > 59 breaks duration estimation.
JSON truncation: LLM max_tokens must be 65536 for long videos. 16K truncates mid-object on 4h+ VODs.
Stereo→mono: NeMo needs (batch, time) shape. Convert with np.mean(data, axis=1).
yt-dlp PATH: When running as subprocess, PATH may not include venv bin. Resolve full path at import time: YT_DLP = shutil.which("yt-dlp") or str(Path(sys.executable).parent / "yt-dlp")
Highlight cap: Append instruction to limit to 50 highlights max.
Audio cleanup: Delete original WAV immediately after transcription (2.4GB+). Keep 16kHz mono for reuse.
Long-running: Transcription takes ~1.5 min per hour of video. Must be async with progress reporting via SSE/polling.
Audio pollution: The PROMPT_5hr.md already handles filtering game voice lines, music lyrics, etc.

Reference Code

Source	What to reuse
`~/workspace/vod-pipeline/vod_pipeline.py`	download_audio, preprocess_audio, transcribe_parakeet, build_analysis_prompt, call_deepseek_analysis, estimate_duration, get_video_info
`~/workspace/vod-pipeline/discord_bot.py`	/find command logic (FIND_SYSTEM_PROMPT + transcript search), progress parsing, retry logic
`~/workspace/clyde-vods/prompts/PROMPT_5hr.md`	Full analysis prompt with audio pollution filtering, chapter/highlight/game extraction
`static/js/research/panel.js`	Overlay pattern, progress tracking, job history, section collapse
`routes/research_routes.py`	Route pattern for long-running async jobs with progress
`src/research_handler.py`	Job lifecycle pattern (pending → running → complete/error)

vod-pipeline-bot Hermes skill has comprehensive architecture docs for all pitfalls
#921 — Settings persistence (should be fixed first)
#924 — Subagent tool rewrite (video analysis could eventually use role-based model routing)

## Feature Request Add a **Video Analysis** side panel to Odysseus, modeled after the existing **Deep Research** panel (`static/js/research/panel.js` + `research-overlay`). Not a chat tool — a dedicated UI section with its own overlay, progress tracking, and results viewer. The entire backend pipeline already exists and is production-tested at `~/workspace/vod-pipeline/`: - **`vod_pipeline.py`** (548 lines) — download → preprocess → Parakeet transcribe → LLM analysis - **`discord_bot.py`** (762 lines) — working Discord bot with `/url`, `/retry`, `/find`, `/help` - **`~/workspace/clyde-vods/prompts/PROMPT_5hr.md`** — analysis prompt (chapters, highlights, games, audio pollution filtering) ~95% of the backend code can be reused. The main work is: frontend panel, backend routes, settings integration. ## UI Design (follow Deep Research pattern) ### Side Panel Button Like the Deep Research button in the sidebar, add a **Video Analysis** button that opens a full-screen overlay (`vod-overlay`). ### Overlay Layout The overlay should have **3 tabs/modes** (like DR has running/completed sections): 1. **Analyze** — Full pipeline: paste URL → transcribe + analyze → view results - Input: YouTube/Twitch URL field + "Analyze" button - Progress bar (reuse the bot's progress parsing logic): Download → Transcribe → Analyze - Results: transcript viewer + analysis cards (chapters, highlights, games) - Downloads: transcript .txt, analysis .json 2. **Find** — Timestamp search in a video - Input: URL + search query - Auto-transcribes if no existing transcript - Returns timestamps + context snippets - Reuses `/find` implementation from `discord_bot.py:564+` 3. **Summarize** — Video summary with key points - Input: URL - Auto-transcribes if no existing transcript - Returns structured summary with timestamps for each key point - **New** (does not exist in Discord bot) ### History Like DR keeps `data/deep_research/<id>.json`, store analyses in `data/video_analysis/<id>.json` with: - url, video_id, video_info (title, channel, date) - transcript_path, analysis_path - status (pending/transcribing/analyzing/complete/error) - timestamps for each phase ## Architecture (follow Deep Research pattern) ### Frontend ``` static/js/vod/ panel.js — overlay, tabs, progress, results rendering jobs.js — job tracking, polling, history ``` ### Backend Routes ``` routes/vod_routes.py — /api/vod/* endpoints ``` Endpoints: - `POST /api/vod/analyze` — Start analysis (download + transcribe + analyze) - `POST /api/vod/find` — Find in video (auto-transcribe if needed, then search) - `POST /api/vod/summarize` — Summarize video (auto-transcribe if needed, then summarize) - `GET /api/vod/jobs` — List all jobs (history) - `GET /api/vod/jobs/{id}` — Get job status + results - `GET /api/vod/jobs/{id}/transcript` — Stream transcript text - `DELETE /api/vod/jobs/{id}` — Delete job + artifacts ### Backend Pipeline ``` src/vod_pipeline.py — audio download, preprocess, Parakeet transcription (from vod_pipeline.py) src/vod_analysis.py — LLM analysis, find, summarize (from discord_bot.py find logic + new summarize) ``` ### Settings Keys ```python "video_analysis_enabled": True, "video_analysis_model": "", # model for analysis/find/summarize "video_analysis_endpoint_id": "", # endpoint for the model ``` Model resolution uses Odysseus' existing `endpoint_resolver` — NOT hardcoded to DeepSeek. Falls back to `default_model` / `default_endpoint_id` if video-specific model not configured. ### Data Storage ``` data/video_analysis/ <job_id>/ audio_16k_mono.wav — kept for re-analysis (delete after configurable TTL) <video_id>_transcript.txt <video_id>_analysis.json job.json — metadata, status, timestamps ``` ## Model Resolution All LLM calls (analysis, find, summarize) go through Odysseus' endpoint resolution: ```python from src.settings import get_setting from src.endpoint_resolver import resolve_endpoint model = get_setting("video_analysis_model") or get_setting("default_model") endpoint_id = get_setting("video_analysis_endpoint_id") or get_setting("default_endpoint_id") url, model, headers = resolve_endpoint(endpoint_id) ``` Uses `src.llm_core.llm_call_async` for the actual calls instead of raw `urllib.request`. ## Key Differences from Discord Bot 1. **No Discord dependency** — strip all `discord.py` imports, use FastAPI routes + SSE/WebSocket for progress 2. **Model from settings** — not hardcoded DeepSeek 3. **Frontend progress** — SSE stream or polling (like DR's progress tracking) instead of Discord message edits 4. **Summarize tool** — new, doesn't exist in the bot 5. **Results viewer** — rendered in the overlay, not just file uploads ## Requirements Python packages (Parakeet transcription): ``` nemo_toolkit[asr] # ~2GB, optional — tool disables gracefully if missing torchaudio soundfile ``` System dependencies: ``` yt-dlp # on PATH ffmpeg # on PATH ``` Make NeMo an **optional import** — the panel shows a setup message if not installed. ## Pitfalls (from production experience) 1. **Timestamp format**: Must use `[HH:MM:SS]`, not `[MM:SS:00]`. Minutes > 59 breaks duration estimation. 2. **JSON truncation**: LLM `max_tokens` must be 65536 for long videos. 16K truncates mid-object on 4h+ VODs. 3. **Stereo→mono**: NeMo needs `(batch, time)` shape. Convert with `np.mean(data, axis=1)`. 4. **yt-dlp PATH**: When running as subprocess, PATH may not include venv bin. Resolve full path at import time: `YT_DLP = shutil.which("yt-dlp") or str(Path(sys.executable).parent / "yt-dlp")` 5. **Highlight cap**: Append instruction to limit to 50 highlights max. 6. **Audio cleanup**: Delete original WAV immediately after transcription (2.4GB+). Keep 16kHz mono for reuse. 7. **Long-running**: Transcription takes ~1.5 min per hour of video. Must be async with progress reporting via SSE/polling. 8. **Audio pollution**: The PROMPT_5hr.md already handles filtering game voice lines, music lyrics, etc. ## Reference Code | Source | What to reuse | |--------|--------------| | `~/workspace/vod-pipeline/vod_pipeline.py` | download_audio, preprocess_audio, transcribe_parakeet, build_analysis_prompt, call_deepseek_analysis, estimate_duration, get_video_info | | `~/workspace/vod-pipeline/discord_bot.py` | /find command logic (FIND_SYSTEM_PROMPT + transcript search), progress parsing, retry logic | | `~/workspace/clyde-vods/prompts/PROMPT_5hr.md` | Full analysis prompt with audio pollution filtering, chapter/highlight/game extraction | | `static/js/research/panel.js` | Overlay pattern, progress tracking, job history, section collapse | | `routes/research_routes.py` | Route pattern for long-running async jobs with progress | | `src/research_handler.py` | Job lifecycle pattern (pending → running → complete/error) | ## Related - `vod-pipeline-bot` Hermes skill has comprehensive architecture docs for all pitfalls - #921 — Settings persistence (should be fixed first) - #924 — Subagent tool rewrite (video analysis could eventually use role-based model routing)

sleepy added the

enhancement

area:tools

labels

2026-06-04 12:13:56 +02:00

sleepy changed title from ~~[feature] VOD/Video analysis tool — transcribe + find + summarize as chat tool~~ to [feature] Video Analysis side panel — transcribe + find + summarize (like Deep Research panel)

2026-06-04 12:18:10 +02:00