[frontend/multimodal] Add video upload support to chat for direct video input to multimodal models (Gemma 4 12B) #826

New issue

Closed

opened 2026-06-03 19:07:55 +02:00 by sleepy · 0 comments

sleepy commented

2026-06-03 19:07:55 +02:00

Owner

Context

Gemma 4 12B supports native video input — video frames (at 1 FPS, up to 60 seconds) are passed directly to the model alongside any text prompt. The OpenAI-compatible API accepts video as a sequence of image frames or as a video_url content type:

{
  "role": "user",
  "content": [
    {"type": "text", "text": "Describe what happens in this video"},
    {"type": "video_url", "video_url": {"url": "data:video/mp4;base64,<base64-encoded-video>"}}
  ]
}

The frontend already has file attachment infrastructure (attachments field in chat submission). We need to extend it to support video files as inline multimodal content rather than just document attachments.

Requirements

1. Video upload in chat bar

Add a video/clip icon button next to the existing attachment button in the chat input area
Or extend the existing file picker to accept video MIME types (video/mp4, video/webm)
Only visible/active when the active model supports video input
Accept: mp4, webm — common browser-friendly formats
Max duration: 60 seconds (enforce client-side, reject longer videos with clear message)
Max file size: reasonable limit (e.g. 50MB)

2. Video preview

After selecting a video, show a thumbnail preview with duration badge in the chat input area (similar to how image attachments preview)
Allow removing the video before sending
Show video metadata: duration, resolution

3. Direct video passthrough to backend

Base64-encode the video file
Send as video_data (base64) and video_format ("mp4", "webm") in the FormData
Backend will handle frame extraction and passing to the model
Do NOT process/extract frames client-side — leave that to the backend

4. UI placement

Video upload button: next to existing image/file attachment button
Video preview: inline in chat input area, above the text input
Supported video MIME types for file picker: video/mp4, video/webm

5. Fallback for non-multimodal models

If model doesn't support video, either:
- Hide the video upload button entirely, OR
- Show it greyed out with tooltip "Current model does not support video input"

Files to modify

static/js/chat.js — video file handling in message submission
static/index.html — video upload button in chat bar, video preview container
static/css/*.css — video thumbnail styling, upload button styling
Possibly routes/upload_routes.py — if video needs server-side processing before sending to model

Acceptance criteria

Video upload button appears when multimodal model selected
Can select mp4/webm files via file picker
Video preview with thumbnail and duration shown before sending
Video rejected if >60 seconds with clear error message
Video base64-encoded and sent to backend
Can remove video before sending
Button hidden/greyed for non-multimodal models

## Context Gemma 4 12B supports **native video input** — video frames (at 1 FPS, up to 60 seconds) are passed directly to the model alongside any text prompt. The OpenAI-compatible API accepts video as a sequence of image frames or as a `video_url` content type: ```json { "role": "user", "content": [ {"type": "text", "text": "Describe what happens in this video"}, {"type": "video_url", "video_url": {"url": "data:video/mp4;base64,<base64-encoded-video>"}} ] } ``` The frontend already has file attachment infrastructure (`attachments` field in chat submission). We need to extend it to support video files as inline multimodal content rather than just document attachments. ## Requirements ### 1. Video upload in chat bar - Add a video/clip icon button next to the existing attachment button in the chat input area - Or extend the existing file picker to accept video MIME types (video/mp4, video/webm) - Only visible/active when the active model supports video input - Accept: mp4, webm — common browser-friendly formats - Max duration: 60 seconds (enforce client-side, reject longer videos with clear message) - Max file size: reasonable limit (e.g. 50MB) ### 2. Video preview - After selecting a video, show a thumbnail preview with duration badge in the chat input area (similar to how image attachments preview) - Allow removing the video before sending - Show video metadata: duration, resolution ### 3. Direct video passthrough to backend - Base64-encode the video file - Send as `video_data` (base64) and `video_format` ("mp4", "webm") in the FormData - Backend will handle frame extraction and passing to the model - Do NOT process/extract frames client-side — leave that to the backend ### 4. UI placement - Video upload button: next to existing image/file attachment button - Video preview: inline in chat input area, above the text input - Supported video MIME types for file picker: `video/mp4`, `video/webm` ### 5. Fallback for non-multimodal models - If model doesn't support video, either: - Hide the video upload button entirely, OR - Show it greyed out with tooltip "Current model does not support video input" ## Files to modify - `static/js/chat.js` — video file handling in message submission - `static/index.html` — video upload button in chat bar, video preview container - `static/css/*.css` — video thumbnail styling, upload button styling - Possibly `routes/upload_routes.py` — if video needs server-side processing before sending to model ## Acceptance criteria - [ ] Video upload button appears when multimodal model selected - [ ] Can select mp4/webm files via file picker - [ ] Video preview with thumbnail and duration shown before sending - [ ] Video rejected if >60 seconds with clear error message - [ ] Video base64-encoded and sent to backend - [ ] Can remove video before sending - [ ] Button hidden/greyed for non-multimodal models

sleepy referenced this issue from a commit

2026-06-03 19:42:44 +02:00

[frontend] Add video upload support for multimodal models (#826)

sleepy referenced this issue from a pull request that will close it,

2026-06-03 19:43:10 +02:00

[frontend] Add video upload support for multimodal models (#826) #830

sleepy closed this issue

2026-06-03 19:45:02 +02:00