[frontend/multimodal] Add video upload support to chat for direct video input to multimodal models (Gemma 4 12B) #826

Closed
opened 2026-06-03 19:07:55 +02:00 by sleepy · 0 comments
Owner

Context

Gemma 4 12B supports native video input — video frames (at 1 FPS, up to 60 seconds) are passed directly to the model alongside any text prompt. The OpenAI-compatible API accepts video as a sequence of image frames or as a video_url content type:

{
  "role": "user",
  "content": [
    {"type": "text", "text": "Describe what happens in this video"},
    {"type": "video_url", "video_url": {"url": "data:video/mp4;base64,<base64-encoded-video>"}}
  ]
}

The frontend already has file attachment infrastructure (attachments field in chat submission). We need to extend it to support video files as inline multimodal content rather than just document attachments.

Requirements

1. Video upload in chat bar

  • Add a video/clip icon button next to the existing attachment button in the chat input area
  • Or extend the existing file picker to accept video MIME types (video/mp4, video/webm)
  • Only visible/active when the active model supports video input
  • Accept: mp4, webm — common browser-friendly formats
  • Max duration: 60 seconds (enforce client-side, reject longer videos with clear message)
  • Max file size: reasonable limit (e.g. 50MB)

2. Video preview

  • After selecting a video, show a thumbnail preview with duration badge in the chat input area (similar to how image attachments preview)
  • Allow removing the video before sending
  • Show video metadata: duration, resolution

3. Direct video passthrough to backend

  • Base64-encode the video file
  • Send as video_data (base64) and video_format ("mp4", "webm") in the FormData
  • Backend will handle frame extraction and passing to the model
  • Do NOT process/extract frames client-side — leave that to the backend

4. UI placement

  • Video upload button: next to existing image/file attachment button
  • Video preview: inline in chat input area, above the text input
  • Supported video MIME types for file picker: video/mp4, video/webm

5. Fallback for non-multimodal models

  • If model doesn't support video, either:
    • Hide the video upload button entirely, OR
    • Show it greyed out with tooltip "Current model does not support video input"

Files to modify

  • static/js/chat.js — video file handling in message submission
  • static/index.html — video upload button in chat bar, video preview container
  • static/css/*.css — video thumbnail styling, upload button styling
  • Possibly routes/upload_routes.py — if video needs server-side processing before sending to model

Acceptance criteria

  • Video upload button appears when multimodal model selected
  • Can select mp4/webm files via file picker
  • Video preview with thumbnail and duration shown before sending
  • Video rejected if >60 seconds with clear error message
  • Video base64-encoded and sent to backend
  • Can remove video before sending
  • Button hidden/greyed for non-multimodal models
## Context Gemma 4 12B supports **native video input** — video frames (at 1 FPS, up to 60 seconds) are passed directly to the model alongside any text prompt. The OpenAI-compatible API accepts video as a sequence of image frames or as a `video_url` content type: ```json { "role": "user", "content": [ {"type": "text", "text": "Describe what happens in this video"}, {"type": "video_url", "video_url": {"url": "data:video/mp4;base64,<base64-encoded-video>"}} ] } ``` The frontend already has file attachment infrastructure (`attachments` field in chat submission). We need to extend it to support video files as inline multimodal content rather than just document attachments. ## Requirements ### 1. Video upload in chat bar - Add a video/clip icon button next to the existing attachment button in the chat input area - Or extend the existing file picker to accept video MIME types (video/mp4, video/webm) - Only visible/active when the active model supports video input - Accept: mp4, webm — common browser-friendly formats - Max duration: 60 seconds (enforce client-side, reject longer videos with clear message) - Max file size: reasonable limit (e.g. 50MB) ### 2. Video preview - After selecting a video, show a thumbnail preview with duration badge in the chat input area (similar to how image attachments preview) - Allow removing the video before sending - Show video metadata: duration, resolution ### 3. Direct video passthrough to backend - Base64-encode the video file - Send as `video_data` (base64) and `video_format` ("mp4", "webm") in the FormData - Backend will handle frame extraction and passing to the model - Do NOT process/extract frames client-side — leave that to the backend ### 4. UI placement - Video upload button: next to existing image/file attachment button - Video preview: inline in chat input area, above the text input - Supported video MIME types for file picker: `video/mp4`, `video/webm` ### 5. Fallback for non-multimodal models - If model doesn't support video, either: - Hide the video upload button entirely, OR - Show it greyed out with tooltip "Current model does not support video input" ## Files to modify - `static/js/chat.js` — video file handling in message submission - `static/index.html` — video upload button in chat bar, video preview container - `static/css/*.css` — video thumbnail styling, upload button styling - Possibly `routes/upload_routes.py` — if video needs server-side processing before sending to model ## Acceptance criteria - [ ] Video upload button appears when multimodal model selected - [ ] Can select mp4/webm files via file picker - [ ] Video preview with thumbnail and duration shown before sending - [ ] Video rejected if >60 seconds with clear error message - [ ] Video base64-encoded and sent to backend - [ ] Can remove video before sending - [ ] Button hidden/greyed for non-multimodal models
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/odysseus#826
No description provided.