[frontend/multimodal] Add mic button to chat bar for direct audio input to multimodal models (Gemma 4 12B) #825

Closed
opened 2026-06-03 19:06:54 +02:00 by sleepy · 0 comments
Owner

Context

Gemma 4 12B supports native audio input — audio is passed directly to the model (no separate STT transcription step). The model accepts audio via the OpenAI-compatible input_audio content type:

{
  "role": "user",
  "content": [
    {"type": "text", "text": "What do you hear in this audio?"},
    {"type": "input_audio", "input_audio": {"data": "<base64-encoded-audio>", "format": "wav"}}
  ]
}

The frontend already has static/js/voiceRecorder.js with full MediaRecorder infrastructure, but it currently only supports STT transcription modes (browser/local/endpoint). When a multimodal model like Gemma 4 12B is selected, the mic button should send the raw audio directly to the model as an input_audio content block instead of transcribing it first.

Requirements

1. Mic button in chat bar

  • Add a microphone icon button next to the send button in the chat input area
  • Only visible when the active model supports audio input (backend can expose this via model capabilities)
  • Click starts recording, click again stops (or hold-to-record)
  • Visual feedback: button pulses/red while recording, timer display

2. Direct audio passthrough mode

  • When the selected model has audio_input: true capability:
    • Record audio as WAV (or WebM, then convert — check what Gemma 4 expects)
    • Base64-encode the recorded audio
    • Send as input_audio content block in the chat message alongside any text
    • Do NOT transcribe via STT — pass raw audio directly
  • When the model does NOT support audio:
    • Fall back to existing STT behavior (transcribe → insert text)

3. Frontend message format

  • The chat_stream endpoint currently accepts FormData with message + attachments
  • Add a new field audio_data (base64-encoded) and audio_format (e.g. "wav", "webm") to the FormData
  • Alternatively, add the audio as a special attachment type that the backend recognizes as inline multimodal content

4. Audio format considerations

  • Gemma 4 expects audio at 16kHz mono 16-bit PCM (standard WAV)
  • The frontend should use MediaRecorder with audio/wav MIME type where supported, or convert WebM → WAV via AudioContext before base64 encoding
  • Max duration: 30 seconds for Gemma 4 audio input

Files to modify

  • static/js/voiceRecorder.js — add direct audio mode alongside existing STT modes
  • static/js/chat.js — handle audio attachment in message submission, add mic button to UI
  • static/index.html — mic button HTML element in chat bar
  • static/css/*.css — mic button styling

Acceptance criteria

  • Mic button appears in chat bar when multimodal model selected
  • Recording works, produces WAV audio
  • Audio is base64-encoded and sent to backend as input_audio content
  • Falls back to STT for non-multimodal models
  • Visual recording indicator (pulsing red, timer)
  • Max 30 second recording limit with auto-stop
## Context Gemma 4 12B supports **native audio input** — audio is passed directly to the model (no separate STT transcription step). The model accepts audio via the OpenAI-compatible `input_audio` content type: ```json { "role": "user", "content": [ {"type": "text", "text": "What do you hear in this audio?"}, {"type": "input_audio", "input_audio": {"data": "<base64-encoded-audio>", "format": "wav"}} ] } ``` The frontend already has `static/js/voiceRecorder.js` with full MediaRecorder infrastructure, but it currently only supports STT transcription modes (browser/local/endpoint). When a multimodal model like Gemma 4 12B is selected, the mic button should send the raw audio directly to the model as an `input_audio` content block instead of transcribing it first. ## Requirements ### 1. Mic button in chat bar - Add a microphone icon button next to the send button in the chat input area - Only visible when the active model supports audio input (backend can expose this via model capabilities) - Click starts recording, click again stops (or hold-to-record) - Visual feedback: button pulses/red while recording, timer display ### 2. Direct audio passthrough mode - When the selected model has `audio_input: true` capability: - Record audio as WAV (or WebM, then convert — check what Gemma 4 expects) - Base64-encode the recorded audio - Send as `input_audio` content block in the chat message alongside any text - Do NOT transcribe via STT — pass raw audio directly - When the model does NOT support audio: - Fall back to existing STT behavior (transcribe → insert text) ### 3. Frontend message format - The `chat_stream` endpoint currently accepts FormData with `message` + `attachments` - Add a new field `audio_data` (base64-encoded) and `audio_format` (e.g. "wav", "webm") to the FormData - Alternatively, add the audio as a special attachment type that the backend recognizes as inline multimodal content ### 4. Audio format considerations - Gemma 4 expects audio at 16kHz mono 16-bit PCM (standard WAV) - The frontend should use `MediaRecorder` with `audio/wav` MIME type where supported, or convert WebM → WAV via AudioContext before base64 encoding - Max duration: 30 seconds for Gemma 4 audio input ## Files to modify - `static/js/voiceRecorder.js` — add direct audio mode alongside existing STT modes - `static/js/chat.js` — handle audio attachment in message submission, add mic button to UI - `static/index.html` — mic button HTML element in chat bar - `static/css/*.css` — mic button styling ## Acceptance criteria - [ ] Mic button appears in chat bar when multimodal model selected - [ ] Recording works, produces WAV audio - [ ] Audio is base64-encoded and sent to backend as `input_audio` content - [ ] Falls back to STT for non-multimodal models - [ ] Visual recording indicator (pulsing red, timer) - [ ] Max 30 second recording limit with auto-stop
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/odysseus#825
No description provided.