[frontend/multimodal] Add mic button to chat bar for direct audio input to multimodal models (Gemma 4 12B) #825

New issue

Closed

opened 2026-06-03 19:06:54 +02:00 by sleepy · 0 comments

sleepy commented

2026-06-03 19:06:54 +02:00

Owner

Context

Gemma 4 12B supports native audio input — audio is passed directly to the model (no separate STT transcription step). The model accepts audio via the OpenAI-compatible input_audio content type:

{
  "role": "user",
  "content": [
    {"type": "text", "text": "What do you hear in this audio?"},
    {"type": "input_audio", "input_audio": {"data": "<base64-encoded-audio>", "format": "wav"}}
  ]
}

The frontend already has static/js/voiceRecorder.js with full MediaRecorder infrastructure, but it currently only supports STT transcription modes (browser/local/endpoint). When a multimodal model like Gemma 4 12B is selected, the mic button should send the raw audio directly to the model as an input_audio content block instead of transcribing it first.

Requirements

1. Mic button in chat bar

Add a microphone icon button next to the send button in the chat input area
Only visible when the active model supports audio input (backend can expose this via model capabilities)
Click starts recording, click again stops (or hold-to-record)
Visual feedback: button pulses/red while recording, timer display

2. Direct audio passthrough mode

When the selected model has audio_input: true capability:
- Record audio as WAV (or WebM, then convert — check what Gemma 4 expects)
- Base64-encode the recorded audio
- Send as input_audio content block in the chat message alongside any text
- Do NOT transcribe via STT — pass raw audio directly
When the model does NOT support audio:
- Fall back to existing STT behavior (transcribe → insert text)

3. Frontend message format

The chat_stream endpoint currently accepts FormData with message + attachments
Add a new field audio_data (base64-encoded) and audio_format (e.g. "wav", "webm") to the FormData
Alternatively, add the audio as a special attachment type that the backend recognizes as inline multimodal content

4. Audio format considerations

Gemma 4 expects audio at 16kHz mono 16-bit PCM (standard WAV)
The frontend should use MediaRecorder with audio/wav MIME type where supported, or convert WebM → WAV via AudioContext before base64 encoding
Max duration: 30 seconds for Gemma 4 audio input

Files to modify

static/js/voiceRecorder.js — add direct audio mode alongside existing STT modes
static/js/chat.js — handle audio attachment in message submission, add mic button to UI
static/index.html — mic button HTML element in chat bar
static/css/*.css — mic button styling

Acceptance criteria

Mic button appears in chat bar when multimodal model selected
Recording works, produces WAV audio
Audio is base64-encoded and sent to backend as input_audio content
Falls back to STT for non-multimodal models
Visual recording indicator (pulsing red, timer)
Max 30 second recording limit with auto-stop

## Context Gemma 4 12B supports **native audio input** — audio is passed directly to the model (no separate STT transcription step). The model accepts audio via the OpenAI-compatible `input_audio` content type: ```json { "role": "user", "content": [ {"type": "text", "text": "What do you hear in this audio?"}, {"type": "input_audio", "input_audio": {"data": "<base64-encoded-audio>", "format": "wav"}} ] } ``` The frontend already has `static/js/voiceRecorder.js` with full MediaRecorder infrastructure, but it currently only supports STT transcription modes (browser/local/endpoint). When a multimodal model like Gemma 4 12B is selected, the mic button should send the raw audio directly to the model as an `input_audio` content block instead of transcribing it first. ## Requirements ### 1. Mic button in chat bar - Add a microphone icon button next to the send button in the chat input area - Only visible when the active model supports audio input (backend can expose this via model capabilities) - Click starts recording, click again stops (or hold-to-record) - Visual feedback: button pulses/red while recording, timer display ### 2. Direct audio passthrough mode - When the selected model has `audio_input: true` capability: - Record audio as WAV (or WebM, then convert — check what Gemma 4 expects) - Base64-encode the recorded audio - Send as `input_audio` content block in the chat message alongside any text - Do NOT transcribe via STT — pass raw audio directly - When the model does NOT support audio: - Fall back to existing STT behavior (transcribe → insert text) ### 3. Frontend message format - The `chat_stream` endpoint currently accepts FormData with `message` + `attachments` - Add a new field `audio_data` (base64-encoded) and `audio_format` (e.g. "wav", "webm") to the FormData - Alternatively, add the audio as a special attachment type that the backend recognizes as inline multimodal content ### 4. Audio format considerations - Gemma 4 expects audio at 16kHz mono 16-bit PCM (standard WAV) - The frontend should use `MediaRecorder` with `audio/wav` MIME type where supported, or convert WebM → WAV via AudioContext before base64 encoding - Max duration: 30 seconds for Gemma 4 audio input ## Files to modify - `static/js/voiceRecorder.js` — add direct audio mode alongside existing STT modes - `static/js/chat.js` — handle audio attachment in message submission, add mic button to UI - `static/index.html` — mic button HTML element in chat bar - `static/css/*.css` — mic button styling ## Acceptance criteria - [ ] Mic button appears in chat bar when multimodal model selected - [ ] Recording works, produces WAV audio - [ ] Audio is base64-encoded and sent to backend as `input_audio` content - [ ] Falls back to STT for non-multimodal models - [ ] Visual recording indicator (pulsing red, timer) - [ ] Max 30 second recording limit with auto-stop

sleepy referenced this issue from a commit

2026-06-03 19:33:39 +02:00

[frontend] Add mic button for direct audio input to multimodal models (#825)

sleepy referenced this issue

2026-06-03 19:33:51 +02:00

[frontend] Add mic button for direct audio input to multimodal models (#825) #829

sleepy referenced this issue from a commit

2026-06-03 19:39:43 +02:00

fix: review fixes for mic button (#825)

sleepy referenced this issue

2026-06-03 19:43:10 +02:00

[frontend] Add video upload support for multimodal models (#826) #830