docsBuild

Audio Streaming Architecture

Last Updated: 2026-04-29

This document describes the audio paths in memQL: the audio WebSocket (browser-based STT/TTS for spaces), the gRPC streaming transcription flow on MemqlService.Stream, and the Polyphon pipeline (multi-agent real-time voice conversations).

Overview

memQL provides three audio paths, each for a different use case:

  1. Audio WebSocket (/memql/audio) -- legacy browser path for in-space STT and the "Read Aloud" TTS feature. Users speak into their mic; audio is transcribed and committed as a v1:cognition:utterance. Still in production use.

  2. gRPC streaming transcription -- canonical path for new clients. AiTranscribeStreamStart / Chunk / End (client -> server) plus AiTranscribeStreamDelta / Complete (server -> client) on MemqlService.Stream. The voice node owns the provider session; the BFF proxies via AiForwardRouter.ForwardContinuation. See component/grpc/ai_transcribe_stream.go.

  3. Polyphon Pipeline -- multi-agent, multi-human real-time voice conversations. LiveKit for audio transport, a Bridge Agent for ASR/TTS, and the cognition pipeline for turn-taking decisions.

text
┌─────────────────────────────────────────────────────────────────────────┐
│ MEMQL SERVER │
│ │
│ /memql/ws (gRPC tunneled over WS) /memql/audio (legacy WS) │
│ - All gRPC messages incl. AiTranscribe* - In-space STT + TTS chunks │
│ - Queries / mutations / subscriptions - Single-user per stream │
│ - Streaming transcription │
│ │
│ HTTP (browser-required exceptions only): │
│ /auth/* OAuth callbacks (HTTP-required) │
│ /healthz Health probe │
│ /spaces/{id}/attachments multipart upload │
│ /polyphon/room-token, /polyphon/status Polyphon multi-agent voice │
│ │
│ All paths share the same identity-service-validated context. │
└─────────────────────────────────────────────────────────────────────────┘

When to use each path

PathPurposeTransportMulti-party
AiTranscribeStream* (gRPC)Transcription for any clientgRPC streamNo
AiTranscribe (gRPC, batch)One-shot upload-and-transcribegRPC streamNo
/memql/audio (WebSocket)Legacy browser STT + Read Aloud TTSWebSocketNo
Polyphon pipelineReal-time voice conversationsLiveKit (WebRTC SFU)Yes (up to 3 agents + 5 humans)

New clients should use the gRPC streaming path (AiTranscribeStreamStart/Chunk/End). The /memql/audio WebSocket exists for the older browser flow and the Read-Aloud TTS feature.


Audio WebSocket Endpoint

The audio WebSocket (/memql/audio) provides browser-based STT transcription and TTS synthesis for spaces. When users speak, their audio is streamed to the server, transcribed using a speech-to-text provider, and converted into utterances that appear in the chat.

Connection

  • Endpoint: /memql/audio
  • Protocol: WebSocket
  • Auth: Same JWT/cookie as /memql/ws

Why a Separate WebSocket?

Audio, video, and query traffic have fundamentally different characteristics:

Traffic TypeFrequencyMessage SizeLatency Sensitivity
Queries1-10/minute100B - 10KBLow
Audio10-20/second2-4KBHigh
Video30-60/second10-100KBVery High

Using separate connections provides:

  1. No interference: Audio flows independently from queries
  2. Optimized for purpose: Each connection is tuned for its traffic type
  3. Independent scaling: Audio processing can scale separately
  4. Failure isolation: STT provider issues don't affect chat
  5. Future-proof: Same pattern extends to video

Message Protocol

All messages are JSON-encoded.

Start Stream (Client to Server)

Sent when the user begins recording:

json
{
"type": "start",
"streamId": "550e8400-e29b-41d4-a716-446655440000",
"spaceId": "space-123",
"participantId": "participant-456",
"format": "pcm16",
"sampleRate": 16000,
"channels": 1,
"languageHint": "en"
}
FieldTypeRequiredDescription
typestringYesMust be "start"
streamIdstringYesClient-generated UUID for this audio stream
spaceIdstringYesID of the space
participantIdstringYesID of the participant speaking
formatstringNoAudio format: "pcm16" (default), "opus", "webm"
sampleRatenumberNoSample rate in Hz (default: 16000)
channelsnumberNoNumber of channels (default: 1)
languageHintstringNoLanguage code hint (e.g., "en", "es")

Audio Chunk (Client to Server)

Sent continuously while recording:

json
{
"type": "chunk",
"streamId": "550e8400-e29b-41d4-a716-446655440000",
"audio": "SGVsbG8gV29ybGQ=",
"sequence": 1
}
FieldTypeRequiredDescription
typestringYesMust be "chunk"
streamIdstringYesSame UUID from start message
audiostringYesBase64-encoded audio data
sequencenumberNoSequence number for ordering

End Stream (Client to Server)

Sent when the user stops recording:

json
{
"type": "end",
"streamId": "550e8400-e29b-41d4-a716-446655440000",
"cancelled": false
}
FieldTypeRequiredDescription
typestringYesMust be "end"
streamIdstringYesSame UUID from start message
cancelledbooleanNotrue to discard without creating utterance

Started Response (Server to Client)

Sent after successful stream initialization:

json
{
"type": "started",
"streamId": "550e8400-e29b-41d4-a716-446655440000"
}

Transcription Event (Server to Client)

Sent as transcription results arrive:

json
{
"type": "transcription",
"streamId": "550e8400-e29b-41d4-a716-446655440000",
"text": "Hello, how are you?",
"isFinal": true,
"confidence": 0.95,
"words": [
{ "word": "Hello", "start": 0, "end": 320, "confidence": 0.98 },
{ "word": "how", "start": 350, "end": 480, "confidence": 0.94 },
{ "word": "are", "start": 500, "end": 580, "confidence": 0.96 },
{ "word": "you", "start": 600, "end": 750, "confidence": 0.93 }
],
"utteranceId": "utt-voice-1702156800000000000",
"durationMs": 1500
}
FieldTypeDescription
typestringAlways "transcription"
streamIdstringStream this result belongs to
textstringTranscribed text
isFinalbooleanfalse for interim, true for final
confidencenumberConfidence score (0.0 - 1.0)
wordsarrayWord-level timestamps (final only)
utteranceIdstringID of created utterance (final only)
durationMsnumberAudio duration in ms (final only)

Error Response (Server to Client)

json
{
"type": "error",
"streamId": "550e8400-e29b-41d4-a716-446655440000",
"error": {
"code": "STREAM_NOT_FOUND",
"message": "No active stream with this ID"
}
}

Audio Format

  • Sample Rate: 16000 Hz (optimal for speech recognition)
  • Channels: 1 (mono)
  • Format: PCM16 (16-bit signed integer)
  • Chunk Size: ~100-200ms of audio per chunk

PCM16 Format

Browser audio (Float32Array with values -1.0 to 1.0) must be converted to PCM16:

typescript
function float32ToPcm16(float32Array) {
const pcm16 = new Int16Array(float32Array.length);
for (let i = 0; i < float32Array.length; i++) {
const s = Math.max(-1, Math.min(1, float32Array[i]));
pcm16[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
}
return pcm16;
}

STT Data Flow

text
1. User presses mic button
2. Client opens /memql/audio WebSocket (if not already open)
3. Client sends "start" message with spaceId, participantId
4. Client captures audio via getUserMedia + AudioWorklet
5. Client converts Float32 to PCM16, base64 encodes, sends "chunk" messages
6. Server forwards chunks to STT provider (Deepgram Nova-3 when configured; OpenAI Realtime / OpenAI Whisper otherwise)
7. Server receives interim transcriptions, sends to client (isFinal: false)
8. User releases mic button
9. Client sends "end" message
10. Server finalizes STT stream, gets complete transcription
11. Server inserts v1:cognition:utterance with:
- utteranceType: "speech"
- source.inputMethod: "stt"
- source.sttProvider: configured provider name
- timestamps.words: word-level timing
12. Server sends final transcription event (isFinal: true) with utteranceId
13. Event bus emits graph.node.created.v1:cognition:utterance
14. All participants receive utterance via /memql/ws subscription
15. Chat UI displays the voice message

Utterance Structure

Voice messages create v1:cognition:utterance records with this structure:

json
{
"concept": "v1:cognition:utterance",
"id": "utt-voice-1702156800000000000",
"payload": {
"spaceId": "space-123",
"participantId": "participant-456",
"utteranceType": "speech",
"text": "Hello, how are you?",
"duration": 1500,
"timestamps": {
"words": [
{ "word": "Hello", "start": 0, "end": 320 },
{ "word": "how", "start": 350, "end": 480 },
{ "word": "are", "start": 500, "end": 580 },
{ "word": "you", "start": 600, "end": 750 }
]
},
"source": {
"inputMethod": "stt",
"sttProvider": "openai-whisper"
}
}
}

STT Configuration

Environment Variables

VariableDescriptionRequired
MEMQL_STT_PROVIDERSTT provider: deepgram (auto-default when key set) / openai-realtime / openai-whisperNo
MEMQL_DEEPGRAM_API_KEYDeepgram API key (selects deepgram automatically when set)Yes (for Deepgram)
MEMQL_SI_OPENAI_API_KEYOpenAI API key (for Whisper / Realtime)Yes (if using OpenAI providers)

MemQL Variables (v1:platform:partitionVariable)

VariableDescriptionDefault
MEMQL_STT_PROVIDERSTT provider nameopenai-realtime
MEMQL_STT_DEFAULT_LANGUAGEDefault language hinten

Provider Comparison

FeatureDeepgram Nova-3OpenAI RealtimeOpenAI Whisper
Real-time streamingYesYesNo (batch)
Interim resultsYesYesNo
Word timestampsYesYesYes
DeployCloud APICloud APICloud API
Best forLowest TTFB, defaultOpenAI-only stacksAccuracy, offline

Deepgram Nova-3 (default when MEMQL_DEEPGRAM_API_KEY is set): Streaming WebSocket via Deepgram's /v1/listen; sub-300 ms first interim partials.

OpenAI Realtime (fallback): Streaming transcription via the Realtime API in transcription-only mode.

OpenAI Whisper: Batch transcription via the transcriptions API. Audio is buffered during the session and transcribed when the user stops speaking. Best for accuracy but no interim results.

STT Component Structure

text
server/audiows/
├── handler.go # WebSocket handler, session management
└── messages.go # Message type definitions
 
integrations/stt/
├── stt.go # Provider interface, common types
├── openai_whisper.go # OpenAI Whisper (batch)
├── openai_realtime.go # OpenAI Realtime (streaming)
└── deepgram.go # Deepgram Nova-3 (streaming)

STT Provider Interface

go
// StreamingProvider provides real-time streaming transcription
type StreamingProvider interface {
// StartStream begins a new streaming session
StartStream(ctx context.Context, config StreamConfig) (StreamingSession, error)
 
// Name returns the provider name
Name() string
}
 
// StreamingSession represents an active transcription session
type StreamingSession interface {
// SendAudio sends audio data to the STT service
SendAudio(audio []byte) error
 
// Receive returns a channel for transcription events
Receive() <-chan TranscriptionResult
 
// Finalize closes the stream and returns final transcription
Finalize(ctx context.Context) (*FinalTranscription, error)
 
// Close terminates without waiting for final result
Close() error
}

Error Handling

Error CodeDescriptionRecovery
STREAM_NOT_FOUNDstreamId doesn't existStart a new stream
STREAM_START_FAILEDFailed to connect to STTRetry or check config
INVALID_FORMATBad audio formatCheck format settings
STT_ERRORSTT provider errorRetry

Limitations

  • Maximum audio duration: Limited by STT provider (typically 5+ minutes)
  • Chunk size: ~200ms recommended for balance of latency/overhead
  • Concurrent streams per connection: 1 (start new after previous ends)

Text-to-Speech (TTS) via Audio WebSocket

The audio WebSocket also supports TTS synthesis for the "Read Aloud" feature in spaces. All TTS requests through this endpoint use the OpenAI TTS API provider configured in the engine's provider registry.

Read Aloud Feature

The "Read Aloud" feature allows any chat message to be spoken by the SI agent.

Synthesize Request (Client to Server)

Sent when the user clicks "Read Aloud" on a message:

json
{
"type": "synthesize",
"requestId": "req-550e8400-e29b-41d4-a716-446655440000",
"text": "Hello, how are you today?",
"voice": "nova",
"format": "wav",
"sampleRate": 24000
}
FieldTypeRequiredDescription
typestringYesEither "synthesize" or "tts_synthesize" (both accepted)
requestIdstringYesClient-generated UUID for this request
textstringYesText to synthesize
voicestringNoVoice ID (defaults to agent's configured voice)
formatstringNoAudio format: "wav" (default) - each chunk is complete WAV file
sampleRatenumberNoSample rate in Hz (default: 24000)

TTS Started Response (Server to Client)

Sent immediately when TTS synthesis begins:

json
{
"type": "tts_started",
"requestId": "req-550e8400-e29b-41d4-a716-446655440000",
"format": "wav",
"sampleRate": 24000,
"spaceId": "space-123",
"participantId": "ai-participant-456",
"text": "Hello, how can I help you?"
}
FieldTypeDescription
typestringAlways "tts_started"
requestIdstringMatches the synthesize request
formatstringAudio format: "wav" (each chunk is complete WAV file)
sampleRatenumberSample rate in Hz
spaceIdstringSpace ID for context
participantIdstringSI participant ID generating the audio
textstringThe text being synthesized

TTS Chunk Response (Server to Client)

Streamed back as TTS generates audio. Each chunk is a complete WAV file that browsers can decode independently:

json
{
"type": "tts_chunk",
"requestId": "req-550e8400-e29b-41d4-a716-446655440000",
"audio": "UklGRiQAAABXQVZFZm10IBAAAA...",
"format": "wav",
"sampleRate": 24000,
"sequence": 0,
"done": false,
"spaceId": "space-123",
"participantId": "ai-participant-456",
"text": "Hello, how can I help you?"
}
FieldTypeDescription
typestringAlways "tts_chunk"
requestIdstringMatches the synthesize request
audiostringBase64-encoded WAV file (complete file with header, ~10KB per 200ms)
formatstringAudio format: "wav"
sampleRatenumberSample rate in Hz (24000)
sequencenumberChunk sequence number (starts at 0)
donebooleantrue for last chunk
spaceIdstringSpace ID for context
participantIdstringSI participant ID generating the audio
textstringThe text being synthesized

TTS Ended Response (Server to Client)

Sent when TTS synthesis completes or fails:

json
{
"type": "tts_ended",
"requestId": "req-550e8400-e29b-41d4-a716-446655440000",
"spaceId": "space-123",
"participantId": "ai-participant-456"
}
FieldTypeDescription
typestringAlways "tts_ended"
requestIdstringMatches the synthesize request
spaceIdstringSpace ID for context
participantIdstringSI participant ID
cancelledbooleantrue if TTS was cancelled (optional)
errorstringError message if TTS failed (optional)

Audio Format Recommendation

WAV is the default format for reliable progressive playback:

FormatSize per 200msBrowser DecodeRecommendation
wav~10KBPerfect - native supportDefault - Most reliable
mp3~800 bytesRequires frame parsingComplex, error-prone
opus~400 bytesNeeds Ogg containerNot supported raw

Why WAV:

  • 100% reliable: Simple 44-byte header + raw PCM data
  • Zero decoding issues: Browser's decodeAudioData() handles WAV perfectly
  • No frame boundaries: Unlike MP3/Opus, no complex parsing required
  • Immediate playback: Each chunk plays immediately with no initialization

Frontend playback with WAV (progressive):

typescript
// Each chunk is a complete WAV file - decode and play immediately!
for await (const chunk of ttsStream) {
const wavBuffer = base64ToArrayBuffer(chunk.audio);
const audioBuffer = await audioContext.decodeAudioData(wavBuffer);
// Queue immediately for playback - starts playing within ~200ms
queueAudioForPlayback(audioBuffer);
}

Chunk characteristics:

  • Each chunk is ~200ms of audio
  • Each chunk is a complete WAV file (44-byte header + PCM data)
  • Each chunk is ~10KB (24kHz mono 16-bit)
  • Browser decodes each chunk instantly and perfectly

TTS Data Flow (Read Aloud)

text
1. User clicks "Read Aloud" on a message
2. Client sends "synthesize" message with text and requestId
3. Server sends "tts_started" message with format info
4. Server calls OpenAI TTS API (from engine provider registry)
5. Server streams "tts_chunk" messages (WAV audio)
6. Client uses native decodeAudioData() for playback
7. Server sends "tts_ended" on completion

Voice consistency: The agent's providerConfig.voice.voiceId is used, ensuring the SI agent has a consistent voice identity.

TTS Configuration

VariableDescriptionDefault
MEMQL_DEFAULT_TTS_PROVIDERTTS provider name from registrytts1

TTS providers are configured in providers/v1/openai/ as .memql files with @type("OpenAITTS"). The default voice, format, and speed are set per-provider in the MemQL configuration.

Chunk Sizing

Chunk sizes are optimized per format for ~200-300ms of audio:

FormatChunk SizeDuration
wav~10 KB~200-300ms
opus8 KB~200-300ms
mp38 KB~200-300ms
pcm12 KB~250ms

gRPC Streaming Transcription

The canonical streaming-transcription path for new clients lives on MemqlService.Stream -- the same bidirectional gRPC stream that carries chat, suggest, and graph traffic.

Message flow

text
client -> server server -> client
───────────────────────────────── ─────────────────────────────────
AiTranscribeStreamStart { AiTranscribeStreamDelta {
request_id, sample_rate, ... request_id, text, is_final
} } (zero or more interim deltas)
AiTranscribeStreamChunk { AiTranscribeStreamComplete {
request_id, audio (PCM16 bytes) request_id, transcript, words
} }
... more chunks ...
AiTranscribeStreamEnd { request_id }

The flow is keyed by request_id. The voice node owns the provider session; the BFF proxies via AiForwardRouter.ForwardContinuation so chunks land on the same voice instance that owns the session.

Files

  • component/grpc/ai_transcribe_stream.go -- handler + per-stream state machine
  • component/grpc/ai_forward.go -- BFF -> voice forwarding
  • integrations/stt/ -- provider implementations (Deepgram Nova-3, OpenAI Realtime, OpenAI Whisper)

Provider selection

Same env vars as the legacy /memql/audio path:

VariableValuesDefault
MEMQL_STT_PROVIDERdeepgram, openai-realtime, openai-whisperauto (deepgram when MEMQL_DEEPGRAM_API_KEY is set, else openai-realtime)
MEMQL_DEEPGRAM_API_KEYDeepgram keyrequired for deepgram
MEMQL_SI_OPENAI_API_KEYOpenAI key (Realtime / Whisper)required for OpenAI

docker-compose.full.yml brings up a voice node alongside the BFF so streaming transcription works on the basic dev path without needing the cluster overlay (this was the change in 545537d).

Single-shot batch path

AiTranscribeMsg (one request, one response) is still supported for clients that buffer the whole recording client-side. Same provider backends.


Polyphon Voice Pipeline

Multi-agent real-time voice conversations route through the Polyphon pipeline -- LiveKit for audio transport, a Bridge Agent for ASR/TTS, and the cognition node for turn-taking.

The full architecture (audio flow, provider flavors, configuration, component structure, costs) lives in /docs/polyphon-architecture.md. Don't duplicate it here.

Endpoints

EndpointMethodDescription
/polyphon/room-tokenPOSTGenerate a LiveKit room token for a participant
/polyphon/statusGETSession count and health status

These are HTTP endpoints (not gRPC) because the LiveKit JavaScript SDK expects an HTTP token endpoint. Available only when the LiveKit env vars are configured.

Provider selection

POLYPHON_VOICE_PROVIDER:

  • deepgram (auto-default when MEMQL_DEEPGRAM_API_KEY is set) -- Nova-3 ASR + Aura-2 TTS.
  • openai (fallback) -- OpenAI Realtime transcription + /v1/audio/speech TTS.

For the Polyphon architecture and deployment details, see /docs/polyphon-architecture.md For the overall memQL architecture, see /docs/public/concepts/architecture.md For integration patterns, see /integrations/CLAUDE.md