Streaming Text-to-Speech API
WebSocket-based streaming endpoint that sends audio chunks progressively as text is processed. Connect once, stream text in, receive audio out — no polling, no repeated HTTP calls.
Common use cases:
- Conversational AI agents — Stream TTS responses in real time for voice-based assistants
- Interactive podcasts — Generate and play back dialogue on the fly with minimal delay
- Low-latency applications — Any scenario where time-to-first-byte (TTFB) matters (IVR, live narration, kiosks)
Why WebSocket Streaming
Audio playback begins as soon as the first chunk is synthesized — no need to wait for the full response.
Hindi, Bengali, Tamil, Telugu, Kannada, Malayalam, Marathi, Gujarati, Punjabi, Odia, and English (Indian accent).
Single WebSocket connection handles multiple text to speech conversions. Send config once, then stream text continuously.
Python (AsyncSarvamAI) and JavaScript (SarvamAIClient) SDKs with built-in async/await and event-driven patterns for seamless integration.
Code Examples
Best Practices
- Always send the config message first
- Use flush messages strategically to ensure complete text processing
- Send ping messages to maintain long-running connections
End of Speech Signal
The TTS streaming API now supports an end of speech signal that allows for clean stream termination when speech generation is complete.
Using send_completion_event
When you set send_completion_event=True in the connection, the API will send a completion event when speech generation ends, allowing your application to handle stream termination gracefully.
Streaming TTS WebSocket – Integration Guide
Easily convert text to speech in real time using Sarvam’s low-latency WebSocket-based TTS API.
Input Message Types
Config Message
Text Message
Flush Message
Ping Message
Sets up voice parameters and must be the first message sent after connection. Parameters:
min_buffer_size: Minimum character length that triggers buffer flushing for TTS model processingmax_chunk_length: Maximum length for sentence splitting (adjust based on content length)output_audio_codec: Supports multiple formats:mp3,wav,aac,opus,flac,pcm(LINEAR16),mulaw(μ-law), andalaw(A-law)output_audio_bitrate: Choose from 5 supported bitrate options