Streaming Text-to-Speech API
Streaming Text-to-Speech API
Streaming Text-to-Speech API
WebSocket-based streaming endpoint that sends audio chunks progressively as text is processed. Connect once, stream text in, receive audio out — no polling, no repeated HTTP calls.
For complete API reference documentation, see the Text-to-Speech API Reference section.
Common use cases:
Audio playback begins as soon as the first chunk is synthesized — no need to wait for the full response.
Hindi, Bengali, Tamil, Telugu, Kannada, Malayalam, Marathi, Gujarati, Punjabi, Odia, and English (Indian accent).
Single WebSocket connection handles multiple text to speech conversions. Send config once, then stream text continuously.
Python (AsyncSarvamAI) and JavaScript (SarvamAIClient) SDKs with built-in async/await and event-driven patterns for seamless integration.
The TTS streaming API now supports an end of speech signal that allows for clean stream termination when speech generation is complete.
send_completion_eventWhen you set send_completion_event=True in the connection, the API will send a completion event when speech generation ends, allowing your application to handle stream termination gracefully.
Easily convert text to speech in real time using Sarvam’s low-latency WebSocket-based TTS API.
Sets up voice parameters and must be the first message sent after connection. Parameters:
min_buffer_size: Minimum character length that triggers buffer flushing for TTS model processingmax_chunk_length: Maximum length for sentence splitting (adjust based on content length)output_audio_codec: Supports multiple formats: mp3, wav, aac, opus, flac, pcm (LINEAR16), mulaw (μ-law), and alaw (A-law)output_audio_bitrate: Choose from 5 supported bitrate optionsCodes 1000–1015 are standard WebSocket codes; 4000–4999 are application-specific — always read the accompanying close reason rather than assuming a fixed meaning. An idle connection closes automatically after ~1 minute, so send ping() to keep long-lived sessions open.
Reconnect with exponential backoff (pseudocode):
Don’t auto-retry on 4xxx auth/quota closes — fix the cause first (see Errors & Troubleshooting).
When a user interrupts the agent mid-reply, you want playback to stop instantly. The TTS WebSocket has no server-side cancel/clear message — convert, flush, ping, and close are the only client messages. Handle barge-in entirely on the client:
speech_start — flush/clear your local audio buffer and stop the player.Because there’s no in-band cancel, keep TTS replies chunked into shorter convert() calls so a barge-in discards less already-generated audio. See the STT WebSocket barge-in recipe, the LiveKit and Pipecat voice-agent guides, and Credits & Rate Limits for streaming concurrency limits.