Streaming Text-to-Speech API
Real-time Processing
Real-time conversion of text into spoken audio, where the audio is generated and played back progressively as the text is being processed.- Efficient for long texts
- Real-time conversion
- Handle multiple requests easily
- Low latency audio generation and faster responses
Features
Low Latency Playback
- Audio starts playing immediately as the text is processed
- Speaks dynamic or live content as it comes in
Language Support
- Multiple Indian languages and English support
- Language code specification (e.g., “kn-IN” for Kannada)
- High accuracy transcription
Efficient Resource Usage
Streams small chunks of audio instead of generating everything at once.
Uses less memory and keeps performance stable even with long texts.
Integration
- Python and JavaScript SDK with async support
- WebSocket connections
- Easy-to-use API interface
Code Examples
Best Practices
- Always send the config message first
- Keep text chunks under 500 characters for optimal streaming
- Use flush messages strategically to ensure complete text processing
- Send ping messages to maintain long-running connections
- Handle error responses appropriately in your application logic
Python
JavaScript
End of Speech Signal
The TTS streaming API now supports an end of speech signal that allows for clean stream termination when speech generation is complete.
Using send_completion_event
When you set send_completion_event=True
in the connection, the API will send a completion event when speech generation ends, allowing your application to handle stream termination gracefully.
Python
Streaming TTS WebSocket – Integration Guide
Easily convert text to speech in real time using Sarvam’s low-latency WebSocket-based TTS API.
Input Message Types
Config Message
Text Message
Flush Message
Ping Message
Sets up voice parameters and must be the first message sent after connection. Parameters:
min_buffer_size
: Minimum character length that triggers buffer flushing for TTS model processingmax_chunk_length
: Maximum length for sentence splitting (adjust based on content length)output_audio_codec
: Supports multiple formats:mp3
,wav
,aac
,opus
,flac
,pcm
(LINEAR16),mulaw
(μ-law), andalaw
(A-law)output_audio_bitrate
: Choose from 5 supported bitrate options