Streaming Text-to-Speech API
Real-time Processing
Real-time conversion of text into spoken audio, where the audio is generated and played back progressively as the text is being processed.- Efficient for long texts
- Real-time conversion
- Handle multiple requests easily
- Low latency audio generation and faster responses
Features
Low Latency Playback
- Audio starts playing immediately as the text is processed
- Speaks dynamic or live content as it comes in
Language Support
- Multiple Indian languages and English support
- Language code specification (e.g., “kn-IN” for Kannada)
- High accuracy transcription
Efficient Resource Usage
- Streams small chunks of audio instead of generating everything at once.
- Uses less memory and keeps performance stable even with long texts.
Integration
- Python and JavaScript SDK with async support
- WebSocket connections
- Easy-to-use API interface
Code Examples
Best Practices
- Always send the config message first
- Keep text chunks under 500 characters for optimal streaming
- Use flush messages strategically to ensure complete text processing
- Send ping messages to maintain long-running connections
- Handle error responses appropriately in your application logic
Python
JavaScript
Streaming TTS WebSocket – Integration Guide
Easily convert text to speech in real time using Sarvam’s low-latency WebSocket-based TTS API.
Input Message Types
Config Message
Text Message
Flush Message
Ping Message
Sets up voice parameters and must be the first message sent after connection. Parameters:
min_buffer_size
: Minimum character length that triggers buffer flushing for TTS model processingmax_chunk_length
: Maximum length for sentence splitting (adjust based on content length)output_audio_codec
: Currently supports MP3 only (optimized for real-time playback)output_audio_bitrate
: Choose from 5 supported bitrate options