Speech-to-Text APIs
Speech-to-Text APIs
Speech-to-Text APIs
Sarvam AI offers powerful speech recognition models: Saaras v3 (recommended — state-of-the-art ASR with flexible output modes: transcribe, translate, verbatim, transliterate, and codemix) and Saarika v2.5 (legacy model, will be deprecated — migrate to Saaras v3).
State-of-the-art ASR model with flexible output modes: transcribe, translate, verbatim, transliterate, and codemix. Best choice for new integrations.
ASR model that transcribes Indian language speech into the same spoken language. Will be deprecated soon - migrate to Saaras v3.
Available API types: REST API for synchronous processing (files under 30 seconds), Batch API for asynchronous processing (files up to 1 hour), and Streaming API for real-time audio with instant results.
Synchronous processing for files under 30 seconds.
Asynchronous processing for files up to 1 hour.
Real-time audio streaming with instant results.
Not sure which one fits your audio length and latency needs? See Which Speech-to-Text API to Use for a side-by-side comparison of REST, WebSocket, and Batch.
The STT and STTT REST and Batch APIs support over 10 major audio formats and MIME type variants. Supported formats and MIME types are listed below:
For most audio formats, our API automatically detects the codec. However, when
using PCM formats (pcm_s16le, pcm_l16, pcm_raw), you must explicitly
specify the input_audio_codec parameter. PCM files are only supported at
16kHz sample rate.
WebSocket/Streaming APIs: The STT and STTT WebSocket streaming APIs only support WAV and raw PCM formats (wav, pcm_s16le, pcm_l16, pcm_raw). Other audio formats are not supported for real-time streaming.
Need help choosing the right API? Contact us on discord for guidance.