For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
CommunityAPI StatusAPI PricingSign Up
DocumentationAPI ReferencesCookbookIntegrationDeveloper Tools
DocumentationAPI ReferencesCookbookIntegrationDeveloper Tools
  • Getting Started
    • Welcome
    • Quickstart
    • SDKs & Libraries
    • Building for Indian Languages
    • Models
    • Credits & Rate Limits
    • Errors & Troubleshooting
    • Talk to us
    • Pricing
    • Changelog
  • API Guides & Tutorials
      • Overview
      • Which API to Use
      • Rest API
      • Batch API
      • Streaming API
      • FAQs
LogoLogo
CommunityAPI StatusAPI PricingSign Up
On this page
  • API Types
  • Supported Audio Formats & MIME Types
  • Technical Capabilities
  • Next Steps
API Guides & TutorialsSpeech to Text

Speech-to-Text APIs

||View as Markdown|
Was this page helpful?
Previous

Which Speech-to-Text API to Use

Next
Built with

Sarvam AI offers powerful speech recognition models: Saaras v3 (recommended — state-of-the-art ASR with flexible output modes: transcribe, translate, verbatim, transliterate, and codemix) and Saarika v2.5 (legacy model, will be deprecated — migrate to Saaras v3).

Saaras v3 (Recommended)

State-of-the-art ASR model with flexible output modes: transcribe, translate, verbatim, transliterate, and codemix. Best choice for new integrations.

Saarika v2.5

ASR model that transcribes Indian language speech into the same spoken language. Will be deprecated soon - migrate to Saaras v3.

API Types

Available API types: REST API for synchronous processing (files under 30 seconds), Batch API for asynchronous processing (files up to 1 hour), and Streaming API for real-time audio with instant results.

REST API

Synchronous processing for files under 30 seconds.

Batch API

Asynchronous processing for files up to 1 hour.

Streaming API

Real-time audio streaming with instant results.

Not sure which one fits your audio length and latency needs? See Which Speech-to-Text API to Use for a side-by-side comparison of REST, WebSocket, and Batch.

Supported Audio Formats & MIME Types

The STT and STTT REST and Batch APIs support over 10 major audio formats and MIME type variants. Supported formats and MIME types are listed below:

Format GroupSupported MIME Types
MP3 Variantsmpeg, mp3, mpeg3, x-mpeg-3, x-mp3
WAV Variantswav, x-wav, wave
AAC Variantsaac, x-aac
AIFF Variantsaiff, x-aiff
OGG / Opus Formatsogg, opus
FLAC Variants (Lossless)flac, x-flac
MP4 / M4A Audiomp4, x-m4a
AMR (Narrowband)amr
WMA (Windows Media Audio)x-ms-wma
WEBM (Audio & Video)webm, webm
PCM Formatspcm_s16le, pcm_l16, pcm_raw

For most audio formats, our API automatically detects the codec. However, when using PCM formats (pcm_s16le, pcm_l16, pcm_raw), you must explicitly specify the input_audio_codec parameter. PCM files are only supported at 16kHz sample rate.

WebSocket/Streaming APIs: The STT and STTT WebSocket streaming APIs only support WAV and raw PCM formats (wav, pcm_s16le, pcm_l16, pcm_raw). Other audio formats are not supported for real-time streaming.


Technical Capabilities

Language Support
  • 22 Indian languages (Saaras v3)
  • Automatic language detection
  • Code-mixing support
  • Multi-speaker handling
Advanced Processing
  • Speaker diarization (Batch API)
  • Timestamp generation
  • Entity preservation
  • Telephony optimization

Next Steps

1

Choose Your API

Select the appropriate API type based on your use case.

2

Get API Key

Sign up and get your API key from the dashboard.

3

Go Live

Deploy your integration and monitor usage in the dashboard.

Need help choosing the right API? Contact us on discord for guidance.