Speech-to-Text Rest API

View as Markdown

Synchronous Processing

Process short audio files with immediate response. Best for quick transcriptions and testing with a maximum duration of 30 seconds.

Saaras v3 is our latest state-of-the-art speech recognition model with flexible output formats. It supports multiple modes for different use cases: transcribe, translate, verbatim, transliterate, and codemix.

Recommended for new integrations. Saaras v3 offers improved accuracy and flexible output modes. Learn more about Saaras v3.

Output Modes

ModeDescription
transcribe (default)Standard transcription in the original language
translateTranslates speech to English
verbatimExact word-for-word transcription
translitRomanization to Latin script
codemixCode-mixed text output

Code Examples for Saaras v3

1from sarvamai import SarvamAI
2
3client = SarvamAI(
4 api_subscription_key="YOUR_SARVAM_API_KEY",
5)
6
7# Transcribe mode (default)
8response = client.speech_to_text.transcribe(
9 file=open("audio.wav", "rb"),
10 model="saaras:v3",
11 mode="transcribe" # or "translate", "verbatim", "translit", "codemix"
12)
13
14print(response)

Check out our detailed API Reference to explore all available options.

Preparing Your Audio

Most failed STT requests are caused by the audio itself, not the API call. Run through this checklist before uploading:

CheckRecommendation
DurationThe sync REST API accepts up to 30 seconds per request. For longer files, use the Batch API (up to 1 hour per file) or split the audio into ≤30s chunks.
Sample rate16 kHz is recommended. 8 kHz telephony audio (IVR, call recordings) is fully supported — no need to upsample.
ChannelsUse mono. For stereo telephony recordings with one speaker per channel, split the channels and transcribe each separately to keep speakers separated.
FormatWAV, MP3, AAC, FLAC, or OGG. Prefer WAV (16-bit PCM) for best accuracy.
File integrityVerify the file exists and is non-empty before uploading (file.size > 0 in browsers). Pass a file object, not a path string — e.g. file=open("audio.wav", "rb") in Python.

Not sure whether to use REST, Batch, or Streaming? See the Which API to Use decision table for a side-by-side comparison of limits, latency, and features.


Legacy Models (Deprecated Soon)

The following models will be deprecated soon. We recommend migrating to Saaras v3 for new integrations.

Saarika v2.5: Speech to Text Transcription

Saarika is a speech-to-text transcription model that excels in handling multi-speaker content, mixed language content, and conference recordings.

Deprecation Notice: Saarika v2.5 will be deprecated soon. Use Saaras v3 with mode="transcribe" instead — see the code examples above.

Saaras v2.5: Speech to Text Translation

Saaras v2.5 is available in the Speech-to-Text Translate endpoint for translating speech directly to English.

Deprecation Notice: Saaras v2.5 will be deprecated soon. Use Saaras v3 with mode="translate" instead — see the code examples above.

API Response Format

Speech to Text Transcription Response

FieldTypeDescription
request_idstringUnique identifier for the request
transcriptstringThe transcribed text from the audio file
language_codestringBCP-47 language code of detected language (e.g., hi-IN). Returns null if no language detected
1{
2 "request_id": "20241115_12345678-1234-5678-1234-567812345678",
3 "transcript": "नमस्ते, आप कैसे हैं?",
4 "language_code": "hi-IN"
5}

Speech to Text Translation Response

FieldTypeDescription
request_idstringUnique identifier for the request
transcriptstringTranslated text in English
language_codestringBCP-47 code of the detected source language

Supported source languages: hi-IN, bn-IN, kn-IN, ml-IN, mr-IN, od-IN, pa-IN, ta-IN, te-IN, gu-IN, en-IN

1{
2 "request_id": "20241115_12345678-1234-5678-1234-567812345678",
3 "transcript": "Hello, how are you?",
4 "language_code": "hi-IN"
5}

Error Responses

All errors return a JSON object with an error field (message, code, request_id). The full error-code table, retry guidance, and SDK exception reference live on the central Errors & Troubleshooting page.

Errors specific to this endpoint:

HTTP StatusError CodeWhen This HappensWhat To Do
422unprocessable_entity_errorInvalid audio format, file too large, or audio over 30 secondsUse supported formats (WAV, MP3, AAC, FLAC, OGG); for longer audio use the Batch API
1from sarvamai import SarvamAI
2from sarvamai.core.api_error import ApiError
3
4client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
5
6try:
7 response = client.speech_to_text.transcribe(
8 file=open("audio.wav", "rb"),
9 model="saaras:v3",
10 mode="transcribe"
11 )
12 print(response.transcript)
13except ApiError as e:
14 if e.status_code == 400:
15 print(f"Bad request: {e.body}")
16 elif e.status_code == 403:
17 print("Invalid API key. Check your credentials.")
18 elif e.status_code == 429:
19 print("Rate limit exceeded. Wait and retry.")
20 elif e.status_code == 503:
21 print("Service overloaded. Retry with backoff.")
22 else:
23 print(f"Error {e.status_code}: {e.body}")

Next Steps

1

Get API Key

Sign up and get your API key from the dashboard.

2

Test Integration

Try the API with sample audio files.
3

Go Live

Deploy your integration and monitor usage.

Need help? Contact us on discord for guidance.