STT API Tutorial Using Saarika Model

Overview

This notebook demonstrates how to use the Saarika Model for Speech-to-Text (STT) API.It covers both short and long audio transcription, including how to split large files into chunks and transcribe them using the real-time API.
It includes instructions for installation, setting up the API key, uploading audio files, and using the API for transcription and translation.

1. Installation

Before you begin, ensure you have the necessary Python libraries installed. Run the following commands to install the required packages:

1 !pip install sarvamai

1 from sarvamai import SarvamAI

2. Authentication

To use the API, you need an API subscription key. Follow these steps to set up your API key:

Obtain your API key: If you don’t have an API key, sign up on the Sarvam AI Dashboard to get one.
Replace the placeholder key: In the code below, replace “YOUR_SARVAM_AI_API_KEY” with your actual API key.

1 SARVAM_API_KEY = "YOUR_SARVAM_AI_API_KEY"

2.1 Initialize the Client

Create a Sarvam client instance using your API key. This client will be used to interact with the Speech-to-Text API.

1 client = SarvamAI(api_subscription_key=SARVAM_API_KEY)

3. Uploading Audio Files

To translate audio, you need to provide a .wav or .mp3 file.

Supported Environments:

Google Colab
Jupyter Notebook (VS Code, JupyterLab, etc.)

####Instructions:

Ensure your audio file is in .wav or .mp3 format.
Run the cell below. The uploader will automatically adjust based on your environment:
- In Google Colab: You’ll be prompted to upload a .wav or .mp3 file via a file picker.
- In Jupyter Notebook: You’ll be prompted to enter the full file path of the .wav or .mp3 file stored locally on your machine.
Once provided, the file will be available for use in the next step.

1 import sys
2 import os
3 
4 def get_audio_file():
5     supported_formats = ['.wav', '.mp3']
6 
7     if 'google.colab' in sys.modules:
8         # Running in Google Colab: use upload widget
9         from google.colab import files
10         uploaded = files.upload()
11         audio_file_path = list(uploaded.keys())[0]
12         ext = os.path.splitext(audio_file_path)[1].lower()
13         if ext not in supported_formats:
14             print(f"Unsupported file format '{ext}'. Please upload a WAV or MP3 file.")
15             return None
16         print(f"File '{audio_file_path}' uploaded successfully in Colab!")
17         return audio_file_path
18     else:
19         # Running in Jupyter Notebook: input file path
20         audio_file_path = input("Enter the path to your MP3 or WAV file: ").strip()
21         ext = os.path.splitext(audio_file_path)[1].lower()
22         if not os.path.exists(audio_file_path):
23             print(f"File not found at: {audio_file_path}")
24             return None
25         if ext not in supported_formats:
26             print(f"Unsupported file format '{ext}'. Please provide a WAV or MP3 file.")
27             return None
28         print(f"File '{audio_file_path}' found successfully in Jupyter!")
29         return audio_file_path

1 #Enter the file path and enter/return.
2 audio_file_path = get_audio_file()

4. Saarika-v2.5 Usage for STT

The Saarika model can be used for converting speech to text across different scenarios.
It supports basic transcription, code-mixed speech, and automatic language detection for Indian languages.

4.1 Basic Usage

Basic transcription with a specified language code.
Ideal for single-language audio with clear speech and minimal noise.

1 if audio_file_path:
2     with open(audio_file_path, "rb") as audio_file:
3         response = client.speech_to_text.transcribe(
4             file=audio_file,
5             model="saarika:v2.5",
6             language_code="en-IN"
7         )
8     print("Transcription Response:")
9     print(response)
10 else:
11     print("No audio file found. Transcription aborted.")

4.2 Code-Mixed Speech

Handles mid-sentence language switches intelligently.
Perfect for conversational speech in Indian multilingual settings.

1 if audio_file_path:
2     with open(audio_file_path, "rb") as audio_file:
3         response = client.speech_to_text.transcribe(
4             file=audio_file,
5             model="saarika:v2.5"
6         )
7     print(response)
8 else:
9     print("No valid audio file found.")

4.3 Automatic Language Detection

Let Saarika detect the spoken language automatically.
Useful when input language is unknown or for multilingual speech.

1 if audio_file_path:
2     with open(audio_file_path, "rb") as audio_file:
3         response = client.speech_to_text.transcribe(
4             file=audio_file,
5             model="saarika:v2.5",
6             language_code="unknown"
7         )
8     print(response)
9 else:
10     print("No valid audio file found.")

5. Handling Long Audio Files

If your audio file exceeds the 30-second limit supported by the real-time transcription API, you must split it into smaller chunks for accurate and successful transcription. These smaller segments are then transcribed individually using the real-time API, and the results are stitched back together to form the final transcript.

👉 For large audio files, switch to the Batch API designed for longer durations.
🔗 Try the Batch API here

When to Use

Audio length > 30 seconds
Real-time API returns timeout or error due to size
You want to batch process long audio files for better accuracy and reliability

How It Works

The full .mp3 or .wav file is first split into smaller chunks (e.g., 29 seconds each)
Each chunk is then transcribed individually using the real-time API
The individual results are finally combined to form one seamless transcript

> ⚠️ For short audio files (<30 seconds), you can skip this step and directly proceed with transcription using the real-time API.

The functions below help with:

Prevents real-time API timeouts
Splitting large .wavor .mp3 files into smaller chunks
Transcribing each chunk using the Saarika:v2.5
Collating results into a single transcript

5.1 Define the split_audio Function

This function splits a long .mp3 or .wav audio file into smaller chunks (default: 29 seconds) using FFmpeg. It ensures each segment remains within the real-time API’s 30-second limit and stores them in the specified output directory.

1 import os
2 import subprocess
3 
4 def split_audio_ffmpeg(audio_path, chunk_duration=29, output_dir="chunks"):
5     os.makedirs(output_dir, exist_ok=True)
6     ext = os.path.splitext(audio_path)[1].lower()
7     base_name = os.path.splitext(os.path.basename(audio_path))[0]
8     output_pattern = os.path.join(output_dir, f"{base_name}_%03d{ext}")
9 
10     codec = "pcm_s16le" if ext == ".wav" else "libmp3lame"
11 
12     command = [
13         "ffmpeg",
14         "-i", audio_path,
15         "-f", "segment",
16         "-segment_time", str(chunk_duration),
17         "-c:a", codec,
18         output_pattern
19     ]
20 
21     print("Running command:", " ".join(command))
22 
23     result = subprocess.run(command, capture_output=True, text=True)
24     print("Return code:", result.returncode)
25     print("STDOUT:\n", result.stdout)
26     print("STDERR:\n", result.stderr)
27 
28     output_files = sorted([
29         os.path.join(output_dir, f) for f in os.listdir(output_dir)
30         if f.endswith(ext)
31     ])
32 
33     print("Chunks generated:", output_files)
34     return output_files

5.2 Define the `transcribe_audio_chunks` Function

This function takes the list of chunked audio file paths and uses the Saarika real-time API to transcribe each one individually. It collects all partial transcriptions and combines them into a single, complete transcript.

1 def transcribe_audio_chunks_sdk(chunk_paths, client, model="saarika:v2.5", language_code="en-IN"):
2     
3     full_transcript = []
4 
5     for idx, chunk_path in enumerate(chunk_paths):
6         print(f"\nTranscribing chunk {idx + 1}/{len(chunk_paths)} → {chunk_path}")
7         with open(chunk_path, "rb") as audio_file:
8             try:
9                 response = client.speech_to_text.transcribe(
10                     file=audio_file,
11                     model=model,
12                     language_code=language_code
13                 )
14                 print("Chunk Response:", response)
15                 full_transcript.append(str(response))
16             except Exception as e:
17                 print(f"Error with chunk {chunk_path}: {e}")
18 
19     return " ".join(full_transcript).strip()

5.3 Putting It All Together

Call the split_audio_ffmpeg() function first to break the audio into chunks, and then pass those chunks to transcribe_audio_chunks() for transcription. This two-step process ensures large audio files are handled smoothly using the real-time API.

1 # 1. Split the audio
2 chunks = split_audio_ffmpeg(audio_file_path)
3 
4 # 2. Transcribe each chunk and collate
5 if chunks:
6     final_transcript = transcribe_audio_chunks_sdk(chunks, client)
7     print("\nFinal Combined Transcript:\n")
8     print(final_transcript)
9 else:
10     print("No audio chunks generated. Transcription aborted.")
11

6. Error Handling

You may encounter these errors while using the API:

403 Forbidden (invalid_api_key_error)
- Cause: Invalid API key.
- Solution: Use a valid API key from the Sarvam AI Dashboard.
429 Too Many Requests (insufficient_quota_error)
- Cause: Exceeded API quota.
- Solution: Check your usage, upgrade if needed, or implement exponential backoff when retrying.
500 Internal Server Error (internal_server_error)
- Cause: Issue on our servers.
- Solution: Try again later. If persistent, contact support.
400 Bad Request (invalid_request_error)
- Cause: Incorrect request formatting.
- Solution: Verify your request structure, and parameters.
422 Unprocessable Entity Request (unprocessable_entity_error)
- Cause: Unable to detect the language of the input text.
- Solution: Explicitly pass the source_language_code parameter with a supported language.

7. Additional Resources

For more details, refer to the our official documentation and we are always there to support and help you on our Discord Server:

Documentation: docs.sarvam.ai
Community: Join the Discord Community

8. Final Notes

Keep your API key secure.
Use clear audio for best results.
Explore advanced features like diarization and translation.

Keep Building! 🚀