Overview

This notebook provides a step-by-step guide on how to use the STT-Translate API for translating audio files into text using Saaras. It includes instructions for installation, setting up the API key, uploading audio files, and translating audio using the API.

0. Installation

Before you begin, ensure you have the necessary Python libraries installed. Run the following commands to install the required packages:

1 pip install requests pandas pydub

1. Import Required Libraries

This section imports the necessary Python libraries for making HTTP requests, handling audio files, and managing data.

1 import requests
2 import io

requests: For making HTTP requests to the API.
pandas: For data manipulation (optional, depending on your use case).

2. Set Up the API Endpoint and Payload

To use the Saaras API, you need an API subscription key. Follow these steps to set up your API key:

Obtain your API key: If you don’t have an API key, sign up on the Sarvam AI Dashboard to get one.
Replace the placeholder key: In the code below, replace “YOUR_SARVAM_AI_API_KEY” with your actual API key.

1 SARVAM_AI_API="YOUR_SARVAM_AI_API_KEY"

2.1 Setting Up the API Endpoint and Payload

This section defines the API endpoint and the payload for the translation request. Replace the placeholder values with your actual API key and desired parameters.

1 # API endpoint for speech-to-text translation
2 api_url = "https://api.sarvam.ai/speech-to-text-translate"
3 
4 # Headers containing the API subscription key
5 headers = {
6     "api-subscription-key": SARVAM_AI_API  # Replace with your API key
7 }
8 
9 # Data payload for the translation request
10 data = {
11     "model": "saaras:v2",  # Specify the model to be used
12     "with_diarization": False  # Set to True for speaker diarization
13 }

3. Uploading Audio Files

To translate audio, you need to upload a .wav file. Follow these steps:

Prepare your audio file: Ensure your audio file is in .wav format. If your file is in a different format, you can use tools like pydub to convert it.
Upload the file: If you’re using Google Colab, you can upload the file using the file uploader:

1 from google.colab import files
2 
3 uploaded = files.upload()
4 audio_file_path = list(uploaded.keys())[0]  # Get the name of the uploaded file

If you’re working locally, ensure the file is in the same directory as your notebook and specify the file name:

1 audio_file_path = "test.wav"  # Replace with your file name

4. Speech-to-Text Translation API

This section demonstrates how to use the STT-Translate API for translating audio files into text using Saaras. The API automatically identifies the language of the audio and supports long audio files by splitting them into chunks.

4.1. Splitting Audio into Chunks

The split_audio function splits an audio file into smaller chunks of a specified duration. This is useful for processing long audio files that exceed the API’s input length limit.

1 from pydub import AudioSegment
2 
3 def split_audio(audio_path, chunk_duration_ms):
4     """
5     Splits an audio file into smaller chunks of specified duration.
6 
7     Args:
8         audio_path (str): Path to the audio file to be split.
9         chunk_duration_ms (int): Duration of each chunk in milliseconds.
10 
11     Returns:
12         list: A list of AudioSegment objects representing the audio chunks.
13     """
14     audio = AudioSegment.from_file(audio_path)  # Load the audio file
15     chunks = []
16     if len(audio) > chunk_duration_ms:
17         # Split the audio into chunks of the specified duration
18         for i in range(0, len(audio), chunk_duration_ms):
19             chunks.append(audio[i:i + chunk_duration_ms])
20     else:
21         # If the audio is shorter than the chunk duration, use the entire audio
22         chunks.append(audio)
23     return chunks

4.2. Translating Audio

The translate_audio function translates audio chunks using the Saaras API. It handles the API request for each chunk and collates the results.

1 def translate_audio(audio_file_path, api_url, headers, data, chunk_duration_ms=5*60*1000):
2     """
3     Translates audio into text with optional diarization and timestamps.
4 
5     Args:
6         audio_file_path (str): Path to the audio file.
7         api_url (str): API endpoint URL for Speech-to-Text and Translate.
8         headers (dict): Headers for API authentication.
9         data (dict): Payload containing model and other options like diarization.
10         chunk_duration_ms (int): Duration of each audio chunk in milliseconds.
11 
12     Returns:
13         dict: Collated response containing the transcript and word-level timestamps.
14     """
15     # Split the audio into chunks
16     chunks = split_audio(audio_file_path, chunk_duration_ms)
17     responses = []
18 
19     # Process each chunk
20     for idx, chunk in enumerate(chunks):
21         # Export the chunk to a BytesIO object (in-memory binary stream)
22         chunk_buffer = io.BytesIO()
23         chunk.export(chunk_buffer, format="wav")
24         chunk_buffer.seek(0)  # Reset the pointer to the start of the stream
25 
26         # Prepare the file for the API request
27         files = {'file': ('audiofile.wav', chunk_buffer, 'audio/wav')}
28 
29         try:
30             # Make the POST request to the API
31             response = requests.post(api_url, headers=headers, files=files, data=data)
32             if response.status_code == 200 or response.status_code == 201:
33                 print(f"Chunk {idx} POST Request Successful!")
34                 response_data = response.json()
35                 transcript = response_data.get("transcript", "")
36                 responses.append({"transcript": transcript})
37             else:
38                 # Handle failed requests
39                 print(f"Chunk {idx} POST Request failed with status code: {response.status_code}")
40                 print("Response:", response.text)
41         except Exception as e:
42             # Handle any exceptions during the request
43             print(f"Error processing chunk {idx}: {e}")
44         finally:
45             # Ensure the buffer is closed after processing
46             chunk_buffer.close()
47 
48     # Collate the transcriptions from all chunks
49     collated_transcript = " ".join([resp["transcript"] for resp in responses])
50     collated_response = {
51         "transcript": collated_transcript,
52         "language": response_data.get("language_code")
53     }
54     return collated_response

4.3 Translating the Audio

This section calls the translate_audio function to translate the audio file. Replace audio_file_path with the path to your audio file.

1 # Path to the audio file to be translated
2 # audio_file_path = "test.wav"  # Replace with your file path
3 
4 # Translate the audio
5 translation = translate_audio(audio_file_path, api_url, headers, data)
6 
7 # Display the translation results
8 translation

Example output:

1 {
2   "transcript": "There are many ethical stories in English that are beneficial for children. They activate your child's imagination, entertain them, and make them happy. Short ethical stories are ideal to keep them focused and focused throughout the story.",
3   "language": "hi-IN"
4 }

5. Conclusion

This tutorial demonstrated how to use the Saaras API for translating audio files into text. By following the steps, you can easily translate audio, even long files, by splitting them into smaller chunks. The process involves installing required libraries, setting up your API key, uploading audio, and translating it using the provided functions.

6. Additional Resources

For more details, refer to the official Saaras API documentation and join the community for support:

Documentation: docs.sarvam.ai
Community: Join the Discord Community

7. Final Notes

Keep your API key secure.
Use clear audio for best results.
Explore advanced features like diarization and word-level timestamps.

Keep Building! 🚀