How to enable speaker diarization

Batch API only: Speaker diarization is only available through the Batch API, not the REST or Streaming APIs.

Speaker diarization identifies and labels different speakers in your audio, making it easy to know “who said what.” This is ideal for meetings, interviews, podcasts, and call center recordings.

Key Features

Automatic speaker detection
Support for up to 10 speakers
Speaker-wise transcription with timestamps

Parameters

Parameter	Type	Description
`with_diarization`	boolean	Enable speaker diarization (default: `false`)
`num_speakers`	integer	Expected number of speakers (optional, 1-10)

If you don’t specify num_speakers, the model will automatically detect the number of speakers.

Example Code

Basic Diarization

With Speaker Count

1 from sarvamai import SarvamAI
2 
3 client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
4 
5 # Create batch job with diarization
6 job = client.speech_to_text_job.create_job(
7     model="saaras:v3",
8     language_code="hi-IN",
9     mode="transcribe",
10     with_diarization=True
11 )
12 
13 # Upload audio files
14 job.upload_files(file_paths=["meeting_recording.mp3"])
15 
16 # Start processing
17 job.start()
18 
19 # Wait for completion
20 job.wait_until_complete()
21 
22 # Download results
23 job.download_outputs(output_dir="./output")

Output Format

When with_diarization=True is passed, the response includes a diarized_transcript field with speaker information:

1 {
2   "request_id": "20260130_d8d2c0e6-1eb6-4982-8045-b267d5165c44",
3   "transcript": "Full transcript text...",
4   "timestamps": {
5     "words": ["Hello, how can I help you today?", "I have a question about my order."],
6     "start_time_seconds": [0.01, 2.8],
7     "end_time_seconds": [2.5, 5.2]
8   },
9   "diarized_transcript": {
10     "entries": [
11       {
12         "transcript": "Hello, how can I help you today?",
13         "start_time_seconds": 0.01,
14         "end_time_seconds": 2.5,
15         "speaker_id": "0"
16       },
17       {
18         "transcript": "I have a question about my order.",
19         "start_time_seconds": 2.8,
20         "end_time_seconds": 5.2,
21         "speaker_id": "1"
22       }
23     ]
24   },
25   "language_code": "en-IN"
26 }

Each entry contains:

transcript: The text spoken by the speaker
start_time_seconds: When the speaker started speaking (float)
end_time_seconds: When the speaker stopped speaking (float)
speaker_id: Unique identifier for the speaker (e.g., “0”, “1”)

Use Cases

Use Case	Recommended Settings
Call center recordings	`num_speakers=2`
Meetings	Let model auto-detect
Interviews	Specify exact count
Podcasts	`num_speakers=2-4`

Speaker diarization is available via the Batch API and has separate pricing. For detailed pricing information, visit dashboard.sarvam.ai.

→ Full Batch API Documentation