How to enable speaker diarization

Batch API only: Speaker diarization is only available through the Batch API, not the REST or Streaming APIs.

Speaker diarization identifies and labels different speakers in your audio, making it easy to know “who said what.” This is ideal for meetings, interviews, podcasts, and call center recordings.

Key Features

  • Automatic speaker detection
  • Support for up to 10 speakers
  • Speaker-wise transcription with timestamps

Parameters

ParameterTypeDescription
with_diarizationbooleanEnable speaker diarization (default: false)
num_speakersintegerExpected number of speakers (optional, 1-10)

If you don’t specify num_speakers, the model will automatically detect the number of speakers.

Example Code

1from sarvamai import SarvamAI
2
3client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
4
5# Create batch job with diarization
6job = client.speech_to_text_job.create_job(
7 model="saaras:v3",
8 language_code="hi-IN",
9 mode="transcribe",
10 with_diarization=True
11)
12
13# Upload audio files
14job.upload_files(file_paths=["meeting_recording.mp3"])
15
16# Start processing
17job.start()
18
19# Wait for completion
20job.wait_until_complete()
21
22# Download results
23job.download_outputs(output_dir="./output")

Output Format

When with_diarization=True is passed, the response includes a diarized_transcript field with speaker information:

1{
2 "request_id": "20260130_d8d2c0e6-1eb6-4982-8045-b267d5165c44",
3 "transcript": "Full transcript text...",
4 "timestamps": {
5 "words": ["Hello, how can I help you today?", "I have a question about my order."],
6 "start_time_seconds": [0.01, 2.8],
7 "end_time_seconds": [2.5, 5.2]
8 },
9 "diarized_transcript": {
10 "entries": [
11 {
12 "transcript": "Hello, how can I help you today?",
13 "start_time_seconds": 0.01,
14 "end_time_seconds": 2.5,
15 "speaker_id": "0"
16 },
17 {
18 "transcript": "I have a question about my order.",
19 "start_time_seconds": 2.8,
20 "end_time_seconds": 5.2,
21 "speaker_id": "1"
22 }
23 ]
24 },
25 "language_code": "en-IN"
26}

Each entry contains:

  • transcript: The text spoken by the speaker
  • start_time_seconds: When the speaker started speaking (float)
  • end_time_seconds: When the speaker stopped speaking (float)
  • speaker_id: Unique identifier for the speaker (e.g., “0”, “1”)

Use Cases

Use CaseRecommended Settings
Call center recordingsnum_speakers=2
MeetingsLet model auto-detect
InterviewsSpecify exact count
Podcastsnum_speakers=2-4

Speaker diarization is available via the Batch API and has separate pricing. For detailed pricing information, visit dashboard.sarvam.ai.

Full Batch API Documentation