Call Analytics Pipeline: Diarized Transcription + LLM Analysis

Overview

This cookbook builds a call analytics pipeline that turns raw call recordings into structured, actionable insights. It uses Sarvam’s Batch Speech-to-Text Translate API with speaker diarization to transcribe calls, then uses a Sarvam chat model to analyze the conversation, answer follow-up questions, and generate summaries.

What you’ll build:

A reusable CallAnalytics class that goes from raw audio to structured insight with one method call
Speaker-wise, diarized transcripts (*_conversation.txt) with per-speaker talk-time (*_timing.json)
An LLM-generated structured analysis of each call, covering customer type, issue, sentiment, and resolution (*_analysis.txt)
Answers to any custom question you ask across all processed calls
A consolidated summary report across all calls (summary_*.txt)

Business Value

Improve agent effectiveness
Understand customer sentiment
Detect operational issues early
Spot upsell/cross-sell opportunities
Generate real-time dashboards

E-commerce / D2C

Contact Centers / BPOs

Healthcare & Insurance

Understand refund requests, delivery concerns, or dissatisfaction with product quality directly from customer calls.

Sample audio files are available in the sarvam-ai-cookbook GitHub repo.

How It Works

Diarization assigns a speaker label to every line of the transcript, which is what makes agent/customer-specific analysis (sentiment, talk-time, resolution tracking) possible in the first place. The pipeline below preserves that speaker structure end-to-end, from raw audio to the final report.

Transcribe with diarization

Upload one or more call recordings to Sarvam’s Batch STT Translate API with with_diarization=True. The API transcribes and translates the call to English, tagging each line with a speaker ID.

Parse speaker-wise transcripts

Convert the raw diarized JSON into a clean SPEAKER: text transcript and compute total speaking time per speaker, useful for agent talk-time vs. listening-time monitoring.

Run LLM analysis

Send the transcript to a Sarvam chat model with a structured analysis prompt to extract customer type, issue, sentiment, resolution, and upsell opportunities.

Ask follow-up questions & summarize

Query any specific detail across all processed calls, and generate a concise, dashboard-ready summary report.

1. Prerequisites

Python 3.8+
A Sarvam API key, sign up on the Sarvam AI Dashboard to get one
FFmpeg installed and on your PATH, required by pydub to read and split audio files
One or more call recordings (.mp3, .wav, etc.). You can use the sample call recording to follow along

2. Install the SDK and Dependencies

$ pip install -U sarvamai pydub

pydub is used to inspect and split long recordings before transcription. See The Full Pipeline Script below. It requires FFmpeg to be installed separately (it is not a Python package).

3. Authentication

Obtain your API key: If you don’t have one, sign up on the Sarvam AI Dashboard.
Set your API key: Export it as an environment variable, export SARVAM_API_KEY="your-key-here" (macOS/Linux) or setx SARVAM_API_KEY "your-key-here" (Windows). The script below reads it via os.environ["SARVAM_API_KEY"].

Loading the key from an environment variable, instead of hardcoding it, keeps it out of source control. If you’re just experimenting locally, you can instead replace os.environ["SARVAM_API_KEY"] with the key as a plain string, but avoid committing that change.

4. The Full Pipeline Script

Everything below, imports, the 2-hour-file-limit helpers, the CallAnalytics class, and the code that runs it end-to-end, lives in one script. Copy it into a single .py file as-is:

1 import os
2 import json
3 import hashlib
4 import textwrap
5 from pathlib import Path
6 from datetime import datetime
7 from typing import List, Dict, Optional
8 from pydub import AudioSegment
9 from sarvamai import SarvamAI
10 
11 OUTPUT_DIR = "outputs"
12 Path(OUTPUT_DIR).mkdir(exist_ok=True)
13 
14 
15 def split_audio(audio_path: str, chunk_duration_ms: int = 2 * 60 * 60 * 1000) -> List[AudioSegment]:
16     """Splits a recording into chunks so each fits the Batch API's 2-hour-per-file limit."""
17     audio = AudioSegment.from_file(audio_path)
18     return [audio[i:i + chunk_duration_ms] for i in range(0, len(audio), chunk_duration_ms)] if len(audio) > chunk_duration_ms else [audio]
19 
20 
21 def prepare_audio_paths(audio_path: str) -> List[str]:
22     """Splits and exports a recording longer than 2 hours; returns the original path unchanged otherwise."""
23     chunks = split_audio(audio_path)
24     if len(chunks) == 1:
25         return [audio_path]
26 
27     paths = []
28     for i, chunk in enumerate(chunks):
29         chunk_path = f"{OUTPUT_DIR}/{Path(audio_path).stem}_part{i + 1}.mp3"
30         chunk.export(chunk_path, format="mp3")
31         paths.append(chunk_path)
32     return paths
33 
34 
35 ANALYSIS_PROMPT_TEMPLATE = """
36 Analyze this call transcription thoroughly from start to finish.
37 
38 TRANSCRIPTION:
39 {transcription}
40 
41 Please answer the following:
42 
43 1. Identify which speaker is the **customer** and which one is the **agent**.
44 2. Determine if the customer is a **new/potential customer** or an **existing customer**.
45 3. What **problem, query, or doubt** did the customer raise at the beginning?
46 4. What **services/products** was the customer inquiring about or facing issues with?
47 5. How did the agent respond to and resolve the issue throughout the call?
48 6. Was the **customer satisfied** at the end of the call?
49 7. Did the customer express any **emotions or sentiments** (positive, negative, or neutral)?
50 8. Were there any mentions of **competitors**, or any opportunities for **upselling or cross-selling**?
51 9. Summarize the **resolution** and whether it was successful.
52 
53 Provide your answer in a clear, structured format with section headings and bullet points.
54 """
55 
56 SUMMARY_PROMPT_TEMPLATE = """
57 Based on this call analysis, summarize each of the following in 2-3 words:
58 
59 {analysis_text}
60 
61 1. Customer & Agent
62 2. Customer Type
63 3. Main Issue
64 4. Service Discussed
65 5. Agent's Response
66 6. Customer Satisfaction
67 7. Sentiment
68 8. Competitor or Upsell
69 9. Resolution
70 """
71 
72 class CallAnalytics:
73     def __init__(self, client):
74         self.client = client
75         self.transcriptions = {}
76 
77     def process_audio_files(self, audio_paths: List[str]) -> Dict[str, dict]:
78         if not audio_paths:
79             print("No audio files provided")
80             return {}
81 
82         print(f"Processing {len(audio_paths)} audio file(s)...")
83 
84         try:
85             job = self.client.speech_to_text_job.create_job(
86                 model="saaras:v3",
87                 mode="translate",
88                 with_diarization=True,
89             )
90 
91             # For longer audio files, raise `timeout` so the upload has time to complete.
92             job.upload_files(file_paths=audio_paths, timeout=300)
93             job.start()
94 
95             print("Waiting for transcription to complete...")
96             job.wait_until_complete()
97 
98             file_results = job.get_file_results()
99             print(f"Successful: {len(file_results['successful'])} | Failed: {len(file_results['failed'])}")
100             for f in file_results["failed"]:
101                 print(f"  Failed: {f['file_name']}, {f['error_message']}")
102 
103             if not file_results["successful"]:
104                 print("All files failed, nothing to analyze.")
105                 return {}
106 
107             output_dir = Path(f"{OUTPUT_DIR}/transcriptions_{job._job_id}")
108             output_dir.mkdir(parents=True, exist_ok=True)
109             job.download_outputs(output_dir=str(output_dir))
110 
111             transcriptions = self._parse_transcriptions(output_dir)
112             self.transcriptions.update(transcriptions)
113 
114             print(f"Successfully transcribed {len(transcriptions)} file(s)!")
115 
116             for file_name, data in transcriptions.items():
117                 self.analyze_transcription(data["conversation_path"], output_dir, file_name)
118 
119             return transcriptions
120 
121         except Exception as e:
122             print(f"Error processing audio files: {e}")
123             return {}
124 
125     def _parse_transcriptions(self, output_dir: Path) -> Dict[str, dict]:
126         transcriptions = {}
127         for json_file in output_dir.glob("*.json"):
128             try:
129                 with open(json_file, "r", encoding="utf-8") as f:
130                     data = json.load(f)
131                 diarized = data.get("diarized_transcript", {}).get("entries")
132                 speaker_times = {}
133                 lines = []
134                 if diarized:
135                     for entry in diarized:
136                         speaker = entry["speaker_id"]
137                         text = entry["transcript"]
138                         lines.append(f"{speaker}: {text}")
139                         start = entry.get("start_time_seconds")
140                         end = entry.get("end_time_seconds")
141                         if start is not None and end is not None:
142                             speaker_times[speaker] = speaker_times.get(speaker, 0.0) + (end - start)
143                 else:
144                     lines = [f"UNKNOWN: {data.get('transcript', '')}"]
145 
146                 conversation_text = "\n".join(lines)
147                 txt_path = output_dir / f"{json_file.stem}_conversation.txt"
148                 with open(txt_path, "w", encoding="utf-8") as f:
149                     f.write(conversation_text)
150 
151                 timing_path = None
152                 if speaker_times:
153                     timing_path = output_dir / f"{json_file.stem}_timing.json"
154                     with open(timing_path, "w", encoding="utf-8") as f:
155                         json.dump(speaker_times, f, indent=2)
156 
157                 transcriptions[json_file.stem] = {
158                     "entries": diarized or [],
159                     "conversation_path": str(txt_path),
160                     "timing_path": str(timing_path) if timing_path else None,
161                 }
162             except Exception as e:
163                 print(f"Error parsing {json_file}: {e}")
164         return transcriptions
165 
166     def analyze_transcription(self, conversation_path: str, output_dir: Path, file_name: str) -> Dict:
167         try:
168             with open(conversation_path, "r", encoding="utf-8") as f:
169                 transcription = f.read()
170 
171             analysis_prompt = textwrap.dedent(ANALYSIS_PROMPT_TEMPLATE).format(transcription=transcription)
172 
173             response = self.client.chat.completions(
174                 model="sarvam-105b",
175                 max_tokens=4096,
176                 reasoning_effort=None,
177                 messages=[
178                     {"role": "system", "content": "You are a call analytics expert working for a company's support operations team. Your job is to understand customer calls end-to-end and provide structured insights to improve customer experience and agent effectiveness."},
179                     {"role": "user", "content": analysis_prompt},
180                 ],
181             )
182             analysis = response.choices[0].message.content
183 
184             analysis_path = output_dir / f"{file_name}_analysis.txt"
185             with open(analysis_path, "w", encoding="utf-8") as f:
186                 f.write(analysis.strip())
187             print(f"Analysis saved to {analysis_path}")
188             return {"file_name": file_name, "analysis_path": str(analysis_path)}
189 
190         except Exception as e:
191             error_msg = f"Error analyzing transcription: {str(e)}"
192             print(error_msg)
193             return {"file_name": file_name, "error": error_msg, "timestamp": datetime.now().isoformat()}
194 
195     def answer_question(self, question: str) -> None:
196         for file_name, data in self.transcriptions.items():
197             try:
198                 with open(data["conversation_path"], "r", encoding="utf-8") as f:
199                     transcription = f.read()
200 
201                 prompt = f"Based on this call transcription, answer the question below:\n\nTRANSCRIPTION:\n{transcription}\n\nQUESTION: {question}"
202                 response = self.client.chat.completions(
203                     model="sarvam-105b",
204                     max_tokens=4096,
205                     reasoning_effort=None,
206                     messages=[
207                         {"role": "system", "content": "You are a call analytics expert. Answer questions about the call using only information present in the transcription."},
208                         {"role": "user", "content": prompt},
209                     ],
210                 )
211                 answer = response.choices[0].message.content
212 
213                 q_hash = hashlib.sha1(question.encode()).hexdigest()[:6]
214                 path = Path(data["conversation_path"]).parent / f"{file_name}_question_{q_hash}.txt"
215                 with open(path, "w", encoding="utf-8") as f:
216                     f.write(f"Question: {question}\n\nAnswer:\n{answer}")
217                 print(f"Answer saved to {path}")
218             except Exception as e:
219                 print(f"Error answering question for {file_name}: {e}")
220 
221     def get_summary(self, output_dir: Optional[Path] = None) -> None:
222         output_dir = output_dir or Path(OUTPUT_DIR)
223         timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
224         summary_path = output_dir / f"summary_{timestamp}.txt"
225         try:
226             with open(summary_path, "w", encoding="utf-8") as f:
227                 f.write("CALL ANALYTICS SUMMARY REPORT\n")
228                 f.write("=" * 60 + "\n")
229                 f.write(f"Generated: {datetime.now()}\n")
230                 f.write(f"Total Calls: {len(self.transcriptions)}\n")
231                 f.write("=" * 60 + "\n\n")
232 
233                 for file_name, data in self.transcriptions.items():
234                     analysis_file = Path(data["conversation_path"]).parent / f"{file_name}_analysis.txt"
235                     if not analysis_file.exists():
236                         print(f"Analysis file not found for {file_name}, skipping.")
237                         continue
238 
239                     with open(analysis_file, "r", encoding="utf-8") as af:
240                         analysis_text = af.read()
241 
242                     summary_prompt = textwrap.dedent(SUMMARY_PROMPT_TEMPLATE).format(analysis_text=analysis_text)
243 
244                     response = self.client.chat.completions(
245                         model="sarvam-105b",
246                         max_tokens=4096,
247                         reasoning_effort=None,
248                         messages=[
249                             {"role": "system", "content": "You are a call analytics summarizing expert. Provide concise and clear answers to each point."},
250                             {"role": "user", "content": summary_prompt},
251                         ],
252                     )
253                     concise_summary = response.choices[0].message.content.strip()
254 
255                     f.write(f"Call File: {file_name}\n")
256                     f.write("-" * 30 + "\n")
257                     f.write(f"{concise_summary}\n\n")
258 
259             print(f"Summary saved to {summary_path}")
260         except Exception as e:
261             print(f"Error writing summary: {e}")
262 
263 
264 client = SarvamAI(api_subscription_key=os.environ["SARVAM_API_KEY"])
265 analytics = CallAnalytics(client=client)
266 
267 audio_path = "/path/to/your/audio/file.mp3"
268 analytics.process_audio_files(prepare_audio_paths(audio_path))
269 analytics.answer_question("Was a refund timeline promised to the customer?")
270 analytics.get_summary()

How it works

split_audio / prepare_audio_paths handle the Batch API’s 2-hour-per-file limit. Most calls are well under 2 hours, so prepare_audio_paths just returns the original path unchanged; only recordings longer than 2 hours actually get split and exported into parts.
process_audio_files creates a diarized transcription job on Sarvam’s Batch STT Translate API, uploads your audio, waits for completion, checks per-file success/failure, downloads the raw outputs, and kicks off analysis for every successfully transcribed file.
_parse_transcriptions reads the downloaded JSON, converts diarized entries into a SPEAKER: text transcript (*_conversation.txt), and tallies each speaker’s total talk-time (*_timing.json), handy for spotting whether the agent is doing too much (or too little) of the talking.
analyze_transcription sends the parsed transcript to a Sarvam chat model with a structured prompt covering speaker roles, customer type, issue, resolution, sentiment, and upsell opportunities, saving the result to *_analysis.txt.
answer_question lets you ask any custom question (e.g. “Did the agent mention a refund timeline?”) against every transcript you’ve processed so far.
get_summary condenses each call’s analysis into a 2-3-word-per-field summary and writes one consolidated report, the fastest way to scan many calls at a glance.

sarvam-105b has reasoning enabled by default, and reasoning tokens count toward max_tokens. With no max_tokens set, the 9-point structured analysis prompt above can get cut off mid-answer (which then breaks get_summary, since it summarizes a truncated analysis). Setting max_tokens=4096 and reasoning_effort=None avoids this and is also cheaper and faster for this kind of structured-extraction task, which doesn’t benefit much from chain-of-thought reasoning. See Reasoning for details.

For very long calls, the transcript plus prompt may exceed the chat model’s context window. See Tips and Best Practices below for how to handle this.

The last block in the script creates the client, runs the pipeline on your audio, asks a follow-up question, and generates a summary. Once it completes, check the outputs/ directory. You’ll have the raw transcription JSON, the parsed conversation and timing files, the structured analysis, your question’s answer, and a summary report, all named by the original audio file.

5. Sample Output

Below is the analysis you’d get by running the pipeline on the sample Sample_product_refund.mp3 recording.

Here's a structured analysis of the call transcription:
### 1. Speaker Identification
* **Customer:** SPEAKER_00 (Adam Wilson)
* **Agent:** SPEAKER_01 (Sam from Coaching Downs)
### 2. Customer Type
* **Existing customer:** The customer has previously made a purchase (order number provided) and is now initiating a return and refund request.
### 3. Initial Problem/Query
* The customer called to:
    * Return an item due to incorrect size.
    * Inquire about the status of their refund, as it hasn't reflected in their account yet.
### 4. Services/Products Involved
* **Product:** Clothing item (implied by the return and size issue).
* **Services:** Return processing and refund issuance.
### 5. Agent's Response and Resolution Process
* **Initial Steps:**
    * Requested the order number, customer name, and contact details (phone number and email).
    * Verified the order details in the system.
* **Issue Identification:**
    * The order wasn't immediately found in the system, leading to further verification.
    * The customer lacked the return tracking number from the courier company.
* **Resolution Steps:**
    * Agent confirmed the return request date (15th of November) and noted it was outside the standard refund processing timeframe.
    * Agent escalated the case to the corporate office for review and promised to send an email update within 2-4 business days.
    * Agent reassured the customer about the refund timeline and confirmed the email address for communication.
### 6. Customer Satisfaction
* **Neutral to slightly positive:** The customer seemed somewhat reassured by the agent's explanation and the promise of a prompt update. However, there was initial frustration about the refund delay.
### 7. Customer Sentiments
* **Initial Frustration:** Expressing concern about the missing refund and the potential delay in processing.
* **Reassurance:** After the agent's explanation, the customer seemed more at ease, though still awaiting confirmation.
### 8. Competitors/Upselling Opportunities
* **No mention of competitors.**
* **No clear upselling/cross-selling opportunities identified during the call.** The focus was solely on resolving the return and refund issue.
### 9. Resolution Summary and Success
* **Resolution:** The agent escalated the case to the corporate office for review and promised an email update within 2-4 business days.

6. Tips & Best Practices

Audio quality: Clear audio with minimal background noise and cross-talk significantly improves diarization and transcription accuracy.
Speaker count: If you know the number of speakers in advance, pass num_speakers to create_job for more consistent diarization instead of relying on auto-detection.
Batch limits: A single job accepts up to 20 files and each file can be up to 2 hours long. split_audio and prepare_audio_paths in The Full Pipeline Script handle anything longer.
Long transcripts: If a call transcript is long enough to risk exceeding the chat model’s context window, chunk it (e.g. by time segment) and analyze each chunk before combining results, rather than sending the entire transcript in one prompt.
Cost & latency: Each call triggers one LLM request per method (analyze_transcription, answer_question, get_summary). For large call volumes, batch or parallelize these calls and monitor your token usage.
API key security: Load your key from an environment variable rather than hardcoding it, especially outside local experimentation.

7. Error Handling

You may encounter these errors while using the API:

403 Forbidden (invalid_api_key_error), invalid API key. Use a valid key from the Sarvam AI Dashboard.
429 Too Many Requests (insufficient_quota_error / rate_limit_exceeded_error), credits exhausted or rate limit hit. Retry with exponential backoff.
500 Internal Server Error (internal_server_error), issue on our servers. Try again later; contact support if persistent.

For the full error-code table, request validation errors (400/422), retry guidance, and SDK exceptions, see Errors & Troubleshooting.

8. Additional Resources

Documentation: docs.sarvam.ai
Related cookbooks: Batch Speech-to-Text Translate · Chat Completion API
Example projects: sarvam-ai-cookbook on GitHub
Community: Join the Discord Community

9. Final Notes

Keep your API key secure, prefer environment variables over hardcoding.
Use clear audio for best diarization and transcription results.
All outputs (transcripts, timing, analysis, answers, summaries) are saved under outputs/ for easy review.

Keep Building! 🚀