Build Your First Voice Agent using Pipecat

Overview

This guide demonstrates how to build a real-time voice agent that can listen, understand, and respond naturally using Pipecat for real-time communication and Sarvam AI for speech processing. Perfect for building voice assistants, customer support bots, and conversational AI applications for Indian languages.

What You’ll Build

A voice agent that can:

Listen to users speaking (in multiple Indian languages!)
Understand and process their requests
Respond back in natural-sounding voices

Quick Overview

Get API keys (Sarvam, OpenAI)
Install packages: pip install pipecat-ai[daily,openai,sarvam] python-dotenv
Create .env file with your API keys
Write ~80 lines of Python code
Run with appropriate transport

Quick Start

1. Prerequisites

Python 3.9 or higher
API keys from:
- Sarvam AI (get API key from dashboard)
- OpenAI (create new secret key)

2. Install Dependencies

macOS/Linux

Windows

$ pip install "pipecat-ai[daily,openai]" python-dotenv loguru

3. Create Environment File

Create a file named .env in your project folder and add your API keys:

1 SARVAM_API_KEY=sk_xxxxxxxxxxxxxxxxxxxxxxxx
2 OPENAI_API_KEY=sk-proj-xxxxxxxxxxxxxxxx

Replace the values with your actual API keys.

4. Write Your Agent

Create agent.py:

1 import os
2 from dotenv import load_dotenv
3 from loguru import logger
4 from pipecat.frames.frames import LLMRunFrame
5 from pipecat.pipeline.pipeline import Pipeline
6 from pipecat.pipeline.runner import PipelineRunner
7 from pipecat.pipeline.task import PipelineTask
8 from pipecat.processors.aggregators.llm_context import LLMContext
9 from pipecat.processors.aggregators.llm_response_universal import (
10     LLMContextAggregatorPair,
11 )
12 from pipecat.runner.types import RunnerArguments
13 from pipecat.runner.utils import create_transport
14 from pipecat.services.sarvam.stt import SarvamSTTService
15 from pipecat.services.sarvam.tts import SarvamTTSService
16 from pipecat.services.openai.llm import OpenAILLMService
17 from pipecat.transports.base_transport import TransportParams
18 from pipecat.transports.daily.transport import DailyParams
19 
20 load_dotenv(override=True)
21 
22 async def bot(runner_args: RunnerArguments):
23     """Main bot entry point."""
24     
25     # Create transport (supports both Daily and WebRTC)
26     transport = await create_transport(
27         runner_args,
28         {
29             "daily": lambda: DailyParams(audio_in_enabled=True, audio_out_enabled=True),
30             "webrtc": lambda: TransportParams(
31                 audio_in_enabled=True, audio_out_enabled=True
32             ),
33         },
34     )
35 
36     # Initialize AI services
37     stt = SarvamSTTService(api_key=os.getenv("SARVAM_API_KEY"))
38     tts = SarvamTTSService(api_key=os.getenv("SARVAM_API_KEY"))
39     llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"), model="gpt-4o-mini")
40 
41     # Set up conversation context
42     messages = [
43         {
44             "role": "system",
45             "content": "You are a friendly AI assistant. Keep your responses brief and conversational.",
46         },
47     ]
48     context = LLMContext(messages)
49     context_aggregator = LLMContextAggregatorPair(context)
50 
51     # Build pipeline
52     pipeline = Pipeline(
53         [
54             transport.input(),
55             stt,
56             context_aggregator.user(),
57             llm,
58             tts,
59             transport.output(),
60             context_aggregator.assistant(),
61         ]
62     )
63 
64     task = PipelineTask(pipeline)
65 
66     @transport.event_handler("on_client_connected")
67     async def on_client_connected(transport, client):
68         logger.info("Client connected")
69         messages.append(
70             {"role": "system", "content": "Say hello and briefly introduce yourself."}
71         )
72         await task.queue_frames([LLMRunFrame()])
73 
74     @transport.event_handler("on_client_disconnected")
75     async def on_client_disconnected(transport, client):
76         logger.info("Client disconnected")
77         await task.cancel()
78 
79     runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)
80     await runner.run(task)
81 
82 if __name__ == "__main__":
83     from pipecat.runner.run import main
84     main()

5. Run Your Agent

For Daily transport:

$ python agent.py

The agent will create a Daily room and provide you with a URL to join.

6. Test Your Agent

Open the provided Daily room URL in your browser and start speaking. Your voice agent will listen and respond!

Customization Examples

Example 1: Hindi Voice Agent

1 # Initialize AI services with Hindi support
2 stt = SarvamSTTService(
3     api_key=os.getenv("SARVAM_API_KEY"),
4     language="hi-IN",  # Hindi
5     model="saarika:v2.5"
6 )
7 
8 tts = SarvamTTSService(
9     api_key=os.getenv("SARVAM_API_KEY"),
10     target_language_code="hi-IN",
11     model="bulbul:v2",
12     speaker="manisha"  # Or: anushka, vidya, arya, abhilash, karun, hitesh
13 )
14 
15 llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"), model="gpt-4o-mini")

Example 2: Tamil Voice Agent

1 stt = SarvamSTTService(
2     api_key=os.getenv("SARVAM_API_KEY"),
3     language="ta-IN",
4     model="saarika:v2.5"
5 )
6 
7 tts = SarvamTTSService(
8     api_key=os.getenv("SARVAM_API_KEY"),
9     target_language_code="ta-IN",
10     model="bulbul:v2",
11     speaker="anushka"
12 )
13 
14 llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"), model="gpt-4o-mini")

Example 3: Multilingual Agent (Auto-detect)

1 # Auto-detect the user's language
2 stt = SarvamSTTService(
3     api_key=os.getenv("SARVAM_API_KEY"),
4     language="unknown",  # Auto-detects language
5     model="saarika:v2.5"
6 )
7 
8 tts = SarvamTTSService(
9     api_key=os.getenv("SARVAM_API_KEY"),
10     target_language_code="en-IN",
11     model="bulbul:v2",
12     speaker="karun"
13 )
14 
15 llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"), model="gpt-4o-mini")

Example 4: Speech-to-English Agent (Saaras)

Difference: Saarika transcribes speech to text in the same language, while Saaras translates speech directly to English text. Use Saaras when user speaks Indian languages but you want to process/respond in English.

1 # User speaks Hindi → Saaras converts to English → LLM processes → Responds in English
2 
3 stt = SarvamSTTService(
4     api_key=os.getenv("SARVAM_API_KEY"),
5     model="saaras:v2.5"  # Speech-to-English translation
6 )
7 
8 tts = SarvamTTSService(
9     api_key=os.getenv("SARVAM_API_KEY"),
10     target_language_code="en-IN",
11     model="bulbul:v2",
12     speaker="abhilash"
13 )
14 
15 llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"), model="gpt-4o-mini")

Note: Saaras automatically detects the source language (Hindi, Tamil, etc.) and translates spoken content directly to English text, making Indian language speech comprehensible to English-based LLMs.

Available Options

Language Codes

Language	Code
English (India)	`en-IN`
Hindi	`hi-IN`
Bengali	`bn-IN`
Tamil	`ta-IN`
Telugu	`te-IN`
Gujarati	`gu-IN`
Kannada	`kn-IN`
Malayalam	`ml-IN`
Marathi	`mr-IN`
Punjabi	`pa-IN`
Odia	`od-IN`
Auto-detect	`unknown`

Speaker Voices (Bulbul v2)

Female Voices:

anushka - Clear and professional (default)
manisha - Warm and friendly
vidya - Articulate and precise
arya - Young and energetic

Male Voices:

abhilash - Deep and authoritative
karun - Natural and conversational
hitesh - Professional and engaging

TTS Additional Parameters

You can customize the TTS service with additional parameters:

1 tts = SarvamTTSService(
2     api_key=os.getenv("SARVAM_API_KEY"),
3     target_language_code="en-IN",
4     model="bulbul:v2",
5     speaker="anushka",
6     pitch=0.0,           # Range: -1.0 to 1.0
7     pace=1.0,            # Range: 0.5 to 2.0
8     loudness=1.5,        # Range: 0.5 to 2.0
9     speech_sample_rate=16000  # 8000, 16000, or 24000 Hz
10 )

Understanding the Pipeline

Pipecat uses a pipeline architecture where data flows through a series of processors:

User Audio → STT → Context Aggregator → LLM → TTS → Audio Output

Transport Input: Receives audio from the user
STT (Speech-to-Text): Converts audio to text using Sarvam’s Saarika
Context Aggregator (User): Adds user message to conversation context
LLM: Generates response using OpenAI
TTS (Text-to-Speech): Converts response to audio using Sarvam’s Bulbul
Transport Output: Sends audio back to the user
Context Aggregator (Assistant): Saves assistant’s response to context

Pro Tips

Use language="unknown" to automatically detect the language. Great for multilingual scenarios!
Sarvam’s models understand code-mixing - your agent can naturally handle Hinglish, Tanglish, and other mixed languages.
Adjust pitch, pace, and loudness to customize the voice personality.
Use gpt-4o-mini for faster responses, or gpt-4o for more complex conversations.

Troubleshooting

API key errors: Check that all keys are in your .env file and the file is in the same directory as your script.

Module not found: Run the installation command again based on your operating system (see Step 2 above).

Poor transcription: Try language="unknown" for auto-detection, or specify the correct language code (en-IN, hi-IN, etc.).

Connection issues: Ensure you have a stable internet connection and the transport is properly configured.

Additional Resources

Need Help?

Sarvam Support: developer@sarvam.ai
Community: Join the Discord Community

Happy Building!