Build Your First Voice Agent using Pipecat
Overview
This guide demonstrates how to build a real-time voice agent that can listen, understand, and respond naturally using Pipecat for real-time communication and Sarvam AI for speech processing. Perfect for building voice assistants, customer support bots, and conversational AI applications for Indian languages.
What You’ll Build
A voice agent that can:
- Listen to users speaking (in multiple Indian languages!)
- Understand and process their requests
- Respond back in natural-sounding voices
Quick Overview
- Get API keys (Sarvam, OpenAI)
- Install packages:
pip install pipecat-ai[daily,openai,sarvam] python-dotenv - Create
.envfile with your API keys - Write ~80 lines of Python code
- Run with appropriate transport
Quick Start
1. Prerequisites
- Python 3.9 or higher
- API keys from:
2. Install Dependencies
3. Create Environment File
Create a file named .env in your project folder and add your API keys:
Replace the values with your actual API keys.
4. Write Your Agent
Create agent.py:
5. Run Your Agent
For Daily transport:
The agent will create a Daily room and provide you with a URL to join.
6. Test Your Agent
Open the provided Daily room URL in your browser and start speaking. Your voice agent will listen and respond!
Customization Examples
Example 1: Hindi Voice Agent
Example 2: Tamil Voice Agent
Example 3: Multilingual Agent (Auto-detect)
Example 4: Speech-to-English Agent (Saaras)
Difference: Saarika transcribes speech to text in the same language, while Saaras translates speech directly to English text. Use Saaras when user speaks Indian languages but you want to process/respond in English.
Note: Saaras automatically detects the source language (Hindi, Tamil, etc.) and translates spoken content directly to English text, making Indian language speech comprehensible to English-based LLMs.
Available Options
Language Codes
Speaker Voices (Bulbul v2)
Female Voices:
anushka- Clear and professional (default)manisha- Warm and friendlyvidya- Articulate and precisearya- Young and energetic
Male Voices:
abhilash- Deep and authoritativekarun- Natural and conversationalhitesh- Professional and engaging
TTS Additional Parameters
You can customize the TTS service with additional parameters:
Understanding the Pipeline
Pipecat uses a pipeline architecture where data flows through a series of processors:
- Transport Input: Receives audio from the user
- STT (Speech-to-Text): Converts audio to text using Sarvam’s Saarika
- Context Aggregator (User): Adds user message to conversation context
- LLM: Generates response using OpenAI
- TTS (Text-to-Speech): Converts response to audio using Sarvam’s Bulbul
- Transport Output: Sends audio back to the user
- Context Aggregator (Assistant): Saves assistant’s response to context
Pro Tips
- Use
language="unknown"to automatically detect the language. Great for multilingual scenarios! - Sarvam’s models understand code-mixing - your agent can naturally handle Hinglish, Tanglish, and other mixed languages.
- Adjust
pitch,pace, andloudnessto customize the voice personality. - Use
gpt-4o-minifor faster responses, orgpt-4ofor more complex conversations.
Troubleshooting
API key errors: Check that all keys are in your .env file and the file is in the same directory as your script.
Module not found: Run pip install pipecat-ai[daily,openai] python-dotenv loguru again.
Poor transcription: Try language="unknown" for auto-detection, or specify the correct language code (en-IN, hi-IN, etc.).
Connection issues: Ensure you have a stable internet connection and the transport is properly configured.
Additional Resources
Need Help?
- Sarvam Support: support@sarvam.ai
- Community: Join the Discord Community
Happy Building!