Build Your First Voice Agent using Pipecat

Overview

This guide demonstrates how to build a real-time voice agent that can listen, understand, and respond naturally using Pipecat for real-time communication and Sarvam AI for speech processing. Perfect for building voice assistants, customer support bots, and conversational AI applications for Indian languages.

What You’ll Build

A voice agent that can:

  • Listen to users speaking (in multiple Indian languages!)
  • Understand and process their requests
  • Respond back in natural-sounding voices

Quick Overview

  1. Get API keys (Sarvam, OpenAI)
  2. Install packages: pip install pipecat-ai[daily,openai,sarvam] python-dotenv
  3. Create .env file with your API keys
  4. Write ~80 lines of Python code
  5. Run with appropriate transport

Quick Start

1. Prerequisites

  • Python 3.9 or higher
  • API keys from:

2. Install Dependencies

$pip install pipecat-ai[daily,openai] python-dotenv loguru

3. Create Environment File

Create a file named .env in your project folder and add your API keys:

1SARVAM_API_KEY=sk_xxxxxxxxxxxxxxxxxxxxxxxx
2OPENAI_API_KEY=sk-proj-xxxxxxxxxxxxxxxx

Replace the values with your actual API keys.

4. Write Your Agent

Create agent.py:

1import os
2from dotenv import load_dotenv
3from loguru import logger
4from pipecat.frames.frames import LLMRunFrame
5from pipecat.pipeline.pipeline import Pipeline
6from pipecat.pipeline.runner import PipelineRunner
7from pipecat.pipeline.task import PipelineTask
8from pipecat.processors.aggregators.llm_context import LLMContext
9from pipecat.processors.aggregators.llm_response_universal import (
10 LLMContextAggregatorPair,
11)
12from pipecat.runner.types import RunnerArguments
13from pipecat.runner.utils import create_transport
14from pipecat.services.sarvam.stt import SarvamSTTService
15from pipecat.services.sarvam.tts import SarvamTTSService
16from pipecat.services.openai.llm import OpenAILLMService
17from pipecat.transports.base_transport import TransportParams
18from pipecat.transports.daily.transport import DailyParams
19
20load_dotenv(override=True)
21
22async def bot(runner_args: RunnerArguments):
23 """Main bot entry point."""
24
25 # Create transport (supports both Daily and WebRTC)
26 transport = await create_transport(
27 runner_args,
28 {
29 "daily": lambda: DailyParams(audio_in_enabled=True, audio_out_enabled=True),
30 "webrtc": lambda: TransportParams(
31 audio_in_enabled=True, audio_out_enabled=True
32 ),
33 },
34 )
35
36 # Initialize AI services
37 stt = SarvamSTTService(api_key=os.getenv("SARVAM_API_KEY"))
38 tts = SarvamTTSService(api_key=os.getenv("SARVAM_API_KEY"))
39 llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"), model="gpt-4o-mini")
40
41 # Set up conversation context
42 messages = [
43 {
44 "role": "system",
45 "content": "You are a friendly AI assistant. Keep your responses brief and conversational.",
46 },
47 ]
48 context = LLMContext(messages)
49 context_aggregator = LLMContextAggregatorPair(context)
50
51 # Build pipeline
52 pipeline = Pipeline(
53 [
54 transport.input(),
55 stt,
56 context_aggregator.user(),
57 llm,
58 tts,
59 transport.output(),
60 context_aggregator.assistant(),
61 ]
62 )
63
64 task = PipelineTask(pipeline)
65
66 @transport.event_handler("on_client_connected")
67 async def on_client_connected(transport, client):
68 logger.info("Client connected")
69 messages.append(
70 {"role": "system", "content": "Say hello and briefly introduce yourself."}
71 )
72 await task.queue_frames([LLMRunFrame()])
73
74 @transport.event_handler("on_client_disconnected")
75 async def on_client_disconnected(transport, client):
76 logger.info("Client disconnected")
77 await task.cancel()
78
79 runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)
80 await runner.run(task)
81
82if __name__ == "__main__":
83 from pipecat.runner.run import main
84 main()

5. Run Your Agent

For Daily transport:

$python agent.py

The agent will create a Daily room and provide you with a URL to join.

6. Test Your Agent

Open the provided Daily room URL in your browser and start speaking. Your voice agent will listen and respond!


Customization Examples

Example 1: Hindi Voice Agent

1# Initialize AI services with Hindi support
2stt = SarvamSTTService(
3 api_key=os.getenv("SARVAM_API_KEY"),
4 language="hi-IN", # Hindi
5 model="saarika:v2.5"
6)
7
8tts = SarvamTTSService(
9 api_key=os.getenv("SARVAM_API_KEY"),
10 target_language_code="hi-IN",
11 model="bulbul:v2",
12 speaker="manisha" # Or: anushka, vidya, arya, abhilash, karun, hitesh
13)
14
15llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"), model="gpt-4o-mini")

Example 2: Tamil Voice Agent

1stt = SarvamSTTService(
2 api_key=os.getenv("SARVAM_API_KEY"),
3 language="ta-IN",
4 model="saarika:v2.5"
5)
6
7tts = SarvamTTSService(
8 api_key=os.getenv("SARVAM_API_KEY"),
9 target_language_code="ta-IN",
10 model="bulbul:v2",
11 speaker="anushka"
12)
13
14llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"), model="gpt-4o-mini")

Example 3: Multilingual Agent (Auto-detect)

1# Auto-detect the user's language
2stt = SarvamSTTService(
3 api_key=os.getenv("SARVAM_API_KEY"),
4 language="unknown", # Auto-detects language
5 model="saarika:v2.5"
6)
7
8tts = SarvamTTSService(
9 api_key=os.getenv("SARVAM_API_KEY"),
10 target_language_code="en-IN",
11 model="bulbul:v2",
12 speaker="karun"
13)
14
15llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"), model="gpt-4o-mini")

Example 4: Speech-to-English Agent (Saaras)

Difference: Saarika transcribes speech to text in the same language, while Saaras translates speech directly to English text. Use Saaras when user speaks Indian languages but you want to process/respond in English.

1# User speaks Hindi → Saaras converts to English → LLM processes → Responds in English
2
3stt = SarvamSTTService(
4 api_key=os.getenv("SARVAM_API_KEY"),
5 model="saaras:v2.5" # Speech-to-English translation
6)
7
8tts = SarvamTTSService(
9 api_key=os.getenv("SARVAM_API_KEY"),
10 target_language_code="en-IN",
11 model="bulbul:v2",
12 speaker="abhilash"
13)
14
15llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"), model="gpt-4o-mini")

Note: Saaras automatically detects the source language (Hindi, Tamil, etc.) and translates spoken content directly to English text, making Indian language speech comprehensible to English-based LLMs.


Available Options

Language Codes

LanguageCode
English (India)en-IN
Hindihi-IN
Bengalibn-IN
Tamilta-IN
Telugute-IN
Gujaratigu-IN
Kannadakn-IN
Malayalamml-IN
Marathimr-IN
Punjabipa-IN
Odiaod-IN
Auto-detectunknown

Speaker Voices (Bulbul v2)

Female Voices:

  • anushka - Clear and professional (default)
  • manisha - Warm and friendly
  • vidya - Articulate and precise
  • arya - Young and energetic

Male Voices:

  • abhilash - Deep and authoritative
  • karun - Natural and conversational
  • hitesh - Professional and engaging

TTS Additional Parameters

You can customize the TTS service with additional parameters:

1tts = SarvamTTSService(
2 api_key=os.getenv("SARVAM_API_KEY"),
3 target_language_code="en-IN",
4 model="bulbul:v2",
5 speaker="anushka",
6 pitch=0.0, # Range: -1.0 to 1.0
7 pace=1.0, # Range: 0.5 to 2.0
8 loudness=1.5, # Range: 0.5 to 2.0
9 speech_sample_rate=16000 # 8000, 16000, or 24000 Hz
10)

Understanding the Pipeline

Pipecat uses a pipeline architecture where data flows through a series of processors:

User Audio → STT → Context Aggregator → LLM → TTS → Audio Output
  1. Transport Input: Receives audio from the user
  2. STT (Speech-to-Text): Converts audio to text using Sarvam’s Saarika
  3. Context Aggregator (User): Adds user message to conversation context
  4. LLM: Generates response using OpenAI
  5. TTS (Text-to-Speech): Converts response to audio using Sarvam’s Bulbul
  6. Transport Output: Sends audio back to the user
  7. Context Aggregator (Assistant): Saves assistant’s response to context

Pro Tips

  • Use language="unknown" to automatically detect the language. Great for multilingual scenarios!
  • Sarvam’s models understand code-mixing - your agent can naturally handle Hinglish, Tanglish, and other mixed languages.
  • Adjust pitch, pace, and loudness to customize the voice personality.
  • Use gpt-4o-mini for faster responses, or gpt-4o for more complex conversations.

Troubleshooting

API key errors: Check that all keys are in your .env file and the file is in the same directory as your script.

Module not found: Run pip install pipecat-ai[daily,openai] python-dotenv loguru again.

Poor transcription: Try language="unknown" for auto-detection, or specify the correct language code (en-IN, hi-IN, etc.).

Connection issues: Ensure you have a stable internet connection and the transport is properly configured.


Additional Resources


Need Help?


Happy Building!