Build Your First Voice Agent using LiveKit

Overview

This guide demonstrates how to build a real-time voice agent that can listen, understand, and respond naturally using LiveKit for real-time communication and Sarvam AI for speech processing. Perfect for building voice assistants, customer support bots, and conversational AI applications for Indian languages.

What You’ll Build

A voice agent that can:

Listen to users speaking (in multiple Indian languages!)
Understand and process their requests
Respond back in natural-sounding voices

Quick Overview

Get API keys (LiveKit, Sarvam, OpenAI)
Install packages: pip install livekit-agents[sarvam,openai,silero] python-dotenv
Create .env file with your API keys
Write ~40 lines of Python code
Run: python agent.py dev
Test: python agent.py console

Quick Start

1. Prerequisites

Python 3.9 or higher
API keys from:
- LiveKit Cloud (free account)
- Sarvam AI (get API key from dashboard)
- OpenAI (create new secret key)

2. Install Dependencies

macOS/Linux

Windows

$ pip install "livekit-agents[sarvam,openai,silero]" python-dotenv

3. Create Environment File

Create a file named .env in your project folder and add your API keys:

1 LIVEKIT_URL=wss://your-project-xxxxx.livekit.cloud
2 LIVEKIT_API_KEY=APIxxxxxxxxxxxxx
3 LIVEKIT_API_SECRET=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
4 SARVAM_API_KEY=sk_xxxxxxxxxxxxxxxxxxxxxxxx
5 OPENAI_API_KEY=sk-proj-xxxxxxxxxxxxxxxx

Replace the values with your actual API keys.

4. Write Your Agent

Create agent.py:

1 import logging
2 from dotenv import load_dotenv
3 from livekit.agents import JobContext, WorkerOptions, cli
4 from livekit.agents.voice import Agent, AgentSession
5 from livekit.plugins import openai, sarvam
6 
7 # Load environment variables
8 load_dotenv()
9 
10 # Set up logging
11 logger = logging.getLogger("voice-agent")
12 logger.setLevel(logging.INFO)
13 
14 
15 class VoiceAgent(Agent):
16     def __init__(self) -> None:
17         super().__init__(
18             # Your agent's personality and instructions
19             instructions="""
20                 You are a helpful voice assistant.
21                 Be friendly, concise, and conversational.
22                 Speak naturally as if you're having a real conversation.
23             """,
24             
25             # Saaras v3 STT - Converts speech to text
26             stt=sarvam.STT(
27                 language="unknown",  # Auto-detect language, or use "en-IN", "hi-IN", etc.
28                 model="saaras:v3",
29                 mode="transcribe"
30             ),
31             
32             # OpenAI LLM - The "brain" that processes and generates responses
33             llm=openai.LLM(model="gpt-4o"),
34             
35             # Bulbul TTS - Converts text to speech
36             tts=sarvam.TTS(
37                 target_language_code="en-IN",
38                 model="bulbul:v3",
39                 speaker="shubh"  # Female: priya, simran, ishita, kavya | Male: aditya, anand, rohan
40             ),
41         )
42     
43     async def on_enter(self):
44         """Called when user joins - agent starts the conversation"""
45         self.session.generate_reply()
46 
47 
48 async def entrypoint(ctx: JobContext):
49     """Main entry point - LiveKit calls this when a user connects"""
50     logger.info(f"User connected to room: {ctx.room.name}")
51     
52     # Create and start the agent session
53     session = AgentSession()
54     await session.start(
55         agent=VoiceAgent(),
56         room=ctx.room
57     )
58 
59 
60 if __name__ == "__main__":
61     # Run the agent
62     cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))

5. Run Your Agent

$ python agent.py dev

6. Test Your Agent

In a new terminal, run:

$ python agent.py console

That’s it! You’ve built your first voice agent!

Customization Examples

Example 1: Hindi Voice Agent

1 stt=sarvam.STT(
2     language="hi-IN",  # Hindi
3     model="saaras:v3",
4     mode="transcribe"
5 ),
6 tts=sarvam.TTS(
7     target_language_code="hi-IN",
8     model="bulbul:v3",
9     speaker="simran"  # Or: priya, ishita, kavya, aditya, anand, rohan
10 )

Example 2: Tamil Voice Agent

1 stt=sarvam.STT(language="ta-IN", model="saaras:v3", mode="transcribe"),
2 tts=sarvam.TTS(
3     target_language_code="ta-IN",
4     model="bulbul:v3",
5     speaker="shubh"
6 )

Example 3: Multilingual Agent (Auto-detect)

1 stt=sarvam.STT(language="unknown", model="saaras:v3", mode="transcribe"),  # Auto-detects language
2 tts=sarvam.TTS(target_language_code="en-IN", model="bulbul:v3", speaker="anand")

Example 4: Speech-to-English Agent (Saaras)

Difference: Saaras v3 handles both transcription (same-language output) and translation (English output) via the mode parameter. Use mode="translate" when user speaks Indian languages but you want to process/respond in English.

1 # User speaks Hindi → Saaras converts to English → LLM processes → Responds in English
2 
3 stt=sarvam.STT(model="saaras:v3", mode="translate"),  # Speech-to-English translation
4 llm=openai.LLM(model="gpt-4o"),
5 tts=sarvam.TTS(target_language_code="en-IN", model="bulbul:v3", speaker="aditya")

Note: Saaras v3 with mode="translate" automatically detects the source language (Hindi, Tamil, etc.) and translates spoken content directly to English text, making Indian language speech comprehensible to English-based LLMs.

Available Options

Language Codes

Language	Code
English (India)	`en-IN`
Hindi	`hi-IN`
Bengali	`bn-IN`
Tamil	`ta-IN`
Telugu	`te-IN`
Gujarati	`gu-IN`
Kannada	`kn-IN`
Malayalam	`ml-IN`
Marathi	`mr-IN`
Punjabi	`pa-IN`
Odia	`od-IN`
Auto-detect	`unknown`

Speaker Voices (Bulbul v3)

Male (23): Shubh (default), Aditya, Rahul, Rohan, Amit, Dev, Ratan, Varun, Manan, Sumit, Kabir, Aayan, Ashutosh, Advait, Anand, Tarun, Sunny, Mani, Gokul, Vijay, Mohit, Rehan, Soham

Female (16): Ritu, Priya, Neha, Pooja, Simran, Kavya, Ishita, Shreya, Roopa, Amelia, Sophia, Tanya, Shruti, Suhani, Kavitha, Rupali

Pro Tips

Use language="unknown" to automatically detect the language. Great for multilingual scenarios!
Sarvam’s models understand code-mixing - your agent can naturally handle Hinglish, Tanglish, and other mixed languages.

Best Practices

When using Sarvam AI plugins with LiveKit, follow these recommendations for optimal performance:

1. Do Not Pass VAD to AgentSession

The vad parameter should not be passed to AgentSession as Voice Activity Detection is handled internally by the Sarvam plugin.

1 # ❌ Avoid this
2 session = AgentSession(vad=silero.VAD.load())
3 
4 # ✅ Do this instead
5 session = AgentSession()

2. Enable Flush Signal in STT

Add flush_signal=True to the STT configuration. This enables the plugin to emit start and end of speech events, which is essential for proper turn-taking.

1 stt=sarvam.STT(
2     language="unknown",
3     model="saaras:v3",
4     mode="transcribe",
5     flush_signal=True  # Enables speech start/end events
6 )

3. Set Turn Detection to STT

Add turn_detection="stt" to the AgentSession configuration. This ensures turn detection is handled by the Sarvam plugin, which emits start and end of speech signals.

1 session = AgentSession(turn_detection="stt")

4. Configure Min Endpointing Delay

Set min_endpointing_delay=0.07 in your AgentSession. The Sarvam STT plugin has a processing latency of approximately 70ms. This setting ensures the agent transitions to the next pipeline step (LLM) as soon as STT finishes processing, minimizing response delay.

1 session = AgentSession(
2     turn_detection="stt",
3     min_endpointing_delay=0.07
4 )

Complete Optimized Example

Here’s a complete example incorporating all best practices:

1 # STT with flush_signal enabled
2 stt=sarvam.STT(
3     language="unknown",
4     model="saaras:v3",
5     mode="transcribe",
6     flush_signal=True
7 )
8 
9 # AgentSession with optimized settings (no VAD parameter)
10 session = AgentSession(
11     turn_detection="stt",
12     min_endpointing_delay=0.07
13 )

Troubleshooting

API key errors: Check that all keys are in your .env file and the file is in the same directory as your script.

Module not found: Run the installation command again based on your operating system (see Step 2 above).

Poor transcription: Try language="unknown" for auto-detection, or specify the correct language code (en-IN, hi-IN, etc.).

Additional Resources

Need Help?

Sarvam Support: developer@sarvam.ai
Community: Join the Discord Community

Happy Building!