Build Your First Voice Agent using LiveKit

Overview

This guide demonstrates how to build a real-time voice agent that can listen, understand, and respond naturally using LiveKit for real-time communication and Sarvam AI for speech processing. Perfect for building voice assistants, customer support bots, and conversational AI applications for Indian languages.

What You’ll Build

A voice agent that can:

  • Listen to users speaking (in multiple Indian languages!)
  • Understand and process their requests
  • Respond back in natural-sounding voices

Quick Overview

  1. Get API keys (LiveKit, Sarvam, OpenAI)
  2. Install packages: pip install livekit-agents[sarvam,openai,silero] python-dotenv
  3. Create .env file with your API keys
  4. Write ~40 lines of Python code
  5. Run: python agent.py dev
  6. Test: python agent.py console

Quick Start

1. Prerequisites

2. Install Dependencies

$pip install "livekit-agents[sarvam,openai,silero]" python-dotenv

3. Create Environment File

Create a file named .env in your project folder and add your API keys:

1LIVEKIT_URL=wss://your-project-xxxxx.livekit.cloud
2LIVEKIT_API_KEY=APIxxxxxxxxxxxxx
3LIVEKIT_API_SECRET=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
4SARVAM_API_KEY=sk_xxxxxxxxxxxxxxxxxxxxxxxx
5OPENAI_API_KEY=sk-proj-xxxxxxxxxxxxxxxx

Replace the values with your actual API keys.

4. Write Your Agent

Create agent.py:

1import logging
2from dotenv import load_dotenv
3from livekit.agents import JobContext, WorkerOptions, cli
4from livekit.agents.voice import Agent, AgentSession
5from livekit.plugins import openai, sarvam
6
7# Load environment variables
8load_dotenv()
9
10# Set up logging
11logger = logging.getLogger("voice-agent")
12logger.setLevel(logging.INFO)
13
14
15class VoiceAgent(Agent):
16 def __init__(self) -> None:
17 super().__init__(
18 # Your agent's personality and instructions
19 instructions="""
20 You are a helpful voice assistant.
21 Be friendly, concise, and conversational.
22 Speak naturally as if you're having a real conversation.
23 """,
24
25 # Saaras v3 STT - Converts speech to text
26 stt=sarvam.STT(
27 language="unknown", # Auto-detect language, or use "en-IN", "hi-IN", etc.
28 model="saaras:v3",
29 mode="transcribe"
30 ),
31
32 # OpenAI LLM - The "brain" that processes and generates responses
33 llm=openai.LLM(model="gpt-4o"),
34
35 # Bulbul TTS - Converts text to speech
36 tts=sarvam.TTS(
37 target_language_code="en-IN",
38 model="bulbul:v3",
39 speaker="shubh" # Female: priya, simran, ishita, kavya | Male: aditya, anand, rohan
40 ),
41 )
42
43 async def on_enter(self):
44 """Called when user joins - agent starts the conversation"""
45 self.session.generate_reply()
46
47
48async def entrypoint(ctx: JobContext):
49 """Main entry point - LiveKit calls this when a user connects"""
50 logger.info(f"User connected to room: {ctx.room.name}")
51
52 # Create and start the agent session
53 session = AgentSession()
54 await session.start(
55 agent=VoiceAgent(),
56 room=ctx.room
57 )
58
59
60if __name__ == "__main__":
61 # Run the agent
62 cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))

5. Run Your Agent

$python agent.py dev

6. Test Your Agent

In a new terminal, run:

$python agent.py console

That’s it! You’ve built your first voice agent!


Customization Examples

Example 1: Hindi Voice Agent

1stt=sarvam.STT(
2 language="hi-IN", # Hindi
3 model="saaras:v3",
4 mode="transcribe"
5),
6tts=sarvam.TTS(
7 target_language_code="hi-IN",
8 model="bulbul:v3",
9 speaker="simran" # Or: priya, ishita, kavya, aditya, anand, rohan
10)

Example 2: Tamil Voice Agent

1stt=sarvam.STT(language="ta-IN", model="saaras:v3", mode="transcribe"),
2tts=sarvam.TTS(
3 target_language_code="ta-IN",
4 model="bulbul:v3",
5 speaker="shubh"
6)

Example 3: Multilingual Agent (Auto-detect)

1stt=sarvam.STT(language="unknown", model="saaras:v3", mode="transcribe"), # Auto-detects language
2tts=sarvam.TTS(target_language_code="en-IN", model="bulbul:v3", speaker="anand")

Example 4: Speech-to-English Agent (Saaras)

Difference: Saaras v3 handles both transcription (same-language output) and translation (English output) via the mode parameter. Use mode="translate" when user speaks Indian languages but you want to process/respond in English.

1# User speaks Hindi → Saaras converts to English → LLM processes → Responds in English
2
3stt=sarvam.STT(model="saaras:v3", mode="translate"), # Speech-to-English translation
4llm=openai.LLM(model="gpt-4o"),
5tts=sarvam.TTS(target_language_code="en-IN", model="bulbul:v3", speaker="aditya")

Note: Saaras v3 with mode="translate" automatically detects the source language (Hindi, Tamil, etc.) and translates spoken content directly to English text, making Indian language speech comprehensible to English-based LLMs.


Available Options

Language Codes

LanguageCode
English (India)en-IN
Hindihi-IN
Bengalibn-IN
Tamilta-IN
Telugute-IN
Gujaratigu-IN
Kannadakn-IN
Malayalamml-IN
Marathimr-IN
Punjabipa-IN
Odiaod-IN
Auto-detectunknown

Speaker Voices (Bulbul v3)

Male (23): Shubh (default), Aditya, Rahul, Rohan, Amit, Dev, Ratan, Varun, Manan, Sumit, Kabir, Aayan, Ashutosh, Advait, Anand, Tarun, Sunny, Mani, Gokul, Vijay, Mohit, Rehan, Soham

Female (16): Ritu, Priya, Neha, Pooja, Simran, Kavya, Ishita, Shreya, Roopa, Amelia, Sophia, Tanya, Shruti, Suhani, Kavitha, Rupali


Pro Tips

  • Use language="unknown" to automatically detect the language. Great for multilingual scenarios!
  • Sarvam’s models understand code-mixing - your agent can naturally handle Hinglish, Tanglish, and other mixed languages.

Best Practices

When using Sarvam AI plugins with LiveKit, follow these recommendations for optimal performance:

1. Do Not Pass VAD to AgentSession

The vad parameter should not be passed to AgentSession as Voice Activity Detection is handled internally by the Sarvam plugin.

1# ❌ Avoid this
2session = AgentSession(vad=silero.VAD.load())
3
4# ✅ Do this instead
5session = AgentSession()

2. Enable Flush Signal in STT

Add flush_signal=True to the STT configuration. This enables the plugin to emit start and end of speech events, which is essential for proper turn-taking.

1stt=sarvam.STT(
2 language="unknown",
3 model="saaras:v3",
4 mode="transcribe",
5 flush_signal=True # Enables speech start/end events
6)

3. Set Turn Detection to STT

Add turn_detection="stt" to the AgentSession configuration. This ensures turn detection is handled by the Sarvam plugin, which emits start and end of speech signals.

1session = AgentSession(turn_detection="stt")

4. Configure Min Endpointing Delay

Set min_endpointing_delay=0.07 in your AgentSession. The Sarvam STT plugin has a processing latency of approximately 70ms. This setting ensures the agent transitions to the next pipeline step (LLM) as soon as STT finishes processing, minimizing response delay.

1session = AgentSession(
2 turn_detection="stt",
3 min_endpointing_delay=0.07
4)

Complete Optimized Example

Here’s a complete example incorporating all best practices:

1# STT with flush_signal enabled
2stt=sarvam.STT(
3 language="unknown",
4 model="saaras:v3",
5 mode="transcribe",
6 flush_signal=True
7)
8
9# AgentSession with optimized settings (no VAD parameter)
10session = AgentSession(
11 turn_detection="stt",
12 min_endpointing_delay=0.07
13)

Troubleshooting

API key errors: Check that all keys are in your .env file and the file is in the same directory as your script.

Module not found: Run the installation command again based on your operating system (see Step 2 above).

Poor transcription: Try language="unknown" for auto-detection, or specify the correct language code (en-IN, hi-IN, etc.).


Additional Resources


Need Help?


Happy Building!