> For clean Markdown of any page, append `.md` to the page URL.
> For a complete documentation index, see https://docs.sarvam.ai/llms.txt.
> For full documentation content in one file, see https://docs.sarvam.ai/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.sarvam.ai/_mcp/server.

# Best Practices for Writing Text for TTS

> A guide to writing text that produces natural-sounding speech output with Sarvam AI Bulbul.

A guide to writing text that produces natural-sounding speech output. Covers both text-formatting tips and model configuration recommendations for production use.

***

## 1. Punctuation for Pauses

| Punctuation       | Effect                           | Example                                 |
| ----------------- | -------------------------------- | --------------------------------------- |
| `,` (comma)       | Short pause                      | "हाँ, मैं समझ गया"                      |
| `.` (full stop)   | Medium pause, sentence end       | "यह Very good है।"                      |
| `!` (exclamation) | Emphasis + pause                 | "नमस्ते!"                               |
| `…` (ellipsis)    | Hesitation / trailing off        | "मुझे लगता है… शायद हम try कर सकते हैं" |
| Line break        | Natural pause between paragraphs | See below                               |

**Tip:** Use `…` (ellipsis) to create a hesitation or trailing-off effect — it signals the speaker is thinking or pausing mid-thought. Use sparingly for natural results.

**Tip:** Use line breaks between paragraphs for natural breathing pauses:

```
हमारी technology सबको समझती है।

हमारा mission है कि हर Indian अपनी mother tongue में technology use कर सके।
```

***

## 2. Fillers & Hesitations for Natural Speech

Add fillers and hesitation markers to make speech sound conversational:

| Filler         | Effect                   | Example                                                     |
| -------------- | ------------------------ | ----------------------------------------------------------- |
| `um`           | Thinking pause           | "चाहे आप um Hindi बोलते हों"                                |
| `uh`           | Short hesitation         | "uh, मुझे एक second दो"                                     |
| `hmm`          | Contemplation            | "hmm, यह interesting है"                                    |
| `like...`      | Casual filler            | "या like... कोई भी Indian language"                         |
| `basically...` | Starting explanation     | "So basically... हम India की हर language को voice देते हैं" |
| `actually...`  | Adding emphasis          | "हमारी technology actually... सबको समझती है"                |
| `you know...`  | Conversational connector | "you know\... यह बहुत simple है"                            |
| `I mean...`    | Self-correction          | "I mean... दूसरा option भी है"                              |

**Combining fillers with ellipsis for natural hesitation:**

```
So basically… हमारा goal है कि um हर Indian language को support करें।
I mean... यह easy नहीं है... but we're getting there.
```

***

## 3. Code-Mixing (Hinglish)

For natural Indian speech, mix English words where they're commonly used. This is how most urban Indians speak — the model handles it well.

**Rule: Write English words in English script, Hindi words in Devanagari:**

* ✅ "Sarvam AI में आपका स्वागत है"
* ❌ "सरवम एआई में आपका स्वागत है"

**Common code-mixed categories:**

| Category           | Examples                                       |
| ------------------ | ---------------------------------------------- |
| Tech terms         | technology, app, website, download, update, AI |
| Everyday words     | basically, actually, like, amazing, simple     |
| Social Expressions | thank you, sorry, please, welcome              |
| Business           | meeting, deadline, budget, report, feedback    |

**Full code-mixed examples:**

```
So basically... हम India की हर language को voice देते हैं।
चाहे आप um Hindi बोलते हों, Tamil, Telugu, Bengali या like... कोई भी Indian language।

अगर आपको koi doubt है तो please हमें contact करें।
Meeting actually postpone हो गई है, I mean... tomorrow रखते हैं।
```

**Keep Hindi sentence structure, swap key nouns/verbs with English:**

* "हर Indian अपनी mother tongue में technology use कर सके"
* "आज का weather actually बहुत pleasant है"
* "यह app basically आपकी daily life को simple बना देगा"

***

## 4. Avoid These

| Avoid                  | Why                            | Fix                                                                            |
| ---------------------- | ------------------------------ | ------------------------------------------------------------------------------ |
| Overusing `...`        | Too many ellipses sound choppy | Use `…` sparingly for hesitation; prefer `,` or line breaks for regular pauses |
| Complex Sanskrit words | May mispronounce               | Use simpler Hindi                                                              |
| Very long sentences    | Unnatural breathing            | Break into shorter sentences                                                   |

***

## 5. Language-Specific Tips

### Sentence-ending punctuation

* If a sentence **ends in Hindi or a regional language**, use `।`: `"हमारी technology सबको समझती है।"`
  * If a sentence **ends in English**, use `.` : `"प्लान simple है, just execute."`

### Writing Conventions

* Write language names in English: Tamil, Telugu, Bengali (not तमिल, तेलुगु)
* Keep brand names in English: Sarvam AI, Google, WhatsApp

***

## 6. Target Language Code

The `target_language_code` parameter is **required** for every TTS request. It is primarily effective for handling language-specific processing of numbers, abbreviations, and special characters.

### Supported Languages

| Language  | Code    |
| --------- | ------- |
| English   | `en-IN` |
| Hindi     | `hi-IN` |
| Bengali   | `bn-IN` |
| Tamil     | `ta-IN` |
| Telugu    | `te-IN` |
| Kannada   | `kn-IN` |
| Malayalam | `ml-IN` |
| Marathi   | `mr-IN` |
| Gujarati  | `gu-IN` |
| Punjabi   | `pa-IN` |
| Odia      | `od-IN` |

### Example

```python
audio = client.text_to_speech.convert(
    text="नमस्ते! Sarvam AI में आपका स्वागत है।",
    model="bulbul:v3",
    target_language_code="hi-IN",  
    speaker="shubh"
)
```

If your text contains mixed languages (e.g. Hinglish), set the `target_language_code` to the language in which you want entities (e.g numbers) in speech.

***

## 7. Understanding the Audio Output (Base64)

The TTS API returns audio data as a **base64-encoded string**. You must decode this string before saving or playing the audio file.

### REST API Response

The REST API returns a response with an `audios` field — an array of base64-encoded audio strings. You need to decode them:

```python
import base64
from sarvamai import SarvamAI

client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")

audio = client.text_to_speech.convert(
    text="नमस्ते! Sarvam AI में आपका स्वागत है।",
    model="bulbul:v3",
    target_language_code="hi-IN",
    speaker="shubh"
)

# The response contains base64-encoded audio in the 'audios' field
# Combine all audio chunks and decode from base64
combined_audio = "".join(audio.audios)
audio_bytes = base64.b64decode(combined_audio)

with open("output.wav", "wb") as f:
    f.write(audio_bytes)
```

```javascript
import { SarvamAIClient } from "sarvamai";
import fs from "fs";

const client = new SarvamAIClient({
  apiSubscriptionKey: "YOUR_SARVAM_API_KEY"
});

const response = await client.textToSpeech.convert({
  text: "नमस्ते! Sarvam AI में आपका स्वागत है।",
  model: "bulbul:v3",
  target_language_code: "hi-IN",
  speaker: "shubh"
});

// The response contains base64-encoded audio
// Decode the base64 string to a Buffer and save
const audioBuffer = Buffer.from(response.audios.join(""), "base64");
fs.writeFileSync("output.wav", audioBuffer);
```

```bash
# The API returns JSON with a base64-encoded "audios" array
curl -X POST https://api.sarvam.ai/text-to-speech \
  -H "api-subscription-key: <YOUR_SARVAM_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "नमस्ते! Sarvam AI में आपका स्वागत है।",
    "model": "bulbul:v3",
    "target_language_code": "hi-IN",
    "speaker": "shubh"
  }' | python3 -c "
import sys, json, base64
resp = json.load(sys.stdin)
audio = base64.b64decode(''.join(resp['audios']))
with open('output.wav', 'wb') as f:
    f.write(audio)
print('Saved output.wav')
"
```

### Streaming API Response

For the streaming (WebSocket) API, each chunk arrives as a base64-encoded audio string. Decode each chunk as it arrives:

```python
import asyncio
import base64
from sarvamai import AsyncSarvamAI, AudioOutput

async def tts_stream():
    client = AsyncSarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")

    async with client.text_to_speech_streaming.connect(model="bulbul:v3") as ws:
        await ws.configure(
            target_language_code="hi-IN",
            speaker="shubh"
        )

        await ws.convert("नमस्ते! Sarvam AI में आपका स्वागत है।")
        await ws.flush()

        with open("output.wav", "wb") as f:
            async for message in ws:
                if isinstance(message, AudioOutput):
                    # Each chunk is base64-encoded — decode before writing
                    audio_chunk = base64.b64decode(message.data.audio)
                    f.write(audio_chunk)

asyncio.run(tts_stream())
```

**Do not** write the raw base64 string directly to a file. The audio will be corrupted and unplayable. Always decode with `base64.b64decode()` (Python) or `Buffer.from(data, "base64")` (JavaScript) first.

***

***

## 8. Choosing the Right API Mode

Bulbul v3 supports two API modes. Choosing correctly has a significant impact on latency and user experience.

|                     | **REST API**                               | **WebSocket Streaming**                           |
| ------------------- | ------------------------------------------ | ------------------------------------------------- |
| **Endpoint**        | `/text-to-speech`                          | `wss://api.sarvam.ai/v1/text-to-speech/stream`    |
| **Character limit** | 2,500 chars/call                           | 2,500 chars/session                               |
| **Latency**         | Higher — full audio returned at once       | Low — audio chunks streamed in real time          |
| **Best for**        | Short/pre-known text, batch, notifications | Conversational agents, IVR, LLM voice output      |
| **Output**          | Complete audio file                        | Incremental audio chunks for progressive playback |
| **Integrations**    | Async pipelines, batch jobs                | LiveKit, Pipecat, custom WebSocket clients        |

**For voice agent pipelines (LLM → TTS), always use WebSocket streaming.** The user perceives audio starting within milliseconds of the first chunk, dramatically improving conversational feel. REST adds a noticeable pause as full audio is generated before delivery.

***

## 9. Voice Parameter Tuning — Pace & Temperature

Two parameters give you fine-grained control over how Bulbul v3 sounds. Getting these right is the single most impactful tuning lever available to developers.

### Pace (Range: 0.5 – 2.0)

`pace` controls the speaking rate relative to the model's natural speed. `1.0` is native speed.

| Pace Value | Effect                      | Recommended Use Case                                     |
| ---------- | --------------------------- | -------------------------------------------------------- |
| 0.5 – 0.7  | Very slow, deliberate       | Accessibility tools, elderly users, pronunciation guides |
| 0.8 – 0.9  | Relaxed, measured           | EdTech narration, meditation/wellness apps, tutorials    |
| **1.0**    | **Natural speed (default)** | **Conversational agents, general-purpose TTS**           |
| 1.1        | Slightly brisk              | Notifications, news briefings, professional IVR          |
| 1.2 – 1.5  | Fast, energetic             | Quick summaries, high-engagement marketing audio         |
| 1.6 – 2.0  | Very fast                   | Screen readers, speed-listening (use with caution)       |

**Default recommendation:** Start at `1.0` (natural) or `1.1` (brisk, professional contexts). Avoid values above `1.5` unless specifically building speed-listening features.

### Temperature (Range: 0.01 – 1.0)

`temperature` controls expressiveness and prosodic variation. Lower values produce consistent, predictable delivery; higher values introduce more natural pitch variation and emotional colour.

| Temperature | Character                        | Recommended Use Case                                |
| ----------- | -------------------------------- | --------------------------------------------------- |
| 0.01 – 0.2  | Flat, highly consistent          | Screen readers, accessibility, compliance narration |
| 0.3 – 0.5   | Controlled, professional         | IVR menus, BFSI notifications, status updates       |
| 0.6         | Balanced; natural yet reliable   | Conversational agents, EdTech, general purpose      |
| 0.7 – 0.8   | Expressive, warm, conversational | Voice personas, companion apps, storytelling        |
| 0.9 – 1.0   | Highly expressive, variable      | Entertainment, creative content, character voices   |

***

## 10. Speaker Selection by Language

Not all speakers perform equally across all languages. Always use the language-specific recommendations below rather than arbitrary speaker selection.

| Language  | Code    | Recommended Male | Recommended Female |
| --------- | ------- | ---------------- | ------------------ |
| English   | `en-IN` | ratan            | ishita             |
| Hindi     | `hi-IN` | shubh, ashutosh  | priya, suhani      |
| Telugu    | `te-IN` | shubh, ratan     | neha, priya        |
| Kannada   | `kn-IN` | shubh, ratan     | neha, ishita       |
| Bengali   | `bn-IN` | rehan            | roopa, suhani      |
| Tamil     | `ta-IN` | ratan, rohan     | ishita, ritu       |
| Odia      | `od-IN` | shubh            | ritu, pooja        |
| Malayalam | `ml-IN` | shubh            | pooja              |
| Marathi   | `mr-IN` | ratan            | priya, ritu        |
| Punjabi   | `pa-IN` | mani             | roopa, suhani      |
| Gujarati  | `gu-IN` | ratan            | priya, ritu        |

**Top picks:** `priya` & `ishita` (best female, excellent across Hindi, Telugu, Kannada, Tamil, Marathi, Gujarati, English) · `mani` (best male overall, Punjabi) · `shubh` (best male for hi, te, kn, od, ml) · `ratan` (best male for en, te, kn, ta, mr, gu).

**Varun** has a deep, dramatic villain/suspense character voice. He is **not** suitable as a neutral default. Reserve `varun` exclusively for thriller, drama, or suspense content.

***

## 11. Use-Case Quick Reference

Recommended parameter and speaker combinations for common production scenarios:

| Use Case              | Language     | Speaker(s)       | Pace | Temperature | Format | Sample Rate |
| --------------------- | ------------ | ---------------- | ---- | ----------- | ------ | ----------- |
| Voice Agent (chat)    | hi-IN        | priya / shubh    | 1.0  | 0.6         | PCM    | 16 kHz      |
| IVR / Telephony       | hi-IN, en-IN | ratan / ishita   | 1.1  | 0.4         | MULAW  | 8 kHz       |
| EdTech Narration      | hi-IN, ta-IN | shubh / ishita   | 0.9  | 0.6         | MP3    | 22 kHz      |
| BFSI Notification     | hi-IN, en-IN | ashutosh / ratan | 1.1  | 0.3         | MP3    | 22 kHz      |
| Wellness / Meditation | hi-IN        | priya / suhani   | 0.75 | 0.5         | MP3    | 24 kHz      |
| News Briefing         | hi-IN, en-IN | ratan / ishita   | 1.2  | 0.5         | MP3    | 22 kHz      |
| Storytelling          | hi-IN, bn-IN | shubh / roopa    | 0.9  | 0.8         | WAV    | 24 kHz      |
| Thriller / Suspense   | hi-IN        | varun            | 0.9  | 0.8         | MP3    | 24 kHz      |

***

## 12. Output Format Recommendations

| Format     | Best For                         | Notes                                           |
| ---------- | -------------------------------- | ----------------------------------------------- |
| `mp3`      | Web, mobile, content delivery    | Good compression; universal compatibility       |
| `wav`      | Post-processing, archival        | Lossless; larger file size                      |
| `aac`      | iOS apps, streaming              | Better quality than MP3 at same bitrate         |
| `opus`     | WebRTC, low-bandwidth streaming  | Excellent for real-time voice; very low latency |
| `flac`     | High-fidelity archival           | Lossless compression                            |
| `linear16` | Real-time playback, voice agents | Raw samples; lowest overhead for streaming      |
| `mulaw`    | PSTN telephony, legacy IVR       | 8 kHz; standard G.711 telephony codec           |
| `alaw`     | European telephony systems       | 8 kHz; G.711 A-law variant                      |

For real-time WebSocket voice agents, use `linear16` (PCM) at 16 kHz — lowest decode overhead, integrates directly with LiveKit, Pipecat, and most audio buffers. Use `mulaw` at 8 kHz for telephony/IVR.

***

## 13. Known Limitations

| Limitation                        | Detail                                                                                                                   | Workaround                                                                                                                |
| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------- |
| **Character limits**              | REST: 2,500 chars/call · WebSocket: 2,500 chars/session                                                                  | Chunk long texts at sentence boundaries before sending                                                                    |
| **Script input (critical)**       | Romanised/transliterated Indic input significantly degrades output quality — this is the most common integration mistake | Always use native script for Indic words (e.g., `"आपका order confirm हो गया है"` not `"Aapka order confirm ho gaya hai"`) |
| **No SSML support**               | Bulbul v3 does not support SSML tags for fine-grained prosody control                                                    | Use `pace` and `temperature` for coarse control; split text at natural pause points for rhythm                            |
| **Speaker–language fit**          | Not all speakers perform equally across all 11 languages                                                                 | Always use the language-specific recommended speakers from Section 10                                                     |
| **High sample rates (REST only)** | 32 kHz, 44.1 kHz, and 48 kHz are available via REST API only                                                             | Use ≤ 24 kHz for WebSocket streaming                                                                                      |

***

## 14. Key Considerations

* For numbers greater than 4 digits, use commas (e.g., `10,000` instead of `10000`) for correct pronunciation.