Best Practices for Writing Text for TTS
A guide to writing text that produces natural-sounding speech output. Covers both text-formatting tips and model configuration recommendations for production use.
1. Punctuation for Pauses
Tip: Use … (ellipsis) to create a hesitation or trailing-off effect — it signals the speaker is thinking or pausing mid-thought. Use sparingly for natural results.
Tip: Use line breaks between paragraphs for natural breathing pauses:
2. Fillers & Hesitations for Natural Speech
Add fillers and hesitation markers to make speech sound conversational:
Combining fillers with ellipsis for natural hesitation:
3. Code-Mixing (Hinglish)
For natural Indian speech, mix English words where they’re commonly used. This is how most urban Indians speak — the model handles it well.
Rule: Write English words in English script, Hindi words in Devanagari:
- ✅ “Sarvam AI में आपका स्वागत है”
- ❌ “सरवम एआई में आपका स्वागत है”
Common code-mixed categories:
Full code-mixed examples:
Keep Hindi sentence structure, swap key nouns/verbs with English:
- “हर Indian अपनी mother tongue में technology use कर सके”
- “आज का weather actually बहुत pleasant है”
- “यह app basically आपकी daily life को simple बना देगा”
4. Avoid These
5. Language-Specific Tips
Sentence-ending punctuation
- If a sentence ends in Hindi or a regional language, use
।:"हमारी technology सबको समझती है।"- If a sentence ends in English, use
.:"प्लान simple है, just execute."
- If a sentence ends in English, use
Writing Conventions
- Write language names in English: Tamil, Telugu, Bengali (not तमिल, तेलुगु)
- Keep brand names in English: Sarvam AI, Google, WhatsApp
6. Target Language Code
The target_language_code parameter is required for every TTS request. It is primarily effective for handling language-specific processing of numbers, abbreviations, and special characters.
Supported Languages
Example
If your text contains mixed languages (e.g. Hinglish), set the target_language_code to the language in which you want entities (e.g numbers) in speech.
7. Understanding the Audio Output (Base64)
The TTS API returns audio data as a base64-encoded string. You must decode this string before saving or playing the audio file.
REST API Response
The REST API returns a response with an audios field — an array of base64-encoded audio strings. You need to decode them:
Python
JavaScript
cURL
Streaming API Response
For the streaming (WebSocket) API, each chunk arrives as a base64-encoded audio string. Decode each chunk as it arrives:
Do not write the raw base64 string directly to a file. The audio will be corrupted and unplayable. Always decode with base64.b64decode() (Python) or Buffer.from(data, "base64") (JavaScript) first.
8. Choosing the Right API Mode
Bulbul v3 supports two API modes. Choosing correctly has a significant impact on latency and user experience.
For voice agent pipelines (LLM → TTS), always use WebSocket streaming. The user perceives audio starting within milliseconds of the first chunk, dramatically improving conversational feel. REST adds a noticeable pause as full audio is generated before delivery.
9. Voice Parameter Tuning — Pace & Temperature
Two parameters give you fine-grained control over how Bulbul v3 sounds. Getting these right is the single most impactful tuning lever available to developers.
Pace (Range: 0.5 – 2.0)
pace controls the speaking rate relative to the model’s natural speed. 1.0 is native speed.
Default recommendation: Start at 1.0 (natural) or 1.1 (brisk, professional contexts). Avoid values above 1.5 unless specifically building speed-listening features.
Temperature (Range: 0.01 – 1.0)
temperature controls expressiveness and prosodic variation. Lower values produce consistent, predictable delivery; higher values introduce more natural pitch variation and emotional colour.
10. Speaker Selection by Language
Not all speakers perform equally across all languages. Always use the language-specific recommendations below rather than arbitrary speaker selection.
Top picks: priya & ishita (best female, excellent across Hindi, Telugu, Kannada, Tamil, Marathi, Gujarati, English) · mani (best male overall, Punjabi) · shubh (best male for hi, te, kn, od, ml) · ratan (best male for en, te, kn, ta, mr, gu).
Varun has a deep, dramatic villain/suspense character voice. He is not suitable as a neutral default. Reserve varun exclusively for thriller, drama, or suspense content.
11. Use-Case Quick Reference
Recommended parameter and speaker combinations for common production scenarios:
12. Output Format Recommendations
For real-time WebSocket voice agents, use linear16 (PCM) at 16 kHz — lowest decode overhead, integrates directly with LiveKit, Pipecat, and most audio buffers. Use mulaw at 8 kHz for telephony/IVR.
13. Known Limitations
14. Key Considerations
- For numbers greater than 4 digits, use commas (e.g.,
10,000instead of10000) for correct pronunciation.