Best Practices for Writing Text for TTS

A guide to writing text that produces natural-sounding speech output. Covers both text-formatting tips and model configuration recommendations for production use.


1. Punctuation for Pauses

PunctuationEffectExample
, (comma)Short pause”हाँ, मैं समझ गया”
. (full stop)Medium pause, sentence end”यह Very good है।“
! (exclamation)Emphasis + pause”नमस्ते!”
(ellipsis)Hesitation / trailing off”मुझे लगता है… शायद हम try कर सकते हैं”
Line breakNatural pause between paragraphsSee below

Tip: Use (ellipsis) to create a hesitation or trailing-off effect — it signals the speaker is thinking or pausing mid-thought. Use sparingly for natural results.

Tip: Use line breaks between paragraphs for natural breathing pauses:

हमारी technology सबको समझती है।
हमारा mission है कि हर Indian अपनी mother tongue में technology use कर सके।

2. Fillers & Hesitations for Natural Speech

Add fillers and hesitation markers to make speech sound conversational:

FillerEffectExample
umThinking pause”चाहे आप um Hindi बोलते हों”
uhShort hesitation”uh, मुझे एक second दो”
hmmContemplation”hmm, यह interesting है”
like...Casual filler”या like… कोई भी Indian language”
basically...Starting explanation”So basically… हम India की हर language को voice देते हैं”
actually...Adding emphasis”हमारी technology actually… सबको समझती है”
you know...Conversational connector”you know… यह बहुत simple है”
I mean...Self-correction”I mean… दूसरा option भी है”

Combining fillers with ellipsis for natural hesitation:

So basically… हमारा goal है कि um हर Indian language को support करें।
I mean... यह easy नहीं है... but we're getting there.

3. Code-Mixing (Hinglish)

For natural Indian speech, mix English words where they’re commonly used. This is how most urban Indians speak — the model handles it well.

Rule: Write English words in English script, Hindi words in Devanagari:

  • ✅ “Sarvam AI में आपका स्वागत है”
  • ❌ “सरवम एआई में आपका स्वागत है”

Common code-mixed categories:

CategoryExamples
Tech termstechnology, app, website, download, update, AI
Everyday wordsbasically, actually, like, amazing, simple
Social Expressionsthank you, sorry, please, welcome
Businessmeeting, deadline, budget, report, feedback

Full code-mixed examples:

So basically... हम India की हर language को voice देते हैं।
चाहे आप um Hindi बोलते हों, Tamil, Telugu, Bengali या like... कोई भी Indian language।
अगर आपको koi doubt है तो please हमें contact करें।
Meeting actually postpone हो गई है, I mean... tomorrow रखते हैं।

Keep Hindi sentence structure, swap key nouns/verbs with English:

  • “हर Indian अपनी mother tongue में technology use कर सके”
  • “आज का weather actually बहुत pleasant है”
  • “यह app basically आपकी daily life को simple बना देगा”

4. Avoid These

AvoidWhyFix
Overusing ...Too many ellipses sound choppyUse sparingly for hesitation; prefer , or line breaks for regular pauses
Complex Sanskrit wordsMay mispronounceUse simpler Hindi
Very long sentencesUnnatural breathingBreak into shorter sentences

5. Language-Specific Tips

Sentence-ending punctuation

  • If a sentence ends in Hindi or a regional language, use : "हमारी technology सबको समझती है।"
    • If a sentence ends in English, use . : "प्लान simple है, just execute."

Writing Conventions

  • Write language names in English: Tamil, Telugu, Bengali (not तमिल, तेलुगु)
  • Keep brand names in English: Sarvam AI, Google, WhatsApp

6. Target Language Code

The target_language_code parameter is required for every TTS request. It is primarily effective for handling language-specific processing of numbers, abbreviations, and special characters.

Supported Languages

LanguageCode
Englishen-IN
Hindihi-IN
Bengalibn-IN
Tamilta-IN
Telugute-IN
Kannadakn-IN
Malayalamml-IN
Marathimr-IN
Gujaratigu-IN
Punjabipa-IN
Odiaod-IN

Example

1audio = client.text_to_speech.convert(
2 text="नमस्ते! Sarvam AI में आपका स्वागत है।",
3 model="bulbul:v3",
4 target_language_code="hi-IN",
5 speaker="shubh"
6)

If your text contains mixed languages (e.g. Hinglish), set the target_language_code to the language in which you want entities (e.g numbers) in speech.


7. Understanding the Audio Output (Base64)

The TTS API returns audio data as a base64-encoded string. You must decode this string before saving or playing the audio file.

REST API Response

The REST API returns a response with an audios field — an array of base64-encoded audio strings. You need to decode them:

1import base64
2from sarvamai import SarvamAI
3
4client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
5
6audio = client.text_to_speech.convert(
7 text="नमस्ते! Sarvam AI में आपका स्वागत है।",
8 model="bulbul:v3",
9 target_language_code="hi-IN",
10 speaker="shubh"
11)
12
13# The response contains base64-encoded audio in the 'audios' field
14# Combine all audio chunks and decode from base64
15combined_audio = "".join(audio.audios)
16audio_bytes = base64.b64decode(combined_audio)
17
18with open("output.wav", "wb") as f:
19 f.write(audio_bytes)

Streaming API Response

For the streaming (WebSocket) API, each chunk arrives as a base64-encoded audio string. Decode each chunk as it arrives:

1import asyncio
2import base64
3from sarvamai import AsyncSarvamAI, AudioOutput
4
5async def tts_stream():
6 client = AsyncSarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
7
8 async with client.text_to_speech_streaming.connect(model="bulbul:v3") as ws:
9 await ws.configure(
10 target_language_code="hi-IN",
11 speaker="shubh"
12 )
13
14 await ws.convert("नमस्ते! Sarvam AI में आपका स्वागत है।")
15 await ws.flush()
16
17 with open("output.wav", "wb") as f:
18 async for message in ws:
19 if isinstance(message, AudioOutput):
20 # Each chunk is base64-encoded — decode before writing
21 audio_chunk = base64.b64decode(message.data.audio)
22 f.write(audio_chunk)
23
24asyncio.run(tts_stream())

Do not write the raw base64 string directly to a file. The audio will be corrupted and unplayable. Always decode with base64.b64decode() (Python) or Buffer.from(data, "base64") (JavaScript) first.



8. Choosing the Right API Mode

Bulbul v3 supports two API modes. Choosing correctly has a significant impact on latency and user experience.

REST APIWebSocket Streaming
Endpoint/text-to-speechwss://api.sarvam.ai/v1/text-to-speech/stream
Character limit2,500 chars/call2,500 chars/session
LatencyHigher — full audio returned at onceLow — audio chunks streamed in real time
Best forShort/pre-known text, batch, notificationsConversational agents, IVR, LLM voice output
OutputComplete audio fileIncremental audio chunks for progressive playback
IntegrationsAsync pipelines, batch jobsLiveKit, Pipecat, custom WebSocket clients

For voice agent pipelines (LLM → TTS), always use WebSocket streaming. The user perceives audio starting within milliseconds of the first chunk, dramatically improving conversational feel. REST adds a noticeable pause as full audio is generated before delivery.


9. Voice Parameter Tuning — Pace & Temperature

Two parameters give you fine-grained control over how Bulbul v3 sounds. Getting these right is the single most impactful tuning lever available to developers.

Pace (Range: 0.5 – 2.0)

pace controls the speaking rate relative to the model’s natural speed. 1.0 is native speed.

Pace ValueEffectRecommended Use Case
0.5 – 0.7Very slow, deliberateAccessibility tools, elderly users, pronunciation guides
0.8 – 0.9Relaxed, measuredEdTech narration, meditation/wellness apps, tutorials
1.0Natural speed (default)Conversational agents, general-purpose TTS
1.1Slightly briskNotifications, news briefings, professional IVR
1.2 – 1.5Fast, energeticQuick summaries, high-engagement marketing audio
1.6 – 2.0Very fastScreen readers, speed-listening (use with caution)

Default recommendation: Start at 1.0 (natural) or 1.1 (brisk, professional contexts). Avoid values above 1.5 unless specifically building speed-listening features.

Temperature (Range: 0.01 – 1.0)

temperature controls expressiveness and prosodic variation. Lower values produce consistent, predictable delivery; higher values introduce more natural pitch variation and emotional colour.

TemperatureCharacterRecommended Use Case
0.01 – 0.2Flat, highly consistentScreen readers, accessibility, compliance narration
0.3 – 0.5Controlled, professionalIVR menus, BFSI notifications, status updates
0.6Balanced; natural yet reliableConversational agents, EdTech, general purpose
0.7 – 0.8Expressive, warm, conversationalVoice personas, companion apps, storytelling
0.9 – 1.0Highly expressive, variableEntertainment, creative content, character voices

10. Speaker Selection by Language

Not all speakers perform equally across all languages. Always use the language-specific recommendations below rather than arbitrary speaker selection.

LanguageCodeRecommended MaleRecommended Female
Englishen-INratanishita
Hindihi-INshubh, ashutoshpriya, suhani
Telugute-INshubh, ratanneha, priya
Kannadakn-INshubh, ratanneha, ishita
Bengalibn-INrehanroopa, suhani
Tamilta-INratan, rohanishita, ritu
Odiaod-INshubhritu, pooja
Malayalamml-INshubhpooja
Marathimr-INratanpriya, ritu
Punjabipa-INmaniroopa, suhani
Gujaratigu-INratanpriya, ritu

Top picks: priya & ishita (best female, excellent across Hindi, Telugu, Kannada, Tamil, Marathi, Gujarati, English) · mani (best male overall, Punjabi) · shubh (best male for hi, te, kn, od, ml) · ratan (best male for en, te, kn, ta, mr, gu).

Varun has a deep, dramatic villain/suspense character voice. He is not suitable as a neutral default. Reserve varun exclusively for thriller, drama, or suspense content.


11. Use-Case Quick Reference

Recommended parameter and speaker combinations for common production scenarios:

Use CaseLanguageSpeaker(s)PaceTemperatureFormatSample Rate
Voice Agent (chat)hi-INpriya / shubh1.00.6PCM16 kHz
IVR / Telephonyhi-IN, en-INratan / ishita1.10.4MULAW8 kHz
EdTech Narrationhi-IN, ta-INshubh / ishita0.90.6MP322 kHz
BFSI Notificationhi-IN, en-INashutosh / ratan1.10.3MP322 kHz
Wellness / Meditationhi-INpriya / suhani0.750.5MP324 kHz
News Briefinghi-IN, en-INratan / ishita1.20.5MP322 kHz
Storytellinghi-IN, bn-INshubh / roopa0.90.8WAV24 kHz
Thriller / Suspensehi-INvarun0.90.8MP324 kHz

12. Output Format Recommendations

FormatBest ForNotes
mp3Web, mobile, content deliveryGood compression; universal compatibility
wavPost-processing, archivalLossless; larger file size
aaciOS apps, streamingBetter quality than MP3 at same bitrate
opusWebRTC, low-bandwidth streamingExcellent for real-time voice; very low latency
flacHigh-fidelity archivalLossless compression
linear16Real-time playback, voice agentsRaw samples; lowest overhead for streaming
mulawPSTN telephony, legacy IVR8 kHz; standard G.711 telephony codec
alawEuropean telephony systems8 kHz; G.711 A-law variant

For real-time WebSocket voice agents, use linear16 (PCM) at 16 kHz — lowest decode overhead, integrates directly with LiveKit, Pipecat, and most audio buffers. Use mulaw at 8 kHz for telephony/IVR.


13. Known Limitations

LimitationDetailWorkaround
Character limitsREST: 2,500 chars/call · WebSocket: 2,500 chars/sessionChunk long texts at sentence boundaries before sending
Script input (critical)Romanised/transliterated Indic input significantly degrades output quality — this is the most common integration mistakeAlways use native script for Indic words (e.g., "आपका order confirm हो गया है" not "Aapka order confirm ho gaya hai")
No SSML supportBulbul v3 does not support SSML tags for fine-grained prosody controlUse pace and temperature for coarse control; split text at natural pause points for rhythm
Speaker–language fitNot all speakers perform equally across all 11 languagesAlways use the language-specific recommended speakers from Section 10
High sample rates (REST only)32 kHz, 44.1 kHz, and 48 kHz are available via REST API onlyUse ≤ 24 kHz for WebSocket streaming

14. Key Considerations

  • For numbers greater than 4 digits, use commas (e.g., 10,000 instead of 10000) for correct pronunciation.