A guide to writing text that produces natural-sounding speech output. Covers both text-formatting tips and model configuration recommendations for production use.
Tip: Use … (ellipsis) to create a hesitation or trailing-off effect — it signals the speaker is thinking or pausing mid-thought. Use sparingly for natural results.
Tip: Use line breaks between paragraphs for natural breathing pauses:
Add fillers and hesitation markers to make speech sound conversational:
Combining fillers with ellipsis for natural hesitation:
For natural Indian speech, mix English words where they’re commonly used. This is how most urban Indians speak — the model handles it well.
Rule: Write English words in English script, Hindi words in Devanagari:
Common code-mixed categories:
Full code-mixed examples:
Keep Hindi sentence structure, swap key nouns/verbs with English:
।: "हमारी technology सबको समझती है।"
. : "प्लान simple है, just execute."The target_language_code parameter is required for every TTS request. It is primarily effective for handling language-specific processing of numbers, abbreviations, and special characters.
If your text contains mixed languages (e.g. Hinglish), set the target_language_code to the language in which you want entities (e.g numbers) in speech.
The TTS API returns audio data as a base64-encoded string. You must decode this string before saving or playing the audio file.
The REST API returns a response with an audios field — an array of base64-encoded audio strings. You need to decode them:
For the streaming (WebSocket) API, each chunk arrives as a base64-encoded audio string. Decode each chunk as it arrives:
Do not write the raw base64 string directly to a file. The audio will be corrupted and unplayable. Always decode with base64.b64decode() (Python) or Buffer.from(data, "base64") (JavaScript) first.
Bulbul v3 supports two API modes. Choosing correctly has a significant impact on latency and user experience.
For voice agent pipelines (LLM → TTS), always use WebSocket streaming. The user perceives audio starting within milliseconds of the first chunk, dramatically improving conversational feel. REST adds a noticeable pause as full audio is generated before delivery.
Two parameters give you fine-grained control over how Bulbul v3 sounds. Getting these right is the single most impactful tuning lever available to developers.
pace controls the speaking rate relative to the model’s natural speed. 1.0 is native speed.
Default recommendation: Start at 1.0 (natural) or 1.1 (brisk, professional contexts). Avoid values above 1.5 unless specifically building speed-listening features.
temperature controls expressiveness and prosodic variation. Lower values produce consistent, predictable delivery; higher values introduce more natural pitch variation and emotional colour.
Not all speakers perform equally across all languages. Always use the language-specific recommendations below rather than arbitrary speaker selection.
Top picks: priya & ishita (best female, excellent across Hindi, Telugu, Kannada, Tamil, Marathi, Gujarati, English) · mani (best male overall, Punjabi) · shubh (best male for hi, te, kn, od, ml) · ratan (best male for en, te, kn, ta, mr, gu).
Varun has a deep, dramatic villain/suspense character voice. He is not suitable as a neutral default. Reserve varun exclusively for thriller, drama, or suspense content.
Recommended parameter and speaker combinations for common production scenarios:
For real-time WebSocket voice agents, use linear16 (PCM) at 16 kHz — lowest decode overhead, integrates directly with LiveKit, Pipecat, and most audio buffers. Use mulaw at 8 kHz for telephony/IVR.
10,000 instead of 10000) for correct pronunciation.