Bulbul v3 handles most text well out of the box — code-mixed Hinglish, numbers, common abbreviations. But some words need explicit guidance: your company name, niche acronyms, or terms borrowed from another language. That’s what pronunciation dictionaries solve.
You upload a JSON file with "word" → "how to say it" mappings, get back a dict_id, and pass it in any TTS call. The engine swaps matching words before synthesis — no model retraining, no prompt engineering.
TTS models do a great job with everyday language. But they can stumble on words they haven’t seen before — abbreviations specific to your industry, brand names with unusual spellings, or acronyms that should be spelled out rather than read as a word.
For example, if your app says:
The model might try to read “NAIC” as a single word (like “naik”) instead of spelling it out, or pronounce “B2B” literally. A pronunciation dictionary tells the model exactly what to do:
Everything else in the sentence stays the same — the dictionary only touches exact matches.
A single JSON file. The top-level key pronunciations maps language codes to word → replacement pairs:
Save this as a .json file. That’s it — no XML, no special phoneme notation, no markup. Just plain text replacements.
Matching is language-aware. When target_language_code is hi-IN, only the hi-IN block applies. This means the same word (like “Sarvam”) can have different spoken forms in Hindi vs English.
1. Create the dictionary JSON file
You can create the file manually, or use this helper function:
2. Upload it
3. Pass dict_id in your TTS call
That’s the core flow. The same dict_id works across REST, HTTP Stream, and WebSocket — just pass it as a parameter.
This is the key design choice: pronunciations are scoped to language codes. A single dictionary can hold mappings for multiple languages, and only the entries matching your target_language_code are applied at synthesis time.
When you call TTS with target_language_code="ta-IN", only the Tamil entries are used. The Hindi and English entries are ignored for that request.
hi-IN bn-IN ta-IN te-IN kn-IN ml-IN mr-IN gu-IN pa-IN od-IN en-IN
Here are some real-world patterns that work well with Sarvam pronunciation dictionaries:
Financial services (IVR / voice bots)
Healthcare
Brand and product names
The dict_id parameter works across REST, HTTP Stream, and WebSocket. Here are complete examples for each:
Full endpoint specs, request/response schemas, and error codes are in the API Reference.
Need help? Reach out on Discord.