Pronunciation Dictionary
Bulbul v3 handles most text well out of the box — code-mixed Hinglish, numbers, common abbreviations. But some words need explicit guidance: your company name, niche acronyms, or terms borrowed from another language. That’s what pronunciation dictionaries solve.
You upload a JSON file with "word" → "how to say it" mappings, get back a dict_id, and pass it in any TTS call. The engine swaps matching words before synthesis — no model retraining, no prompt engineering.
When Do You Need This?
TTS models do a great job with everyday language. But they can stumble on words they haven’t seen before — abbreviations specific to your industry, brand names with unusual spellings, or acronyms that should be spelled out rather than read as a word.
For example, if your app says:
The model might try to read “NAIC” as a single word (like “naik”) instead of spelling it out, or pronounce “B2B” literally. A pronunciation dictionary tells the model exactly what to do:
Everything else in the sentence stays the same — the dictionary only touches exact matches.
Dictionary Format
A single JSON file. The top-level key pronunciations maps language codes to word → replacement pairs:
Save this as a .json file. That’s it — no XML, no special phoneme notation, no markup. Just plain text replacements.
Matching is language-aware. When target_language_code is hi-IN, only the hi-IN block applies. This means the same word (like “Sarvam”) can have different spoken forms in Hindi vs English.
Getting Started
1. Create the dictionary JSON file
You can create the file manually, or use this helper function:
2. Upload it
3. Pass dict_id in your TTS call
That’s the core flow. The same dict_id works across REST, HTTP Stream, and WebSocket — just pass it as a parameter.
Per-Language Matching
This is the key design choice: pronunciations are scoped to language codes. A single dictionary can hold mappings for multiple languages, and only the entries matching your target_language_code are applied at synthesis time.
When you call TTS with target_language_code="ta-IN", only the Tamil entries are used. The Hindi and English entries are ignored for that request.
Supported Language Codes
hi-IN bn-IN ta-IN te-IN kn-IN ml-IN mr-IN gu-IN pa-IN od-IN en-IN
Managing Dictionaries
List
Get Contents
Update
Delete
Limits
Common Patterns
Here are some real-world patterns that work well with Sarvam pronunciation dictionaries:
Financial services (IVR / voice bots)
Healthcare
Brand and product names
Tips
- Only add words that actually mispronounce. Bulbul v3 already handles common English words, numbers, and Hinglish well. Test without a dictionary first.
- One dictionary per request. If you need entries from multiple dictionaries, merge them into one (you have 100 words to work with).
- Update preserves the ID. When pronunciations change, use the update endpoint rather than delete + recreate. Your existing TTS integrations keep working.
- Test with your production voice. Different speakers may handle certain words differently — always verify with the voice you’ll deploy.
Using Pronunciation Dictionary with All TTS APIs
The dict_id parameter works across REST, HTTP Stream, and WebSocket. Here are complete examples for each:
REST API
HTTP Stream
WebSocket
Full endpoint specs, request/response schemas, and error codes are in the API Reference.
Need help? Reach out on Discord.