Pronunciation Dictionary

Bulbul v3 handles most text well out of the box — code-mixed Hinglish, numbers, common abbreviations. But some words need explicit guidance: your company name, niche acronyms, or terms borrowed from another language. That’s what pronunciation dictionaries solve.

You upload a JSON file with "word" → "how to say it" mappings, get back a dict_id, and pass it in any TTS call. The engine swaps matching words before synthesis — no model retraining, no prompt engineering.


When Do You Need This?

TTS models do a great job with everyday language. But they can stumble on words they haven’t seen before — abbreviations specific to your industry, brand names with unusual spellings, or acronyms that should be spelled out rather than read as a word.

For example, if your app says:

NAIC policy number check karein aur B2B portal pe login karein

The model might try to read “NAIC” as a single word (like “naik”) instead of spelling it out, or pronounce “B2B” literally. A pronunciation dictionary tells the model exactly what to do:

Input textWithout dictionaryWith dictionary
NAICmight say “naik” or “na-ic”says “N A I C”
B2Bmight say “b-दो-b”says “B to B”

Everything else in the sentence stays the same — the dictionary only touches exact matches.


Dictionary Format

A single JSON file. The top-level key pronunciations maps language codes to word → replacement pairs:

1{
2 "pronunciations": {
3 "hi-IN": {
4 "B2B": "B to B",
5 "NAIC": "N A I C",
6 "Sarvam": "सारवम"
7 },
8 "en-IN": {
9 "Sarvam": "Saar-vum",
10 "HDFC": "H D F C"
11 },
12 "ta-IN": {
13 "EMI": "இ எம் ஐ"
14 }
15 }
16}

Save this as a .json file. That’s it — no XML, no special phoneme notation, no markup. Just plain text replacements.

Matching is language-aware. When target_language_code is hi-IN, only the hi-IN block applies. This means the same word (like “Sarvam”) can have different spoken forms in Hindi vs English.


Getting Started

1. Create the dictionary JSON file

You can create the file manually, or use this helper function:

1import json
2
3def create_dictionary_file(pronunciations, filename="dict.json"):
4 dictionary = {"pronunciations": pronunciations}
5 with open(filename, "w") as f:
6 json.dump(dictionary, f, ensure_ascii=False, indent=2)
7 return filename
8
9create_dictionary_file({
10 "hi-IN": {
11 "B2B": "B to B",
12 "NAIC": "N A I C",
13 "CIBIL": "सिबिल"
14 },
15 "en-IN": {
16 "Sarvam": "Saar-vum"
17 }
18})

2. Upload it

1from sarvamai import SarvamAI
2
3client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
4
5with open("dict.json", "rb") as f:
6 result = client.pronunciation_dictionary.create(file=f)
7
8print(result.dictionary_id) # e.g. "p_5cb7faa6"

3. Pass dict_id in your TTS call

1from sarvamai import SarvamAI
2from sarvamai.play import save
3
4client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
5
6audio = client.text_to_speech.convert(
7 text="NAIC policy check karein aur B2B portal pe login karein",
8 target_language_code="hi-IN",
9 speaker="shubh",
10 model="bulbul:v3",
11 dict_id="p_5cb7faa6",
12)
13
14save(audio, "output.wav")

That’s the core flow. The same dict_id works across REST, HTTP Stream, and WebSocket — just pass it as a parameter.


Per-Language Matching

This is the key design choice: pronunciations are scoped to language codes. A single dictionary can hold mappings for multiple languages, and only the entries matching your target_language_code are applied at synthesis time.

1{
2 "pronunciations": {
3 "hi-IN": { "EMI": "ई एम आई", "SIP": "सिप" },
4 "en-IN": { "EMI": "E M I", "SIP": "S I P" },
5 "ta-IN": { "EMI": "இ எம் ஐ" },
6 "te-IN": { "EMI": "ఇ ఎం ఐ" }
7 }
8}

When you call TTS with target_language_code="ta-IN", only the Tamil entries are used. The Hindi and English entries are ignored for that request.

Supported Language Codes

hi-IN bn-IN ta-IN te-IN kn-IN ml-IN mr-IN gu-IN pa-IN od-IN en-IN


Managing Dictionaries

1all_dicts = client.pronunciation_dictionary.list()
2print(all_dicts.dictionary_count) # number of dictionaries
3print(all_dicts.dictionaries) # list of dictionary IDs

Limits

Limit
Dictionaries per user10
Words per dictionary100
File size1 MB
Model supportbulbul:v3 only
Dictionaries per request1

Common Patterns

Here are some real-world patterns that work well with Sarvam pronunciation dictionaries:

Financial services (IVR / voice bots)

1{
2 "hi-IN": {
3 "NEFT": "एन ई एफ टी",
4 "RTGS": "आर टी जी एस",
5 "KYC": "के वाई सी",
6 "EMI": "ई एम आई",
7 "CIBIL": "सिबिल"
8 }
9}

Healthcare

1{
2 "hi-IN": {
3 "OPD": "ओ पी डी",
4 "ICU": "आई सी यू",
5 "MRI": "एम आर आई",
6 "BP": "बी पी"
7 }
8}

Brand and product names

1{
2 "hi-IN": {
3 "Sarvam": "सारवम",
4 "PhonePe": "फ़ोन पे",
5 "Zerodha": "ज़ीरोधा"
6 },
7 "en-IN": {
8 "Sarvam": "Saar-vum"
9 }
10}

Tips

  • Only add words that actually mispronounce. Bulbul v3 already handles common English words, numbers, and Hinglish well. Test without a dictionary first.
  • One dictionary per request. If you need entries from multiple dictionaries, merge them into one (you have 100 words to work with).
  • Update preserves the ID. When pronunciations change, use the update endpoint rather than delete + recreate. Your existing TTS integrations keep working.
  • Test with your production voice. Different speakers may handle certain words differently — always verify with the voice you’ll deploy.

Using Pronunciation Dictionary with All TTS APIs

The dict_id parameter works across REST, HTTP Stream, and WebSocket. Here are complete examples for each:

1from sarvamai import SarvamAI
2from sarvamai.play import save
3
4client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
5
6audio = client.text_to_speech.convert(
7 text="NEFT transfer karein aur KYC complete karein",
8 target_language_code="hi-IN",
9 speaker="shubh",
10 model="bulbul:v3",
11 dict_id="p_5cb7faa6",
12)
13
14save(audio, "output.wav")

Full endpoint specs, request/response schemas, and error codes are in the API Reference.

Need help? Reach out on Discord.