How to control the response length with max_tokens

The max_tokens parameter lets you control how long the model’s response can be — in terms of tokens.

A token can be a word, part of a word, or even punctuation
(Example: “Hello!” = 2 tokens: "Hello" + "!")

Why use `max_tokens`?

To limit the size of the output
To control latency / cost (fewer tokens = faster and cheaper)
To avoid overly long answers if you want concise responses

How to choose the value:

Parameter details:

Parameter	Type	Default
`n`	Integer	2048

1 # Install SarvamAI
2 !pip install -Uqq sarvamai
3 from sarvamai import SarvamAI

1 # Initialize the SarvamAI client with your API key
2 client = SarvamAI(api_subscription_key="8f631181-79f7-43e4-9e7c-f78431dc4c91")

1 # Example 1: Using default max_tokens (not specified) — model decides length
2 response = client.chat.completions(
3     messages=[
4         {"role": "system", "content": "You are a helpful assistant."},
5         {"role": "user", "content": "Tell me about the planet Mars."}
6     ]
7     # max_tokens not specified → model uses internal maximum
8 )

1 # Example 2: Using max_tokens = 100 — limit response length
2 response = client.chat.completions(
3     messages=[
4         {"role": "system", "content": "You are a helpful assistant."},
5         {"role": "user", "content": "Summarize the plot of Mahabharata."}
6     ],
7     max_tokens=100
8 )

1 # Receive assistant's reply as output.
2 print(response.choices[0].message.content)

Why use max_tokens?

How to choose the value:

Parameter details:

Why use `max_tokens`?