How to control the response length with max_tokens

The max_tokens parameter lets you control how long the model’s response can be — in terms of tokens.

  • A token can be a word, part of a word, or even punctuation
    (Example: “Hello!” = 2 tokens: "Hello" + "!")

Why use max_tokens?

  • To limit the size of the output
  • To control latency / cost (fewer tokens = faster and cheaper)
  • To avoid overly long answers if you want concise responses

How to choose the value:

Parameter details:

ParameterTypeDefault
nInteger2048

1# Install SarvamAI
2!pip install -Uqq sarvamai
3from sarvamai import SarvamAI
1# Initialize the SarvamAI client with your API key
2client = SarvamAI(api_subscription_key="8f631181-79f7-43e4-9e7c-f78431dc4c91")
1# Example 1: Using default max_tokens (not specified) — model decides length
2response = client.chat.completions(
3 messages=[
4 {"role": "system", "content": "You are a helpful assistant."},
5 {"role": "user", "content": "Tell me about the planet Mars."}
6 ]
7 # max_tokens not specified → model uses internal maximum
8)
1# Example 2: Using max_tokens = 100 — limit response length
2response = client.chat.completions(
3 messages=[
4 {"role": "system", "content": "You are a helpful assistant."},
5 {"role": "user", "content": "Summarize the plot of Mahabharata."}
6 ],
7 max_tokens=100
8)
1# Receive assistant's reply as output.
2print(response.choices[0].message.content)