How to control the response length with max_tokens
The max_tokens
parameter lets you control how long the model’s response can be — in terms of tokens.
- A token can be a word, part of a word, or even punctuation
(Example: “Hello!” = 2 tokens:"Hello"
+"!"
)
Why use max_tokens
?
- To limit the size of the output
- To control latency / cost (fewer tokens = faster and cheaper)
- To avoid overly long answers if you want concise responses