How to control the response length with max_tokens
The max_tokens parameter lets you control how long the model’s response can be — in terms of tokens.
- A token can be a word, part of a word, or even punctuation
(Example: “Hello!” = 2 tokens:"Hello"+"!")
Why use max_tokens?
- To limit the size of the output
- To control latency / cost (fewer tokens = faster and cheaper)
- To avoid overly long answers if you want concise responses