Chat Completions Overview
Sarvam AI provides powerful chat completion APIs designed to build intelligent conversational AI experiences, with native support for Indian languages and deep contextual reasoning.
Our Chat Completion APIs support the following chat models:
30B parameter model with strong reasoning and Indic language support. Balanced performance-to-cost ratio for production workloads.
105B parameter flagship model. Highest quality outputs for complex reasoning, coding, and generation tasks.
Choosing a Model
Simply pass the model name as the model parameter (e.g., model="sarvam-105b").
Token budgeting: the context length covers everything — your messages, any reasoning_content the model produces in think mode, and the generated reply (capped by max_tokens, default 2048). Reasoning tokens are billed as completion tokens, so high reasoning_effort increases both latency and cost. For long conversations, trim or summarize older turns instead of resending the full history.
Authentication: like every Sarvam API, this endpoint uses the api-subscription-key header. It additionally accepts Authorization: Bearer <key> for OpenAI-compatible tooling — see Authentication for details.
Sarvam-M (24B) has been deprecated and is no longer available through the Chat Completions API. Please migrate to Sarvam-30B or Sarvam-105B for improved performance.
Features
- Supports both “think” and “non-think” modes
- Think mode for complex logical reasoning
- Non-think mode for efficient conversations
- Ideal for mathematical and coding tasks
- Post-trained on Indian languages
- Native English proficiency
- Authentic Indian cultural values
- Rich understanding of local context
- Outperforms similar-sized models
- Strong performance on coding tasks
- Excellent mathematical reasoning
- Advanced problem-solving abilities
- Full Indic script support
- Romanized language support
- Multilingual conversation handling
- Natural language understanding
Code Examples
Basic Chat Completion
Multi-turn Conversation
Hindi (Indic Script)
Reasoning effort options: low, medium, high
- Thinking mode is on by default (
low); passreasoning_effort=None(Python) /reasoning_effort: null(JS, cURL) to disable it - Higher values increase reasoning depth
- Reasoning tokens (returned as
reasoning_content) count toward your completion tokens and bill — use lower effort or disable reasoning for latency- and cost-sensitive paths
- Thinking mode is on by default (
Output length is capped by
max_tokens(default 2048) — raise it for long-form generation
Because thinking mode is on by default, a low max_tokens (e.g. under a few hundred) can be consumed entirely by reasoning — you’ll get finish_reason: "length" with an empty content and only reasoning_content populated. Either keep max_tokens generous or disable reasoning with reasoning_effort=None for short replies.
Streaming
Set stream: true to receive the response incrementally over server-sent events instead of waiting for the full completion. This is essential for responsive chat UIs and voice-agent pipelines, where you want to start rendering (or speaking) the reply as soon as the first tokens arrive.
Both SDKs return an iterator of chat.completion.chunk objects. Each chunk carries a delta with the new portion of the message — delta.content for the reply text and, when reasoning is enabled, delta.reasoning_content for thinking tokens.
Over raw HTTP, each event is a data: line containing a chat.completion.chunk JSON object. The final data chunk carries usage (with an empty choices array), and the stream ends with data: [DONE]:
When reasoning_effort is set, thinking tokens stream first via delta.reasoning_content, followed by the reply via delta.content. Check both fields if you display reasoning to users.
Tool Calling (Function Calling)
Describe functions your application exposes with the tools parameter, and the model will decide when to call them — returning the function name and JSON arguments instead of (or alongside) a text reply. You execute the function yourself, append the result as a tool message, and call the API again so the model can produce its final answer.
The flow is:
- Send the conversation plus
toolsdefinitions. - If the model wants a tool, the response has
finish_reason: "tool_calls"andmessage.tool_callswith the function name and stringified JSONarguments. - Run the function, append the assistant message and a
{"role": "tool", "tool_call_id": ..., "content": ...}message with the result. - Call the API again — the model answers using the tool output.
A tool-call response looks like:
Controlling tool use with tool_choice
function.arguments is a JSON string, not an object — always parse it (and validate against your schema) before executing the function.
Structured Outputs (JSON)
The Chat Completions API supports the OpenAI-compatible response_format parameter for getting reliably structured JSON:
Structured Outputs with json_schema
Pass a JSON Schema under json_schema.schema, and set "strict": true to enforce adherence. The structured reply arrives as a JSON string in message.content — parse it before use.
In the current Python SDK, pass response_format through request_options={"additional_body_parameters": {...}} as shown above. The JavaScript SDK forwards response_format from the request object as-is.
The json_schema object accepts:
JSON mode with json_object
When you only need valid JSON without enforcing a specific structure, use {"type": "json_object"} and describe the desired shape in your prompt:
Even with Structured Outputs, validate the parsed JSON against your expected schema (e.g. with pydantic or zod) before acting on it — the schema constrains the model’s output shape, but your application logic may have stricter requirements (value ranges, business rules, etc.).
Alternative: Tool calling as a JSON schema
If your workflow is already built around tool calling, you can also get structured output by defining a single tool whose parameters schema describes the structure you want, and forcing it with tool_choice. The model’s arguments are then constrained to the schema.
Alternative: Prompt-based JSON
For simple cases, you can also instruct the model to reply with JSON only, set a low temperature, and validate the output before using it (consider JSON mode instead, which guarantees valid JSON):
Always validate model-produced JSON against your expected schema (e.g. with pydantic or zod) and add a retry path — prompt-based JSON is good, but not guaranteed.
API Response Format
Success Response Structure
Response Fields
Error Responses
All errors return a JSON object with an error field (message, code, request_id). The full error-code table, retry guidance, and SDK exception reference live on the central Errors & Troubleshooting page.
Errors specific to this endpoint:
Error Handling Code Example
Limits
Check out our detailed API Reference to explore Chat Completion and all available options.