Chat Completions Overview | Sarvam API Docs

Sarvam AI provides powerful chat completion APIs designed to build intelligent conversational AI experiences, with native support for Indian languages and deep contextual reasoning.

Our Chat Completion APIs support the following chat models:

Sarvam-30B

30B parameter model with strong reasoning and Indic language support. Balanced performance-to-cost ratio for production workloads.

Sarvam-105B

105B parameter flagship model. Highest quality outputs for complex reasoning, coding, and generation tasks.

Choosing a Model

	`sarvam-30b`	`sarvam-105b`
Context length	64K tokens	128K tokens
Latency	Lower — faster time-to-first-token, well suited for voice agents and interactive chat	Higher — prioritizes output quality over speed
Cost	Lower per token	Higher per token
Quality	Strong reasoning and Indic language support for everyday tasks	Highest quality for complex reasoning, coding, and long-form generation
Best for	Standard conversations, Q&A, voice-agent pipelines, high-throughput workloads	Complex multi-step reasoning, code generation, document analysis over long contexts

Simply pass the model name as the model parameter (e.g., model="sarvam-105b").

Token budgeting: the context length covers everything — your messages, any reasoning_content the model produces in think mode, and the generated reply (capped by max_tokens, default 2048). Reasoning tokens are billed as completion tokens, so high reasoning_effort increases both latency and cost. For long conversations, trim or summarize older turns instead of resending the full history.

Authentication: like every Sarvam API, this endpoint uses the api-subscription-key header. It additionally accepts Authorization: Bearer <key> for OpenAI-compatible tooling — see Authentication for details.

Sarvam-M (24B) has been deprecated and is no longer available through the Chat Completions API. Please migrate to Sarvam-30B or Sarvam-105B for improved performance.

Features

Hybrid Thinking Mode

Supports both “think” and “non-think” modes
Think mode for complex logical reasoning
Non-think mode for efficient conversations
Ideal for mathematical and coding tasks

Advanced Indic Skills

Post-trained on Indian languages
Native English proficiency
Authentic Indian cultural values
Rich understanding of local context

Superior Reasoning Capabilities

Outperforms similar-sized models
Strong performance on coding tasks
Excellent mathematical reasoning
Advanced problem-solving abilities

Seamless Chatting Experience

Full Indic script support
Romanized language support
Multilingual conversation handling
Natural language understanding

Code Examples

Basic Chat Completion

Multi-turn Conversation

Hindi (Indic Script)

1 from sarvamai import SarvamAI
2 
3 client = SarvamAI(
4     api_subscription_key="YOUR_SARVAM_API_KEY",
5 )
6 response = client.chat.completions(
7     model="sarvam-105b",
8     messages=[
9         {"role": "user", "content": "Hey, what is the capital of India?"}
10     ],
11 )
12 print(response)

Key Considerations

Reasoning effort options: low, medium, high
- Thinking mode is on by default (low); pass reasoning_effort=None (Python) / reasoning_effort: null (JS, cURL) to disable it
- Higher values increase reasoning depth
- Reasoning tokens (returned as reasoning_content) count toward your completion tokens and bill — use lower effort or disable reasoning for latency- and cost-sensitive paths
Output length is capped by max_tokens (default 2048) — raise it for long-form generation

Because thinking mode is on by default, a low max_tokens (e.g. under a few hundred) can be consumed entirely by reasoning — you’ll get finish_reason: "length" with an empty content and only reasoning_content populated. Either keep max_tokens generous or disable reasoning with reasoning_effort=None for short replies.

Streaming

Set stream: true to receive the response incrementally over server-sent events instead of waiting for the full completion. This is essential for responsive chat UIs and voice-agent pipelines, where you want to start rendering (or speaking) the reply as soon as the first tokens arrive.

Both SDKs return an iterator of chat.completion.chunk objects. Each chunk carries a delta with the new portion of the message — delta.content for the reply text and, when reasoning is enabled, delta.reasoning_content for thinking tokens.

1 from sarvamai import SarvamAI
2 
3 client = SarvamAI(
4     api_subscription_key="YOUR_SARVAM_API_KEY",
5 )
6 
7 stream = client.chat.completions(
8     model="sarvam-105b",
9     messages=[
10         {"role": "user", "content": "Write a short poem about the monsoon."}
11     ],
12     stream=True,
13 )
14 
15 for chunk in stream:
16     # The final chunk reports usage and has no choices — guard before indexing
17     if chunk.choices and chunk.choices[0].delta.content:
18         print(chunk.choices[0].delta.content, end="", flush=True)

Over raw HTTP, each event is a data: line containing a chat.completion.chunk JSON object. The final data chunk carries usage (with an empty choices array), and the stream ends with data: [DONE]:

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1699000000,"model":"sarvam-105b","choices":[{"index":0,"delta":{"role":"assistant","content":"The"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1699000000,"model":"sarvam-105b","choices":[{"index":0,"delta":{"content":" rains"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1699000000,"model":"sarvam-105b","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1699000000,"model":"sarvam-105b","choices":[],"usage":{"prompt_tokens":19,"completion_tokens":3,"total_tokens":22}}
data: [DONE]

When reasoning_effort is set, thinking tokens stream first via delta.reasoning_content, followed by the reply via delta.content. Check both fields if you display reasoning to users.

Tool Calling (Function Calling)

Describe functions your application exposes with the tools parameter, and the model will decide when to call them — returning the function name and JSON arguments instead of (or alongside) a text reply. You execute the function yourself, append the result as a tool message, and call the API again so the model can produce its final answer.

The flow is:

Send the conversation plus tools definitions.
If the model wants a tool, the response has finish_reason: "tool_calls" and message.tool_calls with the function name and stringified JSON arguments.
Run the function, append the assistant message and a {"role": "tool", "tool_call_id": ..., "content": ...} message with the result.
Call the API again — the model answers using the tool output.

1 import json
2 from sarvamai import SarvamAI
3 
4 client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
5 
6 tools = [
7     {
8         "type": "function",
9         "function": {
10             "name": "get_weather",
11             "description": "Get the current weather for an Indian city",
12             "parameters": {
13                 "type": "object",
14                 "properties": {
15                     "city": {"type": "string", "description": "City name, e.g. Mumbai"},
16                     "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
17                 },
18                 "required": ["city"],
19             },
20         },
21     }
22 ]
23 
24 messages = [{"role": "user", "content": "What's the weather in Mumbai right now?"}]
25 
26 response = client.chat.completions(
27     model="sarvam-105b",
28     messages=messages,
29     tools=tools,
30     tool_choice="auto",
31 )
32 
33 message = response.choices[0].message
34 
35 if message.tool_calls:
36     tool_call = message.tool_calls[0]
37     args = json.loads(tool_call.function.arguments)
38 
39     # Run your actual function here
40     weather = {"city": args["city"], "temperature": 31, "condition": "Humid"}
41 
42     messages.append(
43         {
44             "role": "assistant",
45             "tool_calls": [
46                 {
47                     "id": tool_call.id,
48                     "type": "function",
49                     "function": {
50                         "name": tool_call.function.name,
51                         "arguments": tool_call.function.arguments,
52                     },
53                 }
54             ],
55         }
56     )
57     messages.append(
58         {
59             "role": "tool",
60             "tool_call_id": tool_call.id,
61             "content": json.dumps(weather),
62         }
63     )
64 
65     final = client.chat.completions(
66         model="sarvam-105b",
67         messages=messages,
68         tools=tools,
69     )
70     print(final.choices[0].message.content)

A tool-call response looks like:

1 {
2   "choices": [
3     {
4       "index": 0,
5       "finish_reason": "tool_calls",
6       "message": {
7         "role": "assistant",
8         "content": null,
9         "tool_calls": [
10           {
11             "id": "call_abc123",
12             "type": "function",
13             "function": {
14               "name": "get_weather",
15               "arguments": "{\"city\": \"Mumbai\", \"unit\": \"celsius\"}"
16             }
17           }
18         ]
19       }
20     }
21   ]
22 }

Controlling tool use with `tool_choice`

Value	Behavior
`"auto"` (default when tools are provided)	The model decides whether to call a tool or reply directly
`"none"`	The model never calls a tool — tools are ignored
`"required"`	The model must call at least one tool
`{"type": "function", "function": {"name": "get_weather"}}`	Forces the model to call the named function

function.arguments is a JSON string, not an object — always parse it (and validate against your schema) before executing the function.

Structured Outputs (JSON)

The Chat Completions API supports the OpenAI-compatible response_format parameter for getting reliably structured JSON:

`response_format`	Behavior
`{"type": "json_schema", "json_schema": {...}}`	Structured Outputs — output is constrained to match the JSON Schema you supply (recommended)
`{"type": "json_object"}`	JSON mode — output is guaranteed to be valid JSON, but not a specific schema
`{"type": "text"}` (default)	Plain text output

Structured Outputs with `json_schema`

Pass a JSON Schema under json_schema.schema, and set "strict": true to enforce adherence. The structured reply arrives as a JSON string in message.content — parse it before use.

1 import json
2 from sarvamai import SarvamAI
3 
4 client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
5 
6 response = client.chat.completions(
7     model="sarvam-105b",
8     messages=[
9         {
10             "role": "user",
11             "content": "Order: 2 masala dosas and 1 filter coffee to Koramangala, Bengaluru.",
12         }
13     ],
14     request_options={
15         "additional_body_parameters": {
16             "response_format": {
17                 "type": "json_schema",
18                 "json_schema": {
19                     "name": "food_order",
20                     "strict": True,
21                     "schema": {
22                         "type": "object",
23                         "properties": {
24                             "items": {
25                                 "type": "array",
26                                 "items": {
27                                     "type": "object",
28                                     "properties": {
29                                         "name": {"type": "string"},
30                                         "quantity": {"type": "integer"},
31                                     },
32                                     "required": ["name", "quantity"],
33                                     "additionalProperties": False,
34                                 },
35                             },
36                             "city": {"type": "string"},
37                         },
38                         "required": ["items", "city"],
39                         "additionalProperties": False,
40                     },
41                 },
42             }
43         }
44     },
45 )
46 
47 order = json.loads(response.choices[0].message.content)
48 print(order)
49 # {'items': [{'name': 'masala dosa', 'quantity': 2}, {'name': 'filter coffee', 'quantity': 1}], 'city': 'Bengaluru'}

In the current Python SDK, pass response_format through request_options={"additional_body_parameters": {...}} as shown above. The JavaScript SDK forwards response_format from the request object as-is.

The json_schema object accepts:

Field	Type	Description
`name`	string (required)	Name of the response format. Alphanumeric characters, underscores and dashes only
`schema`	object	The output structure, described as a JSON Schema object
`strict`	boolean	Enable strict schema adherence when generating the output (default `false`)
`description`	string	What the format is for — helps the model decide how to respond

JSON mode with `json_object`

When you only need valid JSON without enforcing a specific structure, use {"type": "json_object"} and describe the desired shape in your prompt:

$ curl -X POST https://api.sarvam.ai/v1/chat/completions \
>   -H "api-subscription-key: $SARVAM_API_KEY" \
>   -H "Content-Type: application/json" \
>   -d '{
>     "model": "sarvam-105b",
>     "messages": [
>       {"role": "system", "content": "Reply with a JSON object: {\"sentiment\": \"positive\" | \"negative\" | \"neutral\", \"confidence\": number}"},
>       {"role": "user", "content": "यह फिल्म शानदार थी!"}
>     ],
>     "response_format": {"type": "json_object"}
>   }'

Even with Structured Outputs, validate the parsed JSON against your expected schema (e.g. with pydantic or zod) before acting on it — the schema constrains the model’s output shape, but your application logic may have stricter requirements (value ranges, business rules, etc.).

Alternative: Tool calling as a JSON schema

If your workflow is already built around tool calling, you can also get structured output by defining a single tool whose parameters schema describes the structure you want, and forcing it with tool_choice. The model’s arguments are then constrained to the schema.

1 import json
2 from sarvamai import SarvamAI
3 
4 client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
5 
6 response = client.chat.completions(
7     model="sarvam-105b",
8     messages=[
9         {
10             "role": "user",
11             "content": "Order: 2 masala dosas and 1 filter coffee to Koramangala, Bengaluru.",
12         }
13     ],
14     tools=[
15         {
16             "type": "function",
17             "function": {
18                 "name": "extract_order",
19                 "description": "Extract a structured food order",
20                 "parameters": {
21                     "type": "object",
22                     "properties": {
23                         "items": {
24                             "type": "array",
25                             "items": {
26                                 "type": "object",
27                                 "properties": {
28                                     "name": {"type": "string"},
29                                     "quantity": {"type": "integer"},
30                                 },
31                                 "required": ["name", "quantity"],
32                             },
33                         },
34                         "delivery_area": {"type": "string"},
35                         "city": {"type": "string"},
36                     },
37                     "required": ["items", "city"],
38                 },
39             },
40         }
41     ],
42     tool_choice={"type": "function", "function": {"name": "extract_order"}},
43 )
44 
45 arguments = response.choices[0].message.tool_calls[0].function.arguments
46 order = json.loads(arguments)
47 print(order)
48 # {'items': [{'name': 'masala dosa', 'quantity': 2}, {'name': 'filter coffee', 'quantity': 1}], 'delivery_area': 'Koramangala', 'city': 'Bengaluru'}

Alternative: Prompt-based JSON

For simple cases, you can also instruct the model to reply with JSON only, set a low temperature, and validate the output before using it (consider JSON mode instead, which guarantees valid JSON):

1 import json
2 from sarvamai import SarvamAI
3 
4 client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
5 
6 response = client.chat.completions(
7     model="sarvam-105b",
8     messages=[
9         {
10             "role": "system",
11             "content": (
12                 "Reply with a single JSON object only — no prose, no markdown fences. "
13                 'Schema: {"sentiment": "positive" | "negative" | "neutral", "confidence": number}'
14             ),
15         },
16         {"role": "user", "content": "यह फिल्म शानदार थी!"},
17     ],
18     temperature=0.1,
19 )
20 
21 raw = response.choices[0].message.content
22 try:
23     result = json.loads(raw)
24 except json.JSONDecodeError:
25     # Retry, or strip markdown fences / extra text before parsing
26     raise
27 print(result)

Always validate model-produced JSON against your expected schema (e.g. with pydantic or zod) and add a retry path — prompt-based JSON is good, but not guaranteed.

API Response Format

Success Response Structure

1 {
2   "id": "chatcmpl-abc123",
3   "object": "chat.completion",
4   "created": 1699000000,
5   "model": "sarvam-105b",
6   "choices": [
7     {
8       "index": 0,
9       "message": {
10         "role": "assistant",
11         "content": "The capital of India is New Delhi. It has been the capital since 1931."
12       },
13       "finish_reason": "stop"
14     }
15   ],
16   "usage": {
17     "prompt_tokens": 15,
18     "completion_tokens": 25,
19     "total_tokens": 40
20   }
21 }

Response Fields

Field	Type	Description
`id`	string	Unique identifier for the completion request
`object`	string	Always `"chat.completion"`
`created`	integer	Unix timestamp when the completion was created
`model`	string	The model used for completion
`choices[].index`	integer	Index of the choice in the list
`choices[].message.role`	string	Always `"assistant"`
`choices[].message.content`	string	The generated text response (`null` when the model calls a tool)
`choices[].message.reasoning_content`	string	Thinking steps (only when `reasoning_effort` is set)
`choices[].message.tool_calls`	array	Tool invocations requested by the model (only when using tool calling)
`choices[].finish_reason`	string	Why generation stopped: `"stop"`, `"length"`, `"tool_calls"`, `"content_filter"`
`usage.prompt_tokens`	integer	Tokens in the input prompt
`usage.completion_tokens`	integer	Tokens in the generated response
`usage.total_tokens`	integer	Total tokens used (prompt + completion)

Error Responses

All errors return a JSON object with an error field (message, code, request_id). The full error-code table, retry guidance, and SDK exception reference live on the central Errors & Troubleshooting page.

Errors specific to this endpoint:

HTTP Status	Error Code	When This Happens	What To Do
`400`	`invalid_request_error`	Missing `messages` array or missing `model` field	Include both `model` and a valid `messages` array with role/content
`422`	`unprocessable_entity_error`	Invalid model name or parameter values	Check temperature (0-2), model name, etc.

Error Handling Code Example

1 from sarvamai import SarvamAI
2 from sarvamai.core.api_error import ApiError
3 
4 client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
5 
6 try:
7     response = client.chat.completions(
8         model="sarvam-105b",
9         messages=[
10             {"role": "user", "content": "What is the capital of India?"}
11         ],
12     )
13     print(response.choices[0].message.content)
14 except ApiError as e:
15     if e.status_code == 400:
16         print(f"Bad request: {e.body}")
17     elif e.status_code == 403:
18         print("Invalid API key. Check your credentials.")
19     elif e.status_code == 422:
20         print(f"Invalid parameters: {e.body}")
21     elif e.status_code == 429:
22         print("Rate limit exceeded. Wait and retry.")
23     else:
24         print(f"Error {e.status_code}: {e.body}")

Limits

Limit	Value
Context window	64K tokens (`sarvam-30b`) / 128K tokens (`sarvam-105b`)
`max_tokens`	sarvam-30b: Starter 4096 / Pro 8192 / Business 64000 sarvam-105b: Starter 4096 / Pro 16384 / Business 128000 (reasoning tokens count toward completion tokens)
`temperature`	0–2 (default 0.5 when reasoning is enabled — the default — and 0.2 when reasoning is disabled)
`top_p`	0–1
`n` (completions per request)	1–128
`frequency_penalty` / `presence_penalty`	-2 to 2
`stop`	Up to 4 sequences
Rate limits	See Rate Limits

Check out our detailed API Reference to explore Chat Completion and all available options.