Sarvam-30B

View as Markdown

Sarvam-30B (Chat LLM)

A 30B parameter Mixture-of-Experts reasoning model trained from scratch, optimized for Indian languages with only 2.4B active parameters per token. Delivers strong reasoning, coding, and conversational capabilities while remaining efficient to deploy.

Highlights:

  • 30B total parameters, 2.4B active — efficient MoE architecture with Grouped Query Attention
  • Pre-trained on 16 trillion tokens across code, math, multilingual, and web data
  • State-of-the-art Indian language performance across native and romanized scripts
  • Optimized inference for H100, L40S, and Apple Silicon (MXFP4)
  • OpenAI-compatible chat completions API

At a Glance

Model IDsarvam-30b
What it doesChat LLM — reasoning, coding, and conversation with 30B total / 2.4B active parameters (MoE)
Languages10 most-spoken Indian languages + English; native script, romanized, and code-mixed input
APIsChat Completions (/v1/chat/completions, OpenAI-compatible, streaming supported)
Input limits64K-token context window — all limits
BenchmarksKey Features and the model blog
PricingPricing page
Best forVoice-agent pipelines, interactive chat, high-throughput workloads
Known limitationsSee below

Key Features

Strong Indian Language Support

Trained on the 10 most-spoken Indian languages with support for native script, romanized, and code-mixed inputs. Wins 89% of pairwise comparisons on Indian language benchmarks and 87% on STEM, math, and coding.

Efficient MoE Architecture

Mixture-of-Experts Transformer with 128 sparse experts and only 2.4B active parameters per token, enabling high throughput with 3x–6x gains on H100 and local execution on Apple Silicon via MXFP4.

Reasoning & Coding

Achieves 97.0 on Math500, 92.1 on HumanEval, 92.7 on MBPP, and 88.3 on AIME 25 (96.7 with tools) — exceeding typical expectations for models with similar active compute.

Agentic Capabilities

Native tool calling with strong performance on BrowseComp (35.5) and Tau2 (45.7) for web-search-driven tasks, planning, retrieval, and multi-step task execution.

Learn More

For detailed information on architecture, training methodology, performance benchmarks, and inference optimizations, visit our blog.

Model Specifications

Key Considerations
  • Model ID: sarvam-30b
  • Total Parameters: 30B (2.4B active per token)
  • Architecture: MoE Transformer with GQA and 128 sparse experts
  • Pre-training Data: 16T tokens
  • Temperature range: 0 to 2
  • Top-p range: 0 to 1
  • Supports streaming and non-streaming responses
  • OpenAI-compatible chat completions format
  • License: Apache 2.0

Key Capabilities

Simple, one-turn interaction where the user asks a question and the model replies with a single, direct response.

1from sarvamai import SarvamAI
2
3client = SarvamAI(
4 api_subscription_key="YOUR_SARVAM_API_KEY",
5)
6
7response = client.chat.completions(
8 model="sarvam-30b",
9 messages=[
10 {"role": "user", "content": "Explain the significance of the Indian monsoon season."}
11 ],
12 temperature=0.5,
13 top_p=1,
14 max_tokens=1000,
15)
16
17print(response.choices[0].message.content)

Limits

LimitValue
Context window64K tokens
max_tokensStarter 4096 / Pro 8192 / Business 64000
temperature0–2 (default 0.5 when reasoning is enabled — the default — and 0.2 when reasoning is disabled)
top_p0–1
n (completions per request)1–128
frequency_penalty / presence_penalty-2 to 2
stopUp to 4 sequences
Rate limitsSee Rate Limits

Known Limitations

LimitationDetailWorkaround
Thinking mode is on by defaultReasoning (reasoning_effort, default low) is enabled by default, and reasoning tokens count toward completion tokens — a small max_tokens can be consumed entirely by reasoningIncrease max_tokens to leave room for the visible answer, or disable reasoning with reasoning_effort=None

Next Steps