Sarvam-30B
Sarvam-30B (Chat LLM)
A 30B parameter Mixture-of-Experts reasoning model trained from scratch, optimized for Indian languages with only 2.4B active parameters per token. Delivers strong reasoning, coding, and conversational capabilities while remaining efficient to deploy.
Highlights:
- 30B total parameters, 2.4B active — efficient MoE architecture with Grouped Query Attention
- Pre-trained on 16 trillion tokens across code, math, multilingual, and web data
- State-of-the-art Indian language performance across native and romanized scripts
- Optimized inference for H100, L40S, and Apple Silicon (MXFP4)
- OpenAI-compatible chat completions API
Key Features
Trained on the 10 most-spoken Indian languages with support for native script, romanized, and code-mixed inputs. Wins 89% of pairwise comparisons on Indian language benchmarks and 87% on STEM, math, and coding.
Mixture-of-Experts Transformer with 128 sparse experts and only 2.4B active parameters per token, enabling high throughput with 3x–6x gains on H100 and local execution on Apple Silicon via MXFP4.
Achieves 97.0 on Math500, 92.1 on HumanEval, 92.7 on MBPP, and 88.3 on AIME 25 (96.7 with tools) — exceeding typical expectations for models with similar active compute.
Native tool calling with strong performance on BrowseComp (35.5) and Tau2 (45.7) for web-search-driven tasks, planning, retrieval, and multi-step task execution.
Learn More
For detailed information on architecture, training methodology, performance benchmarks, and inference optimizations, visit our blog.
Model Specifications
- Model ID:
sarvam-30b - Total Parameters: 30B (2.4B active per token)
- Architecture: MoE Transformer with GQA and 128 sparse experts
- Pre-training Data: 16T tokens
- Temperature range: 0 to 2
- Top-p range: 0 to 1
- Supports streaming and non-streaming responses
- OpenAI-compatible chat completions format
- License: Apache 2.0
Key Capabilities
Basic Chat Completion
Multi-turn Conversation
Streaming
Simple, one-turn interaction where the user asks a question and the model replies with a single, direct response.