Sarvam-105B

Sarvam-105B (Flagship Chat LLM)

Sarvam AI’s flagship Mixture-of-Experts reasoning model trained from scratch, with Multi-head Latent Attention (MLA) for efficient long-context inference. Matches or outperforms most open and closed-source frontier models of its class across knowledge, reasoning, and agentic benchmarks.

Highlights:

  • 105B+ total parameters — our most capable MoE model with Multi-head Latent Attention
  • Pre-trained on 12 trillion tokens across code, math, multilingual, and web data
  • 98.6 on Math500, 88.3 on AIME 25 (96.7 with tools), 49.5 on BrowseComp
  • State-of-the-art Indian language performance: wins 90% of pairwise comparisons
  • Powers Indus, Sarvam’s AI assistant for complex reasoning and agentic workflows
  • OpenAI-compatible chat completions API | Apache 2.0 open-source

Key Features

Flagship Indian Language Support

Wins 90% of pairwise comparisons across Indian language benchmarks and 84% on STEM, math, and coding. Trained extensively on native script, romanized, and code-mixed inputs across the 10 most-spoken Indian languages.

Advanced Reasoning

98.6 on Math500, 88.3 on AIME 25 (96.7 with tools), 85.8 on HMMT, and 69.1 on Beyond AIME — reflecting deep multi-step reasoning and complex mathematical problem solving.

Agentic Capabilities

49.5 on BrowseComp and 68.3 on Tau2 (avg.) — highest among compared models. Optimized for tool use, long-horizon reasoning, and environment interaction in real-world workflows.

Efficient MoE Architecture

Mixture-of-Experts Transformer with 128 sparse experts and Multi-head Latent Attention (MLA), a compressed attention formulation that reduces memory requirements for long-context inference.

Learn More

For detailed information on architecture, training methodology, performance benchmarks, and inference optimizations, visit our blog.

Model Specifications

Key Considerations
  • Model ID: sarvam-105b
  • Total Parameters: 105B+ with MoE architecture and 128 sparse experts
  • Attention: Multi-head Latent Attention (MLA)
  • Pre-training Data: 12T tokens
  • Temperature range: 0 to 2
  • Top-p range: 0 to 1
  • Supports streaming and non-streaming responses
  • OpenAI-compatible chat completions format
  • License: Apache 2.0

Choosing Between Sarvam Models

FeatureSarvam-30BSarvam-105B
Total Parameters30B (2.4B active)105B+
ArchitectureMoE + GQAMoE + MLA
Pre-training Data16T tokens12T tokens
Best forReal-time deployment & conversational AIMaximum quality, reasoning & agentic workflows
Math50097.098.6
AIME 2588.388.3 (96.7 w/ tools)
BrowseComp35.549.5
Indian Language Win Rate89% avg90% avg
InferenceH100, L40S, Apple SiliconServer-centric (H100)

Choose Sarvam-30B for a balanced performance-to-cost ratio and real-time conversational workloads, and Sarvam-105B when you need the highest quality outputs for complex reasoning and agentic tasks. Sarvam-M (24B) is still available as a legacy model.

Key Capabilities

Simple, one-turn interaction where the user asks a question and the model replies with the highest quality response leveraging its 105B parameter knowledge.

1from sarvamai import SarvamAI
2
3client = SarvamAI(
4 api_subscription_key="YOUR_SARVAM_API_KEY",
5)
6
7response = client.chat.completions(
8 model="sarvam-105b",
9 messages=[
10 {"role": "user", "content": "Explain the economic impact of GST implementation in India."}
11 ],
12 temperature=0.5,
13 top_p=1,
14 max_tokens=2000,
15)
16
17print(response.choices[0].message.content)

Next Steps