Sarvam-105B

Sarvam-105B (Flagship Chat LLM)

Sarvam AI’s flagship Mixture-of-Experts reasoning model trained from scratch, with Multi-head Latent Attention (MLA) for efficient long-context inference. Matches or outperforms most open and closed-source frontier models of its class across knowledge, reasoning, and agentic benchmarks.

Highlights:

105B+ total parameters — our most capable MoE model with Multi-head Latent Attention
Pre-trained on 12 trillion tokens across code, math, multilingual, and web data
98.6 on Math500, 88.3 on AIME 25 (96.7 with tools), 49.5 on BrowseComp
State-of-the-art Indian language performance: wins 90% of pairwise comparisons
Powers Indus, Sarvam’s AI assistant for complex reasoning and agentic workflows
OpenAI-compatible chat completions API | Apache 2.0 open-source

Key Features

Flagship Indian Language Support

Wins 90% of pairwise comparisons across Indian language benchmarks and 84% on STEM, math, and coding. Trained extensively on native script, romanized, and code-mixed inputs across the 10 most-spoken Indian languages.

Advanced Reasoning

98.6 on Math500, 88.3 on AIME 25 (96.7 with tools), 85.8 on HMMT, and 69.1 on Beyond AIME — reflecting deep multi-step reasoning and complex mathematical problem solving.

Agentic Capabilities

49.5 on BrowseComp and 68.3 on Tau2 (avg.) — highest among compared models. Optimized for tool use, long-horizon reasoning, and environment interaction in real-world workflows.

Efficient MoE Architecture

Mixture-of-Experts Transformer with 128 sparse experts and Multi-head Latent Attention (MLA), a compressed attention formulation that reduces memory requirements for long-context inference.

Learn More

For detailed information on architecture, training methodology, performance benchmarks, and inference optimizations, visit our blog.

Model Specifications

Key Considerations

Model ID: sarvam-105b
Total Parameters: 105B+ with MoE architecture and 128 sparse experts
Attention: Multi-head Latent Attention (MLA)
Pre-training Data: 12T tokens
Temperature range: 0 to 2
Top-p range: 0 to 1
Supports streaming and non-streaming responses
OpenAI-compatible chat completions format
License: Apache 2.0

Choosing Between Sarvam Models

Feature	Sarvam-30B	Sarvam-105B
Total Parameters	30B (2.4B active)	105B+
Architecture	MoE + GQA	MoE + MLA
Pre-training Data	16T tokens	12T tokens
Best for	Real-time deployment & conversational AI	Maximum quality, reasoning & agentic workflows
Math500	97.0	98.6
AIME 25	88.3	88.3 (96.7 w/ tools)
BrowseComp	35.5	49.5
Indian Language Win Rate	89% avg	90% avg
Inference	H100, L40S, Apple Silicon	Server-centric (H100)

Choose Sarvam-30B for a balanced performance-to-cost ratio and real-time conversational workloads, and Sarvam-105B when you need the highest quality outputs for complex reasoning and agentic tasks. Sarvam-M (24B) is still available as a legacy model.

Key Capabilities

Basic Chat Completion

Multi-turn Conversation

Streaming

Simple, one-turn interaction where the user asks a question and the model replies with the highest quality response leveraging its 105B parameter knowledge.

1 from sarvamai import SarvamAI
2 
3 client = SarvamAI(
4     api_subscription_key="YOUR_SARVAM_API_KEY",
5 )
6 
7 response = client.chat.completions(
8     model="sarvam-105b",
9     messages=[
10         {"role": "user", "content": "Explain the economic impact of GST implementation in India."}
11     ],
12     temperature=0.5,
13     top_p=1,
14     max_tokens=2000,
15 )
16 
17 print(response.choices[0].message.content)

Next Steps

Developer quickstart

Learn how to integrate chat completion into your application.

API Reference

Complete API documentation for chat completion endpoints.