Sarvam Vision

Sarvam Vision is a 3B parameter state-space Vision Language Model (VLM) purpose-built for high-accuracy Document Intelligence. It powers our Document Intelligence pipeline.

Why Sarvam Vision?

One of the most challenging problems in vision AI today is accurate document intelligence for Indian languages. Much of India’s knowledge—historical texts, government records, academic papers, and cultural archives—remains locked in libraries, scanned collections, and legacy documents. Unlocking this vast repository is essential for preserving cultural heritage and making knowledge accessible.

While frontier Vision Language Models have set a high bar for processing modern English documents, a significant gap remains: most global models treat Indian languages as secondary, often resulting in lower accuracy for regional scripts. Sarvam Vision bridges this gap with native support for 22 Indian languages, delivering world-class accuracy where others fall short.

Want to learn more about how we built Sarvam Vision? Check out our blog post.


What You Can Do

  • Text Extraction: Extract text from PDFs and scanned documents in 23 languages (22 Indian + English)
  • Tables: Convert complex tables to HTML or Markdown
  • Structure Preservation: Maintain document layout, reading order, and hierarchies

Supported Languages

All 22 official Indian languages plus English:

LanguageCodeLanguageCodeLanguageCode
Hindihi-INAssameseas-INKonkanikok-IN
Bengalibn-INUrduur-INMaithilimai-IN
Tamilta-INSanskritsa-INSindhisd-IN
Telugute-INNepaline-INKashmiriks-IN
Marathimr-INDogridoi-INManipurimni-IN
Gujaratigu-INBodobrx-INSantalisat-IN
Kannadakn-INPunjabipa-INEnglishen-IN
Malayalamml-INOdiaod-IN

Capabilities

High-Fidelity Document Intelligence

Sarvam Vision extracts text from documents with exceptional accuracy, preserving the original structure and reading order across 23 languages (22 Indian + English).

Features:

  • High-accuracy text extraction from PDFs and scanned documents
  • Preserves document layout and reading order
  • Native support for all Indian scripts
  • Outputs clean HTML or Markdown

Quick Start

Get started with Document Intelligence with high-fidelity text extraction across all supported languages.

1from sarvamai import SarvamAI
2
3client = SarvamAI(
4 api_subscription_key="YOUR_SARVAM_API_KEY"
5)
6
7# Create a document intelligence job
8job = client.document_intelligence.create_job(
9 language="hi-IN",
10 output_format="md"
11)
12print(f"Job created: {job.job_id}")
13
14# Upload document
15job.upload_file("document.pdf")
16print("File uploaded")
17
18# Start processing
19job.start()
20print("Job started")
21
22# Wait for completion
23status = job.wait_until_complete()
24print(f"Job completed with state: {status.job_state}")
25
26# Get processing metrics
27metrics = job.get_page_metrics()
28print(f"Page metrics: {metrics}")
29
30# Download output (ZIP file containing the processed document)
31job.download_output("./output.zip")
32print("Output saved to ./output.zip")

Model Specifications

Technical Specifications
  • Model Size: 3B parameters
  • Supported Input Formats: PDF, PNG, JPG, ZIP (flat archive with JPG/PNG document pages)
  • Output Formats: HTML, Markdown (md) (delivered as ZIP file)
  • Languages: 23 languages (22 Indian + English)

Next Steps