Document Digitization Overview

Sarvam’s Document Digitization API provides enterprise-grade document processing powered by Sarvam Vision, our state-of-the-art multimodal model.

Transform any document into structured, searchable, and machine-readable data with world-class accuracy.

Sarvam Vision

Our 3B parameter multimodal model powering Document Digitization. SOTA performance on global and Indic benchmarks.

What is Document Digitization?

Document Digitization is a comprehensive document processing pipeline powered by Sarvam Vision that:

Extracts Text: High-fidelity text extraction across 23 languages (22 Indian + English)
Preserves Structure: Maintains document layout, reading order, and hierarchies
Parses Tables: Transforms tables into structured HTML or Markdown formats
Outputs Structured Data: Generates clean, machine-readable HTML or Markdown output

Key Features

23 Language Support

Native support for all Constitutionally recognized Indian languages and English with script-native accuracy.

Multiple Output Formats

Export to HTML or Markdown files, delivered as a ZIP archive. A JSON file with structured page-level data is always included alongside your chosen format.

Table Extraction

Intelligent table detection and conversion to structured formats.

Batch Processing

Process multi-page documents and ZIP archives with automatic page handling.

Layout Preservation

Intelligent reading order detection and complex layout handling.

Enterprise-Ready

Scalable API with job management, progress tracking, and error handling.

Supported Languages

Document Digitization supports all 22 Constitutionally recognized Indian languages:

Primary Languages

Additional Languages

Language	Code	Script
Hindi	`hi-IN`	Devanagari
Bengali	`bn-IN`	Bengali
Tamil	`ta-IN`	Tamil
Telugu	`te-IN`	Telugu
Marathi	`mr-IN`	Devanagari
Gujarati	`gu-IN`	Gujarati
Kannada	`kn-IN`	Kannada
Malayalam	`ml-IN`	Malayalam
Odia	`od-IN`	Odia
Punjabi	`pa-IN`	Gurmukhi
English	`en-IN`	Latin

Supported Input Formats

Format	Extension	Description
PDF	`.pdf`	Multi-page PDF documents (max 10 pages)
PNG	`.png`	Document page images
JPEG	`.jpg`, `.jpeg`	Document page images
ZIP	`.zip`	Flat archive containing document page images — JPG/PNG (max 10 images)

Page Limit: Both PDF and ZIP uploads are limited to a maximum of 10 pages. Exceeding this limit will return a 422 Unprocessable Entity error with code max_page_limit_exceeded. Split larger documents into batches of 10 pages or fewer before uploading.

For ZIP files, include only JPG and PNG document pages in a flat structure (no nested folders). The API will process all pages in the archive and maintain page order based on filename.

Output Formats

The output_format parameter controls the primary content format in the output ZIP archive. You can choose between html and md (Markdown).

JSON is always included by default. Regardless of whether you choose HTML or Markdown as your output format, a JSON file with structured page-level data is always included in the output ZIP archive. This JSON contains the same extracted content in a machine-readable format, making it easy to programmatically process the results alongside the human-readable HTML or Markdown output.

Format	`output_format` value	Description
Markdown	`md`	Human-readable Markdown output + JSON page data
HTML	`html`	Rich HTML output for web rendering + JSON page data

Quick Start

API parameter names: Document Digitization uses language and output_format values "md" or "html" in job_parameters. Do not use language_code (ignored by the API) or "markdown" (returns 400 — use "md" instead). This differs from STT, translate, and LID endpoints.

Get started with Document Digitization in minutes:

1 from sarvamai import SarvamAI
2 
3 client = SarvamAI(
4     api_subscription_key="YOUR_SARVAM_API_KEY"
5 )
6 
7 # Create a Document Digitization job
8 job = client.document_intelligence.create_job(
9     language="hi-IN",           # Target language (BCP-47 format)
10     output_format="md"          # Output format: "html" or "md" (delivered as ZIP)
11 )
12 
13 # Upload your document
14 job.upload_file("document.pdf")
15 
16 # Start processing
17 job.start()
18 
19 # Wait for completion
20 status = job.wait_until_complete()
21 print(f"Job completed: {status.job_state}")
22 
23 # Get processing metrics
24 metrics = job.get_page_metrics()
25 print(f"Pages processed: {metrics['pages_processed']}")
26 
27 # Download the output (ZIP file containing the processed document)
28 # The ZIP includes your chosen format (MD/HTML) plus a JSON file with page-level data
29 job.download_output("./output.zip")
30 print("Output saved to ./output.zip")

Response Format

Job Status Response

1 {
2   "job_id": "abc123-def456-ghi789",
3   "job_state": "Completed",
4   "created_at": "2026-02-04T10:30:00Z",
5   "updated_at": "2026-02-04T10:35:00Z",
6   "page_metrics": {
7     "total_pages": 10,
8     "pages_processed": 10,
9     "pages_succeeded": 10,
10     "pages_failed": 0
11   }
12 }

Job States

State	Description
`Accepted`	Job created, awaiting file upload
`Pending`	File uploaded, waiting to start
`Running`	Job is being processed
`Completed`	All pages processed successfully
`PartiallyCompleted`	Some pages succeeded, some failed
`Failed`	All pages failed or job-level error

Error Handling

Error Handling Example

1 from sarvamai import SarvamAI
2 from sarvamai.core.api_error import ApiError
3 
4 client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
5 
6 try:
7     job = client.document_intelligence.create_job(
8         language="hi-IN",
9         output_format="md"
10     )
11     job.upload_file("document.pdf")
12     job.start()
13     status = job.wait_until_complete()
14     
15     if status.job_state == "Completed":
16         job.download_output("./output.zip")
17         print("Output saved to ./output.zip")
18     else:
19         print(f"Job failed: {status}")
20         
21 except ApiError as e:
22     if e.status_code == 400:
23         print(f"Bad request: {e.body}")
24     elif e.status_code == 403:
25         print("Invalid API key")
26     elif e.status_code == 429:
27         print("Rate limit exceeded")
28     else:
29         print(f"Error {e.status_code}: {e.body}")
30 except FileNotFoundError:
31     print("Document file not found")

Error Codes

The full error-code table, retry guidance, and SDK exception reference live on the central Errors & Troubleshooting page. Errors specific to this API:

HTTP Status	Error Code	Description
`404`	`not_found_error`	Job not found
`422`	`unprocessable_entity_error`	Invalid file format or corrupted file
`422`	`max_page_limit_exceeded`	Document exceeds the 10-page limit (applies to both PDF and ZIP)

Best Practices

Choose the Right Format

Use Markdown for human-readable output and HTML for web rendering and rich formatting. JSON page-level data is always included regardless of your choice.

Specify Language

Always specify the correct language code for optimal text extraction accuracy, especially for Indian languages.

Handle Large Documents

For large documents, monitor page_metrics to track progress and handle partial failures gracefully.

Use HTML for Tables

Choose HTML output format when you need to preserve table structures and rich formatting.

Limits

Limit	Value
Max pages per PDF	10 (`422 max_page_limit_exceeded` if exceeded)
Max images per ZIP	10
Max file size	200 MB
Supported input formats	PDF, PNG, JPG, ZIP
Rate limit	10 requests/minute (all plans) — see Rate Limits

Next Steps

Sarvam Vision Model

Learn about the model powering Document Digitization.

API Reference

Complete API documentation with all parameters and options.

Try in API Dashboard

Upload and process documents in the API Dashboard.