> For clean Markdown of any page, append `.md` to the page URL.
> For a complete documentation index, see https://docs.sarvam.ai/llms.txt.
> For full documentation content in one file, see https://docs.sarvam.ai/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.sarvam.ai/_mcp/server.

# Document Digitization Overview

> Transform documents into structured, queryable data with Sarvam's Document Digitization API. Powered by Sarvam Vision for accurate text extraction and table parsing across 23 languages (22 Indian + English).

Sarvam's Document Digitization API provides enterprise-grade document processing powered by [Sarvam Vision](/api-reference-docs/getting-started/models/sarvam-vision), our state-of-the-art multimodal model.

<p>
  Transform any document into structured, searchable, and machine-readable data with world-class accuracy.
</p>

Our 3B parameter multimodal model powering Document Digitization. SOTA performance on global and Indic benchmarks.

***

## What is Document Digitization?

Document Digitization is a comprehensive document processing pipeline powered by Sarvam Vision that:

1. **Extracts Text**: High-fidelity text extraction across 23 languages (22 Indian + English)
2. **Preserves Structure**: Maintains document layout, reading order, and hierarchies
3. **Parses Tables**: Transforms tables into structured HTML or Markdown formats
4. **Outputs Structured Data**: Generates clean, machine-readable HTML or Markdown output

***

## Key Features

Native support for all Constitutionally recognized Indian languages and English with script-native accuracy.

Export to HTML or Markdown files, delivered as a ZIP archive. A JSON file with structured page-level data is always included alongside your chosen format.

Intelligent table detection and conversion to structured formats.

Process multi-page documents and ZIP archives with automatic page handling.

Intelligent reading order detection and complex layout handling.

Scalable API with job management, progress tracking, and error handling.

***

## Supported Languages

Document Digitization supports all 22 Constitutionally recognized Indian languages:

| Language  | Code    | Script     |
| --------- | ------- | ---------- |
| Hindi     | `hi-IN` | Devanagari |
| Bengali   | `bn-IN` | Bengali    |
| Tamil     | `ta-IN` | Tamil      |
| Telugu    | `te-IN` | Telugu     |
| Marathi   | `mr-IN` | Devanagari |
| Gujarati  | `gu-IN` | Gujarati   |
| Kannada   | `kn-IN` | Kannada    |
| Malayalam | `ml-IN` | Malayalam  |
| Odia      | `od-IN` | Odia       |
| Punjabi   | `pa-IN` | Gurmukhi   |
| English   | `en-IN` | Latin      |

| Language | Code     | Script            |
| -------- | -------- | ----------------- |
| Assamese | `as-IN`  | Assamese          |
| Urdu     | `ur-IN`  | Perso-Arabic      |
| Sanskrit | `sa-IN`  | Devanagari        |
| Nepali   | `ne-IN`  | Devanagari        |
| Konkani  | `kok-IN` | Devanagari        |
| Maithili | `mai-IN` | Devanagari        |
| Sindhi   | `sd-IN`  | Devanagari/Arabic |
| Kashmiri | `ks-IN`  | Perso-Arabic      |
| Dogri    | `doi-IN` | Devanagari        |
| Manipuri | `mni-IN` | Meetei Mayek      |
| Bodo     | `brx-IN` | Devanagari        |
| Santali  | `sat-IN` | Ol Chiki          |

***

## Supported Input Formats

| Format | Extension       | Description                                                            |
| ------ | --------------- | ---------------------------------------------------------------------- |
| PDF    | `.pdf`          | Multi-page PDF documents (max 10 pages)                                |
| PNG    | `.png`          | Document page images                                                   |
| JPEG   | `.jpg`, `.jpeg` | Document page images                                                   |
| ZIP    | `.zip`          | Flat archive containing document page images — JPG/PNG (max 10 images) |

**Page Limit:** Both PDF and ZIP uploads are limited to a maximum of **10 pages**. Exceeding this limit will return a `422 Unprocessable Entity` error with code `max_page_limit_exceeded`. Split larger documents into batches of 10 pages or fewer before uploading.

For ZIP files, include only JPG and PNG document pages in a flat structure (no nested folders). The API will process all pages in the archive and maintain page order based on filename.

***

## Output Formats

The `output_format` parameter controls the primary content format in the output ZIP archive. You can choose between `html` and `md` (Markdown).

**JSON is always included by default.** Regardless of whether you choose HTML or Markdown as your output format, a JSON file with structured page-level data is always included in the output ZIP archive. This JSON contains the same extracted content in a machine-readable format, making it easy to programmatically process the results alongside the human-readable HTML or Markdown output.

| Format   | `output_format` value | Description                                         |
| -------- | --------------------- | --------------------------------------------------- |
| Markdown | `md`                  | Human-readable Markdown output + JSON page data     |
| HTML     | `html`                | Rich HTML output for web rendering + JSON page data |

***

## Quick Start

**API parameter names:** Document Digitization uses `language` and `output_format` values `"md"` or `"html"` in `job_parameters`. Do not use `language_code` (ignored by the API) or `"markdown"` (returns `400` — use `"md"` instead). This differs from STT, translate, and LID endpoints.

Get started with Document Digitization in minutes:

```python
from sarvamai import SarvamAI

client = SarvamAI(
    api_subscription_key="YOUR_SARVAM_API_KEY"
)

# Create a Document Digitization job
job = client.document_intelligence.create_job(
    language="hi-IN",           # Target language (BCP-47 format)
    output_format="md"          # Output format: "html" or "md" (delivered as ZIP)
)

# Upload your document
job.upload_file("document.pdf")

# Start processing
job.start()

# Wait for completion
status = job.wait_until_complete()
print(f"Job completed: {status.job_state}")

# Get processing metrics
metrics = job.get_page_metrics()
print(f"Pages processed: {metrics['pages_processed']}")

# Download the output (ZIP file containing the processed document)
# The ZIP includes your chosen format (MD/HTML) plus a JSON file with page-level data
job.download_output("./output.zip")
print("Output saved to ./output.zip")
```

```javascript
import { SarvamAIClient } from "sarvamai";

const client = new SarvamAIClient({
    apiSubscriptionKey: "YOUR_SARVAM_API_KEY"
});

async function processDocument() {
    // Create a Document Digitization job
    const job = await client.documentIntelligence.createJob({
        language: "hi-IN",
        outputFormat: "md"
    });

    // Upload your document
    await job.uploadFile("document.pdf");

    // Start processing
    await job.start();

    // Wait for completion
    const status = await job.waitUntilComplete();
    console.log(`Job completed: ${status.job_state}`);

    // Get processing metrics
    const metrics = job.getPageMetrics();
    console.log(`Pages processed: ${metrics.pagesProcessed}`);

    // Download the output (ZIP file containing the processed document)
    // The ZIP includes your chosen format (MD/HTML) plus a JSON file with page-level data
    await job.downloadOutput("./output.zip");
    console.log("Output saved to ./output.zip");
}

processDocument();
```

***

## Response Format

### Job Status Response

```json
{
  "job_id": "abc123-def456-ghi789",
  "job_state": "Completed",
  "created_at": "2026-02-04T10:30:00Z",
  "updated_at": "2026-02-04T10:35:00Z",
  "page_metrics": {
    "total_pages": 10,
    "pages_processed": 10,
    "pages_succeeded": 10,
    "pages_failed": 0
  }
}
```

### Job States

| State                | Description                         |
| -------------------- | ----------------------------------- |
| `Accepted`           | Job created, awaiting file upload   |
| `Pending`            | File uploaded, waiting to start     |
| `Running`            | Job is being processed              |
| `Completed`          | All pages processed successfully    |
| `PartiallyCompleted` | Some pages succeeded, some failed   |
| `Failed`             | All pages failed or job-level error |

***

## Error Handling

```python
from sarvamai import SarvamAI
from sarvamai.core.api_error import ApiError

client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")

try:
    job = client.document_intelligence.create_job(
        language="hi-IN",
        output_format="md"
    )
    job.upload_file("document.pdf")
    job.start()
    status = job.wait_until_complete()
    
    if status.job_state == "Completed":
        job.download_output("./output.zip")
        print("Output saved to ./output.zip")
    else:
        print(f"Job failed: {status}")
        
except ApiError as e:
    if e.status_code == 400:
        print(f"Bad request: {e.body}")
    elif e.status_code == 403:
        print("Invalid API key")
    elif e.status_code == 429:
        print("Rate limit exceeded")
    else:
        print(f"Error {e.status_code}: {e.body}")
except FileNotFoundError:
    print("Document file not found")
```

### Error Codes

| HTTP Status | Error Code                   | Description                                                      |
| ----------- | ---------------------------- | ---------------------------------------------------------------- |
| `400`       | `invalid_request_error`      | Invalid parameters or missing required fields                    |
| `403`       | `invalid_api_key_error`      | Invalid or missing API key                                       |
| `404`       | `not_found_error`            | Job not found                                                    |
| `422`       | `unprocessable_entity_error` | Invalid file format or corrupted file                            |
| `422`       | `max_page_limit_exceeded`    | Document exceeds the 10-page limit (applies to both PDF and ZIP) |
| `429`       | `insufficient_quota_error`   | Rate limit or quota exceeded                                     |
| `500`       | `internal_server_error`      | Server error, retry the request                                  |

***

## Best Practices

Use Markdown for human-readable output and HTML for web rendering and rich formatting. JSON page-level data is always included regardless of your choice.

Always specify the correct language code for optimal text extraction accuracy, especially for Indian languages.

For large documents, monitor `page_metrics` to track progress and handle partial failures gracefully.

Choose HTML output format when you need to preserve table structures and rich formatting.

***

## Next Steps

Learn about the model powering Document Digitization.

Complete API documentation with all parameters and options.

Upload and process documents in the API Dashboard.