Sarvam’s Document Digitization API provides enterprise-grade document processing powered by Sarvam Vision, our state-of-the-art multimodal model.
Transform any document into structured, searchable, and machine-readable data with world-class accuracy.
Document Digitization is a comprehensive document processing pipeline powered by Sarvam Vision that:
Native support for all Constitutionally recognized Indian languages and English with script-native accuracy.
Export to HTML or Markdown files, delivered as a ZIP archive. A JSON file with structured page-level data is always included alongside your chosen format.
Intelligent table detection and conversion to structured formats.
Process multi-page documents and ZIP archives with automatic page handling.
Intelligent reading order detection and complex layout handling.
Scalable API with job management, progress tracking, and error handling.
Document Digitization supports all 22 Constitutionally recognized Indian languages:
Page Limit: Both PDF and ZIP uploads are limited to a maximum of 10 pages. Exceeding this limit will return a 422 Unprocessable Entity error with code max_page_limit_exceeded. Split larger documents into batches of 10 pages or fewer before uploading.
For ZIP files, include only JPG and PNG document pages in a flat structure (no nested folders). The API will process all pages in the archive and maintain page order based on filename.
The output_format parameter controls the primary content format in the output ZIP archive. You can choose between html and md (Markdown).
JSON is always included by default. Regardless of whether you choose HTML or Markdown as your output format, a JSON file with structured page-level data is always included in the output ZIP archive. This JSON contains the same extracted content in a machine-readable format, making it easy to programmatically process the results alongside the human-readable HTML or Markdown output.
API parameter names: Document Digitization uses language and output_format values "md" or "html" in job_parameters. Do not use language_code (ignored by the API) or "markdown" (returns 400 — use "md" instead). This differs from STT, translate, and LID endpoints.
Get started with Document Digitization in minutes:
Use Markdown for human-readable output and HTML for web rendering and rich formatting. JSON page-level data is always included regardless of your choice.
Always specify the correct language code for optimal text extraction accuracy, especially for Indian languages.
For large documents, monitor page_metrics to track progress and handle partial failures gracefully.
Choose HTML output format when you need to preserve table structures and rich formatting.