docs: add agent skill bundle for coding assistants (SKILL.md, pipelines, convert/evaluate) (#3174)

* docs: add agent skill bundle with convert/evaluate helpers

- Add docs/examples/agent_skill/docling-document-intelligence/ with
  SKILL.md, pipelines.md, EXAMPLE.md, improvement-log template, and
  scripts/docling-convert.py + docling-evaluate.py (standard/vlm-local/vlm-api).
- Document InputFormat.PDF + PdfFormatOption for explicit PdfPipelineOptions.
- Link from examples index and mkdocs nav.

Made-with: Cursor

* docs: align agent skill README and EXAMPLE with Cursor bundle

- Document both ~/.cursor/skills and docs/examples paths.
- README notes repo parity for PRs and local installs.

Made-with: Cursor

* DCO Remediation Commit for jehlum11 <jehlum11@gmail.com>

I, jehlum11 <jehlum11@gmail.com>, hereby add my Signed-off-by to this commit: 2d268ffb6f
I, jehlum11 <jehlum11@gmail.com>, hereby add my Signed-off-by to this commit: 041e709c66

Signed-off-by: jehlum11 <jehlum11@gmail.com>
Made-with: Cursor

* docs: refactor agent skill to use docling CLI for conversion

Address maintainer feedback: the custom docling-convert.py script was
largely redundant with the existing docling CLI. This commit:

- Removes scripts/docling-convert.py (redundant with `docling` CLI)
- Refactors SKILL.md (v1.4 → v2.0) to use `docling` CLI for all
  conversion tasks, reserving the Python API only for features the
  CLI does not expose (chunking, VLM API endpoint config,
  force_backend_text hybrid mode)
- Updates docling-evaluate.py recommended_actions to reference
  `docling` CLI flags instead of the removed script
- Updates README.md, EXAMPLE.md, pipelines.md to use `docling` CLI
  examples throughout
- Simplifies requirements.txt (removes packaging dependency)

The only custom script retained is docling-evaluate.py, which provides
heuristic quality evaluation — functionality the CLI does not cover.

Signed-off-by: jehlum11 <jehlum11@gmail.com>
Made-with: Cursor

* docs: fix ruff format on docling-evaluate.py

Signed-off-by: jehlum11 <jehlum11@gmail.com>
Made-with: Cursor

---------

Signed-off-by: jehlum11 <jehlum11@gmail.com>
This commit is contained in:
Jehlum Pandit
2026-04-13 09:02:51 -04:00
committed by GitHub
parent 42157a3e10
commit c23622f6f5
9 changed files with 1109 additions and 0 deletions
@@ -0,0 +1,99 @@
# Using the Docling agent skill
[Agent Skills](https://agentskills.io/specification) are folders of instructions that AI coding agents (Cursor, Claude Code, GitHub Copilot, etc.) can load when relevant.
## Where this bundle lives
- **Cursor (local):** `~/.cursor/skills/docling-document-intelligence/` (or copy this folder there).
- **Docling repository (docs + PRs):** `docs/examples/agent_skill/docling-document-intelligence/` in [github.com/docling-project/docling](https://github.com/docling-project/docling).
The two trees are kept in sync; use either source.
## Install (copy into your agent's skills directory)
```bash
# From a checkout of the Docling repo
cp -r docs/examples/agent_skill/docling-document-intelligence ~/.cursor/skills/
# Or copy from another machine / archive into e.g. ~/.claude/skills/
```
No extra config is required beyond installing Python dependencies (below).
## Usage
Open your agent-enabled IDE and ask, for example:
```
Parse report.pdf and give me a structural outline
```
```
Convert https://arxiv.org/pdf/2408.09869 to markdown
```
```
Chunk invoice.pdf for RAG ingestion with 512 token chunks
```
```
Process scanned.pdf using the VLM pipeline
```
The agent should read `SKILL.md`, match the task, and run the appropriate
`docling` CLI command or Python API call.
## Running the docling CLI directly
```bash
pip install docling docling-core
# Basic conversion to Markdown
docling report.pdf --output /tmp/
# JSON output
docling report.pdf --to json --output /tmp/
# Custom OCR engine
docling report.pdf --ocr-engine rapidocr --output /tmp/
# VLM pipeline
docling scanned.pdf --pipeline vlm --output /tmp/
# VLM with specific model
docling scanned.pdf --pipeline vlm --vlm-model granite_docling --output /tmp/
# Remote VLM services
docling doc.pdf --pipeline vlm --enable-remote-services --output /tmp/
```
## Evaluate and refine
```bash
docling report.pdf --to json --output /tmp/
docling report.pdf --to md --output /tmp/
python3 scripts/docling-evaluate.py /tmp/report.json --markdown /tmp/report.md
```
If the report shows `warn` or `fail`, follow `recommended_actions`, re-convert
with `docling` using the suggested flags, and optionally append a note to
`improvement-log.md` (see `SKILL.md` section 7).
## What the skill covers
| Task | How to ask |
|---|---|
| Parse PDF / DOCX / PPTX / HTML / image | "parse this file" |
| Convert to Markdown | "convert to markdown" |
| Export as structured JSON | "export as JSON" |
| Chunk for RAG | "chunk for RAG", "prepare for ingestion" |
| Analyze structure | "show me the headings and tables" |
| Use VLM pipeline | "use the VLM pipeline", "process scanned PDF" |
| Use remote inference | "use vLLM", "call the API pipeline" |
## Further reading
- [Agent Skills specification](https://agentskills.io/specification)
- [Docling documentation](https://docling-project.github.io/docling/)
- [Docling CLI reference](https://docling-project.github.io/docling/reference/cli/)
- [Docling GitHub](https://github.com/docling-project/docling)
@@ -0,0 +1,43 @@
# Docling agent skill (Cursor & compatible assistants)
This folder is an **[Agent Skill](https://agentskills.io/specification)**-style bundle for AI coding assistants: structured instructions (`SKILL.md`), a pipeline reference (`pipelines.md`), and a quality evaluator (`scripts/docling-evaluate.py`).
Conversion is done via the **`docling` CLI** (included with `pip install docling`).
The evaluator provides a **convert → evaluate → refine** feedback loop that the
existing CLI does not cover.
It complements the official [Docling documentation](https://docling-project.github.io/docling/) and the [`docling` CLI reference](https://docling-project.github.io/docling/reference/cli/).
The same layout is published in the Docling repo at `docs/examples/agent_skill/docling-document-intelligence/` (for docs and PRs).
## Contents
| Path | Purpose |
|------|---------|
| [`SKILL.md`](SKILL.md) | Full skill instructions (pipelines, chunking, evaluation loop) |
| [`pipelines.md`](pipelines.md) | Standard vs VLM pipelines, OCR engines, API notes |
| [`EXAMPLE.md`](EXAMPLE.md) | Installing into `~/.cursor/skills/`; running the CLI and evaluator |
| [`improvement-log.md`](improvement-log.md) | Optional template for local "what worked" notes |
| [`scripts/docling-evaluate.py`](scripts/docling-evaluate.py) | Heuristic quality report on JSON (+ optional Markdown) |
| [`scripts/requirements.txt`](scripts/requirements.txt) | Minimal pip deps for the evaluator |
## Quick start
```bash
pip install docling docling-core
# Convert to Markdown
docling https://arxiv.org/pdf/2408.09869 --output /tmp/
# Convert to JSON
docling https://arxiv.org/pdf/2408.09869 --to json --output /tmp/
# Evaluate quality
python3 scripts/docling-evaluate.py /tmp/2408.09869.json --markdown /tmp/2408.09869.md
```
Use `--pipeline vlm` for vision-model pipelines; see `SKILL.md` and `pipelines.md`.
## License
MIT (aligned with [Docling](https://github.com/docling-project/docling)).
@@ -0,0 +1,393 @@
---
name: docling-document-intelligence
description: >
Parse, convert, chunk, and analyze documents using Docling. Use this skill
when the user provides a document (PDF, DOCX, PPTX, HTML, image) as a file
path or URL and wants to: extract text or structured content, convert to
Markdown or JSON, chunk the document for RAG ingestion, analyze document
structure (headings, tables, figures, reading order), or run quality
evaluation with iterative pipeline tuning. Triggers: "parse this PDF",
"convert to markdown", "chunk for RAG", "extract tables", "analyze document
structure", "prepare for ingestion", "process document", "evaluate docling
output", "improve conversion quality".
license: MIT
compatibility: Requires Python 3.10+, docling>=2.81.0, docling-core>=2.67.1
metadata:
author: docling-project
version: "2.0"
upstream: https://github.com/docling-project/docling
allowed-tools: Bash(docling:*) Bash(python3:*) Bash(pip:*)
---
# Docling Document Intelligence Skill
Use this skill to parse, convert, chunk, and analyze documents with Docling.
It handles both local file paths and URLs, and outputs either Markdown or
structured JSON (`DoclingDocument`).
Conversion uses the **`docling` CLI** (installed with `pip install docling`).
The Python API is used only for features the CLI does not expose (chunking,
VLM remote-API endpoint configuration, hybrid `force_backend_text` mode).
## Scope
| Task | Covered |
|---|---|
| Parse PDF / DOCX / PPTX / HTML / image | ✅ |
| Convert to Markdown | ✅ |
| Export as DoclingDocument JSON | ✅ |
| Chunk for RAG (hybrid: heading + token) | ✅ (Python API) |
| Analyze structure (headings, tables, figures) | ✅ (Python API) |
| OCR for scanned PDFs | ✅ (auto-enabled) |
| Multi-source batch conversion | ✅ |
## Step-by-Step Instructions
### 1. Resolve the input
Determine whether the user supplied a **local path** or a **URL**.
The `docling` CLI accepts both directly.
```bash
docling path/to/file.pdf
docling https://example.com/a.pdf
```
### 2. Choose a pipeline
Docling has two pipeline families. Pick based on document type and hardware.
| Pipeline | CLI flag | Best for | Key tradeoff |
|---|---|---|---|
| **Standard** (default) | `--pipeline standard` | Born-digital PDFs, speed | No GPU needed; OCR for scanned pages |
| **VLM** | `--pipeline vlm` | Complex layouts, handwriting, formulas | Needs GPU; slower |
See [pipelines.md](pipelines.md) for the full decision matrix, OCR engine table
(EasyOCR, RapidOCR, Tesseract, macOS), and VLM model presets.
### 3. Convert the document
#### CLI (preferred for straightforward conversions)
```bash
# Markdown (default output)
docling report.pdf --output /tmp/
# JSON (structured, lossless)
docling report.pdf --to json --output /tmp/
# VLM pipeline
docling report.pdf --pipeline vlm --output /tmp/
# VLM with specific model
docling report.pdf --pipeline vlm --vlm-model granite_docling --output /tmp/
# Custom OCR engine
docling report.pdf --ocr-engine tesserocr --output /tmp/
# Disable OCR or tables for speed
docling report.pdf --no-ocr --output /tmp/
docling report.pdf --no-tables --output /tmp/
# Remote VLM services
docling report.pdf --pipeline vlm --enable-remote-services --output /tmp/
```
The CLI writes output files to the `--output` directory, named after the
input file (e.g. `report.pdf``report.md` or `report.json`).
**CLI reference:** <https://docling-project.github.io/docling/reference/cli/>
#### Python API (for advanced features)
Use the Python API when you need features the CLI does not expose:
chunking, VLM remote-API endpoint configuration, or hybrid
`force_backend_text` mode.
**Docling 2.81+ API note:** `DocumentConverter(format_options=...)` expects
`dict[InputFormat, FormatOption]` (e.g. `InputFormat.PDF``PdfFormatOption`).
Using string keys like `{"pdf": PdfPipelineOptions(...)}` fails at runtime with
`AttributeError: 'PdfPipelineOptions' object has no attribute 'backend'`.
**Standard pipeline (default):**
```python
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
converter = DocumentConverter()
result = converter.convert("report.pdf")
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=PdfPipelineOptions(do_ocr=True, do_table_structure=True),
),
}
)
result = converter.convert("report.pdf")
```
**VLM pipeline — local (GraniteDocling via HF Transformers):**
```python
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import VlmPipelineOptions
from docling.datamodel import vlm_model_specs
from docling.pipeline.vlm_pipeline import VlmPipeline
pipeline_options = VlmPipelineOptions(
vlm_options=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS,
generate_page_images=True,
)
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_cls=VlmPipeline,
pipeline_options=pipeline_options,
)
}
)
result = converter.convert("report.pdf")
```
**VLM pipeline — remote API (vLLM / LM Studio / Ollama):**
This is only available via the Python API; the CLI does not expose endpoint
URL, model name, or API key configuration.
```python
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import VlmPipelineOptions
from docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions, ResponseFormat
from docling.pipeline.vlm_pipeline import VlmPipeline
vlm_opts = ApiVlmOptions(
url="http://localhost:8000/v1/chat/completions",
params=dict(model="ibm-granite/granite-docling-258M", max_tokens=4096),
prompt="Convert this page to docling.",
response_format=ResponseFormat.DOCTAGS,
timeout=120,
)
pipeline_options = VlmPipelineOptions(
vlm_options=vlm_opts,
generate_page_images=True,
enable_remote_services=True, # required — gates all outbound HTTP
)
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_cls=VlmPipeline,
pipeline_options=pipeline_options,
)
}
)
result = converter.convert("report.pdf")
```
**Hybrid mode (force_backend_text) — Python API only:**
Uses deterministic PDF text extraction for text regions while routing
images and tables through the VLM. Reduces hallucination on text-heavy pages.
```python
pipeline_options = VlmPipelineOptions(
vlm_options=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS,
force_backend_text=True,
generate_page_images=True,
)
```
`result.document` is a `DoclingDocument` object in all cases.
### 4. Choose output format
**Markdown** (default, human-readable):
```bash
docling report.pdf --to md --output /tmp/
```
Or via Python: `result.document.export_to_markdown()`
**JSON / DoclingDocument** (structured, lossless):
```bash
docling report.pdf --to json --output /tmp/
```
Or via Python: `result.document.export_to_dict()`
> If the user does not specify a format, ask: "Should I output Markdown or
> structured JSON (DoclingDocument)?"
### 5. Chunk for RAG (hybrid strategy)
Chunking is only available via the Python API.
Default: **hybrid chunker** — splits first by heading hierarchy, then
subdivides oversized sections by token count. This preserves semantic
boundaries while respecting model context limits.
The tokenizer API changed in docling-core 2.8.0. Pass a `BaseTokenizer`
object, not a raw string:
**HuggingFace tokenizer (default):**
```python
from docling.chunking import HybridChunker
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
tokenizer = HuggingFaceTokenizer.from_pretrained(
model_name="sentence-transformers/all-MiniLM-L6-v2",
max_tokens=512,
)
chunker = HybridChunker(tokenizer=tokenizer, merge_peers=True)
chunks = list(chunker.chunk(result.document))
for chunk in chunks:
embed_text = chunker.contextualize(chunk)
print(chunk.meta.headings) # heading breadcrumb list
print(chunk.meta.origin.page_no) # source page number
```
**OpenAI tokenizer (for OpenAI embedding models):**
```python
import tiktoken
from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer
tokenizer = OpenAITokenizer(
tokenizer=tiktoken.encoding_for_model("text-embedding-3-small"),
max_tokens=8192,
)
# Requires: pip install 'docling-core[chunking-openai]'
```
For chunking strategies and tokenizer details, see the Docling documentation
on chunking and `HybridChunker`.
### 6. Analyze document structure
Use the `DoclingDocument` object directly to inspect structure:
```python
doc = result.document
for item, level in doc.iterate_items():
if hasattr(item, 'label') and item.label.name == 'SECTION_HEADER':
print(f"{'#' * level} {item.text}")
for table in doc.tables:
print(table.export_to_dataframe()) # pandas DataFrame
print(table.export_to_markdown())
for picture in doc.pictures:
print(picture.caption_text(doc)) # caption if present
```
For the full API surface, see Docling's structure and table export docs.
### 7. Evaluate output and iterate (required for "best effort" conversions)
After **every** conversion where the user cares about fidelity (not quick
previews), run the bundled evaluator on the JSON export, then refine the
pipeline if needed. This is how the agent **checks its work** and **improves
the run** without guessing.
**Step A — Produce JSON and optional Markdown**
```bash
docling "<source>" --to json --output /tmp/
docling "<source>" --to md --output /tmp/
```
**Step B — Evaluate**
```bash
python3 scripts/docling-evaluate.py /tmp/<filename>.json --markdown /tmp/<filename>.md
```
If the user expects tables (invoices, spreadsheets in PDF), add
`--expect-tables`. Tighten gates with `--fail-on-warn` in CI-style checks.
The script prints a JSON report to stdout: `status` (`pass` | `warn` | `fail`),
`metrics`, `issues`, and `recommended_actions` (concrete `docling` CLI
flags to try next).
**Step C — Refinement loop (max 3 attempts unless the user says otherwise)**
1. If `status` is `warn` or `fail`, apply **one** primary change from
`recommended_actions` (e.g. switch `--pipeline vlm`, change
`--ocr-engine`, ensure tables are enabled).
2. Re-convert with `docling`, re-run `scripts/docling-evaluate.py`.
3. Stop when `status` is `pass`, or after 3 iterations — then summarize what
worked and any remaining issues for the user.
**Step D — Self-improvement log (skill memory)**
After a successful pass **or** after the final iteration, append one entry to
[improvement-log.md](improvement-log.md) in this skill directory:
- Source type (e.g. scanned PDF, digital PDF, DOCX)
- First-run problems (from `issues`)
- Pipeline + flags that fixed or best mitigated them
- Final `status` and one line of subjective quality notes
This log is optional for the user to git-ignore; it is for **local** learning
so future runs on similar documents start closer to the right pipeline.
### 8. Agent quality checklist (manual, if script unavailable)
If `scripts/docling-evaluate.py` cannot run, still verify:
| Check | Action if bad |
|---|---|
| Page count matches source (roughly) | Re-run; try `--pipeline vlm` if layout is complex |
| Markdown is not near-empty | Enable OCR / VLM |
| Tables missing when visually obvious | Remove `--no-tables`; try `--pipeline vlm` |
| `\ufffd` replacement characters | Different `--ocr-engine` or `--pipeline vlm` |
| Same line repeated many times | `--pipeline vlm` or hybrid `force_backend_text` (Python API) |
## Common Edge Cases
| Situation | Handling |
|---|---|
| Scanned / image-only PDF | Standard pipeline with OCR, or `--pipeline vlm` for best quality |
| Password-protected PDF | `--pdf-password PASSWORD`; will raise `ConversionError` if wrong |
| Very large document (500+ pages) | Standard pipeline with `--no-tables` for speed |
| Complex layout / multi-column | `--pipeline vlm`; standard may misorder reading flow |
| Handwriting or formulas | `--pipeline vlm` only — standard OCR will not handle these |
| URL behind auth | Pre-download to temp file; pass local path |
| Tables with merged cells | `table.export_to_markdown()` handles spans; VLM often more accurate |
| Non-UTF-8 encoding | Docling normalises internally; no special handling needed |
| VLM hallucinating text | `force_backend_text=True` via Python API for hybrid mode |
| VLM API call blocked | `--enable-remote-services` (CLI) or `enable_remote_services=True` (Python) |
| Apple Silicon | `--vlm-model granite_docling` with MLX backend, or `GRANITEDOCLING_MLX` preset (Python API) |
## Pipeline reference
Full decision matrix, all OCR engine options, VLM model presets, and API
server configuration: [pipelines.md](pipelines.md)
## Output conventions
- Always report the number of pages and conversion status.
- When evaluation is in scope, report evaluator `status`, top `issues`, and
which refinement attempt produced the final output.
- For Markdown output: wrap in a fenced code block only if the user will copy/paste it; otherwise render directly.
- For JSON output: pretty-print with `indent=2` unless the user specifies otherwise.
- For chunks: report total chunk count, min/max/avg token counts.
- For structure analysis: summarise heading tree + table count + figure count before going into detail.
## Dependencies
```bash
pip install docling docling-core
# For OpenAI tokenizer support:
pip install 'docling-core[chunking-openai]'
```
The `docling` CLI is included with the `docling` package — no separate install needed.
Check installed versions (prefer distribution metadata — `docling` may not set `__version__`):
```python
from importlib.metadata import version
print(version("docling"), version("docling-core"))
```
@@ -0,0 +1,20 @@
# Docling agent skill — improvement log
Agents may append a short entry after running **evaluate → refine** on a document
so similar files are faster to process next time. This file is optional and is
not tracked by every user; it is meant for **local** learning.
## Template (copy for each entry)
```markdown
### YYYY-MM-DD — <short source label>
- **Source type:** (e.g. scanned PDF / digital PDF / DOCX / URL)
- **Issues (first run):** …
- **Pipeline / flags that helped:** …
- **Final evaluator status:** pass | warn | fail
- **Notes:** …
```
## Entries
_(None — add your own after running conversions.)_
@@ -0,0 +1,253 @@
# Docling Pipelines Reference
Docling has two pipeline families for PDFs: **standard** (parse + OCR + layout/tables)
and **VLM** (page images through a vision-language model). The `docling` CLI
exposes both via `--pipeline standard` (default) and `--pipeline vlm`.
The right choice depends on document type, hardware, and latency budget.
---
## Decision matrix
| Document type | Recommended pipeline | Reason |
|---|---|---|
| Born-digital PDF (text selectable) | Standard | Fast, accurate, no GPU needed |
| Scanned PDF / image-only | Standard + OCR or VLM | Depends on quality |
| Complex layout (multi-column, dense tables) | VLM | Better structural understanding |
| Handwriting, formulas, figures with embedded text | VLM | Only viable option |
| Air-gapped / no GPU | Standard | Runs on CPU |
| Production scale, GPU server available | VLM (vLLM) | Best throughput |
| Apple Silicon / local dev | VLM (MLX) | MPS acceleration |
| Speed-critical, accuracy secondary | Standard, no tables | Fastest path |
---
## Pipeline 1: Standard PDF Pipeline
Uses deterministic PDF parsing (docling-parse) + optional neural OCR + neural
table structure detection.
### CLI usage
```bash
# Default (standard pipeline, OCR + tables enabled)
docling report.pdf --output /tmp/
# Custom OCR engine
docling report.pdf --ocr-engine tesserocr --output /tmp/
# Disable OCR or tables
docling report.pdf --no-ocr --output /tmp/
docling report.pdf --no-tables --output /tmp/
```
### Python API
```python
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
# Minimal — library defaults (standard PDF pipeline)
converter = DocumentConverter()
# Explicit PdfPipelineOptions (docling 2.81+): use InputFormat.PDF + PdfFormatOption.
# Do not use format_options={"pdf": opts}; that raises AttributeError on pipeline options.
opts = PdfPipelineOptions(
do_ocr=True, # False = skip OCR entirely
do_table_structure=True, # False = skip table detection (faster)
)
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=opts),
}
)
```
### OCR engine options
All engines are plug-and-play via the CLI `--ocr-engine` flag or the Python
`ocr_options` parameter. Default is EasyOCR.
#### CLI flags
| Engine | CLI flag | Notes |
|--------|----------|-------|
| EasyOCR | `--ocr-engine easyocr` (default) | No extra pip beyond docling defaults |
| RapidOCR | `--ocr-engine rapidocr` | Lightweight; see Docling notes on read-only FS |
| Tesseract (Python) | `--ocr-engine tesserocr` | Needs `pip install tesserocr` and system Tesseract |
| Tesseract (CLI) | `--ocr-engine tesseract` | Shells out to `tesseract` binary |
| macOS Vision | `--ocr-engine ocrmac` | macOS only |
#### Python API
```python
# EasyOCR (default — no extra install needed)
from docling.datamodel.pipeline_options import PdfPipelineOptions
opts = PdfPipelineOptions(do_ocr=True) # uses EasyOCR by default
# Tesseract (requires system Tesseract + pip install tesserocr — see Docling install docs)
from docling.datamodel.pipeline_options import TesseractOcrOptions
opts = PdfPipelineOptions(do_ocr=True, ocr_options=TesseractOcrOptions())
# RapidOCR (lightweight, no C deps)
from docling.datamodel.pipeline_options import RapidOcrOptions
opts = PdfPipelineOptions(do_ocr=True, ocr_options=RapidOcrOptions())
# macOS native OCR
from docling.datamodel.pipeline_options import OcrMacOptions
opts = PdfPipelineOptions(do_ocr=True, ocr_options=OcrMacOptions())
```
---
## Pipeline 2: VLM Pipeline — local inference
Processes each page as an image through a vision-language model. Replaces the
standard layout detection + OCR stack entirely.
### CLI usage
```bash
# Default VLM model (granite_docling)
docling report.pdf --pipeline vlm --output /tmp/
# Specific model
docling report.pdf --pipeline vlm --vlm-model smoldocling --output /tmp/
```
### Python API
```python
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import VlmPipelineOptions
from docling.datamodel import vlm_model_specs
from docling.pipeline.vlm_pipeline import VlmPipeline
pipeline_options = VlmPipelineOptions(
vlm_options=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS,
generate_page_images=True,
)
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_cls=VlmPipeline,
pipeline_options=pipeline_options,
)
}
)
```
### Available model presets
| CLI `--vlm-model` | Python preset (`vlm_model_specs`) | Backend | Device | Notes |
|---|---|---|---|---|
| `granite_docling` | `GRANITEDOCLING_TRANSFORMERS` | HF Transformers | CPU/GPU | Default |
| `smoldocling` | `SMOLDOCLING_TRANSFORMERS` | HF Transformers | CPU/GPU | Lighter |
| (Python API only) | `GRANITEDOCLING_VLLM` | vLLM | GPU | Fast batch |
| (Python API only) | `GRANITEDOCLING_MLX` | MLX | Apple MPS | M-series Macs |
### Hybrid mode: PDF text + VLM for images/tables
Set `force_backend_text=True` (Python API only) to use deterministic text
extraction for normal text regions while routing images and tables through the
VLM. Reduces hallucination risk on text-heavy pages.
```python
pipeline_options = VlmPipelineOptions(
vlm_options=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS,
force_backend_text=True, # <-- hybrid mode
generate_page_images=True,
)
```
---
## Pipeline 3: VLM Pipeline — remote API
Sends page images to any OpenAI-compatible endpoint. Works with vLLM,
LM Studio, Ollama, or a hosted model API.
This is available via the CLI with `--pipeline vlm --enable-remote-services`,
but endpoint URL, model name, and API key configuration require the Python API.
### CLI usage (basic)
```bash
docling report.pdf --pipeline vlm --enable-remote-services --output /tmp/
```
### Python API (full configuration)
```python
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import VlmPipelineOptions
from docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions, ResponseFormat
from docling.pipeline.vlm_pipeline import VlmPipeline
vlm_opts = ApiVlmOptions(
url="http://localhost:8000/v1/chat/completions",
params=dict(
model="ibm-granite/granite-docling-258M",
max_tokens=4096,
),
headers={"Authorization": "Bearer YOUR_KEY"}, # omit if not needed
prompt="Convert this page to docling.",
response_format=ResponseFormat.DOCTAGS,
timeout=120,
scale=2.0,
)
pipeline_options = VlmPipelineOptions(
vlm_options=vlm_opts,
generate_page_images=True,
enable_remote_services=True, # required — gates any HTTP call
)
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_cls=VlmPipeline,
pipeline_options=pipeline_options,
)
}
)
```
**`enable_remote_services=True` is mandatory** for API pipelines. Docling
blocks outbound HTTP by default as a safety measure.
### Common API targets
| Server | Default URL | Notes |
|---|---|---|
| vLLM | `http://localhost:8000/v1/chat/completions` | Best throughput |
| LM Studio | `http://localhost:1234/v1/chat/completions` | Local dev |
| Ollama | `http://localhost:11434/v1/chat/completions` | Model: `ibm/granite-docling:258m` |
| OpenAI-compatible cloud | Provider URL | Set Authorization header |
---
## VLM install requirements
Local inference requires PyTorch + Transformers:
```bash
pip install docling[vlm]
# or manually:
pip install torch transformers accelerate
```
MLX (Apple Silicon only):
```bash
pip install mlx mlx-lm
```
vLLM backend (server-side):
```bash
pip install vllm
vllm serve ibm-granite/granite-docling-258M
```
@@ -0,0 +1,296 @@
#!/usr/bin/env python3
# SPDX-License-Identifier: MIT
"""
Evaluate a Docling JSON export and suggest pipeline / option changes.
Typical flow (agent or human):
docling input.pdf --to json --output /tmp/
docling input.pdf --to md --output /tmp/
python3 scripts/docling-evaluate.py /tmp/input.json --markdown /tmp/input.md
Exit codes: 0 = pass; 1 = fail or --fail-on-warn with status warn
"""
from __future__ import annotations
import argparse
import json
import sys
from collections import Counter
from pathlib import Path
from typing import Any
def load_document(path: Path):
data = json.loads(path.read_text(encoding="utf-8"))
try:
from docling_core.types.doc.document import DoclingDocument
return DoclingDocument.model_validate(data), data
except Exception:
return None, data
def page_numbers_from_doc(doc) -> set[int]:
pages: set[int] = set()
for item, _ in doc.iterate_items():
for prov in getattr(item, "prov", None) or []:
p = getattr(prov, "page_no", None)
if p is not None:
pages.add(int(p))
return pages
def collect_text_samples(doc, limit: int = 200) -> list[str]:
texts: list[str] = []
for item, _ in doc.iterate_items():
t = getattr(item, "text", None)
if t and str(t).strip():
texts.append(str(t).strip())
if len(texts) >= limit:
break
return texts
def metrics_from_doc(doc) -> dict[str, Any]:
n_tables = len(getattr(doc, "tables", []) or [])
n_pictures = len(getattr(doc, "pictures", []) or [])
n_headers = 0
n_text_items = 0
total_chars = 0
for item, _ in doc.iterate_items():
label = getattr(getattr(item, "label", None), "name", None) or ""
if label == "SECTION_HEADER":
n_headers += 1
t = getattr(item, "text", None)
if t:
n_text_items += 1
total_chars += len(str(t))
pages = page_numbers_from_doc(doc)
n_pages = len(pages) if pages else 0
density = (total_chars / n_pages) if n_pages else total_chars
samples = collect_text_samples(doc)
rep = Counter(samples)
top_rep = rep.most_common(1)[0] if rep else ("", 0)
dup_ratio = (
sum(c for _, c in rep.items() if c > 2) / max(len(rep), 1) if rep else 0.0
)
md = ""
try:
md = doc.export_to_markdown()
except Exception:
pass
replacement = md.count("\ufffd") + sum(str(t).count("\ufffd") for t in samples)
return {
"page_count": n_pages,
"section_headers": n_headers,
"text_items": n_text_items,
"total_text_chars": total_chars,
"chars_per_page": round(density, 2),
"tables": n_tables,
"pictures": n_pictures,
"markdown_chars": len(md),
"replacement_chars": replacement,
"most_repeated_text_count": int(top_rep[1]) if top_rep else 0,
"duplicate_heavy": dup_ratio > 0.15 and len(samples) > 10,
}
def heuristic_metrics(data: dict) -> dict[str, Any]:
"""Fallback when DoclingDocument cannot be validated (older export / drift)."""
texts = data.get("texts") or []
tables = data.get("tables") or []
body = data.get("body") or {}
children = body.get("children") if isinstance(body, dict) else None
n_children = len(children) if isinstance(children, list) else 0
char_sum = 0
for t in texts:
if isinstance(t, dict):
char_sum += len(str(t.get("text") or ""))
return {
"page_count": 0,
"section_headers": 0,
"text_items": len(texts),
"total_text_chars": char_sum,
"chars_per_page": 0.0,
"tables": len(tables),
"pictures": len(data.get("pictures") or []),
"markdown_chars": 0,
"replacement_chars": 0,
"most_repeated_text_count": 0,
"duplicate_heavy": False,
"heuristic_only": True,
"body_children": n_children,
}
def evaluate(
m: dict[str, Any],
*,
expect_tables: bool,
min_chars_per_page: float,
min_markdown_chars: int,
) -> tuple[str, list[str], list[str]]:
issues: list[str] = []
actions: list[str] = []
if m.get("heuristic_only"):
issues.append("Could not load full DoclingDocument; metrics are partial.")
actions.append(
"Ensure docling-core matches export; re-export with: docling <source> --to json --output <dir>"
)
cpp = m.get("chars_per_page") or 0
if m.get("page_count", 0) >= 2 and cpp < min_chars_per_page:
issues.append(
f"Low text density ({cpp} chars/page); likely scan, image-heavy PDF, or extraction gap."
)
actions.append(
"Retry: docling <source> --ocr-engine tesserocr (or rapidocr, ocrmac)"
)
actions.append("Retry: docling <source> --pipeline vlm")
if m.get("replacement_chars", 0) > 5:
issues.append(
"Unicode replacement characters detected; OCR may be garbling text."
)
actions.append("Retry: docling <source> --ocr-engine tesserocr (or rapidocr)")
actions.append(
"Retry: docling <source> --pipeline vlm (use force_backend_text=True via Python API for hybrid)"
)
if m.get("duplicate_heavy") or (m.get("most_repeated_text_count", 0) > 8):
issues.append(
"Repeated text blocks; possible layout/OCR loop or bad reading order."
)
actions.append("Retry: docling <source> --pipeline vlm")
actions.append(
"If using VLM: try force_backend_text=True via Python API for text-heavy pages"
)
if expect_tables and m.get("tables", 0) == 0:
issues.append("No tables detected but tables were expected.")
actions.append(
"Retry: docling <source> (tables are enabled by default; remove --no-tables if set)"
)
actions.append(
"Retry: docling <source> --pipeline vlm (better for merged-cell or visual tables)"
)
mc = m.get("markdown_chars", 0)
if mc > 0 and mc < min_markdown_chars and m.get("page_count", 0) >= 1:
issues.append(f"Markdown export is very short ({mc} chars) for the page count.")
actions.append(
"Retry: docling <source> --pipeline vlm (or try different --ocr-engine)"
)
if m.get("text_items", 0) == 0 and m.get("page_count", 0) == 0:
issues.append(
"No text items and no page provenance; export may be empty or invalid."
)
actions.append(
"Verify source file opens correctly; retry with: docling <source> --pipeline standard"
)
seen = set()
uniq_actions = []
for a in actions:
if a not in seen:
seen.add(a)
uniq_actions.append(a)
if not issues:
return "pass", [], []
severe = m.get("text_items", 0) == 0 or (
m.get("page_count", 0) >= 1 and mc < 50 and mc > 0
)
status = "fail" if severe or m.get("replacement_chars", 0) > 20 else "warn"
return status, issues, uniq_actions
def parse_args():
p = argparse.ArgumentParser(description="Evaluate Docling JSON export quality")
p.add_argument(
"json_path", type=Path, help="Path to DoclingDocument JSON (export_to_dict)"
)
p.add_argument(
"--markdown",
type=Path,
default=None,
help="Optional markdown file to cross-check length",
)
p.add_argument("--expect-tables", action="store_true")
p.add_argument("--min-chars-per-page", type=float, default=120.0)
p.add_argument("--min-markdown-chars", type=int, default=200)
p.add_argument("--fail-on-warn", action="store_true")
p.add_argument(
"--quiet", action="store_true", help="Only print JSON report to stdout"
)
return p.parse_args()
def main() -> None:
args = parse_args()
if not args.json_path.is_file():
print(json.dumps({"error": f"not found: {args.json_path}"}), file=sys.stderr)
sys.exit(1)
doc, raw = load_document(args.json_path)
if doc is not None:
m = metrics_from_doc(doc)
else:
m = heuristic_metrics(raw)
if args.markdown and args.markdown.is_file():
md_len = len(args.markdown.read_text(encoding="utf-8"))
m["markdown_file_chars"] = md_len
if m.get("markdown_chars", 0) == 0:
m["markdown_chars"] = md_len
status, issues, actions = evaluate(
m,
expect_tables=args.expect_tables,
min_chars_per_page=args.min_chars_per_page,
min_markdown_chars=args.min_markdown_chars,
)
report = {
"status": status,
"metrics": m,
"issues": issues,
"recommended_actions": actions,
"next_steps_for_agent": [
"Re-run docling with flags from recommended_actions.",
"Re-export JSON and run this script again until status is pass.",
"Append a row to improvement-log.md (see SKILL.md).",
],
}
print(json.dumps(report, indent=2, ensure_ascii=False))
if not args.quiet:
print(f"\nstatus={status}", file=sys.stderr)
if issues:
print("issues:", file=sys.stderr)
for i in issues:
print(f" - {i}", file=sys.stderr)
if actions:
print("recommended_actions:", file=sys.stderr)
for a in actions:
print(f" - {a}", file=sys.stderr)
if status == "fail":
sys.exit(1)
if status == "warn" and args.fail_on_warn:
sys.exit(1)
sys.exit(0)
if __name__ == "__main__":
main()
@@ -0,0 +1,3 @@
# pip install -r scripts/requirements.txt
docling>=2.81.0
docling-core>=2.67.1
+1
View File
@@ -7,6 +7,7 @@ Here some of our picks to get you started:
- 📤 [{==\[:fontawesome-solid-flask:{ title="beta feature" } beta\]==} structured data extraction](./extraction.ipynb)
- examples for ✍️ [serialization](./serialization.ipynb) and ✂️ [chunking](./hybrid_chunking.ipynb), including [user-defined customizations](./advanced_chunking_and_serialization.ipynb)
- 🖼️ [picture annotations](./pictures_description.ipynb) and [enrichments](./enrich_doclingdocument.py)
- 🤝 [**Agent skill**](./agent_skill/docling-document-intelligence/README.md) for Cursor and other assistants (`SKILL.md`, pipeline reference, `docling-convert.py` / `docling-evaluate.py` helpers)
👈 ... and there is much more: explore all the examples using the navigation menu on the side
+1
View File
@@ -80,6 +80,7 @@ nav:
- Plugins: concepts/plugins.md
- Examples:
- Examples: examples/index.md
- "🤝 Agent skill (Cursor / assistants)": examples/agent_skill/docling-document-intelligence/README.md
- 🔀 Conversion:
- "Simple conversion": examples/minimal.py
- "Custom conversion": examples/custom_convert.py