mirror of
https://github.com/docling-project/docling.git
synced 2026-05-17 13:10:38 +00:00
docs: add agent skill bundle for coding assistants (SKILL.md, pipelines, convert/evaluate) (#3174)
* docs: add agent skill bundle with convert/evaluate helpers - Add docs/examples/agent_skill/docling-document-intelligence/ with SKILL.md, pipelines.md, EXAMPLE.md, improvement-log template, and scripts/docling-convert.py + docling-evaluate.py (standard/vlm-local/vlm-api). - Document InputFormat.PDF + PdfFormatOption for explicit PdfPipelineOptions. - Link from examples index and mkdocs nav. Made-with: Cursor * docs: align agent skill README and EXAMPLE with Cursor bundle - Document both ~/.cursor/skills and docs/examples paths. - README notes repo parity for PRs and local installs. Made-with: Cursor * DCO Remediation Commit for jehlum11 <jehlum11@gmail.com> I, jehlum11 <jehlum11@gmail.com>, hereby add my Signed-off-by to this commit:2d268ffb6fI, jehlum11 <jehlum11@gmail.com>, hereby add my Signed-off-by to this commit:041e709c66Signed-off-by: jehlum11 <jehlum11@gmail.com> Made-with: Cursor * docs: refactor agent skill to use docling CLI for conversion Address maintainer feedback: the custom docling-convert.py script was largely redundant with the existing docling CLI. This commit: - Removes scripts/docling-convert.py (redundant with `docling` CLI) - Refactors SKILL.md (v1.4 → v2.0) to use `docling` CLI for all conversion tasks, reserving the Python API only for features the CLI does not expose (chunking, VLM API endpoint config, force_backend_text hybrid mode) - Updates docling-evaluate.py recommended_actions to reference `docling` CLI flags instead of the removed script - Updates README.md, EXAMPLE.md, pipelines.md to use `docling` CLI examples throughout - Simplifies requirements.txt (removes packaging dependency) The only custom script retained is docling-evaluate.py, which provides heuristic quality evaluation — functionality the CLI does not cover. Signed-off-by: jehlum11 <jehlum11@gmail.com> Made-with: Cursor * docs: fix ruff format on docling-evaluate.py Signed-off-by: jehlum11 <jehlum11@gmail.com> Made-with: Cursor --------- Signed-off-by: jehlum11 <jehlum11@gmail.com>
This commit is contained in:
@@ -0,0 +1,99 @@
|
||||
# Using the Docling agent skill
|
||||
|
||||
[Agent Skills](https://agentskills.io/specification) are folders of instructions that AI coding agents (Cursor, Claude Code, GitHub Copilot, etc.) can load when relevant.
|
||||
|
||||
## Where this bundle lives
|
||||
|
||||
- **Cursor (local):** `~/.cursor/skills/docling-document-intelligence/` (or copy this folder there).
|
||||
- **Docling repository (docs + PRs):** `docs/examples/agent_skill/docling-document-intelligence/` in [github.com/docling-project/docling](https://github.com/docling-project/docling).
|
||||
|
||||
The two trees are kept in sync; use either source.
|
||||
|
||||
## Install (copy into your agent's skills directory)
|
||||
|
||||
```bash
|
||||
# From a checkout of the Docling repo
|
||||
cp -r docs/examples/agent_skill/docling-document-intelligence ~/.cursor/skills/
|
||||
|
||||
# Or copy from another machine / archive into e.g. ~/.claude/skills/
|
||||
```
|
||||
|
||||
No extra config is required beyond installing Python dependencies (below).
|
||||
|
||||
## Usage
|
||||
|
||||
Open your agent-enabled IDE and ask, for example:
|
||||
|
||||
```
|
||||
Parse report.pdf and give me a structural outline
|
||||
```
|
||||
|
||||
```
|
||||
Convert https://arxiv.org/pdf/2408.09869 to markdown
|
||||
```
|
||||
|
||||
```
|
||||
Chunk invoice.pdf for RAG ingestion with 512 token chunks
|
||||
```
|
||||
|
||||
```
|
||||
Process scanned.pdf using the VLM pipeline
|
||||
```
|
||||
|
||||
The agent should read `SKILL.md`, match the task, and run the appropriate
|
||||
`docling` CLI command or Python API call.
|
||||
|
||||
## Running the docling CLI directly
|
||||
|
||||
```bash
|
||||
pip install docling docling-core
|
||||
|
||||
# Basic conversion to Markdown
|
||||
docling report.pdf --output /tmp/
|
||||
|
||||
# JSON output
|
||||
docling report.pdf --to json --output /tmp/
|
||||
|
||||
# Custom OCR engine
|
||||
docling report.pdf --ocr-engine rapidocr --output /tmp/
|
||||
|
||||
# VLM pipeline
|
||||
docling scanned.pdf --pipeline vlm --output /tmp/
|
||||
|
||||
# VLM with specific model
|
||||
docling scanned.pdf --pipeline vlm --vlm-model granite_docling --output /tmp/
|
||||
|
||||
# Remote VLM services
|
||||
docling doc.pdf --pipeline vlm --enable-remote-services --output /tmp/
|
||||
```
|
||||
|
||||
## Evaluate and refine
|
||||
|
||||
```bash
|
||||
docling report.pdf --to json --output /tmp/
|
||||
docling report.pdf --to md --output /tmp/
|
||||
python3 scripts/docling-evaluate.py /tmp/report.json --markdown /tmp/report.md
|
||||
```
|
||||
|
||||
If the report shows `warn` or `fail`, follow `recommended_actions`, re-convert
|
||||
with `docling` using the suggested flags, and optionally append a note to
|
||||
`improvement-log.md` (see `SKILL.md` section 7).
|
||||
|
||||
## What the skill covers
|
||||
|
||||
| Task | How to ask |
|
||||
|---|---|
|
||||
| Parse PDF / DOCX / PPTX / HTML / image | "parse this file" |
|
||||
| Convert to Markdown | "convert to markdown" |
|
||||
| Export as structured JSON | "export as JSON" |
|
||||
| Chunk for RAG | "chunk for RAG", "prepare for ingestion" |
|
||||
| Analyze structure | "show me the headings and tables" |
|
||||
| Use VLM pipeline | "use the VLM pipeline", "process scanned PDF" |
|
||||
| Use remote inference | "use vLLM", "call the API pipeline" |
|
||||
|
||||
## Further reading
|
||||
|
||||
- [Agent Skills specification](https://agentskills.io/specification)
|
||||
- [Docling documentation](https://docling-project.github.io/docling/)
|
||||
- [Docling CLI reference](https://docling-project.github.io/docling/reference/cli/)
|
||||
- [Docling GitHub](https://github.com/docling-project/docling)
|
||||
@@ -0,0 +1,43 @@
|
||||
# Docling agent skill (Cursor & compatible assistants)
|
||||
|
||||
This folder is an **[Agent Skill](https://agentskills.io/specification)**-style bundle for AI coding assistants: structured instructions (`SKILL.md`), a pipeline reference (`pipelines.md`), and a quality evaluator (`scripts/docling-evaluate.py`).
|
||||
|
||||
Conversion is done via the **`docling` CLI** (included with `pip install docling`).
|
||||
The evaluator provides a **convert → evaluate → refine** feedback loop that the
|
||||
existing CLI does not cover.
|
||||
|
||||
It complements the official [Docling documentation](https://docling-project.github.io/docling/) and the [`docling` CLI reference](https://docling-project.github.io/docling/reference/cli/).
|
||||
|
||||
The same layout is published in the Docling repo at `docs/examples/agent_skill/docling-document-intelligence/` (for docs and PRs).
|
||||
|
||||
## Contents
|
||||
|
||||
| Path | Purpose |
|
||||
|------|---------|
|
||||
| [`SKILL.md`](SKILL.md) | Full skill instructions (pipelines, chunking, evaluation loop) |
|
||||
| [`pipelines.md`](pipelines.md) | Standard vs VLM pipelines, OCR engines, API notes |
|
||||
| [`EXAMPLE.md`](EXAMPLE.md) | Installing into `~/.cursor/skills/`; running the CLI and evaluator |
|
||||
| [`improvement-log.md`](improvement-log.md) | Optional template for local "what worked" notes |
|
||||
| [`scripts/docling-evaluate.py`](scripts/docling-evaluate.py) | Heuristic quality report on JSON (+ optional Markdown) |
|
||||
| [`scripts/requirements.txt`](scripts/requirements.txt) | Minimal pip deps for the evaluator |
|
||||
|
||||
## Quick start
|
||||
|
||||
```bash
|
||||
pip install docling docling-core
|
||||
|
||||
# Convert to Markdown
|
||||
docling https://arxiv.org/pdf/2408.09869 --output /tmp/
|
||||
|
||||
# Convert to JSON
|
||||
docling https://arxiv.org/pdf/2408.09869 --to json --output /tmp/
|
||||
|
||||
# Evaluate quality
|
||||
python3 scripts/docling-evaluate.py /tmp/2408.09869.json --markdown /tmp/2408.09869.md
|
||||
```
|
||||
|
||||
Use `--pipeline vlm` for vision-model pipelines; see `SKILL.md` and `pipelines.md`.
|
||||
|
||||
## License
|
||||
|
||||
MIT (aligned with [Docling](https://github.com/docling-project/docling)).
|
||||
@@ -0,0 +1,393 @@
|
||||
---
|
||||
name: docling-document-intelligence
|
||||
description: >
|
||||
Parse, convert, chunk, and analyze documents using Docling. Use this skill
|
||||
when the user provides a document (PDF, DOCX, PPTX, HTML, image) as a file
|
||||
path or URL and wants to: extract text or structured content, convert to
|
||||
Markdown or JSON, chunk the document for RAG ingestion, analyze document
|
||||
structure (headings, tables, figures, reading order), or run quality
|
||||
evaluation with iterative pipeline tuning. Triggers: "parse this PDF",
|
||||
"convert to markdown", "chunk for RAG", "extract tables", "analyze document
|
||||
structure", "prepare for ingestion", "process document", "evaluate docling
|
||||
output", "improve conversion quality".
|
||||
license: MIT
|
||||
compatibility: Requires Python 3.10+, docling>=2.81.0, docling-core>=2.67.1
|
||||
metadata:
|
||||
author: docling-project
|
||||
version: "2.0"
|
||||
upstream: https://github.com/docling-project/docling
|
||||
allowed-tools: Bash(docling:*) Bash(python3:*) Bash(pip:*)
|
||||
---
|
||||
|
||||
# Docling Document Intelligence Skill
|
||||
|
||||
Use this skill to parse, convert, chunk, and analyze documents with Docling.
|
||||
It handles both local file paths and URLs, and outputs either Markdown or
|
||||
structured JSON (`DoclingDocument`).
|
||||
|
||||
Conversion uses the **`docling` CLI** (installed with `pip install docling`).
|
||||
The Python API is used only for features the CLI does not expose (chunking,
|
||||
VLM remote-API endpoint configuration, hybrid `force_backend_text` mode).
|
||||
|
||||
## Scope
|
||||
|
||||
| Task | Covered |
|
||||
|---|---|
|
||||
| Parse PDF / DOCX / PPTX / HTML / image | ✅ |
|
||||
| Convert to Markdown | ✅ |
|
||||
| Export as DoclingDocument JSON | ✅ |
|
||||
| Chunk for RAG (hybrid: heading + token) | ✅ (Python API) |
|
||||
| Analyze structure (headings, tables, figures) | ✅ (Python API) |
|
||||
| OCR for scanned PDFs | ✅ (auto-enabled) |
|
||||
| Multi-source batch conversion | ✅ |
|
||||
|
||||
## Step-by-Step Instructions
|
||||
|
||||
### 1. Resolve the input
|
||||
|
||||
Determine whether the user supplied a **local path** or a **URL**.
|
||||
The `docling` CLI accepts both directly.
|
||||
|
||||
```bash
|
||||
docling path/to/file.pdf
|
||||
docling https://example.com/a.pdf
|
||||
```
|
||||
|
||||
### 2. Choose a pipeline
|
||||
|
||||
Docling has two pipeline families. Pick based on document type and hardware.
|
||||
|
||||
| Pipeline | CLI flag | Best for | Key tradeoff |
|
||||
|---|---|---|---|
|
||||
| **Standard** (default) | `--pipeline standard` | Born-digital PDFs, speed | No GPU needed; OCR for scanned pages |
|
||||
| **VLM** | `--pipeline vlm` | Complex layouts, handwriting, formulas | Needs GPU; slower |
|
||||
|
||||
See [pipelines.md](pipelines.md) for the full decision matrix, OCR engine table
|
||||
(EasyOCR, RapidOCR, Tesseract, macOS), and VLM model presets.
|
||||
|
||||
### 3. Convert the document
|
||||
|
||||
#### CLI (preferred for straightforward conversions)
|
||||
|
||||
```bash
|
||||
# Markdown (default output)
|
||||
docling report.pdf --output /tmp/
|
||||
|
||||
# JSON (structured, lossless)
|
||||
docling report.pdf --to json --output /tmp/
|
||||
|
||||
# VLM pipeline
|
||||
docling report.pdf --pipeline vlm --output /tmp/
|
||||
|
||||
# VLM with specific model
|
||||
docling report.pdf --pipeline vlm --vlm-model granite_docling --output /tmp/
|
||||
|
||||
# Custom OCR engine
|
||||
docling report.pdf --ocr-engine tesserocr --output /tmp/
|
||||
|
||||
# Disable OCR or tables for speed
|
||||
docling report.pdf --no-ocr --output /tmp/
|
||||
docling report.pdf --no-tables --output /tmp/
|
||||
|
||||
# Remote VLM services
|
||||
docling report.pdf --pipeline vlm --enable-remote-services --output /tmp/
|
||||
```
|
||||
|
||||
The CLI writes output files to the `--output` directory, named after the
|
||||
input file (e.g. `report.pdf` → `report.md` or `report.json`).
|
||||
|
||||
**CLI reference:** <https://docling-project.github.io/docling/reference/cli/>
|
||||
|
||||
#### Python API (for advanced features)
|
||||
|
||||
Use the Python API when you need features the CLI does not expose:
|
||||
chunking, VLM remote-API endpoint configuration, or hybrid
|
||||
`force_backend_text` mode.
|
||||
|
||||
**Docling 2.81+ API note:** `DocumentConverter(format_options=...)` expects
|
||||
`dict[InputFormat, FormatOption]` (e.g. `InputFormat.PDF` → `PdfFormatOption`).
|
||||
Using string keys like `{"pdf": PdfPipelineOptions(...)}` fails at runtime with
|
||||
`AttributeError: 'PdfPipelineOptions' object has no attribute 'backend'`.
|
||||
|
||||
**Standard pipeline (default):**
|
||||
```python
|
||||
from docling.document_converter import DocumentConverter, PdfFormatOption
|
||||
from docling.datamodel.base_models import InputFormat
|
||||
from docling.datamodel.pipeline_options import PdfPipelineOptions
|
||||
|
||||
converter = DocumentConverter()
|
||||
result = converter.convert("report.pdf")
|
||||
|
||||
converter = DocumentConverter(
|
||||
format_options={
|
||||
InputFormat.PDF: PdfFormatOption(
|
||||
pipeline_options=PdfPipelineOptions(do_ocr=True, do_table_structure=True),
|
||||
),
|
||||
}
|
||||
)
|
||||
result = converter.convert("report.pdf")
|
||||
```
|
||||
|
||||
**VLM pipeline — local (GraniteDocling via HF Transformers):**
|
||||
```python
|
||||
from docling.document_converter import DocumentConverter, PdfFormatOption
|
||||
from docling.datamodel.base_models import InputFormat
|
||||
from docling.datamodel.pipeline_options import VlmPipelineOptions
|
||||
from docling.datamodel import vlm_model_specs
|
||||
from docling.pipeline.vlm_pipeline import VlmPipeline
|
||||
|
||||
pipeline_options = VlmPipelineOptions(
|
||||
vlm_options=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS,
|
||||
generate_page_images=True,
|
||||
)
|
||||
converter = DocumentConverter(
|
||||
format_options={
|
||||
InputFormat.PDF: PdfFormatOption(
|
||||
pipeline_cls=VlmPipeline,
|
||||
pipeline_options=pipeline_options,
|
||||
)
|
||||
}
|
||||
)
|
||||
result = converter.convert("report.pdf")
|
||||
```
|
||||
|
||||
**VLM pipeline — remote API (vLLM / LM Studio / Ollama):**
|
||||
|
||||
This is only available via the Python API; the CLI does not expose endpoint
|
||||
URL, model name, or API key configuration.
|
||||
|
||||
```python
|
||||
from docling.document_converter import DocumentConverter, PdfFormatOption
|
||||
from docling.datamodel.base_models import InputFormat
|
||||
from docling.datamodel.pipeline_options import VlmPipelineOptions
|
||||
from docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions, ResponseFormat
|
||||
from docling.pipeline.vlm_pipeline import VlmPipeline
|
||||
|
||||
vlm_opts = ApiVlmOptions(
|
||||
url="http://localhost:8000/v1/chat/completions",
|
||||
params=dict(model="ibm-granite/granite-docling-258M", max_tokens=4096),
|
||||
prompt="Convert this page to docling.",
|
||||
response_format=ResponseFormat.DOCTAGS,
|
||||
timeout=120,
|
||||
)
|
||||
pipeline_options = VlmPipelineOptions(
|
||||
vlm_options=vlm_opts,
|
||||
generate_page_images=True,
|
||||
enable_remote_services=True, # required — gates all outbound HTTP
|
||||
)
|
||||
converter = DocumentConverter(
|
||||
format_options={
|
||||
InputFormat.PDF: PdfFormatOption(
|
||||
pipeline_cls=VlmPipeline,
|
||||
pipeline_options=pipeline_options,
|
||||
)
|
||||
}
|
||||
)
|
||||
result = converter.convert("report.pdf")
|
||||
```
|
||||
|
||||
**Hybrid mode (force_backend_text) — Python API only:**
|
||||
|
||||
Uses deterministic PDF text extraction for text regions while routing
|
||||
images and tables through the VLM. Reduces hallucination on text-heavy pages.
|
||||
|
||||
```python
|
||||
pipeline_options = VlmPipelineOptions(
|
||||
vlm_options=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS,
|
||||
force_backend_text=True,
|
||||
generate_page_images=True,
|
||||
)
|
||||
```
|
||||
|
||||
`result.document` is a `DoclingDocument` object in all cases.
|
||||
|
||||
### 4. Choose output format
|
||||
|
||||
**Markdown** (default, human-readable):
|
||||
```bash
|
||||
docling report.pdf --to md --output /tmp/
|
||||
```
|
||||
Or via Python: `result.document.export_to_markdown()`
|
||||
|
||||
**JSON / DoclingDocument** (structured, lossless):
|
||||
```bash
|
||||
docling report.pdf --to json --output /tmp/
|
||||
```
|
||||
Or via Python: `result.document.export_to_dict()`
|
||||
|
||||
> If the user does not specify a format, ask: "Should I output Markdown or
|
||||
> structured JSON (DoclingDocument)?"
|
||||
|
||||
### 5. Chunk for RAG (hybrid strategy)
|
||||
|
||||
Chunking is only available via the Python API.
|
||||
|
||||
Default: **hybrid chunker** — splits first by heading hierarchy, then
|
||||
subdivides oversized sections by token count. This preserves semantic
|
||||
boundaries while respecting model context limits.
|
||||
|
||||
The tokenizer API changed in docling-core 2.8.0. Pass a `BaseTokenizer`
|
||||
object, not a raw string:
|
||||
|
||||
**HuggingFace tokenizer (default):**
|
||||
```python
|
||||
from docling.chunking import HybridChunker
|
||||
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
|
||||
|
||||
tokenizer = HuggingFaceTokenizer.from_pretrained(
|
||||
model_name="sentence-transformers/all-MiniLM-L6-v2",
|
||||
max_tokens=512,
|
||||
)
|
||||
chunker = HybridChunker(tokenizer=tokenizer, merge_peers=True)
|
||||
chunks = list(chunker.chunk(result.document))
|
||||
|
||||
for chunk in chunks:
|
||||
embed_text = chunker.contextualize(chunk)
|
||||
print(chunk.meta.headings) # heading breadcrumb list
|
||||
print(chunk.meta.origin.page_no) # source page number
|
||||
```
|
||||
|
||||
**OpenAI tokenizer (for OpenAI embedding models):**
|
||||
```python
|
||||
import tiktoken
|
||||
from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer
|
||||
|
||||
tokenizer = OpenAITokenizer(
|
||||
tokenizer=tiktoken.encoding_for_model("text-embedding-3-small"),
|
||||
max_tokens=8192,
|
||||
)
|
||||
# Requires: pip install 'docling-core[chunking-openai]'
|
||||
```
|
||||
|
||||
For chunking strategies and tokenizer details, see the Docling documentation
|
||||
on chunking and `HybridChunker`.
|
||||
|
||||
### 6. Analyze document structure
|
||||
|
||||
Use the `DoclingDocument` object directly to inspect structure:
|
||||
|
||||
```python
|
||||
doc = result.document
|
||||
|
||||
for item, level in doc.iterate_items():
|
||||
if hasattr(item, 'label') and item.label.name == 'SECTION_HEADER':
|
||||
print(f"{'#' * level} {item.text}")
|
||||
|
||||
for table in doc.tables:
|
||||
print(table.export_to_dataframe()) # pandas DataFrame
|
||||
print(table.export_to_markdown())
|
||||
|
||||
for picture in doc.pictures:
|
||||
print(picture.caption_text(doc)) # caption if present
|
||||
```
|
||||
|
||||
For the full API surface, see Docling's structure and table export docs.
|
||||
|
||||
### 7. Evaluate output and iterate (required for "best effort" conversions)
|
||||
|
||||
After **every** conversion where the user cares about fidelity (not quick
|
||||
previews), run the bundled evaluator on the JSON export, then refine the
|
||||
pipeline if needed. This is how the agent **checks its work** and **improves
|
||||
the run** without guessing.
|
||||
|
||||
**Step A — Produce JSON and optional Markdown**
|
||||
|
||||
```bash
|
||||
docling "<source>" --to json --output /tmp/
|
||||
docling "<source>" --to md --output /tmp/
|
||||
```
|
||||
|
||||
**Step B — Evaluate**
|
||||
|
||||
```bash
|
||||
python3 scripts/docling-evaluate.py /tmp/<filename>.json --markdown /tmp/<filename>.md
|
||||
```
|
||||
|
||||
If the user expects tables (invoices, spreadsheets in PDF), add
|
||||
`--expect-tables`. Tighten gates with `--fail-on-warn` in CI-style checks.
|
||||
|
||||
The script prints a JSON report to stdout: `status` (`pass` | `warn` | `fail`),
|
||||
`metrics`, `issues`, and `recommended_actions` (concrete `docling` CLI
|
||||
flags to try next).
|
||||
|
||||
**Step C — Refinement loop (max 3 attempts unless the user says otherwise)**
|
||||
|
||||
1. If `status` is `warn` or `fail`, apply **one** primary change from
|
||||
`recommended_actions` (e.g. switch `--pipeline vlm`, change
|
||||
`--ocr-engine`, ensure tables are enabled).
|
||||
2. Re-convert with `docling`, re-run `scripts/docling-evaluate.py`.
|
||||
3. Stop when `status` is `pass`, or after 3 iterations — then summarize what
|
||||
worked and any remaining issues for the user.
|
||||
|
||||
**Step D — Self-improvement log (skill memory)**
|
||||
|
||||
After a successful pass **or** after the final iteration, append one entry to
|
||||
[improvement-log.md](improvement-log.md) in this skill directory:
|
||||
|
||||
- Source type (e.g. scanned PDF, digital PDF, DOCX)
|
||||
- First-run problems (from `issues`)
|
||||
- Pipeline + flags that fixed or best mitigated them
|
||||
- Final `status` and one line of subjective quality notes
|
||||
|
||||
This log is optional for the user to git-ignore; it is for **local** learning
|
||||
so future runs on similar documents start closer to the right pipeline.
|
||||
|
||||
### 8. Agent quality checklist (manual, if script unavailable)
|
||||
|
||||
If `scripts/docling-evaluate.py` cannot run, still verify:
|
||||
|
||||
| Check | Action if bad |
|
||||
|---|---|
|
||||
| Page count matches source (roughly) | Re-run; try `--pipeline vlm` if layout is complex |
|
||||
| Markdown is not near-empty | Enable OCR / VLM |
|
||||
| Tables missing when visually obvious | Remove `--no-tables`; try `--pipeline vlm` |
|
||||
| `\ufffd` replacement characters | Different `--ocr-engine` or `--pipeline vlm` |
|
||||
| Same line repeated many times | `--pipeline vlm` or hybrid `force_backend_text` (Python API) |
|
||||
|
||||
## Common Edge Cases
|
||||
|
||||
| Situation | Handling |
|
||||
|---|---|
|
||||
| Scanned / image-only PDF | Standard pipeline with OCR, or `--pipeline vlm` for best quality |
|
||||
| Password-protected PDF | `--pdf-password PASSWORD`; will raise `ConversionError` if wrong |
|
||||
| Very large document (500+ pages) | Standard pipeline with `--no-tables` for speed |
|
||||
| Complex layout / multi-column | `--pipeline vlm`; standard may misorder reading flow |
|
||||
| Handwriting or formulas | `--pipeline vlm` only — standard OCR will not handle these |
|
||||
| URL behind auth | Pre-download to temp file; pass local path |
|
||||
| Tables with merged cells | `table.export_to_markdown()` handles spans; VLM often more accurate |
|
||||
| Non-UTF-8 encoding | Docling normalises internally; no special handling needed |
|
||||
| VLM hallucinating text | `force_backend_text=True` via Python API for hybrid mode |
|
||||
| VLM API call blocked | `--enable-remote-services` (CLI) or `enable_remote_services=True` (Python) |
|
||||
| Apple Silicon | `--vlm-model granite_docling` with MLX backend, or `GRANITEDOCLING_MLX` preset (Python API) |
|
||||
|
||||
## Pipeline reference
|
||||
|
||||
Full decision matrix, all OCR engine options, VLM model presets, and API
|
||||
server configuration: [pipelines.md](pipelines.md)
|
||||
|
||||
## Output conventions
|
||||
|
||||
- Always report the number of pages and conversion status.
|
||||
- When evaluation is in scope, report evaluator `status`, top `issues`, and
|
||||
which refinement attempt produced the final output.
|
||||
- For Markdown output: wrap in a fenced code block only if the user will copy/paste it; otherwise render directly.
|
||||
- For JSON output: pretty-print with `indent=2` unless the user specifies otherwise.
|
||||
- For chunks: report total chunk count, min/max/avg token counts.
|
||||
- For structure analysis: summarise heading tree + table count + figure count before going into detail.
|
||||
|
||||
## Dependencies
|
||||
|
||||
```bash
|
||||
pip install docling docling-core
|
||||
# For OpenAI tokenizer support:
|
||||
pip install 'docling-core[chunking-openai]'
|
||||
```
|
||||
|
||||
The `docling` CLI is included with the `docling` package — no separate install needed.
|
||||
|
||||
Check installed versions (prefer distribution metadata — `docling` may not set `__version__`):
|
||||
|
||||
```python
|
||||
from importlib.metadata import version
|
||||
print(version("docling"), version("docling-core"))
|
||||
```
|
||||
@@ -0,0 +1,20 @@
|
||||
# Docling agent skill — improvement log
|
||||
|
||||
Agents may append a short entry after running **evaluate → refine** on a document
|
||||
so similar files are faster to process next time. This file is optional and is
|
||||
not tracked by every user; it is meant for **local** learning.
|
||||
|
||||
## Template (copy for each entry)
|
||||
|
||||
```markdown
|
||||
### YYYY-MM-DD — <short source label>
|
||||
- **Source type:** (e.g. scanned PDF / digital PDF / DOCX / URL)
|
||||
- **Issues (first run):** …
|
||||
- **Pipeline / flags that helped:** …
|
||||
- **Final evaluator status:** pass | warn | fail
|
||||
- **Notes:** …
|
||||
```
|
||||
|
||||
## Entries
|
||||
|
||||
_(None — add your own after running conversions.)_
|
||||
@@ -0,0 +1,253 @@
|
||||
# Docling Pipelines Reference
|
||||
|
||||
Docling has two pipeline families for PDFs: **standard** (parse + OCR + layout/tables)
|
||||
and **VLM** (page images through a vision-language model). The `docling` CLI
|
||||
exposes both via `--pipeline standard` (default) and `--pipeline vlm`.
|
||||
The right choice depends on document type, hardware, and latency budget.
|
||||
|
||||
---
|
||||
|
||||
## Decision matrix
|
||||
|
||||
| Document type | Recommended pipeline | Reason |
|
||||
|---|---|---|
|
||||
| Born-digital PDF (text selectable) | Standard | Fast, accurate, no GPU needed |
|
||||
| Scanned PDF / image-only | Standard + OCR or VLM | Depends on quality |
|
||||
| Complex layout (multi-column, dense tables) | VLM | Better structural understanding |
|
||||
| Handwriting, formulas, figures with embedded text | VLM | Only viable option |
|
||||
| Air-gapped / no GPU | Standard | Runs on CPU |
|
||||
| Production scale, GPU server available | VLM (vLLM) | Best throughput |
|
||||
| Apple Silicon / local dev | VLM (MLX) | MPS acceleration |
|
||||
| Speed-critical, accuracy secondary | Standard, no tables | Fastest path |
|
||||
|
||||
---
|
||||
|
||||
## Pipeline 1: Standard PDF Pipeline
|
||||
|
||||
Uses deterministic PDF parsing (docling-parse) + optional neural OCR + neural
|
||||
table structure detection.
|
||||
|
||||
### CLI usage
|
||||
|
||||
```bash
|
||||
# Default (standard pipeline, OCR + tables enabled)
|
||||
docling report.pdf --output /tmp/
|
||||
|
||||
# Custom OCR engine
|
||||
docling report.pdf --ocr-engine tesserocr --output /tmp/
|
||||
|
||||
# Disable OCR or tables
|
||||
docling report.pdf --no-ocr --output /tmp/
|
||||
docling report.pdf --no-tables --output /tmp/
|
||||
```
|
||||
|
||||
### Python API
|
||||
|
||||
```python
|
||||
from docling.document_converter import DocumentConverter, PdfFormatOption
|
||||
from docling.datamodel.base_models import InputFormat
|
||||
from docling.datamodel.pipeline_options import PdfPipelineOptions
|
||||
|
||||
# Minimal — library defaults (standard PDF pipeline)
|
||||
converter = DocumentConverter()
|
||||
|
||||
# Explicit PdfPipelineOptions (docling 2.81+): use InputFormat.PDF + PdfFormatOption.
|
||||
# Do not use format_options={"pdf": opts}; that raises AttributeError on pipeline options.
|
||||
opts = PdfPipelineOptions(
|
||||
do_ocr=True, # False = skip OCR entirely
|
||||
do_table_structure=True, # False = skip table detection (faster)
|
||||
)
|
||||
converter = DocumentConverter(
|
||||
format_options={
|
||||
InputFormat.PDF: PdfFormatOption(pipeline_options=opts),
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
### OCR engine options
|
||||
|
||||
All engines are plug-and-play via the CLI `--ocr-engine` flag or the Python
|
||||
`ocr_options` parameter. Default is EasyOCR.
|
||||
|
||||
#### CLI flags
|
||||
|
||||
| Engine | CLI flag | Notes |
|
||||
|--------|----------|-------|
|
||||
| EasyOCR | `--ocr-engine easyocr` (default) | No extra pip beyond docling defaults |
|
||||
| RapidOCR | `--ocr-engine rapidocr` | Lightweight; see Docling notes on read-only FS |
|
||||
| Tesseract (Python) | `--ocr-engine tesserocr` | Needs `pip install tesserocr` and system Tesseract |
|
||||
| Tesseract (CLI) | `--ocr-engine tesseract` | Shells out to `tesseract` binary |
|
||||
| macOS Vision | `--ocr-engine ocrmac` | macOS only |
|
||||
|
||||
#### Python API
|
||||
|
||||
```python
|
||||
# EasyOCR (default — no extra install needed)
|
||||
from docling.datamodel.pipeline_options import PdfPipelineOptions
|
||||
opts = PdfPipelineOptions(do_ocr=True) # uses EasyOCR by default
|
||||
|
||||
# Tesseract (requires system Tesseract + pip install tesserocr — see Docling install docs)
|
||||
from docling.datamodel.pipeline_options import TesseractOcrOptions
|
||||
opts = PdfPipelineOptions(do_ocr=True, ocr_options=TesseractOcrOptions())
|
||||
|
||||
# RapidOCR (lightweight, no C deps)
|
||||
from docling.datamodel.pipeline_options import RapidOcrOptions
|
||||
opts = PdfPipelineOptions(do_ocr=True, ocr_options=RapidOcrOptions())
|
||||
|
||||
# macOS native OCR
|
||||
from docling.datamodel.pipeline_options import OcrMacOptions
|
||||
opts = PdfPipelineOptions(do_ocr=True, ocr_options=OcrMacOptions())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Pipeline 2: VLM Pipeline — local inference
|
||||
|
||||
Processes each page as an image through a vision-language model. Replaces the
|
||||
standard layout detection + OCR stack entirely.
|
||||
|
||||
### CLI usage
|
||||
|
||||
```bash
|
||||
# Default VLM model (granite_docling)
|
||||
docling report.pdf --pipeline vlm --output /tmp/
|
||||
|
||||
# Specific model
|
||||
docling report.pdf --pipeline vlm --vlm-model smoldocling --output /tmp/
|
||||
```
|
||||
|
||||
### Python API
|
||||
|
||||
```python
|
||||
from docling.document_converter import DocumentConverter, PdfFormatOption
|
||||
from docling.datamodel.base_models import InputFormat
|
||||
from docling.datamodel.pipeline_options import VlmPipelineOptions
|
||||
from docling.datamodel import vlm_model_specs
|
||||
from docling.pipeline.vlm_pipeline import VlmPipeline
|
||||
|
||||
pipeline_options = VlmPipelineOptions(
|
||||
vlm_options=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS,
|
||||
generate_page_images=True,
|
||||
)
|
||||
|
||||
converter = DocumentConverter(
|
||||
format_options={
|
||||
InputFormat.PDF: PdfFormatOption(
|
||||
pipeline_cls=VlmPipeline,
|
||||
pipeline_options=pipeline_options,
|
||||
)
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
### Available model presets
|
||||
|
||||
| CLI `--vlm-model` | Python preset (`vlm_model_specs`) | Backend | Device | Notes |
|
||||
|---|---|---|---|---|
|
||||
| `granite_docling` | `GRANITEDOCLING_TRANSFORMERS` | HF Transformers | CPU/GPU | Default |
|
||||
| `smoldocling` | `SMOLDOCLING_TRANSFORMERS` | HF Transformers | CPU/GPU | Lighter |
|
||||
| (Python API only) | `GRANITEDOCLING_VLLM` | vLLM | GPU | Fast batch |
|
||||
| (Python API only) | `GRANITEDOCLING_MLX` | MLX | Apple MPS | M-series Macs |
|
||||
|
||||
### Hybrid mode: PDF text + VLM for images/tables
|
||||
|
||||
Set `force_backend_text=True` (Python API only) to use deterministic text
|
||||
extraction for normal text regions while routing images and tables through the
|
||||
VLM. Reduces hallucination risk on text-heavy pages.
|
||||
|
||||
```python
|
||||
pipeline_options = VlmPipelineOptions(
|
||||
vlm_options=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS,
|
||||
force_backend_text=True, # <-- hybrid mode
|
||||
generate_page_images=True,
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Pipeline 3: VLM Pipeline — remote API
|
||||
|
||||
Sends page images to any OpenAI-compatible endpoint. Works with vLLM,
|
||||
LM Studio, Ollama, or a hosted model API.
|
||||
|
||||
This is available via the CLI with `--pipeline vlm --enable-remote-services`,
|
||||
but endpoint URL, model name, and API key configuration require the Python API.
|
||||
|
||||
### CLI usage (basic)
|
||||
|
||||
```bash
|
||||
docling report.pdf --pipeline vlm --enable-remote-services --output /tmp/
|
||||
```
|
||||
|
||||
### Python API (full configuration)
|
||||
|
||||
```python
|
||||
from docling.document_converter import DocumentConverter, PdfFormatOption
|
||||
from docling.datamodel.base_models import InputFormat
|
||||
from docling.datamodel.pipeline_options import VlmPipelineOptions
|
||||
from docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions, ResponseFormat
|
||||
from docling.pipeline.vlm_pipeline import VlmPipeline
|
||||
|
||||
vlm_opts = ApiVlmOptions(
|
||||
url="http://localhost:8000/v1/chat/completions",
|
||||
params=dict(
|
||||
model="ibm-granite/granite-docling-258M",
|
||||
max_tokens=4096,
|
||||
),
|
||||
headers={"Authorization": "Bearer YOUR_KEY"}, # omit if not needed
|
||||
prompt="Convert this page to docling.",
|
||||
response_format=ResponseFormat.DOCTAGS,
|
||||
timeout=120,
|
||||
scale=2.0,
|
||||
)
|
||||
|
||||
pipeline_options = VlmPipelineOptions(
|
||||
vlm_options=vlm_opts,
|
||||
generate_page_images=True,
|
||||
enable_remote_services=True, # required — gates any HTTP call
|
||||
)
|
||||
|
||||
converter = DocumentConverter(
|
||||
format_options={
|
||||
InputFormat.PDF: PdfFormatOption(
|
||||
pipeline_cls=VlmPipeline,
|
||||
pipeline_options=pipeline_options,
|
||||
)
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
**`enable_remote_services=True` is mandatory** for API pipelines. Docling
|
||||
blocks outbound HTTP by default as a safety measure.
|
||||
|
||||
### Common API targets
|
||||
|
||||
| Server | Default URL | Notes |
|
||||
|---|---|---|
|
||||
| vLLM | `http://localhost:8000/v1/chat/completions` | Best throughput |
|
||||
| LM Studio | `http://localhost:1234/v1/chat/completions` | Local dev |
|
||||
| Ollama | `http://localhost:11434/v1/chat/completions` | Model: `ibm/granite-docling:258m` |
|
||||
| OpenAI-compatible cloud | Provider URL | Set Authorization header |
|
||||
|
||||
---
|
||||
|
||||
## VLM install requirements
|
||||
|
||||
Local inference requires PyTorch + Transformers:
|
||||
|
||||
```bash
|
||||
pip install docling[vlm]
|
||||
# or manually:
|
||||
pip install torch transformers accelerate
|
||||
```
|
||||
|
||||
MLX (Apple Silicon only):
|
||||
```bash
|
||||
pip install mlx mlx-lm
|
||||
```
|
||||
|
||||
vLLM backend (server-side):
|
||||
```bash
|
||||
pip install vllm
|
||||
vllm serve ibm-granite/granite-docling-258M
|
||||
```
|
||||
+296
@@ -0,0 +1,296 @@
|
||||
#!/usr/bin/env python3
|
||||
# SPDX-License-Identifier: MIT
|
||||
"""
|
||||
Evaluate a Docling JSON export and suggest pipeline / option changes.
|
||||
|
||||
Typical flow (agent or human):
|
||||
|
||||
docling input.pdf --to json --output /tmp/
|
||||
docling input.pdf --to md --output /tmp/
|
||||
python3 scripts/docling-evaluate.py /tmp/input.json --markdown /tmp/input.md
|
||||
|
||||
Exit codes: 0 = pass; 1 = fail or --fail-on-warn with status warn
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
from collections import Counter
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
|
||||
def load_document(path: Path):
|
||||
data = json.loads(path.read_text(encoding="utf-8"))
|
||||
try:
|
||||
from docling_core.types.doc.document import DoclingDocument
|
||||
|
||||
return DoclingDocument.model_validate(data), data
|
||||
except Exception:
|
||||
return None, data
|
||||
|
||||
|
||||
def page_numbers_from_doc(doc) -> set[int]:
|
||||
pages: set[int] = set()
|
||||
for item, _ in doc.iterate_items():
|
||||
for prov in getattr(item, "prov", None) or []:
|
||||
p = getattr(prov, "page_no", None)
|
||||
if p is not None:
|
||||
pages.add(int(p))
|
||||
return pages
|
||||
|
||||
|
||||
def collect_text_samples(doc, limit: int = 200) -> list[str]:
|
||||
texts: list[str] = []
|
||||
for item, _ in doc.iterate_items():
|
||||
t = getattr(item, "text", None)
|
||||
if t and str(t).strip():
|
||||
texts.append(str(t).strip())
|
||||
if len(texts) >= limit:
|
||||
break
|
||||
return texts
|
||||
|
||||
|
||||
def metrics_from_doc(doc) -> dict[str, Any]:
|
||||
n_tables = len(getattr(doc, "tables", []) or [])
|
||||
n_pictures = len(getattr(doc, "pictures", []) or [])
|
||||
n_headers = 0
|
||||
n_text_items = 0
|
||||
total_chars = 0
|
||||
for item, _ in doc.iterate_items():
|
||||
label = getattr(getattr(item, "label", None), "name", None) or ""
|
||||
if label == "SECTION_HEADER":
|
||||
n_headers += 1
|
||||
t = getattr(item, "text", None)
|
||||
if t:
|
||||
n_text_items += 1
|
||||
total_chars += len(str(t))
|
||||
|
||||
pages = page_numbers_from_doc(doc)
|
||||
n_pages = len(pages) if pages else 0
|
||||
density = (total_chars / n_pages) if n_pages else total_chars
|
||||
|
||||
samples = collect_text_samples(doc)
|
||||
rep = Counter(samples)
|
||||
top_rep = rep.most_common(1)[0] if rep else ("", 0)
|
||||
dup_ratio = (
|
||||
sum(c for _, c in rep.items() if c > 2) / max(len(rep), 1) if rep else 0.0
|
||||
)
|
||||
|
||||
md = ""
|
||||
try:
|
||||
md = doc.export_to_markdown()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
replacement = md.count("\ufffd") + sum(str(t).count("\ufffd") for t in samples)
|
||||
|
||||
return {
|
||||
"page_count": n_pages,
|
||||
"section_headers": n_headers,
|
||||
"text_items": n_text_items,
|
||||
"total_text_chars": total_chars,
|
||||
"chars_per_page": round(density, 2),
|
||||
"tables": n_tables,
|
||||
"pictures": n_pictures,
|
||||
"markdown_chars": len(md),
|
||||
"replacement_chars": replacement,
|
||||
"most_repeated_text_count": int(top_rep[1]) if top_rep else 0,
|
||||
"duplicate_heavy": dup_ratio > 0.15 and len(samples) > 10,
|
||||
}
|
||||
|
||||
|
||||
def heuristic_metrics(data: dict) -> dict[str, Any]:
|
||||
"""Fallback when DoclingDocument cannot be validated (older export / drift)."""
|
||||
texts = data.get("texts") or []
|
||||
tables = data.get("tables") or []
|
||||
body = data.get("body") or {}
|
||||
children = body.get("children") if isinstance(body, dict) else None
|
||||
n_children = len(children) if isinstance(children, list) else 0
|
||||
char_sum = 0
|
||||
for t in texts:
|
||||
if isinstance(t, dict):
|
||||
char_sum += len(str(t.get("text") or ""))
|
||||
return {
|
||||
"page_count": 0,
|
||||
"section_headers": 0,
|
||||
"text_items": len(texts),
|
||||
"total_text_chars": char_sum,
|
||||
"chars_per_page": 0.0,
|
||||
"tables": len(tables),
|
||||
"pictures": len(data.get("pictures") or []),
|
||||
"markdown_chars": 0,
|
||||
"replacement_chars": 0,
|
||||
"most_repeated_text_count": 0,
|
||||
"duplicate_heavy": False,
|
||||
"heuristic_only": True,
|
||||
"body_children": n_children,
|
||||
}
|
||||
|
||||
|
||||
def evaluate(
|
||||
m: dict[str, Any],
|
||||
*,
|
||||
expect_tables: bool,
|
||||
min_chars_per_page: float,
|
||||
min_markdown_chars: int,
|
||||
) -> tuple[str, list[str], list[str]]:
|
||||
issues: list[str] = []
|
||||
actions: list[str] = []
|
||||
|
||||
if m.get("heuristic_only"):
|
||||
issues.append("Could not load full DoclingDocument; metrics are partial.")
|
||||
actions.append(
|
||||
"Ensure docling-core matches export; re-export with: docling <source> --to json --output <dir>"
|
||||
)
|
||||
|
||||
cpp = m.get("chars_per_page") or 0
|
||||
if m.get("page_count", 0) >= 2 and cpp < min_chars_per_page:
|
||||
issues.append(
|
||||
f"Low text density ({cpp} chars/page); likely scan, image-heavy PDF, or extraction gap."
|
||||
)
|
||||
actions.append(
|
||||
"Retry: docling <source> --ocr-engine tesserocr (or rapidocr, ocrmac)"
|
||||
)
|
||||
actions.append("Retry: docling <source> --pipeline vlm")
|
||||
|
||||
if m.get("replacement_chars", 0) > 5:
|
||||
issues.append(
|
||||
"Unicode replacement characters detected; OCR may be garbling text."
|
||||
)
|
||||
actions.append("Retry: docling <source> --ocr-engine tesserocr (or rapidocr)")
|
||||
actions.append(
|
||||
"Retry: docling <source> --pipeline vlm (use force_backend_text=True via Python API for hybrid)"
|
||||
)
|
||||
|
||||
if m.get("duplicate_heavy") or (m.get("most_repeated_text_count", 0) > 8):
|
||||
issues.append(
|
||||
"Repeated text blocks; possible layout/OCR loop or bad reading order."
|
||||
)
|
||||
actions.append("Retry: docling <source> --pipeline vlm")
|
||||
actions.append(
|
||||
"If using VLM: try force_backend_text=True via Python API for text-heavy pages"
|
||||
)
|
||||
|
||||
if expect_tables and m.get("tables", 0) == 0:
|
||||
issues.append("No tables detected but tables were expected.")
|
||||
actions.append(
|
||||
"Retry: docling <source> (tables are enabled by default; remove --no-tables if set)"
|
||||
)
|
||||
actions.append(
|
||||
"Retry: docling <source> --pipeline vlm (better for merged-cell or visual tables)"
|
||||
)
|
||||
|
||||
mc = m.get("markdown_chars", 0)
|
||||
if mc > 0 and mc < min_markdown_chars and m.get("page_count", 0) >= 1:
|
||||
issues.append(f"Markdown export is very short ({mc} chars) for the page count.")
|
||||
actions.append(
|
||||
"Retry: docling <source> --pipeline vlm (or try different --ocr-engine)"
|
||||
)
|
||||
|
||||
if m.get("text_items", 0) == 0 and m.get("page_count", 0) == 0:
|
||||
issues.append(
|
||||
"No text items and no page provenance; export may be empty or invalid."
|
||||
)
|
||||
actions.append(
|
||||
"Verify source file opens correctly; retry with: docling <source> --pipeline standard"
|
||||
)
|
||||
|
||||
seen = set()
|
||||
uniq_actions = []
|
||||
for a in actions:
|
||||
if a not in seen:
|
||||
seen.add(a)
|
||||
uniq_actions.append(a)
|
||||
|
||||
if not issues:
|
||||
return "pass", [], []
|
||||
|
||||
severe = m.get("text_items", 0) == 0 or (
|
||||
m.get("page_count", 0) >= 1 and mc < 50 and mc > 0
|
||||
)
|
||||
status = "fail" if severe or m.get("replacement_chars", 0) > 20 else "warn"
|
||||
return status, issues, uniq_actions
|
||||
|
||||
|
||||
def parse_args():
|
||||
p = argparse.ArgumentParser(description="Evaluate Docling JSON export quality")
|
||||
p.add_argument(
|
||||
"json_path", type=Path, help="Path to DoclingDocument JSON (export_to_dict)"
|
||||
)
|
||||
p.add_argument(
|
||||
"--markdown",
|
||||
type=Path,
|
||||
default=None,
|
||||
help="Optional markdown file to cross-check length",
|
||||
)
|
||||
p.add_argument("--expect-tables", action="store_true")
|
||||
p.add_argument("--min-chars-per-page", type=float, default=120.0)
|
||||
p.add_argument("--min-markdown-chars", type=int, default=200)
|
||||
p.add_argument("--fail-on-warn", action="store_true")
|
||||
p.add_argument(
|
||||
"--quiet", action="store_true", help="Only print JSON report to stdout"
|
||||
)
|
||||
return p.parse_args()
|
||||
|
||||
|
||||
def main() -> None:
|
||||
args = parse_args()
|
||||
if not args.json_path.is_file():
|
||||
print(json.dumps({"error": f"not found: {args.json_path}"}), file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
doc, raw = load_document(args.json_path)
|
||||
if doc is not None:
|
||||
m = metrics_from_doc(doc)
|
||||
else:
|
||||
m = heuristic_metrics(raw)
|
||||
|
||||
if args.markdown and args.markdown.is_file():
|
||||
md_len = len(args.markdown.read_text(encoding="utf-8"))
|
||||
m["markdown_file_chars"] = md_len
|
||||
if m.get("markdown_chars", 0) == 0:
|
||||
m["markdown_chars"] = md_len
|
||||
|
||||
status, issues, actions = evaluate(
|
||||
m,
|
||||
expect_tables=args.expect_tables,
|
||||
min_chars_per_page=args.min_chars_per_page,
|
||||
min_markdown_chars=args.min_markdown_chars,
|
||||
)
|
||||
|
||||
report = {
|
||||
"status": status,
|
||||
"metrics": m,
|
||||
"issues": issues,
|
||||
"recommended_actions": actions,
|
||||
"next_steps_for_agent": [
|
||||
"Re-run docling with flags from recommended_actions.",
|
||||
"Re-export JSON and run this script again until status is pass.",
|
||||
"Append a row to improvement-log.md (see SKILL.md).",
|
||||
],
|
||||
}
|
||||
|
||||
print(json.dumps(report, indent=2, ensure_ascii=False))
|
||||
if not args.quiet:
|
||||
print(f"\nstatus={status}", file=sys.stderr)
|
||||
if issues:
|
||||
print("issues:", file=sys.stderr)
|
||||
for i in issues:
|
||||
print(f" - {i}", file=sys.stderr)
|
||||
if actions:
|
||||
print("recommended_actions:", file=sys.stderr)
|
||||
for a in actions:
|
||||
print(f" - {a}", file=sys.stderr)
|
||||
|
||||
if status == "fail":
|
||||
sys.exit(1)
|
||||
if status == "warn" and args.fail_on_warn:
|
||||
sys.exit(1)
|
||||
sys.exit(0)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
+3
@@ -0,0 +1,3 @@
|
||||
# pip install -r scripts/requirements.txt
|
||||
docling>=2.81.0
|
||||
docling-core>=2.67.1
|
||||
Vendored
+1
@@ -7,6 +7,7 @@ Here some of our picks to get you started:
|
||||
- 📤 [{==\[:fontawesome-solid-flask:{ title="beta feature" } beta\]==} structured data extraction](./extraction.ipynb)
|
||||
- examples for ✍️ [serialization](./serialization.ipynb) and ✂️ [chunking](./hybrid_chunking.ipynb), including [user-defined customizations](./advanced_chunking_and_serialization.ipynb)
|
||||
- 🖼️ [picture annotations](./pictures_description.ipynb) and [enrichments](./enrich_doclingdocument.py)
|
||||
- 🤝 [**Agent skill**](./agent_skill/docling-document-intelligence/README.md) for Cursor and other assistants (`SKILL.md`, pipeline reference, `docling-convert.py` / `docling-evaluate.py` helpers)
|
||||
|
||||
👈 ... and there is much more: explore all the examples using the navigation menu on the side
|
||||
|
||||
|
||||
@@ -80,6 +80,7 @@ nav:
|
||||
- Plugins: concepts/plugins.md
|
||||
- Examples:
|
||||
- Examples: examples/index.md
|
||||
- "🤝 Agent skill (Cursor / assistants)": examples/agent_skill/docling-document-intelligence/README.md
|
||||
- 🔀 Conversion:
|
||||
- "Simple conversion": examples/minimal.py
|
||||
- "Custom conversion": examples/custom_convert.py
|
||||
|
||||
Reference in New Issue
Block a user