Files
docling/docs/examples/enrich_doclingdocument.py
Christoph Auer 03532938b5 feat: Unified model-family inference engines (including image-classification) and KServe v2 API support (#2979)
* feat: Inference engines abstraction for image classification model family with HF Transformers and ONNX runtime

Implements runtime abstraction for image classification models with support for both ONNX Runtime and HuggingFace Transformers engines. Users can switch between engines without model retraining, similar to the object detection abstraction (#2959).

Key components:
- BaseImageClassificationEngine with factory pattern
- OnnxRuntimeImageClassificationEngine and TransformersImageClassificationEngine implementations
- Shared HfVisionModelMixin for common HF model utilities
- Engine-specific configuration options
- Test suite and example demonstrating runtime engine switching

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add missing files and re-export for backward compat

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Don't run with OCR in the example.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove excess onnxruntime related options for inuts and outputs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* feat: centralize torch compile defaults with DOCLING_INFERENCE_COMPILE_TORCH_MODELS

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* feat: Add Kserve2 API engine for image classifier and object detection models (#2999)

* fix: add failed pages to DoclingDocument for page break consistency (#2939)

* fix: add failed pages to DoclingDocument for page break consistency

When some PDF pages fail to parse, they were not added to
DoclingDocument.pages, causing page break markers to be incorrect
during export. This adds failed/skipped pages with their size info
(if available) to maintain correct page numbering and structure.

- Add _add_failed_pages_to_document() method in StandardPdfPipeline
- Add test cases for failed page handling
- Add test cases for normal page handling (regression test)
- Add test PDF files

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

* fix: ensure resource cleanup and simplify type hints

- Wrap page_backend usage in try-finally to guarantee unload (prevents resource leaks).
- Simplify redundant 'float | None | None' type hint.

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

* fix: add groundtruth for normal_4pages.pdf and exclude failing PDFs from e2e test

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

* fix: ensure correct status assertion for failed pages in tests

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

---------

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

* fix: Use timezone-aware datetime (#2947)

* Use timezone-aware datetime for profiling timestamps

Updated timestamp recording to use timezone-aware datetime.

Signed-off-by: Nikhil Singh <124866156+Ritinikhil@users.noreply.github.com>

* run formatter

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Nikhil Singh <124866156+Ritinikhil@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>

* fix(asciidoc): handle commas in image alt text (#2983)

* Fix: Handle commas in AsciiDoc image alt text

  - Modified _parse_picture() to gracefully handle alt text containing commas
  - Commas in alt text are now preserved instead of causing ValueError
  - Added test case with realistic auto-generated alt text
  - split('=', 1) prevents issues when values contain '=' characters

* DCO Remediation Commit for n0rdp0l <n90.w135@gmail.com>

I, n0rdp0l <n90.w135@gmail.com>, hereby add my Signed-off-by to this commit: ee752491fc

Signed-off-by: n0rdp0l <n90.w135@gmail.com>

* style: fix ruff formatting in test_backend_asciidoc.py

Signed-off-by: n0rdp0l <n90.w135@gmail.com>

---------

Signed-off-by: n0rdp0l <n90.w135@gmail.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>

* chore: bump version to 2.73.1 [skip ci]

* First attempt at establishing API Kserve2 facet

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* refactor: improve KServe v2 engine implementation after code review

- Add comprehensive error handling to KserveV2HttpClient
  - Catch and wrap Timeout, ConnectionError, HTTPError with context
  - Validate response formats with clear error messages

- Refactor URL building to eliminate duplication
  - Extract _build_model_url() helper method
  - Single source of truth for infer_url and model_metadata_url

- Make URL required parameter (remove default localhost:8000)
  - Update ApiKserveV2*EngineOptions to require explicit URL
  - Add preset validation with helpful error messages

- Rename constants for clarity: TRITON_* → KSERVE_V2_*
  - Add comment explaining KServe v2 uses Triton type system

- Improve error messages with actual values
  - Show counts, shapes, and supported types in validation errors

- Document official KServe Python SDK alternative
  - Note async-only requirement and alpha status

- Update tests for required URL parameter

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup in kserve http helper and options

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Further cleanup

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix for remote-services on tablemodel

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: improved deserialization of engine_options (#3008)

* add registry of discriminated subclasses

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix detection of engine_type value

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Add options serialization improvements

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
Signed-off-by: Nikhil Singh <124866156+Ritinikhil@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: n0rdp0l <n90.w135@gmail.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: jhchoi1182 <jhchoi1182@gmail.com>
Co-authored-by: Nikhil Singh <124866156+Ritinikhil@users.noreply.github.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Felix Wente <63914035+n0rdp0l@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>

* Fixes from review

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* DCO Remediation Commit for Christoph Auer <cau@zurich.ibm.com>

I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: 4cdb01e6d3

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* DCO Remediation Commit for Christoph Auer <60343111+cau-git@users.noreply.github.com>

I, Christoph Auer <60343111+cau-git@users.noreply.github.com>, hereby add my Signed-off-by to this commit: e293ba3270

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add fallback for API variants

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Recreate uv.lock

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
Signed-off-by: Nikhil Singh <124866156+Ritinikhil@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: n0rdp0l <n90.w135@gmail.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: jhchoi1182 <jhchoi1182@gmail.com>
Co-authored-by: Nikhil Singh <124866156+Ritinikhil@users.noreply.github.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Felix Wente <63914035+n0rdp0l@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2026-02-18 10:49:19 +01:00

156 lines
5.1 KiB
Python
Vendored

# %% [markdown]
# Enrich an existing DoclingDocument JSON with a custom model (post-conversion).
#
# What this example does
# - Loads a previously converted DoclingDocument from JSON (no reconversion).
# - Uses a backend to crop images for items and runs an enrichment model in batches.
# - Prints a few example annotations to stdout.
#
# Prerequisites
# - A DoclingDocument JSON produced by another conversion (path configured below).
# - Install Docling and dependencies for the chosen enrichment model.
# - Ensure the JSON and the referenced PDF match (same document/version), so
# provenance bounding boxes line up for accurate cropping.
#
# How to run
# - From the repo root: `python docs/examples/enrich_doclingdocument.py`.
# - Adjust `input_doc_path` and `input_pdf_path` if your data is elsewhere.
#
# Notes
# - `BATCH_SIZE` controls how many elements are passed to the model at once.
# - `prepare_element()` crops context around elements based on the model's expansion.
# %%
### Load modules
from pathlib import Path
from typing import Iterable, Optional
from docling_core.types.doc import BoundingBox, DocItem, DoclingDocument, NodeItem
from rich.pretty import pprint
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
from docling.datamodel.accelerator_options import AcceleratorOptions
from docling.datamodel.base_models import InputFormat, ItemAndImageEnrichmentElement
from docling.datamodel.document import InputDocument
from docling.models.base_model import BaseItemAndImageEnrichmentModel
from docling.models.stages.picture_classifier.document_picture_classifier import (
DocumentPictureClassifier,
DocumentPictureClassifierOptions,
)
from docling.utils.utils import chunkify
### Define batch size used for processing
BATCH_SIZE = 4
# Trade-off: larger batches improve throughput but increase memory usage.
### From DocItem to the model inputs
# The following function is responsible for taking an item and applying the required pre-processing for the model.
# In this case we generate a cropped image from the document backend.
def prepare_element(
doc: DoclingDocument,
backend: PyPdfiumDocumentBackend,
model: BaseItemAndImageEnrichmentModel,
element: NodeItem,
) -> Optional[ItemAndImageEnrichmentElement]:
if not model.is_processable(doc=doc, element=element):
return None
assert isinstance(element, DocItem)
element_prov = element.prov[0]
bbox = element_prov.bbox
width = bbox.r - bbox.l
height = bbox.t - bbox.b
expanded_bbox = BoundingBox(
l=bbox.l - width * model.expansion_factor,
t=bbox.t + height * model.expansion_factor,
r=bbox.r + width * model.expansion_factor,
b=bbox.b - height * model.expansion_factor,
coord_origin=bbox.coord_origin,
)
page_ix = element_prov.page_no - 1
page_backend = backend.load_page(page_no=page_ix)
cropped_image = page_backend.get_page_image(
scale=model.images_scale, cropbox=expanded_bbox
)
return ItemAndImageEnrichmentElement(item=element, image=cropped_image)
### Iterate through the document
# This block defines the `enrich_document()` which is responsible for iterating through the document
# and batch the selected document items for running through the model.
def enrich_document(
doc: DoclingDocument,
backend: PyPdfiumDocumentBackend,
model: BaseItemAndImageEnrichmentModel,
) -> DoclingDocument:
def _prepare_elements(
doc: DoclingDocument,
backend: PyPdfiumDocumentBackend,
model: BaseItemAndImageEnrichmentModel,
) -> Iterable[NodeItem]:
for doc_element, _level in doc.iterate_items():
prepared_element = prepare_element(
doc=doc, backend=backend, model=model, element=doc_element
)
if prepared_element is not None:
yield prepared_element
for element_batch in chunkify(
_prepare_elements(doc, backend, model),
BATCH_SIZE,
):
for element in model(doc=doc, element_batch=element_batch): # Must exhaust!
pass
return doc
### Open and process
# The `main()` function which initializes the document and model objects for calling `enrich_document()`.
def main():
data_folder = Path(__file__).parent / "../../tests/data"
input_pdf_path = data_folder / "pdf/2206.01062.pdf"
input_doc_path = data_folder / "groundtruth/docling_v2/2206.01062.json"
doc = DoclingDocument.load_from_json(input_doc_path)
in_pdf_doc = InputDocument(
input_pdf_path,
format=InputFormat.PDF,
backend=PyPdfiumDocumentBackend,
filename=input_pdf_path.name,
)
backend = in_pdf_doc._backend
model = DocumentPictureClassifier(
enabled=True,
artifacts_path=None,
options=DocumentPictureClassifierOptions.from_preset(
"document_figure_classifier_v2"
),
accelerator_options=AcceleratorOptions(),
)
doc = enrich_document(doc=doc, backend=backend, model=model)
for pic in doc.pictures[:5]:
print(pic.self_ref)
pprint(pic.meta)
if __name__ == "__main__":
main()