Commit Graph

151 Commits

Author SHA1 Message Date
geoHeil 5b1df788ef ci: tighten pre-commit guardrails (#3346)
* ci: tighten pre-commit guardrails

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: validate pre-commit guardrail changes

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: switch hook validation to prek

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: exempt active slim plan from max-lines

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: move max-lines config under github

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: fail on uncovered tach modules

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: ignore generated docs in max-lines check

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: clarify local validation tasks

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* docs: refine agent instructions

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: replace mypy with ty

(cherry picked from commit 382afbde8f00abfaeba95ea9c8e9cc603f27a2d9)
Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: replace justfile with makefile

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

---------

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
2026-05-08 15:07:11 +02:00
Christoph Auer aba7f155ae fix(client): Make submit_and_retrieve_many accept lazy iterable and yield (#3405)
* Remove eager materialization from docling-service batch submission

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update lock

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Make convert_all evaluate Iterable input lazily, remove raises_on_error

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Make convert_all use async generator like submit_and_retrieve_many

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix mypy fast check

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* upgrade packages

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* update test GT data

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update test GT from linux machine

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Reset all GT test data and uv.lock

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2026-05-07 18:15:26 +02:00
geoHeil 45c3d2b895 ci: share typecheck deps with PR fast checks (#3406)
Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
2026-05-06 15:05:21 +02:00
github-actions[bot] 61c37a23a9 chore: bump version to 2.93.0 [skip ci] 2026-05-05 19:53:32 +00:00
geoHeil eb4724ee4c ci: prototype tach-based modular skipping (#3333)
* ci: prototype tach-based modular skipping

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: modularize ubuntu setup and refine gating

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: adopt metaxy-inspired governance helpers

- replace custom aggregate check with re-actors/alls-green

- set FORCE_JAVASCRIPT_ACTIONS_TO_NODE24 on every workflow

- keep PR concurrency alive when the graphite:merge label is present

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: tune checks and pin action versions

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: split CI suites and heavy examples

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: ecaa4777886157d5c2a7b3893c3a820983089dbf
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: d15416f3ca94ac97af2a8317cd6404208db9d896

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: sharpen tach graph and per-suite path filters

- Split docling.pipeline into per-pipeline tach modules
  (asr, vlm, standard_pdf, threaded_standard_pdf, legacy_standard_pdf,
  extraction_vlm, base, base_extraction, simple) so pytest --tach-base
  impact analysis can attribute changes to a specific pipeline rather
  than the whole package.
- Split the asr- and vlm-specific docling.datamodel option files
  (asr_model_specs, pipeline_options_asr_model, vlm_engine_options,
  vlm_model_specs, pipeline_options_vlm_model, layout_model_specs,
  stage_model_specs, backend_options) into their own tach modules so
  a narrow spec/options change no longer marks the full datamodel as
  impacted.
- Narrow the per-suite pipeline path filters in checks.yml to the
  concrete pipeline files relevant to each suite, so editing
  vlm_pipeline.py only triggers the vlm matrix cell and editing
  asr_pipeline.py only the asr one.
- Rekey the model cache in setup-ubuntu-ci to include runner.os and
  hashFiles(uv.lock, pyproject.toml), with ordered restore-keys
  fallbacks so a lockfile bump no longer silently stales the cache.

Metaxy parity note: layered tach enforcement (layer = "...") is
blocked by existing backend<->datamodel and utils<->stages cycles;
depot runners, nox dynamic matrices, devenv/nix, dprint and ty are
not applicable to docling's stack. All pinned action SHAs are on
their latest release as of this commit.

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: introduce pipeline and orchestration tach layers

Earlier notes claimed layers were blocked. That was only true for the
cyclic core (backend<->datamodel, utils<->stages). The boundary
*above* core is clean:

- No module under docling/backend, docling/datamodel, docling/models,
  docling/utils, docling/exceptions, or docling/chunking imports
  anything from docling.pipeline (verified by grep).
- No module anywhere in docling/ imports from docling.cli,
  docling.document_converter, docling.document_extractor, or
  docling.service_client (also verified).

So we can introduce two real layers on top of the cyclic core:

- "pipeline"      — docling.pipeline and all nine concrete pipelines
                     (base, simple, base_extraction, asr, vlm,
                     extraction_vlm, standard_pdf,
                     threaded_standard_pdf, legacy_standard_pdf).
- "orchestration" — docling.cli, docling.document_converter,
                     docling.document_extractor, and
                     docling.experimental.pipeline.

Unlayered modules stay "below" both layers (tach allows them to be
depended on freely) and continue to carry the declared-but-cyclic
backend<->datamodel and utils<->stages edges.

A VLM-only layer was explored but rejected: only
docling.pipeline.vlm_pipeline and docling.pipeline.extraction_vlm_pipeline
could be cleanly layered as "vlm", because the matching datamodel
options (pipeline_options_vlm_model, vlm_engine_options,
vlm_model_specs) and model stages (vlm_convert, vlm_pipeline_models)
sit inside the datamodel/models cycle and cannot be promoted to a
higher layer without first breaking that cycle. Layering only the
two pipeline files is not worth the extra config.

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: expand tach layers to entrypoints/pipeline/models/core

Follow-up to the two-layer attempt. After verifying via grep that
nothing in datamodel/utils/backend imports from
docling.models.{extraction,factories,plugins,vlm_pipeline_models}
or from the "upper" stages (page_assemble, page_preprocessing,
reading_order, picture_description, vlm_convert), those nine
modules can be promoted out of the cyclic core into a dedicated
"models" layer.

The resulting order (highest first):

- entrypoints — cli, document_converter, document_extractor,
                experimental.pipeline
- pipeline    — docling.pipeline + the nine concrete pipelines
- models      — model factories, extraction, plugins,
                vlm_pipeline_models, and the five "upper" stages
- core        — datamodel*, backend*, utils, exceptions, chunking,
                models (base), models.utils, inference_engines.*,
                the six "core stages" that utils cycles with
                (chart_extraction, code_formula, layout, ocr,
                picture_classifier, table_structure), and the
                experimental.* and service_client modules

Rename the previous "orchestration" layer to "entrypoints" to
match the common docling vocabulary. Every module now carries an
explicit layer tag instead of relying on implicit unlayered
behaviour, so future additions must pick a layer deliberately.

A VLM layer, a stand-alone inference-engines layer, and separating
datamodel from backend all remain blocked by the bidirectional
backend<->datamodel and utils<->core-stages edges; those need a
code-level refactor first.

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: refine tach client and foundation layers

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: add optional windows and macos smoke lanes

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: normalize reusable workflow boolean inputs

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: replace external all-green action

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: use org-allowed setup-uv action

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: install compiler toolchain for ML tests

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: bb714afb42cd1b29ab073a7f59cc72874ff2fdcd

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: a1f2761da8f72bfed636bd571ebf77b42c8771b6

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: cc6551b54c5bf4815ae9cd57cf43a98928a74be0

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: b21b0e7ca12b552dbdd54fac1bda113719c286f1

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: simplify ML pytest suite patterns

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: gate heavy examples on label, add job timeouts

- ci-heavy-examples: run only on main push, schedule, workflow_dispatch,
  or when a PR is labeled tests:full / tests:heavy-examples. Drops the
  path-based auto-trigger so that common edits to pyproject.toml,
  uv.lock, or .github/actions do not kick off the 45-60min matrix on
  every PR push. Collapses the changes job into a job-level if gate and
  adds timeout-minutes: 90.
- checks.yml: add timeout-minutes to every job so stuck runners cannot
  burn the full 6h default.

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: tolerate cancelled allowed-skip jobs in check aggregator

Intentional cancellations (manual cancel, concurrency replacement) on
jobs that are already in ALLOWED_SKIPS should not mark the overall
workflow red. Treat `cancelled` the same as `skipped` when the job is
listed as an allowed skip; any unexpected cancellation of a required
job still fails.

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* docs: make minimal vlm example portable

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 2135051da3ed73d4b8a9130f584f40b56155af1a

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 4f6d1d7960f7418d0cde6425ae61538da84fda40

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: install workspace packages in CI syncs

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 492fa9883d4de6d98ebcb40fa863eafe2facff3c

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 3eefae71643f9ca3df0264690c0c6eb1f67f06f1

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: fe8c9689a0ee94f36eb826da8e2177ef87404f5e

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: eabdd24a6734ec873cdaac857718aef2473677e7

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: remove unused graphite concurrency exception

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: document test labels and gate cross-platform lanes

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: select ml tests with pytest markers

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: fix marker selector typing

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: simplify ml suite scheduling

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: mark cross-platform smoke tests

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: reuse test trigger for ml matrix

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: tighten full ci aggregation

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: share required job result check

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

---------

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:15:35 +02:00
geoHeil 41e9fa7886 ci: implement phase 1 path-based workflow skipping (#3332)
* ci: add phase 1 path-based workflow skipping

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: add fast pull_request_target lint checks

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: keep pr fast checks cheap

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: expand full matrix triggers

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: enable same-repo and merge queue checks

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: harden pull_request_target fetch inputs

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: address phase 1 workflow review

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: grant reusable checks permissions

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: temporarily enable pr fast checks validation

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: allow first run of pr fast checks

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: load pr fast check script for first validation

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: format pr fast check script

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: guard temporary pr fast check script fallback

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: use pr metadata for temporary fast check validation

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: remove temporary pr fast checks trigger

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: disable duplicate pull request runs

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: run fast pr checks without path trigger filter

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: add job timeouts in checks.yml

Cap every job so a stuck runner cannot burn the 6h default. Limits:
changes=5, lint=20, run-tests-1/2=45, run-examples=60,
test-pip-install-*=30, build/test-package=15.

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: restore pull request workflow triggers

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: run lint on pull requests

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

---------

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
2026-04-29 10:55:27 +02:00
github-actions[bot] 80f81b2799 chore: bump version to 2.92.0 [skip ci] 2026-04-29 07:38:26 +00:00
Michele Dolfi ed32c5e993 feat: Introduce modular docling-slim package (#3285)
* plans folder structure

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* initial plan

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* updated plan

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* restructure repo for docling and docling-slim

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* transpose package structures

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add all-packages

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* updated  lock and deps

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* align deps

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* more lock like main

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* more locked pinning

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename extras

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add simple README for docling-slim

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix scikit-image issue

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add readme placeholder

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add all extras in package test

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* cli in docling-slim

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* apply formatting

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix testing package

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* override grpcio in no-header test

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update package description

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* updated extras

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix publish scripts

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update package test

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2026-04-24 15:14:57 +02:00
github-actions[bot] 188b6a192c chore: bump version to 2.91.0 [skip ci] 2026-04-23 09:29:36 +00:00
github-actions[bot] d5bff7155c chore: bump version to 2.90.0 [skip ci] 2026-04-17 11:56:33 +00:00
github-actions[bot] fa334aeb46 chore: bump version to 2.89.0 [skip ci] 2026-04-16 08:08:36 +00:00
Rayabharapu Trinai a15c16e19f feat: explicit TikZ environment handling in LaTeX backend (#3187)
* feat: explicitly handle tikzpicture as atomic figure nodes

Signed-off-by: StealthTensor <rayabharaputrinai@gmail.com>

* fix: resolve copilot review on recursion depth and text labels

Signed-off-by: StealthTensor <rayabharaputrinai@gmail.com>

* test: cover tikz fallback label and depth guard

Signed-off-by: StealthTensor <rayabharaputrinai@gmail.com>

* fix: address copilot tikz parent and end-tag edge cases

Signed-off-by: StealthTensor <rayabharaputrinai@gmail.com>

* feat(latex): store TikZ source in Picture meta code field

Handle tikzpicture environments as real Picture nodes via add_picture and
store raw TikZ in PictureMeta.code (CodeMetaField) with language metadata.

Also update LaTeX backend tests to assert picture meta code behavior and
bump docling-core minimum version to 2.71.0 (includes PictureMeta.code).

Signed-off-by: StealthTensor <rayabharaputrinai@gmail.com>

* fix(latex): make TikZ code metadata compatible across docling-core versions

Avoid hard import of CodeMetaField to prevent import-time crashes when older
docling-core is resolved. Use PictureMeta.code when available and retain legacy
CODE fallback otherwise. Update TikZ tests for dual-path behavior and refresh
uv.lock to pin docling-core 2.71.0.

Signed-off-by: StealthTensor <rayabharaputrinai@gmail.com>

* refactor(latex): simplify TikZ parent assignment

Signed-off-by: StealthTensor <rayabharaputrinai@gmail.com>

* fix(latex): remove legacy PictureMeta guard and use TIKZ code metadata

Signed-off-by: StealthTensor <rayabharaputrinai@gmail.com>

* fix: require docling-core 2.73.0 for TIKZ code label

Signed-off-by: StealthTensor <rayabharaputrinai@gmail.com>

* test: remove legacy tikz code-meta compatibility branch

Signed-off-by: StealthTensor <rayabharaputrinai@gmail.com>

---------

Signed-off-by: StealthTensor <rayabharaputrinai@gmail.com>
2026-04-15 05:47:20 +02:00
github-actions[bot] e04e602fc8 chore: bump version to 2.88.0 [skip ci] 2026-04-13 14:05:27 +00:00
Christoph Auer 42157a3e10 feat(service): Establish client SDK for docling serve (#3264)
* Move client SDK to docling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add client SDK examples

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Mark client SDK as experimental

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2026-04-13 14:54:06 +02:00
geoHeil 6b257ece33 fix(ocr): support rapidocr 3.8 mobile model naming (#3277)
Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
2026-04-13 11:44:33 +02:00
github-actions[bot] 2446f5c41b chore: bump version to 2.87.0 [skip ci] 2026-04-13 07:37:13 +00:00
github-actions[bot] 45daa001a8 chore: bump version to 2.86.0 [skip ci] 2026-04-10 14:15:01 +00:00
Peter W. J. Staar fd834204fa feat: Support for GraniteVision v4 (#3217)
* feat: add GV4

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* got everything to work with granite-vision-4

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* robustifying the output of granite-vision-v4

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* refactored the code to reduce duplications:

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* refactored the code to align with pipeline options

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* ran pre-commit

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the circular import

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the chart_extraction_options file

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-04-10 08:55:40 +02:00
github-actions[bot] 50b9b63b66 chore: bump version to 2.85.0 [skip ci] 2026-04-07 14:22:01 +00:00
github-actions[bot] 76e7ce83b1 chore: bump version to 2.84.0 [skip ci] 2026-04-01 18:35:26 +00:00
github-actions[bot] eacd09f1a2 chore: bump version to 2.83.0 [skip ci] 2026-03-31 09:32:58 +00:00
Peter W. J. Staar d2c6357982 feat: upgrade to transformers v5 (#3200)
* feat: upgrade to transformers v5

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the pre-commit

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* allow transformers v4

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Allow transformers 4.x in vlm extra, clean up annotations

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2026-03-31 09:40:41 +02:00
github-actions[bot] d355af6c59 chore: bump version to 2.82.0 [skip ci] 2026-03-25 09:40:06 +00:00
Christoph Auer a0fc3c9d73 fix: manage PDFium backend resource lifecycles to avoid SIGSEGV/SIGTRAP crashes (#3180)
* fix: explicitly close PdfBitmap after copy in both PDF backends

pypdfium2's to_pil() shares native buffer memory for RGBA/RGBX/L formats
via frombuffer(). The chained render().to_pil().resize() pattern allowed
the PdfBitmap to reach refcount 0 mid-expression, causing GC to invoke
FPDFBitmap_Destroy and free the native buffer while PIL still held a
dangling pointer to it — resulting in non-deterministic SIGSEGV crashes
in concurrent scenarios.

Fix: store the bitmap explicitly, copy the PIL image to detach it from
the shared native buffer, then close the bitmap under the lock before
proceeding with the resize on the independent copy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* upgrade uv.lock

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: managed PDFium backend lifecycle with explicit native close and live-page tracking

Introduces ManagedPdfiumDocumentBackend / ManagedPdfiumPageBackend base
classes that both PDF backends now inherit from. Key changes:

- Live pages are tracked in a set on the document; document unload waits
  for all pages to be released before tearing down native handles.
- Page and document unload now call explicit .close() on native PDFium
  objects under the lock, rather than just nulling Python references.
  This makes teardown deterministic rather than relying on GC finalizers
  which can fire from any thread without the lock.
- text_page is explicitly closed before _ppage to respect the PDFium
  parent/child handle hierarchy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: strip dead live-page tracking from managed PDFium backend

The Condition, Lock, _live_pages set, _closing flag, and owner back-ref
on pages were remnants of the Group-3b pipeline defensive shutdown that
was not included here. The pipeline always unloads page backends before
calling document.unload(), so _close_live_pages() was always a no-op
and notify_all() had zero waiters.

Reduced ManagedPdfiumDocumentBackend/ManagedPdfiumPageBackend to just
a _closed guard and the abstract _close_native_* dispatch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* DCO Remediation Commit for Christoph Auer <cau@zurich.ibm.com>

I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: b3f4e6692d
I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: 79b18945a8
I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: b389c82456
I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: 5e3510f80f

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* downgrade mkdocs-jupyter to <0.26 because it breaks docs gen

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-24 17:59:55 +01:00
Maxim Lysak 1c74a9b9c7 feat: Implementation of HTML backend with headless browser (#2969)
- Implementation of HTML backend that (optionally) uses headless browser (via Playwright) to materialize HTML pages into images, and add provenances with bboxes to all elements in the converted docling document.
- Conversion preserves reading order given by HTML DOM tree
- Added support for HTML "input" fields: checkboxes, radiobuttons, text inputs, etc.
- Added support to Key-Value convention in HTML (i.e. elements with id "key1" and "key1_value1" will be paired as key-values, see test cases as examples)
- Heuristic that glues independent inline HTML elements with single-character text in them into larger text blocks
- Support for inline styling (bold, italic, etc.)

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2026-03-24 14:28:57 +01:00
github-actions[bot] 4e650af56d chore: bump version to 2.81.0 [skip ci] 2026-03-20 21:32:59 +00:00
Ron 8ae0974a9d fix: handle external image relationships in MsWordDocumentBackend (#3114)
* fix: handle external image relationships in MsWordDocumentBackend

When a .docx file contains image relationships with TargetMode="External"
(common in documents saved from web browsers), accessing
`_Relationship.target_part` raises ValueError because external relationships
don't have a target part within the package.

Check `rel.is_external` before accessing `target_part`, emitting a
UserWarning with the external target URL and returning None so external
images fall through to the existing "image cannot be found" handling.

Includes test with ground truth files for a .docx with external image
references.

Fixes #3113

Signed-off-by: rongo-ms <127863751+rongo-ms@users.noreply.github.com>

* chore: upgrade dependencies in uv.lock file

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: rongo-ms <127863751+rongo-ms@users.noreply.github.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-03-19 14:22:21 +01:00
github-actions[bot] cbe3db470d chore: bump version to 2.80.0 [skip ci] 2026-03-14 05:57:47 +00:00
github-actions[bot] f73df4f916 chore: bump version to 2.79.0 [skip ci] 2026-03-12 07:40:51 +00:00
github-actions[bot] 594fc3e2ca chore: bump version to 2.78.0 [skip ci] 2026-03-10 14:55:08 +00:00
Peter W. J. Staar 4ccd1d465d feat: Add support for TableFormer v2 (#3013)
* ran DCO

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* modified tf

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Added  to TableFormer v1

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added convenience methods for quality testing

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated with comments

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* ran pre-commit

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* chore: update lockfile

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Align __init__ args with factory method kwargs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* chore: bump docling-ibm-models version

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix mypy type stubs error with torchvision

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Add torch/torchvision direct deps

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove global torch imports

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* add test diffs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix jats/xbrl test generate, updated test GT from docling-core upgrade

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fixed the  in the cli

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated uv lock

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixing merge conflicts between main

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixing merge conflicts between main (2)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* upgrade uv.lock

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Ahmed Nassar AHN@zurich.ibm.com <AHN@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2026-03-10 11:57:00 +01:00
Ivan Traus 80f75b8896 fix(html): fix broken document tree and quadratic complexity in rich table cells (#3025)
* fix(html): fix broken document tree and quadratic PictureItems in rich table cells

Three related bugs in the HTML backend when processing table cells that
contain rich content (RichTableCell), as found on Wikipedia pages with
large reference, taxobox, or classification tables:

Bug 1 — orphaned InlineGroups causing broken parent/child relationships
------------------------------------------------------------------------
When _use_inline_group() created an InlineGroup node (for paragraphs
containing multiple hyperlinks, e.g. "text <a> and <a>"), it was added
as a child of the current parent via doc.add_group(), but its RefItem was
never appended to added_refs / provs_in_cell. This meant:

  - group_cell_elements() reparented the text items inside the InlineGroup
    (because their individual refs WERE in added_refs), moving them from
    body → outer_group_element.
  - The InlineGroup itself remained in body.children still pointing to
    those same text items as its .children.
  - Result: two nodes (InlineGroup and outer_group_element) claimed the
    same child items, with contradictory .parent pointers. This broken
    tree caused double-serialization of text items in export_to_markdown().

Fix: make _use_inline_group() yield the RefItem of the created group.
Callers (_flush_buffer, _handle_block, _handle_list) now track the
InlineGroup ref instead of individual leaf refs when a group was created.
group_cell_elements() then reparents the whole InlineGroup (with its
children intact) rather than orphaning it.

Bug 2 — quadratic PictureItem creation from stray outer image loop
-------------------------------------------------------------------
In _handle_block() for <table> tags, after parse_table_data() had already
walked the entire table subtree (including nested tables) and emitted
PictureItems for every <img>, there was an additional outer loop:

    for img_tag in tag("img"):
        im_ref2 = self._emit_image(tag, doc)

Because BeautifulSoup's .find_all("img") on a tag finds ALL descendant
<img> elements (including those in nested tables), this loop processed
every image in the entire subtree again. A table nested N levels deep
caused N*(N+1)/2 duplicate PictureItems per image (quadratic growth).

Fix: remove the outer loop. Images are already handled by parse_table_data()
-> _use_table_cell_context() -> _walk() -> _emit_image().

Bug 3 — missing space separator between nested table cell text
--------------------------------------------------------------
HTMLDocumentBackend.get_text() uses _extract_text_recursively(), which
only appended a trailing space for <p> and <li> tags. When a table cell
contained a nested <table>, adjacent <th> or <td> elements without
whitespace NavigableString nodes between them were concatenated directly
(e.g. "TypeSound" instead of "Type Sound").

Fix: add "th" and "td" to the trailing-space tag set so that the text
content of each cell is separated by a space.

Bug 1 and Bug 2 were introduced in docling v2.55.0 (commit c803abe) with
rich table cell support.

Signed-off-by: Ivan Traus <ivan@liminary.io>

* test(html): align markdown fixtures with current docling-core behavior

Signed-off-by: Ivan Traus <ivan@liminary.io>

* test(xbrl): update XBRL fixture after get_text() cell spacing fix

The Bug 3 fix (adding th/td to trailing-space tags in get_text())
affects the XBRL backend which internally uses HTMLDocumentBackend.
Regenerate the mlac-20251231 fixture to match the corrected text
extraction.

Signed-off-by: Ivan Traus <ivan@liminary.io>

* chore(deps): bump docling-core to 2.67.1, regenerate fixtures and trim tests

Update uv.lock to pull in the merged nested-table flattening fix
(docling-core#525). Regenerate markdown fixtures that now show flattened
text instead of invalid embedded table syntax. Trim verbose test
docstrings and remove narrating comments.

Signed-off-by: Ivan Traus <ivan@liminary.io>

* fix: annotate _use_inline_group return type and regenerate docx fixtures

Add Generator[RefItem | None, None, None] return type and Google-style
Yields section to _use_inline_group. Regenerate docx ground truth
fixtures affected by docling-core 2.67.1 nested-table flattening.

Signed-off-by: Ivan Traus <ivan@liminary.io>

* refactor: use Iterator type hint and remove redundant test

Apply feedback: use Iterator instead of Generator, drop type from Yields docstring, and remove
test_e2e_rich_table_cells_markdown (already covered by test_e2e_html_conversions).

Signed-off-by: Ivan Traus <ivan@liminary.io>

* style(html): apply indent to docstrings

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Ivan Traus <ivan@liminary.io>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-03-10 09:48:21 +01:00
Faiq Adzlan 5188180ea3 fix: loosen dependency for pandas3 (#3095)
* modify `pyproject.toml` to support for pandas 3.0.0

* update `uv.lock` file

Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>

* DCO Remediation Commit for wanadzhar913 <adzhar.faiq@gmail.com>

I, wanadzhar913 <adzhar.faiq@gmail.com>, hereby add my Signed-off-by to this commit: 9c1e9ae5b4

Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>

---------

Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>
2026-03-10 06:36:19 +01:00
Christoph Auer 3d90778e3e feat: Add gRPC transport for KServe v2 API engine (#3074)
* debug: add profiling across threaded standard PDF pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* feat: add KServe v2 gRPC client for inference engines

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* cleanup

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* cleanup excess profiling logging

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* remove nonsense tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2026-03-07 10:54:25 +01:00
github-actions[bot] b7815658d1 chore: bump version to 2.77.0 [skip ci] 2026-03-06 13:45:28 +00:00
github-actions[bot] 752f81b3dd chore: bump version to 2.76.0 [skip ci] 2026-03-02 14:43:12 +00:00
Cesar Berrospi Ramis d276e60561 feat: export to WebVTT format (#3036)
* style(cli): apply python 3.10+ syntax, remove unnecessary imports

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* feat(vtt): export of DoclingDocument to WebVTT format

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* build: pin docling-core version 2.66.0

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-02-27 14:22:52 +01:00
github-actions[bot] aec57629af chore: bump version to 2.75.0 [skip ci] 2026-02-24 20:16:56 +00:00
Cesar Berrospi Ramis 334ba6e51f feat: create a backend parser for XBRL instance reports (#3017)
* build(xbrl): add Arelle as open-source library for XBRL

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* feat(xbrl): design and implement a backend parser for XBRL documents

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* test: remove print statements to reduce verbosity

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* style(XBRL): apply PEP8 naming convention for acronyms

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(XBRL): set XBRL dependencies as optional

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-02-24 16:52:02 +01:00
Christoph Auer 03532938b5 feat: Unified model-family inference engines (including image-classification) and KServe v2 API support (#2979)
* feat: Inference engines abstraction for image classification model family with HF Transformers and ONNX runtime

Implements runtime abstraction for image classification models with support for both ONNX Runtime and HuggingFace Transformers engines. Users can switch between engines without model retraining, similar to the object detection abstraction (#2959).

Key components:
- BaseImageClassificationEngine with factory pattern
- OnnxRuntimeImageClassificationEngine and TransformersImageClassificationEngine implementations
- Shared HfVisionModelMixin for common HF model utilities
- Engine-specific configuration options
- Test suite and example demonstrating runtime engine switching

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add missing files and re-export for backward compat

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Don't run with OCR in the example.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove excess onnxruntime related options for inuts and outputs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* feat: centralize torch compile defaults with DOCLING_INFERENCE_COMPILE_TORCH_MODELS

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* feat: Add Kserve2 API engine for image classifier and object detection models (#2999)

* fix: add failed pages to DoclingDocument for page break consistency (#2939)

* fix: add failed pages to DoclingDocument for page break consistency

When some PDF pages fail to parse, they were not added to
DoclingDocument.pages, causing page break markers to be incorrect
during export. This adds failed/skipped pages with their size info
(if available) to maintain correct page numbering and structure.

- Add _add_failed_pages_to_document() method in StandardPdfPipeline
- Add test cases for failed page handling
- Add test cases for normal page handling (regression test)
- Add test PDF files

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

* fix: ensure resource cleanup and simplify type hints

- Wrap page_backend usage in try-finally to guarantee unload (prevents resource leaks).
- Simplify redundant 'float | None | None' type hint.

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

* fix: add groundtruth for normal_4pages.pdf and exclude failing PDFs from e2e test

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

* fix: ensure correct status assertion for failed pages in tests

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

---------

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

* fix: Use timezone-aware datetime (#2947)

* Use timezone-aware datetime for profiling timestamps

Updated timestamp recording to use timezone-aware datetime.

Signed-off-by: Nikhil Singh <124866156+Ritinikhil@users.noreply.github.com>

* run formatter

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Nikhil Singh <124866156+Ritinikhil@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>

* fix(asciidoc): handle commas in image alt text (#2983)

* Fix: Handle commas in AsciiDoc image alt text

  - Modified _parse_picture() to gracefully handle alt text containing commas
  - Commas in alt text are now preserved instead of causing ValueError
  - Added test case with realistic auto-generated alt text
  - split('=', 1) prevents issues when values contain '=' characters

* DCO Remediation Commit for n0rdp0l <n90.w135@gmail.com>

I, n0rdp0l <n90.w135@gmail.com>, hereby add my Signed-off-by to this commit: ee752491fc

Signed-off-by: n0rdp0l <n90.w135@gmail.com>

* style: fix ruff formatting in test_backend_asciidoc.py

Signed-off-by: n0rdp0l <n90.w135@gmail.com>

---------

Signed-off-by: n0rdp0l <n90.w135@gmail.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>

* chore: bump version to 2.73.1 [skip ci]

* First attempt at establishing API Kserve2 facet

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* refactor: improve KServe v2 engine implementation after code review

- Add comprehensive error handling to KserveV2HttpClient
  - Catch and wrap Timeout, ConnectionError, HTTPError with context
  - Validate response formats with clear error messages

- Refactor URL building to eliminate duplication
  - Extract _build_model_url() helper method
  - Single source of truth for infer_url and model_metadata_url

- Make URL required parameter (remove default localhost:8000)
  - Update ApiKserveV2*EngineOptions to require explicit URL
  - Add preset validation with helpful error messages

- Rename constants for clarity: TRITON_* → KSERVE_V2_*
  - Add comment explaining KServe v2 uses Triton type system

- Improve error messages with actual values
  - Show counts, shapes, and supported types in validation errors

- Document official KServe Python SDK alternative
  - Note async-only requirement and alpha status

- Update tests for required URL parameter

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup in kserve http helper and options

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Further cleanup

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix for remote-services on tablemodel

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: improved deserialization of engine_options (#3008)

* add registry of discriminated subclasses

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix detection of engine_type value

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Add options serialization improvements

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
Signed-off-by: Nikhil Singh <124866156+Ritinikhil@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: n0rdp0l <n90.w135@gmail.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: jhchoi1182 <jhchoi1182@gmail.com>
Co-authored-by: Nikhil Singh <124866156+Ritinikhil@users.noreply.github.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Felix Wente <63914035+n0rdp0l@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>

* Fixes from review

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* DCO Remediation Commit for Christoph Auer <cau@zurich.ibm.com>

I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: 4cdb01e6d3

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* DCO Remediation Commit for Christoph Auer <60343111+cau-git@users.noreply.github.com>

I, Christoph Auer <60343111+cau-git@users.noreply.github.com>, hereby add my Signed-off-by to this commit: e293ba3270

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add fallback for API variants

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Recreate uv.lock

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
Signed-off-by: Nikhil Singh <124866156+Ritinikhil@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: n0rdp0l <n90.w135@gmail.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: jhchoi1182 <jhchoi1182@gmail.com>
Co-authored-by: Nikhil Singh <124866156+Ritinikhil@users.noreply.github.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Felix Wente <63914035+n0rdp0l@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2026-02-18 10:49:19 +01:00
github-actions[bot] 460e4e5ce6 chore: bump version to 2.74.0 [skip ci] 2026-02-17 21:16:42 +00:00
Cesar Berrospi Ramis 576bada7b7 fix: security vulnerabilities with XML External Entity and related attacks (#3009)
* fix(uspto): disable external entity resolution in SAX parser to prevent XXE

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* style(uspto): use vertical bar annotation instead of Optional and Union

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(jats): add parser options to prevent XXE attacks

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* style(jats): use vertical bar annotation instead of Optional and Union

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-02-17 20:29:04 +01:00
Peter W. J. Staar bf417e6d26 feat: Introduce docling-parse v5 and deprecate old docling-parse backends (#2872)
* feat: simplifying towards docling-parse v5

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* working on integrating docling-parse v5

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the test_backend_docling_parse

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Updated the docling-parse to 5.3.0

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* ran the pre-commit

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the backend_docling_parse

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* ran pre-commit

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the groundtruth to deal with rounding errors

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated comments for later docling-parse integrations

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* ran pre-commit

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Make DoclingParseV2 and DoclingParseV4 backend stubs that route to new backend, emit warning.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* lock docling-parse

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* updated to 3.5.2

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2026-02-17 20:27:56 +01:00
github-actions[bot] 16b2081035 chore: bump version to 2.73.1 [skip ci] 2026-02-13 15:34:52 +00:00
Samved Divekar 0967a4d908 fix:Loosen pillow version constraints to allow CVE-2026-25990 fix (#2992)
* Loosen pillow version constraints to allow CVE-2026-25990 fix

Signed-off-by: divekarsc <divekar.samved@gmail.com>

* Added numba>=0.63.0 constraint directly to the asr optional dependency in pyproject.toml

Signed-off-by: divekarsc <divekar.samved@gmail.com>

* fix:Added numba>=0.63.0 constraint directly to the asr optional dependency in pyproject.toml

Signed-off-by: divekarsc <divekar.samved@gmail.com>

---------

Signed-off-by: divekarsc <divekar.samved@gmail.com>
2026-02-13 10:03:58 +01:00
github-actions[bot] 9166b47e73 chore: bump version to 2.73.0 [skip ci] 2026-02-11 09:53:58 +00:00
Christoph Auer 14e474c955 feat: Inference engines abstraction for object detection model family with HF Transformers and ONNX runtime (#2959)
* Add object_detection family and inference engine

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes for alignment with vlm family

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Proposals for layout and table models

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update the object-detection family, runtime, plugin, and more

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add comments

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Clean up artifacts path handling.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* feat: Load label mappings from HuggingFace config in object detection engines

- Add abstract get_label_mapping() method to BaseObjectDetectionEngine
- Implement label loading from config.json in OnnxRuntimeObjectDetectionEngine
- Refactor LayoutObjectDetectionModel to use engine-provided labels instead of hardcoded mapping
- Centralizes label mapping logic in the inference engine layer

This eliminates hardcoded label dictionaries and makes label mappings configurable through model configs.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* feat: Add Transformers engine for object detection

Implement TransformersObjectDetectionEngine as a PyTorch-based alternative
to ONNX Runtime. Works as drop-in replacement for both layout and table
detection models with support for CPU, CUDA, and MPS devices.

- Add TransformersObjectDetectionEngine with AutoModelForObjectDetection
- Update TransformersObjectDetectionEngineOptions (score_threshold, torch_dtype)
- Update factory to instantiate Transformers engine
- Switch OBJECT_DETECTION_LAYOUT_HERON preset to use Transformers by default
- Add logging configuration to layout_object_detection_example.py

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Improve OD example with different runtimes to demonstrate abstraction.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Added onnxruntime as extra

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update example header.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add missing transformers_engine for OD

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove unnused cleanup hook.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Guard against onnxruntime missing on python 3.14

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* refactor: extract shared HF object-detection engine base

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2026-02-10 16:15:30 +01:00
github-actions[bot] c28d0d6c5d chore: bump version to 2.72.0 [skip ci] 2026-02-03 15:08:32 +00:00
github-actions[bot] c74d378b08 chore: bump version to 2.71.0 [skip ci] 2026-01-30 17:11:20 +00:00
Cesar Berrospi Ramis 0602a7cdab feat: webvtt and source tracker (#2787)
* refactor(provenance): account for provenance as union of ProvenanceItem and ProvenanceTrack

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(webvtt): update WebVTTDocumentBackend with new docling-core classes

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(webvtt): preserve new lines and add helper handlers

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(webvtt): set ProvenanceTrack timinings as float type

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* style(asr): remove unnecessary imports

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(asr): use ProvenanceTrack in ASR pipeline

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(webvtt): add additional tests

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(webvtt): parse the title of the WEBVTT file

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(webvtt): apply refactoring of TrackProvenance from docling-core

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* style(webvtt): apply X | Y annotation instead of Optional, Union

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(webvtt): drop cue span classes, 'lang' and 'c' tags

Drop WebVTT formatting features not covered by Docling across formats.
Only 'u', 'b', 'i', and 'v' are supported and without classes.
Align with docling-core v2.62.0

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* build: pin docling-core 2.62.0

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-01-30 17:44:03 +01:00