Commit Graph

303 Commits

Author SHA1 Message Date
Maksym Lysak 38354b7d13 Added support of "row_section" semantics of HTML_backend.
Improvements on complex rendering example.

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2026-05-12 17:08:27 +02:00
David Wallace 694cf0c791 fix: Handle valid JATS contributor name variants (#3432)
fix: Handle JATS author name variants

Signed-off-by: David Wallace <dwallace0723@gmail.com>
2026-05-12 10:44:54 +02:00
benvizel b5f2e530e2 feat(extraction): add Granite Vision 4.1 as alternative KVP extraction model (#3398)
Introduce a unified TransformersExtractionModel that supports multiple
prompt styles via an ExtractionPromptStyle enum. This replaces the
need for separate model classes per VLM.

- Add ExtractionPromptStyle enum (NUEXTRACT, GRANITE_VISION)
- Add prompt_utils.py with style-specific prompt builders
- Add TransformersExtractionModel with prompt-style dispatch
- Add GRANITE_VISION_4_1_TRANSFORMERS model spec
- Add extraction_prompt_style field to VlmExtractionPipelineOptions

Signed-off-by: Ben Wiesel <benwiesel@ibm.com>
Co-authored-by: Ben Wiesel <benwiesel@ibm.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-12 06:53:22 +02:00
Brighton 0c317060cf fix(docx): preserve custom numbering text prefix in list markers (#3425)
* fix(docx): preserve custom numbering text prefix in list markers

* DCO Remediation Commit for Brighton <brighton@Brightons-MacBook-Air.local>

I, Brighton <brighton@Brightons-MacBook-Air.local>, hereby add my Signed-off-by to this commit: b92e32597b

Signed-off-by: Brighton <brighton@Brightons-MacBook-Air.local>

* fix: resolve lint errors in test (unused vars, import order)

Signed-off-by: Brighton <brighton@Brightons-MacBook-Air.local>

* style: fix ruff lint and format

Signed-off-by: Brighton <brighton@Brightons-MacBook-Air.local>

* refactor: move import re to top of file

Signed-off-by: Brighton <brighton@Brightons-MacBook-Air.local>

---------

Signed-off-by: Brighton <brighton@Brightons-MacBook-Air.local>
Co-authored-by: Brighton <brighton@Brightons-MacBook-Air.local>
2026-05-12 06:49:34 +02:00
Christoph Auer 64ddeb64b8 fix: Update service client URL parsing with v1 suffix (#3415)
Update service client URL parsing of v1 suffix

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2026-05-08 15:34:21 +02:00
geoHeil 5b1df788ef ci: tighten pre-commit guardrails (#3346)
* ci: tighten pre-commit guardrails

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: validate pre-commit guardrail changes

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: switch hook validation to prek

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: exempt active slim plan from max-lines

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: move max-lines config under github

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: fail on uncovered tach modules

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: ignore generated docs in max-lines check

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: clarify local validation tasks

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* docs: refine agent instructions

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: replace mypy with ty

(cherry picked from commit 382afbde8f00abfaeba95ea9c8e9cc603f27a2d9)
Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: replace justfile with makefile

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

---------

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
2026-05-08 15:07:11 +02:00
Christoph Auer aba7f155ae fix(client): Make submit_and_retrieve_many accept lazy iterable and yield (#3405)
* Remove eager materialization from docling-service batch submission

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update lock

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Make convert_all evaluate Iterable input lazily, remove raises_on_error

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Make convert_all use async generator like submit_and_retrieve_many

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix mypy fast check

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* upgrade packages

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* update test GT data

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update test GT from linux machine

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Reset all GT test data and uv.lock

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2026-05-07 18:15:26 +02:00
Panos Vagenas eb6e1e6609 fix(html): add redirect validation to image fetching (#3407)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-05-07 08:12:06 +02:00
Panos Vagenas 2bb0fa67bd fix(html): improve local file path handling (#3400)
* fix: improve local file path handling

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* fix: improve Windows path handling

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add backslash examples

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-05-06 16:34:17 +02:00
Qiefan Jiang 6b3322ef85 fix(markdown): flush pending list/heading creation on CodeSpan to prevent RecursionError (#3361)
Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2026-05-06 13:26:44 +02:00
geoHeil 885873ea36 ci: avoid mutable PR merge refs in fast checks (#3397)
* ci: build stable PR fast-check merge tree

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* test: skip PR fast-check tree test on Windows

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

---------

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
2026-05-06 10:33:16 +02:00
Cesar Berrospi Ramis e00735dd59 fix(docx): fix OMML equation handling and improve type safety (#3381)
* fix(docx): handle missing chr attribute in groupChr OMML elements

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(docx): escape spaces in OMML limit text for proper LaTeX rendering

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(docx): fix inline equation reconstruction to prevent tag corruption

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): add type hints and docstrings to OMML module

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(docx): fix genfrac formatting and eliminate grouping function warnings

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(docx): handle unmapped characters in OMML % formatting

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-05-04 10:58:25 +02:00
geoHeil eb4724ee4c ci: prototype tach-based modular skipping (#3333)
* ci: prototype tach-based modular skipping

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: modularize ubuntu setup and refine gating

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: adopt metaxy-inspired governance helpers

- replace custom aggregate check with re-actors/alls-green

- set FORCE_JAVASCRIPT_ACTIONS_TO_NODE24 on every workflow

- keep PR concurrency alive when the graphite:merge label is present

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: tune checks and pin action versions

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: split CI suites and heavy examples

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: ecaa4777886157d5c2a7b3893c3a820983089dbf
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: d15416f3ca94ac97af2a8317cd6404208db9d896

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: sharpen tach graph and per-suite path filters

- Split docling.pipeline into per-pipeline tach modules
  (asr, vlm, standard_pdf, threaded_standard_pdf, legacy_standard_pdf,
  extraction_vlm, base, base_extraction, simple) so pytest --tach-base
  impact analysis can attribute changes to a specific pipeline rather
  than the whole package.
- Split the asr- and vlm-specific docling.datamodel option files
  (asr_model_specs, pipeline_options_asr_model, vlm_engine_options,
  vlm_model_specs, pipeline_options_vlm_model, layout_model_specs,
  stage_model_specs, backend_options) into their own tach modules so
  a narrow spec/options change no longer marks the full datamodel as
  impacted.
- Narrow the per-suite pipeline path filters in checks.yml to the
  concrete pipeline files relevant to each suite, so editing
  vlm_pipeline.py only triggers the vlm matrix cell and editing
  asr_pipeline.py only the asr one.
- Rekey the model cache in setup-ubuntu-ci to include runner.os and
  hashFiles(uv.lock, pyproject.toml), with ordered restore-keys
  fallbacks so a lockfile bump no longer silently stales the cache.

Metaxy parity note: layered tach enforcement (layer = "...") is
blocked by existing backend<->datamodel and utils<->stages cycles;
depot runners, nox dynamic matrices, devenv/nix, dprint and ty are
not applicable to docling's stack. All pinned action SHAs are on
their latest release as of this commit.

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: introduce pipeline and orchestration tach layers

Earlier notes claimed layers were blocked. That was only true for the
cyclic core (backend<->datamodel, utils<->stages). The boundary
*above* core is clean:

- No module under docling/backend, docling/datamodel, docling/models,
  docling/utils, docling/exceptions, or docling/chunking imports
  anything from docling.pipeline (verified by grep).
- No module anywhere in docling/ imports from docling.cli,
  docling.document_converter, docling.document_extractor, or
  docling.service_client (also verified).

So we can introduce two real layers on top of the cyclic core:

- "pipeline"      — docling.pipeline and all nine concrete pipelines
                     (base, simple, base_extraction, asr, vlm,
                     extraction_vlm, standard_pdf,
                     threaded_standard_pdf, legacy_standard_pdf).
- "orchestration" — docling.cli, docling.document_converter,
                     docling.document_extractor, and
                     docling.experimental.pipeline.

Unlayered modules stay "below" both layers (tach allows them to be
depended on freely) and continue to carry the declared-but-cyclic
backend<->datamodel and utils<->stages edges.

A VLM-only layer was explored but rejected: only
docling.pipeline.vlm_pipeline and docling.pipeline.extraction_vlm_pipeline
could be cleanly layered as "vlm", because the matching datamodel
options (pipeline_options_vlm_model, vlm_engine_options,
vlm_model_specs) and model stages (vlm_convert, vlm_pipeline_models)
sit inside the datamodel/models cycle and cannot be promoted to a
higher layer without first breaking that cycle. Layering only the
two pipeline files is not worth the extra config.

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: expand tach layers to entrypoints/pipeline/models/core

Follow-up to the two-layer attempt. After verifying via grep that
nothing in datamodel/utils/backend imports from
docling.models.{extraction,factories,plugins,vlm_pipeline_models}
or from the "upper" stages (page_assemble, page_preprocessing,
reading_order, picture_description, vlm_convert), those nine
modules can be promoted out of the cyclic core into a dedicated
"models" layer.

The resulting order (highest first):

- entrypoints — cli, document_converter, document_extractor,
                experimental.pipeline
- pipeline    — docling.pipeline + the nine concrete pipelines
- models      — model factories, extraction, plugins,
                vlm_pipeline_models, and the five "upper" stages
- core        — datamodel*, backend*, utils, exceptions, chunking,
                models (base), models.utils, inference_engines.*,
                the six "core stages" that utils cycles with
                (chart_extraction, code_formula, layout, ocr,
                picture_classifier, table_structure), and the
                experimental.* and service_client modules

Rename the previous "orchestration" layer to "entrypoints" to
match the common docling vocabulary. Every module now carries an
explicit layer tag instead of relying on implicit unlayered
behaviour, so future additions must pick a layer deliberately.

A VLM layer, a stand-alone inference-engines layer, and separating
datamodel from backend all remain blocked by the bidirectional
backend<->datamodel and utils<->core-stages edges; those need a
code-level refactor first.

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: refine tach client and foundation layers

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: add optional windows and macos smoke lanes

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: normalize reusable workflow boolean inputs

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: replace external all-green action

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: use org-allowed setup-uv action

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: install compiler toolchain for ML tests

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: bb714afb42cd1b29ab073a7f59cc72874ff2fdcd

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: a1f2761da8f72bfed636bd571ebf77b42c8771b6

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: cc6551b54c5bf4815ae9cd57cf43a98928a74be0

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: b21b0e7ca12b552dbdd54fac1bda113719c286f1

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: simplify ML pytest suite patterns

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: gate heavy examples on label, add job timeouts

- ci-heavy-examples: run only on main push, schedule, workflow_dispatch,
  or when a PR is labeled tests:full / tests:heavy-examples. Drops the
  path-based auto-trigger so that common edits to pyproject.toml,
  uv.lock, or .github/actions do not kick off the 45-60min matrix on
  every PR push. Collapses the changes job into a job-level if gate and
  adds timeout-minutes: 90.
- checks.yml: add timeout-minutes to every job so stuck runners cannot
  burn the full 6h default.

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: tolerate cancelled allowed-skip jobs in check aggregator

Intentional cancellations (manual cancel, concurrency replacement) on
jobs that are already in ALLOWED_SKIPS should not mark the overall
workflow red. Treat `cancelled` the same as `skipped` when the job is
listed as an allowed skip; any unexpected cancellation of a required
job still fails.

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* docs: make minimal vlm example portable

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 2135051da3ed73d4b8a9130f584f40b56155af1a

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 4f6d1d7960f7418d0cde6425ae61538da84fda40

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: install workspace packages in CI syncs

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 492fa9883d4de6d98ebcb40fa863eafe2facff3c

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 3eefae71643f9ca3df0264690c0c6eb1f67f06f1

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: fe8c9689a0ee94f36eb826da8e2177ef87404f5e

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: eabdd24a6734ec873cdaac857718aef2473677e7

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: remove unused graphite concurrency exception

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: document test labels and gate cross-platform lanes

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: select ml tests with pytest markers

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: fix marker selector typing

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: simplify ml suite scheduling

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: mark cross-platform smoke tests

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: reuse test trigger for ml matrix

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: tighten full ci aggregation

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: share required job result check

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

---------

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:15:35 +02:00
geoHeil 41e9fa7886 ci: implement phase 1 path-based workflow skipping (#3332)
* ci: add phase 1 path-based workflow skipping

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: add fast pull_request_target lint checks

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: keep pr fast checks cheap

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: expand full matrix triggers

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: enable same-repo and merge queue checks

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: harden pull_request_target fetch inputs

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: address phase 1 workflow review

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: grant reusable checks permissions

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: temporarily enable pr fast checks validation

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: allow first run of pr fast checks

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: load pr fast check script for first validation

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: format pr fast check script

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: guard temporary pr fast check script fallback

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: use pr metadata for temporary fast check validation

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: remove temporary pr fast checks trigger

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: disable duplicate pull request runs

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: run fast pr checks without path trigger filter

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: add job timeouts in checks.yml

Cap every job so a stuck runner cannot burn the 6h default. Limits:
changes=5, lint=20, run-tests-1/2=45, run-examples=60,
test-pip-install-*=30, build/test-package=15.

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: restore pull request workflow triggers

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: run lint on pull requests

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

---------

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
2026-04-29 10:55:27 +02:00
pateltejas 72942486ff fix(pptx): skip malformed picture shapes instead of aborting conversion (#3372)
* fix(pptx): skip malformed picture shapes instead of aborting conversion

MsPowerpointDocumentBackend._handle_pictures reads embedded image bytes via python-pptx's shape.image accessor. On PPTX files with slightly malformed <p:pic> shapes, shape.image raises three exceptions that the existing (UnidentifiedImageError, OSError, ValueError) clause does not catch, so one bad picture aborts conversion of the entire presentation:

- InvalidXmlError when <p:blipFill> is missing
- KeyError when <a:blip r:embed> points to an unknown relationship
- AttributeError when the embedded part's content-type isn't an image

These files open normally in Keynote and Google Drive, so the backend should handle them as gracefully as it already handles truncated or unreadable image payloads.

This follows the same pattern as #2914, which extended the same except tuple with ValueError to handle linked (external) image references. The three cases above are the remaining shape.image failure modes that still escape.

Extend the except tuple to cover the three cases and log the same warning used for other unreadable images, leaving the rest of the presentation to convert normally. Add a regression fixture with one malformed picture per failure mode plus a focused test.

Fixes #3371

Signed-off-by: pateltejas <tejas226@hotmail.com>

* refactor(pptx): use warnings.warn for malformed picture skips

Address PR review feedback: use Python's warnings module with UserWarning to signal the skip to callers instead of logging.Logger.warning, matching the pattern used in msword_backend for "Skipping external image reference". This makes the skip visible via standard warning filters and catchable in tests.

Update the regression test to assert the warning is emitted via pytest.warns, which also suppresses the message during the test run so it doesn't clutter suite output.

Signed-off-by: pateltejas <tejas226@hotmail.com>

---------

Signed-off-by: pateltejas <tejas226@hotmail.com>
2026-04-29 08:29:08 +02:00
Nikos Livathinos 8b67fae687 feat: Extend the kserve-triton OCR model to have multi-lingual support (#3368)
* chore: Update .gitignore with local dirs of AI agents

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* feat: Extend KserveV2OcrModel and kserve_v2_grpc.py to support the new version of Triton-RapidOCR
model where the language is the first input parameter:
- The gRPC client has been extended to encode BYTE input, needed for String types.
- An additional test ensures to have proper BYTE encoding/decoding.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* feat: Add test for the KServe-Triton integration: WIP
- The test currently supports only the gRPC KServe client
- Extend the ground-truth test data.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Simplify code in kserve test

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Rename test file

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* feat: Extend the kserve_v2 implementation to support binary data in the HTTP interface.
- Decouple functions for binary encoding/decoding inside the kserve_v2_utils.py and share for both HTTP and gRPC.
- Introduce use_binary_data init parameter in KserveV2OptionsMixin
- Improve tests

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Put back the field grpc_use_binary_data of KserveV2OptionsMixin as a deprecated alias to use_binary_data

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

---------

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2026-04-28 16:00:57 +02:00
Cesar Berrospi Ramis 3df80e7f46 fix(docx): OMML conversion failures for unsupported limit functions (#3359)
* fix(docx): handle unsupported limit functions gracefully in OMML conversion

Replace RuntimeError with graceful fallback for unknown limit functions in do_limlow().
Add argmax and argmin to LIM_FUNC dictionary for proper LaTeX rendering.
Fixes conversion failures when Word documents contain mathematical operators
not previously supported in the limit function dictionary.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* test(docx): regenerate ground truth files

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-04-28 14:43:24 +02:00
Cesar Berrospi Ramis c455a65e36 feat(docx): add checkbox parsing support (#3349)
* feat(docx): add checkbox parsing support to MsWordDocumentBackend

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(docx): remove duplicate code in text element handling

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* docs(docx): update checkbox method docstrings

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(docx): use self._BLIP_NAMESPACES for w14 namespace in checkbox methods

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-04-28 14:38:43 +02:00
Aatrey Sahay f2c03edb30 fix(html):preserve fragment-only anchor links during path resolution (#3262)
fix(html): preserve fragment-only anchor links during path resolution

Fragment-only hrefs (e.g. href="#section1") were resolved as filesystem
paths when source_uri was set, breaking internal document navigation.

Add '#' to the skip-resolution prefixes in _resolve_relative_path() so
fragment links pass through unchanged.

Partially addresses #2929

Signed-off-by: aatrey56 <aatrey.sahay@gmail.com>
2026-04-28 10:28:23 +02:00
Christoph Auer a6a37ca895 fix: Make VLLM model_impl configurable (#3358)
* Add ResponseFormat.DOCLANG and parsing branch in VLM pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Remove bogus preamble from VLM chat template

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Add include_stop_str_in_output in allowed VLLM sampling params

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* allow vllm model_impl to be defined

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* make VLLM model_impl default to auto

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2026-04-24 09:53:23 +02:00
Christoph Auer 0f6f8d0bcd feat: Add ResponseFormat.DOCLANG and parsing branch in VLM pipeline (#3350)
* Add ResponseFormat.DOCLANG and parsing branch in VLM pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Remove bogus preamble from VLM chat template

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Add include_stop_str_in_output in allowed VLLM sampling params

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2026-04-24 08:36:30 +02:00
Cesar Berrospi Ramis c1dbac22c7 fix: strengthen input validation for METS‑GBS processing (#3336)
* fix: prevent XXE and decompression bomb in METS-GBS processing

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor: enforce resource limits for METS-GBS tar extraction

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-04-23 10:17:39 +02:00
Panos Vagenas cd0cb69530 fix(html): refine image URL and size handling (#3348)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-04-22 17:34:01 +02:00
Cesar Berrospi Ramis 2ddaa3be97 feat(docx): extract VML images with v:imagedata elements (#3343)
feat(docx): Extract VML images with v:imagedata elements

Add VML image support with EMF/WMF conversion and consolidate image handler code.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-04-22 08:46:36 +02:00
Matvei Smirnov 3a3c8f68dd fix(pptx)!: assign pptx notes to ContentLayer.NOTES (#3341)
Signed-off-by: Matvei Smirnov <vdalekesmirnov@gmail.com>
2026-04-21 18:35:43 +02:00
Christoph Auer 075fa69491 fix(service): Add explicit usage exceeded exception handling (#3325)
* Add explicit usage exceeded error handling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Make websocket more resilient

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2026-04-17 16:57:50 +02:00
EliSchwartz 1569e42f84 feat: implement GraniteVisionTableStructureModel for VLM-based table extraction (#3323)
Add a new table structure model using IBM Granite Vision to extract table
structure from document images via OTSL token generation.

Changes:
- Add `GraniteVisionTableStructureOptions` with configurable model repo,
  device, batch size, and crop padding options
- Implement `GraniteVisionTableStructureModel` that uses a VLM pipeline to
  generate OTSL tokens from cropped table images, then parses them into
  `TableData` with cells, rows, and columns
- Register the model in `table_structure_engines` alongside existing engines
- Add example script `docs/examples/granite_vision_table_structure.py`
- Add tests covering options, model enable/disable, OTSL parsing (including
  self-closing tags xcel/srow/ecel), and invalid-backend error handling
- Update model catalog docs and CI workflow accordingly

Signed-off-by: Eli Schwartz <eli.shw@gmail.com>
2026-04-17 11:02:20 +02:00
Smeet Agrawal 101233ebe2 fix(latex): fully unwrap deeply nested formatting macros (#3249)
* fix(latex): fully unwrap deeply nested formatting macros

Two related bugs when formatting macros are nested:

1. `\textcolor{color}{...}` extracted the color name alongside the
   text content because `_nodes_to_text` fell through to the generic
   else branch, which concatenates all arguments. E.g.
   `\section{\textcolor{blue}{\textbf{[SEP]}}}` produced heading text
   "blue [SEP]" instead of "[SEP]".

2. `\textsc`, `\textsf`, `\textrm`, `\textnormal`, `\mbox` and
   `\textcolor`/`\colorbox` are listed in MACROS_STRUCTURAL, so when
   encountered mid-sentence `_process_macro_node_inline` flushed the
   text buffer and called `_process_macro`, which creates a new doc
   node. This broke inline paragraphs into fragments.

Fix:
- Add explicit handlers for MACROS_TEXT_STYLE and textcolor/colorbox
  in `_process_macro_node_inline` (before the MACROS_STRUCTURAL flush
  path) so they are accumulated inline like MACROS_TEXT_FORMATTING.
- Add matching handlers in `_nodes_to_text` so colour names are
  skipped and only the text-content argument is returned.

Fixes #3207

Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com>
Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>

* fix(latex): fully unwrap deeply nested formatting macros

Two related bugs when formatting macros are nested inside each other:

1. `\textcolor{color}{...}` extracted the color name alongside the
   text content because `_nodes_to_text` fell through to the generic
   else branch, which concatenates all arguments. E.g.
   `\section{\textcolor{blue}{\textbf{[SEP]}}}` produced heading text
   "blue [SEP]" instead of "[SEP]".

2. `\textsc`, `\textsf`, `\textrm`, `\textnormal`, `\mbox` and
   `\textcolor`/`\colorbox` are listed in MACROS_STRUCTURAL, so when
   encountered mid-sentence `_process_macro_node_inline` flushed the
   text buffer and called `_process_macro`, which creates a new doc
   node. This broke inline paragraphs into fragments.

Fix:
- Add MACROS_COLOR_INLINE constant for textcolor/colorbox to keep
  all macro classifications in one place (constants.py).
- Add explicit handlers for MACROS_TEXT_STYLE and MACROS_COLOR_INLINE
  in `_process_macro_node_inline` (before the MACROS_STRUCTURAL flush
  path) so they are accumulated inline like MACROS_TEXT_FORMATTING.
- Merge the identical MACROS_TEXT_FORMATTING and MACROS_TEXT_STYLE
  branches in `_nodes_to_text` into a single branch.
- Use argnlist[-1] instead of reversed() iteration for
  MACROS_COLOR_INLINE since the text content is always the last arg,
  consistent with _extract_macro_arg.

Fixes #3207

Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com>
Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>

* refactor(latex): extract _macro_node_to_text to reduce complexity

Split the macro-handling branch of `_nodes_to_text` into a dedicated
`_macro_node_to_text` helper so that cyclomatic complexity stays within
the ruff C901 limit (was 31, now < 30 for both methods).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>

* fix(latex): migrate nested-formatting test to tests/test_latex/

Upstream reorganised all latex tests from tests/test_backend_latex.py
into tests/test_latex/. Move test_latex_nested_formatting_macros to
tests/test_latex/test_macros.py and fix ruff-reported style nits.

Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>

---------

Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com>
Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>
Co-authored-by: Smeet Agrawal <smeetagrawal23@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 09:21:44 +02:00
Cesar Berrospi Ramis c7615123e6 fix(docx): handle inline formulas in list items (#3304)
* fix(docx) Handle inline formulas in list items

Fixes issue where inline formulas in list items were ignored during conversion.
Added helper methods to eliminate code duplication.
Updated test data with list items containing inline equations.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(docx): collect element refs in _add_inline_equations_to_parent

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-04-17 07:33:20 +02:00
Yarizakura 3bab6b4d38 fix(format): add MD fallback for .txt files in _guess_from_content (#3311)
fix(format): detect plain-text (.txt/.text/.qmd) as Markdown

Fixes docling-project/docling#3259

Previously, .txt files were rejected with "File format not allowed"
because _mime_from_extension intentionally returns None for .txt
(the extension is ambiguous between XML_USPTO and MD), and the
text/plain branch in _guess_from_content only checked for the USPTO
PATN\r\n marker before giving up.

This change lets _guess_from_content fall back to InputFormat.MD for
text/plain content when the file extension is in the MD format's
extension list (md/txt/text/qmd/rmd/Rmd). Crucially, the fallback is
extension-gated: unknown extensions like .xyz continue to return None
and raise ConversionError as before, preserving the expected behavior
in tests/test_invalid_input.py.

Changes:
- _guess_from_content accepts an optional ext parameter
- _guess_format now captures the extension into obj_ext and passes
  it into _guess_from_content
- test_input_doc.py asserts that .txt streams and files are detected
  as InputFormat.MD (not None)

Signed-off-by: CrepuscularIRIS <serenitygp@qq.com>
2026-04-17 07:29:20 +02:00
pateltejas 043ed2dd3d fix(pptx): handle NotImplementedError from shape.shape_type (#3309)
* fix(pptx): handle NotImplementedError from shape.shape_type

python-pptx raises NotImplementedError from Shape.shape_type for
<p:sp> elements that aren't placeholders, autoshapes, textboxes, or
freeforms (e.g. shapes with empty <p:spPr> from Google Slides exports,
LibreOffice, or Keynote). handle_groups() and handle_shapes() access
shape_type without catching this, crashing the entire conversion.

Add a _safe_shape_type() helper that returns None on
NotImplementedError, so unrecognized shapes skip only the GROUP
recursion and PICTURE extraction while text and table extraction
proceed normally.

Fixes #3308

Signed-off-by: Tejas Patel <tejas226@hotmail.com>

* Fix lint

Signed-off-by: Tejas Patel <tejas226@hotmail.com>

---------

Signed-off-by: Tejas Patel <tejas226@hotmail.com>
2026-04-17 06:59:48 +02:00
geoHeil 251c8b217a fix(ocr): align RapidOCR english assets with 3.8 mobile models (#3291)
* fix(ocr): support language selection for RapidOCR engine

Allows specifying 'english' or 'chinese' via the --ocr-lang flag and automatically downloads the correct models.

Signed-off-by: DevAbdullah90 <abdullahkashif12b3@gmail.com>

* fix(ocr): fix linting and add unit tests for RapidOCR language selection

Signed-off-by: DevAbdullah90 <abdullahkashif12b3@gmail.com>

* fix(ocr): align RapidOCR english assets with 3.8 mobile models

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* fix(ocr): restore RapidOCR default model compatibility

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* fix(examples): disable OCR in code formula comparison

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

---------

Signed-off-by: DevAbdullah90 <abdullahkashif12b3@gmail.com>
Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
Co-authored-by: DevAbdullah90 <abdullahkashif12b3@gmail.com>
2026-04-15 12:16:41 +02:00
Cesar Berrospi Ramis 740c386730 fix(docx): isolate list state in table cells (#3294)
* fix(docx): isolate list state in table cells

Lists with the same numId in different table cells were incorrectly
merged. Added context manager to isolate list state during cell
processing. Includes test cases and updated ground truth files.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* style(docx): modernize type hints to use PEP 604 union syntax

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-04-15 09:51:37 +02:00
Hemanth Battu 5b84911a4c fix(pipeline): prevent cache miss due to pipeline options mutation during chart extraction (#3300)
* fix(pipeline): prevent cache miss due to pipeline options mutation

Use local variable instead of mutating shared options object, preventing cache miss.

Signed-off-by: Hemanth Battu <hbattu@ibm.com>

* test(pipeline): add regression test for chart extraction cache miss

Signed-off-by: Hemanth Battu <hbattu@ibm.com>

---------

Signed-off-by: Hemanth Battu <hbattu@ibm.com>
Co-authored-by: Hemanth Battu <hbattu@ibm.com>
2026-04-15 09:39:56 +02:00
Rayabharapu Trinai a15c16e19f feat: explicit TikZ environment handling in LaTeX backend (#3187)
* feat: explicitly handle tikzpicture as atomic figure nodes

Signed-off-by: StealthTensor <rayabharaputrinai@gmail.com>

* fix: resolve copilot review on recursion depth and text labels

Signed-off-by: StealthTensor <rayabharaputrinai@gmail.com>

* test: cover tikz fallback label and depth guard

Signed-off-by: StealthTensor <rayabharaputrinai@gmail.com>

* fix: address copilot tikz parent and end-tag edge cases

Signed-off-by: StealthTensor <rayabharaputrinai@gmail.com>

* feat(latex): store TikZ source in Picture meta code field

Handle tikzpicture environments as real Picture nodes via add_picture and
store raw TikZ in PictureMeta.code (CodeMetaField) with language metadata.

Also update LaTeX backend tests to assert picture meta code behavior and
bump docling-core minimum version to 2.71.0 (includes PictureMeta.code).

Signed-off-by: StealthTensor <rayabharaputrinai@gmail.com>

* fix(latex): make TikZ code metadata compatible across docling-core versions

Avoid hard import of CodeMetaField to prevent import-time crashes when older
docling-core is resolved. Use PictureMeta.code when available and retain legacy
CODE fallback otherwise. Update TikZ tests for dual-path behavior and refresh
uv.lock to pin docling-core 2.71.0.

Signed-off-by: StealthTensor <rayabharaputrinai@gmail.com>

* refactor(latex): simplify TikZ parent assignment

Signed-off-by: StealthTensor <rayabharaputrinai@gmail.com>

* fix(latex): remove legacy PictureMeta guard and use TIKZ code metadata

Signed-off-by: StealthTensor <rayabharaputrinai@gmail.com>

* fix: require docling-core 2.73.0 for TIKZ code label

Signed-off-by: StealthTensor <rayabharaputrinai@gmail.com>

* test: remove legacy tikz code-meta compatibility branch

Signed-off-by: StealthTensor <rayabharaputrinai@gmail.com>

---------

Signed-off-by: StealthTensor <rayabharaputrinai@gmail.com>
2026-04-15 05:47:20 +02:00
Christoph Auer 42157a3e10 feat(service): Establish client SDK for docling serve (#3264)
* Move client SDK to docling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add client SDK examples

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Mark client SDK as experimental

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2026-04-13 14:54:06 +02:00
geoHeil 6b257ece33 fix(ocr): support rapidocr 3.8 mobile model naming (#3277)
Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
2026-04-13 11:44:33 +02:00
Aditya Sasidhar 60fc517af0 chore: Condensing the latex test backend into multiple files (#3281)
chore:Condensing the latex test backend into multiple files

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
2026-04-13 10:04:22 +02:00
geoHeil 27d3cf490f fix(vlm): add explicit MLX support for OCR presets (#3272)
* fix(vlm): add explicit mlx support for OCR presets

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* build(vlm): require mlx-vlm 0.4.3 for OCR presets

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* fix(vlm): use dedicated mlx falcon checkpoint

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

---------

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
2026-04-13 07:27:38 +02:00
Said Gürbüz a6aeddf9e2 fix(markdown): normalize repeated leading dash markers (#3286)
Signed-off-by: Saidgurbuz <said.guerbuez@inf.ethz.ch>
2026-04-13 07:26:11 +02:00
Said Gürbüz 6cb1bc0c02 fix(docx): preserve inline SDT references (#3280)
Signed-off-by: Saidgurbuz <said.guerbuez@inf.ethz.ch>
2026-04-13 06:55:25 +02:00
Said Gürbüz e4fd93742e fix(pptx): respect page_range during conversion (#3282)
fix(pptx): honor page_range during conversion

Signed-off-by: Saidgurbuz <said.guerbuez@inf.ethz.ch>
2026-04-13 06:52:35 +02:00
geoHeil 9970d1ef94 feat(vlm): add Nanonets OCR2 onboarding (#3274)
* feat(vlm): add nanonets ocr2 onboarding

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* feat(vlm): add vLLM and API runtimes for Nanonets-OCR2

Extend the Nanonets-OCR2 preset with vLLM + remote API paths so all
standard docling runtimes (Transformers, MLX, vLLM, API, LM Studio,
OpenAI-compatible) work out of the box. Drop the restricted
supported_engines set to match the GLM-OCR / LightOnOCR / Falcon-OCR
pattern, add top-level torch_dtype on the Transformers override, and
register NANONETS_OCR2_VLLM / NANONETS_OCR2_VLLM_API /
NANONETS_OCR2_LMSTUDIO_API legacy specs plus VlmModelType enum entries.

Folds in the remote-API scope that was on the superseded PR #3275.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

---------

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-13 06:51:35 +02:00
Said Gürbüz 9c3ab934d6 fix(vlm): support tool-calling API responses (#3271)
Signed-off-by: Saidgurbuz <said.guerbuez@inf.ethz.ch>
2026-04-12 09:02:06 +02:00
Smeet Agrawal ab5254df7c fix(pdf): extend ligature map with Dutch IJ and PUA glyph U+F0A0 (#3254)
* fix(pdf): extend ligature map with Dutch IJ and PUA glyph U+F0A0

Add two entries missing from the PDF text sanitizer's ligature map:
- U+0132 (IJ) → "IJ" and U+0133 (ij) → "ij": Latin capital/small ligature
  IJ, used in Dutch (e.g. IJssel, IJ becomes IJ at the start of words).
- U+F0A0 → "": a Private-Use Area glyph emitted by some PDF fonts as a
  spurious character with no textual meaning; it is silently discarded.

The _LIGATURE_RE pattern is updated to match these new code points.

Closes #2882

Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com>

* style: apply ruff formatter fixes

Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com>

* fix: remove accidentally included msexcel tests from ligature branch

Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com>

---------

Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com>
Co-authored-by: Smeet Agrawal <smeetagrawal23@gmail.com>
2026-04-12 07:37:11 +02:00
vwe-ibm 9b4b67b23e feat: add signature/stamp html block to DC document (#3251) 2026-04-09 06:13:22 +02:00
Smeet Agrawal 61809252ec fix(latex): discard arguments of filtered spacing commands (#3245)
Commands like \vspace{-1mm} and \hspace{0.2cm} were being filtered
at the command level but their argument values were leaking through as
plain text nodes. Ensure that when a spacing/ignored command is
encountered, its arguments are also suppressed.

- Add vspace, hspace, vspace*, hspace*, addvspace to MACROS_SPACING
- Guard against spacing/ignored macros in _process_macro_node_inline
  so their brace arguments are not extracted as inline text
- Guard against spacing/ignored macros in _nodes_to_text so dimension
  values do not leak when processing footnotes, captions, etc.
- Update ground truth files to reflect corrected output

Fixes #3240

Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com>
Co-authored-by: Smeet Agrawal <smeetagrawal23@gmail.com>
2026-04-08 11:19:56 +02:00
Aatrey Sahay 6699642fa0 feat(vlm): add PARTIAL_SUCCESS status for VLM pipeline pages (#3215)
* feat(vlm): add PARTIAL_SUCCESS status for VLM pipeline pages
Override _determine_status in VlmPipeline to detect partial failures
from VLM inference. Pages with truncated output (LENGTH) or filtered
content (CONTENT_FILTERED) now correctly report PARTIAL_SUCCESS with
descriptive error messages, matching the pattern used by
ExtractionVlmPipeline and AsrPipeline.
Includes @override decorator per coding conventions and 7 unit tests
covering all VlmStopReason variants.
Closes docling-project#2583

Signed-off-by: aatrey56 <aatrey.sahay@gmail.com>

* style(vlm): fix import ordering for typing_extensions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: aatrey56 <aatrey.sahay@gmail.com>

---------

Signed-off-by: aatrey56 <aatrey.sahay@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 09:09:56 +02:00
geoHeil d0e19be14f feat: add support for Falcon-OCR (#3237)
* feat: add support for Falcon-OCR (tiiuae/Falcon-OCR)

Add Falcon-OCR as a VLM convert model for document OCR and markdown
conversion. Falcon-OCR is a 0.3B parameter model from TII that supports
full-page text extraction, formula (LaTeX), and table (HTML) recognition.

Changes:
- Add VLM_CONVERT_FALCON_OCR preset (stage_model_specs.py)
- Add legacy specs: FALCON_OCR_TRANSFORMERS, FALCON_OCR_VLLM,
  FALCON_OCR_VLLM_API (vlm_model_specs.py)
- Register preset with VlmConvertOptions (pipeline_options.py)
- Fix trust_remote_code propagation from VlmModelSpec to engine options
  in auto_inline_engine.py and from_preset(), which also fixes the
  existing Phi-4 preset
- Add test suite (tests/test_falcon_ocr_vlm.py)

Closes: docling-project/docling#3236

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* style: apply ruff formatting

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* fix: align Falcon-OCR presets with review feedback

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

---------

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 14:41:49 +02:00
geoHeil f2affd7614 feat: add support for LightOnOCR-2-1B (#3213)
feat: add support for LightOnOCR-2-1B VLM model

Add LightOnOCR-2-1B as a new VLM model option for OCR and markdown conversion.
Includes transformers, vLLM, and API configurations, stage model preset,
and tests.

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
2026-04-02 18:34:00 +02:00