Commit Graph

6 Commits

Author SHA1 Message Date
geoHeil eb4724ee4c ci: prototype tach-based modular skipping (#3333)
* ci: prototype tach-based modular skipping

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: modularize ubuntu setup and refine gating

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: adopt metaxy-inspired governance helpers

- replace custom aggregate check with re-actors/alls-green

- set FORCE_JAVASCRIPT_ACTIONS_TO_NODE24 on every workflow

- keep PR concurrency alive when the graphite:merge label is present

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: tune checks and pin action versions

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: split CI suites and heavy examples

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: ecaa4777886157d5c2a7b3893c3a820983089dbf
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: d15416f3ca94ac97af2a8317cd6404208db9d896

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: sharpen tach graph and per-suite path filters

- Split docling.pipeline into per-pipeline tach modules
  (asr, vlm, standard_pdf, threaded_standard_pdf, legacy_standard_pdf,
  extraction_vlm, base, base_extraction, simple) so pytest --tach-base
  impact analysis can attribute changes to a specific pipeline rather
  than the whole package.
- Split the asr- and vlm-specific docling.datamodel option files
  (asr_model_specs, pipeline_options_asr_model, vlm_engine_options,
  vlm_model_specs, pipeline_options_vlm_model, layout_model_specs,
  stage_model_specs, backend_options) into their own tach modules so
  a narrow spec/options change no longer marks the full datamodel as
  impacted.
- Narrow the per-suite pipeline path filters in checks.yml to the
  concrete pipeline files relevant to each suite, so editing
  vlm_pipeline.py only triggers the vlm matrix cell and editing
  asr_pipeline.py only the asr one.
- Rekey the model cache in setup-ubuntu-ci to include runner.os and
  hashFiles(uv.lock, pyproject.toml), with ordered restore-keys
  fallbacks so a lockfile bump no longer silently stales the cache.

Metaxy parity note: layered tach enforcement (layer = "...") is
blocked by existing backend<->datamodel and utils<->stages cycles;
depot runners, nox dynamic matrices, devenv/nix, dprint and ty are
not applicable to docling's stack. All pinned action SHAs are on
their latest release as of this commit.

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: introduce pipeline and orchestration tach layers

Earlier notes claimed layers were blocked. That was only true for the
cyclic core (backend<->datamodel, utils<->stages). The boundary
*above* core is clean:

- No module under docling/backend, docling/datamodel, docling/models,
  docling/utils, docling/exceptions, or docling/chunking imports
  anything from docling.pipeline (verified by grep).
- No module anywhere in docling/ imports from docling.cli,
  docling.document_converter, docling.document_extractor, or
  docling.service_client (also verified).

So we can introduce two real layers on top of the cyclic core:

- "pipeline"      — docling.pipeline and all nine concrete pipelines
                     (base, simple, base_extraction, asr, vlm,
                     extraction_vlm, standard_pdf,
                     threaded_standard_pdf, legacy_standard_pdf).
- "orchestration" — docling.cli, docling.document_converter,
                     docling.document_extractor, and
                     docling.experimental.pipeline.

Unlayered modules stay "below" both layers (tach allows them to be
depended on freely) and continue to carry the declared-but-cyclic
backend<->datamodel and utils<->stages edges.

A VLM-only layer was explored but rejected: only
docling.pipeline.vlm_pipeline and docling.pipeline.extraction_vlm_pipeline
could be cleanly layered as "vlm", because the matching datamodel
options (pipeline_options_vlm_model, vlm_engine_options,
vlm_model_specs) and model stages (vlm_convert, vlm_pipeline_models)
sit inside the datamodel/models cycle and cannot be promoted to a
higher layer without first breaking that cycle. Layering only the
two pipeline files is not worth the extra config.

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: expand tach layers to entrypoints/pipeline/models/core

Follow-up to the two-layer attempt. After verifying via grep that
nothing in datamodel/utils/backend imports from
docling.models.{extraction,factories,plugins,vlm_pipeline_models}
or from the "upper" stages (page_assemble, page_preprocessing,
reading_order, picture_description, vlm_convert), those nine
modules can be promoted out of the cyclic core into a dedicated
"models" layer.

The resulting order (highest first):

- entrypoints — cli, document_converter, document_extractor,
                experimental.pipeline
- pipeline    — docling.pipeline + the nine concrete pipelines
- models      — model factories, extraction, plugins,
                vlm_pipeline_models, and the five "upper" stages
- core        — datamodel*, backend*, utils, exceptions, chunking,
                models (base), models.utils, inference_engines.*,
                the six "core stages" that utils cycles with
                (chart_extraction, code_formula, layout, ocr,
                picture_classifier, table_structure), and the
                experimental.* and service_client modules

Rename the previous "orchestration" layer to "entrypoints" to
match the common docling vocabulary. Every module now carries an
explicit layer tag instead of relying on implicit unlayered
behaviour, so future additions must pick a layer deliberately.

A VLM layer, a stand-alone inference-engines layer, and separating
datamodel from backend all remain blocked by the bidirectional
backend<->datamodel and utils<->core-stages edges; those need a
code-level refactor first.

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: refine tach client and foundation layers

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: add optional windows and macos smoke lanes

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: normalize reusable workflow boolean inputs

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: replace external all-green action

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: use org-allowed setup-uv action

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: install compiler toolchain for ML tests

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: bb714afb42cd1b29ab073a7f59cc72874ff2fdcd

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: a1f2761da8f72bfed636bd571ebf77b42c8771b6

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: cc6551b54c5bf4815ae9cd57cf43a98928a74be0

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: b21b0e7ca12b552dbdd54fac1bda113719c286f1

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: simplify ML pytest suite patterns

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: gate heavy examples on label, add job timeouts

- ci-heavy-examples: run only on main push, schedule, workflow_dispatch,
  or when a PR is labeled tests:full / tests:heavy-examples. Drops the
  path-based auto-trigger so that common edits to pyproject.toml,
  uv.lock, or .github/actions do not kick off the 45-60min matrix on
  every PR push. Collapses the changes job into a job-level if gate and
  adds timeout-minutes: 90.
- checks.yml: add timeout-minutes to every job so stuck runners cannot
  burn the full 6h default.

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: tolerate cancelled allowed-skip jobs in check aggregator

Intentional cancellations (manual cancel, concurrency replacement) on
jobs that are already in ALLOWED_SKIPS should not mark the overall
workflow red. Treat `cancelled` the same as `skipped` when the job is
listed as an allowed skip; any unexpected cancellation of a required
job still fails.

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* docs: make minimal vlm example portable

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 2135051da3ed73d4b8a9130f584f40b56155af1a

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 4f6d1d7960f7418d0cde6425ae61538da84fda40

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: install workspace packages in CI syncs

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 492fa9883d4de6d98ebcb40fa863eafe2facff3c

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 3eefae71643f9ca3df0264690c0c6eb1f67f06f1

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: fe8c9689a0ee94f36eb826da8e2177ef87404f5e

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: eabdd24a6734ec873cdaac857718aef2473677e7

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: remove unused graphite concurrency exception

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: document test labels and gate cross-platform lanes

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: select ml tests with pytest markers

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: fix marker selector typing

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: simplify ml suite scheduling

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: mark cross-platform smoke tests

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: reuse test trigger for ml matrix

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: tighten full ci aggregation

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: share required job result check

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

---------

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:15:35 +02:00
Mahafuzur Rahman dbab30e92c fix: formula conversion with page_range param set (#1791)
When page_range param is used for formula conversion,
the system throws list index out of range error.

Included tests to validate that the fix works.

Signed-off-by: Masum <masumsofts@yahoo.com>
2025-06-17 13:58:45 +02:00
Michele Dolfi 5458a88464 ci: add coverage and ruff (#1383)
* add coverage calculation and push

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* new codecov version and usage of token

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* enable ruff formatter instead of black and isort

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* apply ruff lint fixes

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* apply ruff unsafe fixes

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add removed imports

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* runs 1 on linter issues

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* finalize linter fixes

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Update pyproject.toml

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-04-14 18:01:26 +02:00
Christoph Auer 3960b199d6 feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905)
* Add DoclingParseV3 backend implementation

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Use docling-core with docling-parse types

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes and test updates

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix streams

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix streams

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Reset tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* update test cases

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* update test units

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add back DoclingParse v1 backend, pipeline options

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update locks

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: update docling-core to 2.22.0

Update dependency library docling-core to latest release 2.22.0
Fix regression tests and ground truth files

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* Ground-truth files updated

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update tests, use TextCell.from_ocr property

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Text fixes, new test data

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Rename docling backend to v4

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Test all backends, fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Reset all tests to use docling-parse v1 for now

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes for DPv4 backend init, better test coverage

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* test_input_doc use default backend

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-03-18 10:38:19 +01:00
Michele Dolfi 9114ada7bc fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903)
fix: Support for RTL programmatic documents
fix(parser): detect and handle rotated pages
fix(parser): fix bug causing duplicated text
fix(formula): improve stopping criteria
chore: update lock file
fix: temporary constrain beautifulsoup


* switch to code formula model v1.0.1 and new test pdf

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* switch to code formula model v1.0.1 and new test pdf

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* cleaned up the data folder in the tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* switch to code formula model v1.0.1 and new test pdf

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* added three test-files for right-to-left

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fix black

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* added new gt for test_e2e_conversion

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* added new gt for test_e2e_conversion

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* Add code to expose text direction of cell

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* new test file

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* update lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix mypy reports

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix example filepaths

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add test data results

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* pin wheel of latest docling-parse release

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use latest docling-core

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove debugging code

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix path to files in example

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Revert unwanted RTL additions

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix test data paths in examples

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-07 08:43:31 +01:00
Matteo 3213b247ad feat: Code and equation model for PDF and code blocks in markdown (#752)
* propagated changes for new CodeItem class

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>

* Rebased branch on latest main. changes for CodeItem

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>

* removed unused files

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>

* chore: update lockfile

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* pin latest docling-core

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update docling-core pinning

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* pin docling-core

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use new add_code in backends and update typing in MD backend

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* added if statement for backend

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>

* removed unused import

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>

* removed print statements

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>

* gt for new pdf

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>

* Update docling/pipeline/standard_pdf_pipeline.py

Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com>

* fixed doc comment of __call__ function of code_formula_model

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>

* fix artifacts_path type

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* move imports

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* move expansion_factor to base class

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2025-01-24 16:54:22 +01:00