4 Commits

Author SHA1 Message Date
Christoph Auer 629a451d7b feat: Layout evaluation fixes, mode control and cleanup (#133)
* Misc fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Make DatasetRecord tolerant to old parquet files

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Make DatasetRecord tolerant to old parquet files (2)

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix docvqa test, more cleanup

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Important fixes for layout mAP computation

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Adding modes for missing_prediction_strategy and label_filtering_strategy

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes for mismatched docs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add F1 no_picture metrics to layout evaluator

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixed commands on all READMEs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove extract_images ambiguity, use utility and fix errors on visualizer

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Upgrade to latest docling_core

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix ocrmac dep, upgrade uv.lock

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix for tableformer provider

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove code redundancy

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-01 10:02:59 +02:00
Christoph Auer e3debd61d7 fix: Address missing conversion status (PENDING), add artifacts path, remove unused CLI args (#69)
* Add README for Docling-DPBench

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Restructured CVAT builder (WIP)

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* CVAT preannotation and dataset builders, with test cases

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add CLI, merge from main

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update README for CVAT

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add artifacts path option to CLI, several fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove raise

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-04-17 14:14:58 +02:00
Christoph Auer a3d99b9f13 feat: Establish new API encapsulation for dataset creation and prediction providers (#30)
* correct mpy

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatting

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* adding the script to make an initial dataset from pdf's

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* before switching to specific docling-core branch

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* rebased on kv-items and updated the create script in CVAT

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the cvat

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the annotation description on CVAT

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the annotation description on CVAT (2)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the annotation description on CVAT (3)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* [WIP] Crafting new dataset builder and prediction provider API

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Restructure to docling_eval_next

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix mypy

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix f-strings

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Changes for prediction_provider interface, to support all cases.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add omnidocbench DatasetBuilder

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add doclaynet v1, funsd

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add XFUND, more fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* update the kv cell creation to prevent false positives

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>

* chore: Fixing imports

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Update docling-core version

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* feat: Introduce new design for Evaluators based on BaseEvaluator that accept external predictions.
And utility adapters.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Factor PredictionProvider out of dataset builder, many fixes on DatasetRecord

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Sketch example for file-directory prediction provider

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* chore: Fix typing hints

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Update poetry to doclign-core 2.24.0

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* feat: WIP: Introduce the FilePredictionProvider that reads files with predictions from the disk
- It currently supports doctags, markdown, json, yaml formats.
- We still need to improve the returned type so that it allows for no DoclingDocument but only for
  the source data (e.g. in case of markdown).

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Add DocLayNetV2DatasetBuilder

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Added TableDatasetBuilder and test, update TableFormerPredictionProvider

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* chore: Update MyPy configuration in toml

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* feat: Refactor the BasePredictionProvider.predict() to return DatasetRecordWithPrediction

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Fix the FilePredictionProvider. Return None in the predicted document in case of Markdown.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Remove the kwargs from all PredictonProvider classes and introduce provider specific
initialization arguments

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* feat: Introduce the parameter "ignore_missing_files" in FilePredictionProvider

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Add do_visualization to PredictionProvider

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Move next-gen API to main source tree, re-organize module paths

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup, change path handling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup, change path handling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* More module removal and renaming

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Small test fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Add the "prediction_format" in the serialization of DatasetRecordWithPrediction

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* feat: Refactor the MarkdownTextEvaluator to support the new classes design. Add unit test.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Improve the new design of MarkdownEvaluator to move common functionalities into the base class

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* feat: Refactor the LayoutEvaluator to use the new class design. Add unit test.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Clean up LayoutEvaluator code

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Implementation cleanup and fixes for new class design (#52)

* More module removal and renaming

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Small test fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Small test fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup of tests and more fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add visualization for tables

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add visualization for all tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes for test files, FilePredictionProvider changes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Put new CLI

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Rename CLI

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update all README with new commands.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove old examples

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Several Fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* README updates

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add gt_dir arg to create-eval, README fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes, pass tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* feat: Refactor the TableEvaluator to use the new class design.
Move common evaluator code to BaseEvaluator.
Add more unit tests. Introduce pytest dependencies.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Update lockfile

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update lockfile

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Make pytest CI output more verbose

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* feat: Refactor the ReadingOrderEvaluator to use the new class design.
Remove the BaseReadingOrderEvaluator. Add unit test.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Optimize GT downloading behaviour

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add file sources

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Allow pytest output on CI

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Disable tests in CI

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Reenable tests in CI

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add correct @pytest.mark.dependency()

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* feat: Introduce TypeVars for the UnitEvaluation and DatasetEvaluation used by the BaseEvaluator.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Minimize tests in CI

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* feat: Refactor BboxTestEvaluator to use the new design. Introduce unit test.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Remove streaming in DocLaynet v1

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add back test dependency

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Saidgurbuz <said.gurbuz@epfl.ch>
Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com>
2025-04-01 13:04:03 +02:00
Nikos Livathinos ddae1ec966 fix: Fix the modalities for DPBench, OmniDocBench, DLNv1. Switch to new settings in SmolDocling API. Improve the documentation. (#37)
* chore: Change the pinning of docling

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Fix the modalities supported for DPBench, OmniDocBench, DLNv1. Clean up code.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* docs: Update documentation to have all benchmarks in separate md files and place links in Readme.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Change the initialization of the create_smol_docling_converter() to allow flash-attn

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* docs: List benchmarks in the main readme with short description. Fix broken links in the documentation.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* docs: Fix broken link in Readme.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Update lock file

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Add debug code to dump the predicted text in create_dlnv1_e2e_dataset()

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Update toml to pin docling with branch and extras

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Disable the generation of VLM text debugging files for DLNv1 benchmark

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Update toml to docling v2.25.0 with vln extra

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

---------

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2025-02-26 15:50:02 +01:00