Files
docling-eval/docs/DP-Bench_benchmarks.md
T
Christoph Auer 629a451d7b feat: Layout evaluation fixes, mode control and cleanup (#133)
* Misc fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Make DatasetRecord tolerant to old parquet files

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Make DatasetRecord tolerant to old parquet files (2)

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix docvqa test, more cleanup

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Important fixes for layout mAP computation

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Adding modes for missing_prediction_strategy and label_filtering_strategy

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes for mismatched docs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add F1 no_picture metrics to layout evaluator

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixed commands on all READMEs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove extract_images ambiguity, use utility and fix errors on visualizer

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Upgrade to latest docling_core

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix ocrmac dep, upgrade uv.lock

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix for tableformer provider

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove code redundancy

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-01 10:02:59 +02:00

4.0 KiB

DP-Bench Benchmarks

DP-Bench on HuggingFace

Create DPBench evaluation datasets:

# Make the ground-truth
docling-eval create-gt --benchmark DPBench --output-dir ./benchmarks/DPBench-gt/ 

# Make predictions for different modalities.
docling-eval create-eval \
  --benchmark DPBench \
  --gt-dir ./benchmarks/DPBench-gt/gt_dataset/ \
  --output-dir ./benchmarks/DPBench-e2e/ \
  --prediction-provider Docling # use full-document predictions from docling
  
docling-eval create-eval \
  --benchmark DPBench \
  --gt-dir ./benchmarks/DPBench-gt/gt_dataset/ \
  --output-dir ./benchmarks/DPBench-tables/ \
  --prediction-provider TableFormer # use tableformer predictions only

Layout Evaluation

Create the evaluation report:

docling-eval evaluate \
  --modality layout \
  --benchmark DPBench \
  --output-dir ./benchmarks/DPBench-e2e/ 

Layout evaluation json

Visualize the report:

docling-eval visualize \
  --modality layout \
  --benchmark DPBench \
  --output-dir ./benchmarks/DPBench-e2e/ 

mAP[0.5:0.95] report

mAP[0.5:0.95] plot

TableFormer Evaluation

Create the evaluation report:

docling-eval evaluate \
  --modality table_structure \
  --benchmark DPBench \
  --output-dir ./benchmarks/DPBench-tables/ 

Visualize the report:

Tableformer evaluation json

Visualize the report:

docling-eval visualize \
  --modality table_structure \
  --benchmark DPBench \
  --output-dir ./benchmarks/DPBench-tables/ 

TEDS plot

TEDS struct only plot

TEDS struct only report

TEDS struct with text plot

TEDS struct with text report

Reading order Evaluation

Create the evaluation report:

docling-eval evaluate \
  --modality reading_order \
  --benchmark DPBench \
  --output-dir ./benchmarks/DPBench-e2e/ 

Reading order json

Visualize the report:

docling-eval visualize \
  --modality reading_order \
  --benchmark DPBench \
  --output-dir ./benchmarks/DPBench-e2e/ 

ARD plot

ARD report

Weighted ARD plot

Weighted ARD report

Markdown text Evaluation

Create the evaluation report:

docling-eval evaluate \
  --modality markdown_text \
  --benchmark DPBench \
  --output-dir ./benchmarks/DPBench-e2e/ 

Markdown text json

Visualize the report:

docling-eval visualize \
  --modality markdown_text \
  --benchmark DPBench \
  --output-dir ./benchmarks/DPBench-e2e/ 

Markdown text report

BLEU plot

Edit distance plot

F1 plot

Meteor plot

Precision plot

Recall plot