mirror of
https://github.com/docling-project/docling-parse.git
synced 2026-05-17 13:10:49 +00:00
e0264dd22d
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Perf tools for page-level parsing benchmarking.
Usage
- Install extras for optional parsers (not part of main package):
- pip:
pip install .[perf-tools] - uv (already configured):
uv sync --group perf-test
- pip:
- Run on a file or directory:
python perf/run_perf.py ./docs/sample.pdfpython perf/run_perf.py ./dataset --recursive -p pdfplumber
CLI
input: PDF file or directory of PDFs.--parser|-p: one ofdocling(default),pdfplumber,pypdfium2(alias:pypdfium),pymupdf.--recursive|-r: recurse when input is a directory.--output|-o: output CSV path (default underperf/results).
CSV columns
filename,page_number,elapsed_sec,success,error
Statistics
- Prints totals, avg sec/page, min/max, and percentiles (p50/p90/p95/p99) after the run.
Visualization
python perf/run_eval.py perf/resultscreates plots underperf/viz:- Per-parser page-time histograms (log-log)
- Superposed histogram across parsers
- Stacked histograms with common x-axis
- Per-document scatter (pages vs total time) with linear fit
- Pairwise hexbin plots of per-page times across parsers
Analysis
python perf/run_analysis.py <perf_csv> --top 20 --mode typedextracts detailed stage timings for the slowest pages to help identify bottlenecks. Writesperf/results/analysis_<ts>.csvby default.