mirror of https://github.com/docling-project/docling-parse.git synced 2026-05-17 13:10:49 +00:00

Files

T

Peter W. J. Staar ae66f6ddf0 feat: add parallelization for parsing (#216 )

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

2026-03-04 10:42:04 +01:00

5.2 KiB

Raw Permalink Blame History

Performance Benchmarking Guide

This page explains how to benchmark page-level PDF parsing with the perf/run_perf.py script. It supports multiple backends — including a multi-threaded mode — and produces a CSV plus summary stats for quick comparison.

Prerequisites

Python 3.9+
Install the project in your environment. For optional parsers, install the perf extras:
- pip: pip install .[perf-tools]
- uv: uv sync --group perf-test

Optional parsers (pdfplumber, pypdfium2, pymupdf) are only needed when you select them via -p.

Script Overview

Entry point: python perf/run_perf.py <input> [options]
Input can be a single PDF file or a directory containing PDFs.
Output is a CSV file with one row per page and a printed summary.

CSV columns:

filename,page_number,elapsed_sec,success,error
For docling backends, additional columns with per-step timing breakdowns and percentages are appended.

Summary includes overall totals and per-document tables with mean/median/min/max and p50/p90/p95/p99 percentiles.

Parser Backends

Use --parser/-p to select the backend. Available options:

docling (default) — sequential, one page at a time
docling-threaded — parallel, multi-threaded with backpressure
pdfplumber
pypdfium2 (alias: pypdfium)
pymupdf

CLI Reference

python perf/run_perf.py <input> [options]

positional arguments:
  input                      Path to a PDF file or directory of PDFs

options:
  --parser, -p PARSER        Parser backend (default: docling)
  --recursive, -r            Recurse into subdirectories
  --output, -o PATH          Output CSV path (default: perf/results/perf_<parser>_<timestamp>.csv)
  --limit, -l N              Maximum number of documents to process
  --bytesio                  (docling only) Load PDFs via BytesIO instead of file path
  --threads, -t N            (docling-threaded only) Number of worker threads (default: 4)
  --max-concurrent-results N (docling-threaded only) Max buffered results before workers pause (default: 64)

Common Examples

Sequential (single-threaded) docling parsing

# Single file
python perf/run_perf.py ./docs/sample.pdf

# Directory, recursive
python perf/run_perf.py ./dataset -r -p docling

Parallel (multi-threaded) docling parsing

# 4 threads (default), up to 64 buffered results
python perf/run_perf.py ./dataset -r -p docling-threaded

# 8 threads, tighter backpressure
python perf/run_perf.py ./dataset -r -p docling-threaded --threads 8 --max-concurrent-results 32

# Single thread (useful as a baseline to measure thread overhead)
python perf/run_perf.py ./dataset -r -p docling-threaded --threads 1

Compare with non-docling backends

python perf/run_perf.py ./dataset -r -p pdfplumber
python perf/run_perf.py ./dataset -r -p pypdfium2
python perf/run_perf.py ./dataset -r -p pymupdf

Output Location

By default, results are written to:

perf/results/perf_<parser>_<timestamp>.csv

Customize output path with --output/-o:

python perf/run_perf.py ./dataset -r -p docling          -o perf/results/sequential.csv
python perf/run_perf.py ./dataset -r -p docling-threaded  -o perf/results/parallel_4t.csv

Tips for Fair Comparisons

Run sequential vs threaded in separate invocations and compare CSVs and printed summaries.
Pin the same --recursive and input set for both runs.
For docling-threaded, note that elapsed_sec per row measures the wait time for each result (not CPU time). The total wall time printed at the end is the true parallel throughput measure.
Consider two passes:
- Cold run: after a reboot or clearing OS caches (if feasible).
- Warm run: repeat the same command to observe cache effects.
Avoid other heavy workloads while benchmarking.
Record environment details (CPU, RAM, OS, Python, library versions).

Interpreting Results

Sequential (`docling`)

elapsed_sec measures wall-clock seconds per page parse.

Parallel (`docling-threaded`)

elapsed_sec per row is the time spent waiting for that specific result (includes queue wait time).
The total wall time printed at the end is the key metric — it reflects actual parallel throughput.
Compare total wall time of docling-threaded with time_total_sec of sequential docling to see the speedup.

General

The summary shows overall averages and percentiles across all successful pages.
The per-document table helps identify outliers at the file level.
For docling backends, a timing breakdown table shows average time and percentage for each internal parsing step (resource decoding, content parsing, etc.).

Troubleshooting

"No PDFs found": verify the input path and file extensions (.pdf). Use -r for directories with nested PDFs.
Import errors for optional parsers: install extras (see Prerequisites) or switch -p to another backend.
Permission errors writing CSV: pass -o to a writable location.

Reproducible Runs (uv)

If you use uv, sync the perf group and run:

uv run python perf/run_perf.py ./dataset -r -p docling
uv run python perf/run_perf.py ./dataset -r -p docling-threaded --threads 4

This ensures a consistent environment for fair, repeatable measurements.

5.2 KiB Raw Permalink Blame History