# Performance Benchmarking Guide This page explains how to benchmark page-level PDF parsing with the `perf/run_perf.py` script. It supports multiple backends — including a multi-threaded mode — and produces a CSV plus summary stats for quick comparison. ## Prerequisites - Python 3.9+ - Install the project in your environment. For optional parsers, install the perf extras: - pip: `pip install .[perf-tools]` - uv: `uv sync --group perf-test` Optional parsers (pdfplumber, pypdfium2, pymupdf) are only needed when you select them via `-p`. ## Script Overview - Entry point: `python perf/run_perf.py [options]` - Input can be a single PDF file or a directory containing PDFs. - Output is a CSV file with one row per page and a printed summary. CSV columns: - `filename,page_number,elapsed_sec,success,error` - For docling backends, additional columns with per-step timing breakdowns and percentages are appended. Summary includes overall totals and per-document tables with mean/median/min/max and p50/p90/p95/p99 percentiles. ## Parser Backends Use `--parser/-p` to select the backend. Available options: - `docling` (default) — sequential, one page at a time - `docling-threaded` — parallel, multi-threaded with backpressure - `pdfplumber` - `pypdfium2` (alias: `pypdfium`) - `pymupdf` ## CLI Reference ``` python perf/run_perf.py [options] positional arguments: input Path to a PDF file or directory of PDFs options: --parser, -p PARSER Parser backend (default: docling) --recursive, -r Recurse into subdirectories --output, -o PATH Output CSV path (default: perf/results/perf__.csv) --limit, -l N Maximum number of documents to process --bytesio (docling only) Load PDFs via BytesIO instead of file path --threads, -t N (docling-threaded only) Number of worker threads (default: 4) --max-concurrent-results N (docling-threaded only) Max buffered results before workers pause (default: 64) ``` ## Common Examples ### Sequential (single-threaded) docling parsing ```sh # Single file python perf/run_perf.py ./docs/sample.pdf # Directory, recursive python perf/run_perf.py ./dataset -r -p docling ``` ### Parallel (multi-threaded) docling parsing ```sh # 4 threads (default), up to 64 buffered results python perf/run_perf.py ./dataset -r -p docling-threaded # 8 threads, tighter backpressure python perf/run_perf.py ./dataset -r -p docling-threaded --threads 8 --max-concurrent-results 32 # Single thread (useful as a baseline to measure thread overhead) python perf/run_perf.py ./dataset -r -p docling-threaded --threads 1 ``` ### Compare with non-docling backends ```sh python perf/run_perf.py ./dataset -r -p pdfplumber python perf/run_perf.py ./dataset -r -p pypdfium2 python perf/run_perf.py ./dataset -r -p pymupdf ``` ## Output Location By default, results are written to: - `perf/results/perf__.csv` Customize output path with `--output/-o`: ```sh python perf/run_perf.py ./dataset -r -p docling -o perf/results/sequential.csv python perf/run_perf.py ./dataset -r -p docling-threaded -o perf/results/parallel_4t.csv ``` ## Tips for Fair Comparisons - Run sequential vs threaded in separate invocations and compare CSVs and printed summaries. - Pin the same `--recursive` and input set for both runs. - For `docling-threaded`, note that `elapsed_sec` per row measures the *wait time* for each result (not CPU time). The total wall time printed at the end is the true parallel throughput measure. - Consider two passes: - Cold run: after a reboot or clearing OS caches (if feasible). - Warm run: repeat the same command to observe cache effects. - Avoid other heavy workloads while benchmarking. - Record environment details (CPU, RAM, OS, Python, library versions). ## Interpreting Results ### Sequential (`docling`) - `elapsed_sec` measures wall-clock seconds per page parse. ### Parallel (`docling-threaded`) - `elapsed_sec` per row is the time spent waiting for that specific result (includes queue wait time). - The **total wall time** printed at the end is the key metric — it reflects actual parallel throughput. - Compare `total wall time` of `docling-threaded` with `time_total_sec` of sequential `docling` to see the speedup. ### General - The summary shows overall averages and percentiles across all successful pages. - The per-document table helps identify outliers at the file level. - For docling backends, a timing breakdown table shows average time and percentage for each internal parsing step (resource decoding, content parsing, etc.). ## Troubleshooting - "No PDFs found": verify the input path and file extensions (`.pdf`). Use `-r` for directories with nested PDFs. - Import errors for optional parsers: install extras (see Prerequisites) or switch `-p` to another backend. - Permission errors writing CSV: pass `-o` to a writable location. ## Reproducible Runs (uv) If you use `uv`, sync the perf group and run: ```sh uv run python perf/run_perf.py ./dataset -r -p docling uv run python perf/run_perf.py ./dataset -r -p docling-threaded --threads 4 ``` This ensures a consistent environment for fair, repeatable measurements.