docling-parse/docs/performance_code.md

# Performance Benchmarking Guide

This page explains how to benchmark page-level PDF parsing with the `perf/run_perf.py` script. It supports multiple backends — including a multi-threaded mode — and produces a CSV plus summary stats for quick comparison.

## Prerequisites
- Python 3.9+
- Install the project in your environment. For optional parsers, install the perf extras:
  - pip: `pip install .[perf-tools]`
  - uv: `uv sync --group perf-test`

Optional parsers (pdfplumber, pypdfium2, pymupdf) are only needed when you select them via `-p`.

## Script Overview
- Entry point: `python perf/run_perf.py <input> [options]`
- Input can be a single PDF file or a directory containing PDFs.
- Output is a CSV file with one row per page and a printed summary.

CSV columns:
- `filename,page_number,elapsed_sec,success,error`
- For docling backends, additional columns with per-step timing breakdowns and percentages are appended.

Summary includes overall totals and per-document tables with mean/median/min/max and p50/p90/p95/p99 percentiles.

## Parser Backends
Use `--parser/-p` to select the backend. Available options:
- `docling` (default) — sequential, one page at a time
- `docling-threaded` — parallel, multi-threaded with backpressure
- `pdfplumber`
- `pypdfium2` (alias: `pypdfium`)
- `pymupdf`

## CLI Reference

```
python perf/run_perf.py <input> [options]

positional arguments:
  input                      Path to a PDF file or directory of PDFs

options:
  --parser, -p PARSER        Parser backend (default: docling)
  --recursive, -r            Recurse into subdirectories
  --output, -o PATH          Output CSV path (default: perf/results/perf_<parser>_<timestamp>.csv)
  --limit, -l N              Maximum number of documents to process
  --bytesio                  (docling only) Load PDFs via BytesIO instead of file path
  --threads, -t N            (docling-threaded only) Number of worker threads (default: 4)
  --max-concurrent-results N (docling-threaded only) Max buffered results before workers pause (default: 64)
```

## Common Examples

### Sequential (single-threaded) docling parsing

```sh
# Single file
python perf/run_perf.py ./docs/sample.pdf

# Directory, recursive
python perf/run_perf.py ./dataset -r -p docling
```

### Parallel (multi-threaded) docling parsing

```sh
# 4 threads (default), up to 64 buffered results
python perf/run_perf.py ./dataset -r -p docling-threaded

# 8 threads, tighter backpressure
python perf/run_perf.py ./dataset -r -p docling-threaded --threads 8 --max-concurrent-results 32

# Single thread (useful as a baseline to measure thread overhead)
python perf/run_perf.py ./dataset -r -p docling-threaded --threads 1
```

### Compare with non-docling backends

```sh
python perf/run_perf.py ./dataset -r -p pdfplumber
python perf/run_perf.py ./dataset -r -p pypdfium2
python perf/run_perf.py ./dataset -r -p pymupdf
```

## Output Location
By default, results are written to:
- `perf/results/perf_<parser>_<timestamp>.csv`

Customize output path with `--output/-o`:
```sh
python perf/run_perf.py ./dataset -r -p docling          -o perf/results/sequential.csv
python perf/run_perf.py ./dataset -r -p docling-threaded  -o perf/results/parallel_4t.csv
```

## Tips for Fair Comparisons
- Run sequential vs threaded in separate invocations and compare CSVs and printed summaries.
- Pin the same `--recursive` and input set for both runs.
- For `docling-threaded`, note that `elapsed_sec` per row measures the *wait time* for each result (not CPU time). The total wall time printed at the end is the true parallel throughput measure.
- Consider two passes:
  - Cold run: after a reboot or clearing OS caches (if feasible).
  - Warm run: repeat the same command to observe cache effects.
- Avoid other heavy workloads while benchmarking.
- Record environment details (CPU, RAM, OS, Python, library versions).

## Interpreting Results

### Sequential (`docling`)
- `elapsed_sec` measures wall-clock seconds per page parse.

### Parallel (`docling-threaded`)
- `elapsed_sec` per row is the time spent waiting for that specific result (includes queue wait time).
- The **total wall time** printed at the end is the key metric — it reflects actual parallel throughput.
- Compare `total wall time` of `docling-threaded` with `time_total_sec` of sequential `docling` to see the speedup.

### General
- The summary shows overall averages and percentiles across all successful pages.
- The per-document table helps identify outliers at the file level.
- For docling backends, a timing breakdown table shows average time and percentage for each internal parsing step (resource decoding, content parsing, etc.).

## Troubleshooting
- "No PDFs found": verify the input path and file extensions (`.pdf`). Use `-r` for directories with nested PDFs.
- Import errors for optional parsers: install extras (see Prerequisites) or switch `-p` to another backend.
- Permission errors writing CSV: pass `-o` to a writable location.

## Reproducible Runs (uv)
If you use `uv`, sync the perf group and run:
```sh
uv run python perf/run_perf.py ./dataset -r -p docling
uv run python perf/run_perf.py ./dataset -r -p docling-threaded --threads 4
```

This ensures a consistent environment for fair, repeatable measurements.