From be085c0e39dd5c51572b883d0f795c5a7abefd5d Mon Sep 17 00:00:00 2001 From: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Date: Fri, 19 Dec 2025 13:16:59 +0100 Subject: [PATCH] docs(RTX): Guidelines for best performance on RTX GPUs (#2765) * add RTX docs Signed-off-by: Michele Dolfi * add artwork and fix title Signed-off-by: Michele Dolfi * fix series definition Signed-off-by: Michele Dolfi * add nvidia logo and update todo Signed-off-by: Michele Dolfi --------- Signed-off-by: Michele Dolfi --- docs/assets/nvidia_logo_green.svg | 39 +++++ docs/getting_started/rtx.md | 261 ++++++++++++++++++++++++++++++ mkdocs.yml | 1 + 3 files changed, 301 insertions(+) create mode 100644 docs/assets/nvidia_logo_green.svg create mode 100644 docs/getting_started/rtx.md diff --git a/docs/assets/nvidia_logo_green.svg b/docs/assets/nvidia_logo_green.svg new file mode 100644 index 00000000..ffd2e63c --- /dev/null +++ b/docs/assets/nvidia_logo_green.svg @@ -0,0 +1,39 @@ + + + + +generated by pstoedit version:3.44 from NVBadge_2D.eps + + + + + + diff --git a/docs/getting_started/rtx.md b/docs/getting_started/rtx.md new file mode 100644 index 00000000..7f9fde1e --- /dev/null +++ b/docs/getting_started/rtx.md @@ -0,0 +1,261 @@ +# ⚡ RTX GPU Acceleration + +
+ Docling on RTX +
+ + +Whether you're an AI enthusiast, researcher, or developer working with document processing, this guide will help you unlock the full potential of your NVIDIA RTX GPU with Docling. + +By leveraging GPU acceleration, you can achieve up to **6x speedup** compared to CPU-only processing. This dramatic performance improvement makes GPU acceleration especially valuable for processing large batches of documents, handling high-throughput document conversion workflows, or experimenting with advanced document understanding models. + + + +## Prerequisites + +Before setting up GPU acceleration, ensure you have: + +- An NVIDIA RTX GPU (RTX 40/50 series) +- Windows 10/11 or Linux operating system + +## Installation Steps + +### 1. Install NVIDIA GPU Drivers + +First, ensure you have the latest NVIDIA GPU drivers installed: + +- **Windows**: Download from [NVIDIA Driver Downloads](https://www.nvidia.com/Download/index.aspx) +- **Linux**: Use your distribution's package manager or download from NVIDIA + +Verify the installation: + +```bash +nvidia-smi +``` + +This command should display your GPU information and driver version. + +### 2. Install CUDA Toolkit + +CUDA is NVIDIA's parallel computing platform required for GPU acceleration. + +Follow the official installation guide for your operating system at [NVIDIA CUDA Downloads](https://developer.nvidia.com/cuda-downloads). The installer will guide you through the process and automatically set up the required environment variables. + +### 3. Install cuDNN + +cuDNN provides optimized implementations for deep learning operations. + +Follow the official installation guide at [NVIDIA cuDNN Downloads](https://developer.nvidia.com/cudnn). The guide provides detailed instructions for all supported platforms. + +### 4. Install PyTorch with CUDA Support + +To use GPU acceleration with Docling, you need to install PyTorch with CUDA support using the special `extra-index-url`: + +```bash +# For CUDA 12.8 (current default for PyTorch) +pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 + +# For CUDA 13.0 +pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130 +``` + +!!! note + The `--index-url` parameter is crucial as it ensures you get the CUDA-enabled version of PyTorch instead of the CPU-only version. + +For other CUDA versions and installation options, refer to the [PyTorch Installation Matrix](https://pytorch.org/get-started/locally/). + +Verify PyTorch CUDA installation: + +```python +import torch +print(f"PyTorch version: {torch.__version__}") +print(f"CUDA available: {torch.cuda.is_available()}") +print(f"CUDA version: {torch.version.cuda}") +print(f"GPU device: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}") +``` + +### 5. Install and Run Docling + +Install Docling with all dependencies: + +```bash +pip install docling +``` + +**That's it!** Docling will automatically detect and use your RTX GPU when available. No additional configuration is required for basic usage. + +```python +from docling.document_converter import DocumentConverter + +# Docling automatically uses GPU when available +converter = DocumentConverter() +result = converter.convert("document.pdf") +``` + +
+Advanced: Tuning GPU Performance + +For optimal GPU performance with large document batches, you can adjust batch sizes and explicitly configure the accelerator: + +```python +from docling.document_converter import DocumentConverter +from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions +from docling.datamodel.pipeline_options import ThreadedPdfPipelineOptions + +# Explicitly configure GPU acceleration +accelerator_options = AcceleratorOptions( + device=AcceleratorDevice.CUDA, # Use CUDA for NVIDIA GPUs +) + +# Configure pipeline for optimal GPU performance +pipeline_options = ThreadedPdfPipelineOptions( + ocr_batch_size=64, # Increase batch size for GPU + layout_batch_size=64, # Increase batch size for GPU + table_batch_size=4, +) + +# Create converter with custom settings +converter = DocumentConverter( + accelerator_options=accelerator_options, + pipeline_options=pipeline_options, +) + +# Convert documents +result = converter.convert("document.pdf") +``` + +Adjust batch sizes based on your GPU memory (see Performance Optimization Tips below). + +
+ +## GPU-Accelerated VLM Pipeline + +For maximum performance with Vision Language Models (VLM), you can run a local inference server on your RTX GPU. This approach provides significantly better throughput than inline VLM processing. + +### Linux: Using vLLM (Recommended) + +vLLM provides the best performance for GPU-accelerated VLM inference. Start the vLLM server with optimized parameters: + +```bash +vllm serve ibm-granite/granite-docling-258M \ + --host 127.0.0.1 --port 8000 \ + --max-num-seqs 512 \ + --max-num-batched-tokens 8192 \ + --enable-chunked-prefill \ + --gpu-memory-utilization 0.9 +``` + +### Windows: Using llama-server + +On Windows, you can use `llama-server` from llama.cpp for GPU-accelerated VLM inference: + +#### Installation + +1. Download the latest llama.cpp release from the [GitHub releases page](https://github.com/ggml-org/llama.cpp/releases) +2. Extract the archive and locate `llama-server.exe` + +#### Launch Command + +```powershell +llama-server.exe ` + --hf-repo ibm-granite/granite-docling-258M-GGUF ` + -cb ` + -ngl -1 ` + --port 8000 ` + --context-shift ` + -np 16 -c 131072 +``` + +!!! note "Performance Comparison" + vLLM delivers approximately **4x better performance** compared to llama-server. For Windows users seeking maximum performance, consider running vLLM via WSL2 (Windows Subsystem for Linux). See [vLLM on RTX 5090 via Docker](https://github.com/BoltzmannEntropy/vLLM-5090) for detailed WSL2 setup instructions. + +### Configure Docling for VLM Server + +Once your inference server is running, configure Docling to use it: + +```python +from docling.datamodel.pipeline_options import VlmPipelineOptions +from docling.datamodel.settings import settings + +BATCH_SIZE = 64 + +# Configure VLM options +vlm_options = vlm_model_specs.GRANITEDOCLING_VLLM_API +vlm_options.concurrency = BATCH_SIZE + +# when running with llama.cpp (llama-server), use the different model name. +# vlm_options.params["model"] = "ibm-granite_granite-docling-258M-GGUF_granite-docling-258M-BF16.gguf" + +# Set page batch size to match or exceed concurrency +settings.perf.page_batch_size = BATCH_SIZE + +# Create converter with VLM pipeline +converter = DocumentConverter( + pipeline_options=vlm_options, +) +``` + +For more details on VLM pipeline configuration, see the [GPU Support Guide](../usage/gpu.md). + +## Performance Optimization Tips + +### Batch Size Tuning + +Adjust batch sizes based on your GPU memory: + +- **RTX 5090 (32GB)**: Use batch sizes of 64-128 +- **RTX 4090 (24GB)**: Use batch sizes of 32-64 +- **RTX 5070 (12GB)**: Use batch sizes of 16-32 + +### Memory Management + +Monitor GPU memory usage: + +```python +import torch + +# Check GPU memory +if torch.cuda.is_available(): + print(f"GPU Memory allocated: {torch.cuda.memory_allocated(0) / 1024**3:.2f} GB") + print(f"GPU Memory reserved: {torch.cuda.memory_reserved(0) / 1024**3:.2f} GB") +``` + +## Troubleshooting + +### CUDA Out of Memory + +If you encounter out-of-memory errors: + +1. Reduce batch sizes in `pipeline_options` +2. Process fewer documents concurrently +3. Clear GPU cache between batches: + +```python +import torch +torch.cuda.empty_cache() +``` + +### CUDA Not Available + +If `torch.cuda.is_available()` returns `False`: + +1. Verify NVIDIA drivers are installed: `nvidia-smi` +2. Check CUDA installation: `nvcc --version` +3. Reinstall PyTorch with correct CUDA version +4. Ensure your GPU is CUDA-compatible + +### Performance Not Improving + +If GPU acceleration doesn't improve performance: + +1. Increase batch sizes (if memory allows) +2. Ensure you're processing enough documents to benefit from GPU parallelization +3. Check GPU utilization: `nvidia-smi -l 1` +4. Verify PyTorch is using GPU: `torch.cuda.is_available()` + +## Additional Resources + +- [NVIDIA CUDA Documentation](https://docs.nvidia.com/cuda/) +- [PyTorch CUDA Installation Guide](https://pytorch.org/get-started/locally/) +- [Docling GPU Support Guide](../usage/gpu.md) +- [GPU Performance Examples](../examples/gpu_standard_pipeline.py) diff --git a/mkdocs.yml b/mkdocs.yml index 2d483f9f..8d51fe60 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -57,6 +57,7 @@ nav: - Getting started : - Installation: getting_started/installation.md - Quickstart: getting_started/quickstart.md + - ⚡ RTX GPU: getting_started/rtx.md - Usage: - Advanced options: usage/advanced_options.md - Supported formats: usage/supported_formats.md