mirror of https://github.com/docling-project/docling-parse.git synced 2026-05-17 13:10:49 +00:00

T

Christoph Auer 6fdd74870d feat!: Upgrade to v2.0.0 (#48 )

* feat!: Upgrade to v2.0.0

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Dummy change

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* rename old parser as pdf_parser_v1

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>

2024-10-23 14:15:52 +02:00

.github

feat!: Upgrade to v2.0.0 (#48 )

2024-10-23 14:15:52 +02:00

app

feat!: Upgrade to v2.0.0 (#48 )

2024-10-23 14:15:52 +02:00

cmake

fix: cmake-cxxopts by using similar approach as glm (#44 )

2024-10-18 13:56:07 +02:00

docling_parse

feat!: Upgrade to v2.0.0 (#48 )

2024-10-23 14:15:52 +02:00

docs/example_visualisations

feat: add an experimental v2 parser to improve performance (#29 )

2024-10-11 11:52:33 +02:00

src

feat: fixed the v2 parser to only return the pages that are requested (#47 )

2024-10-23 10:14:39 +02:00

tests

feat!: Upgrade to v2.0.0 (#48 )

2024-10-23 14:15:52 +02:00

.gitignore

feat: build using system deps (#33 )

2024-10-02 12:34:52 +02:00

.pre-commit-config.yaml

add pre-commit and update CONTRIBUTING

2024-08-05 15:55:15 +02:00

build.py

feat: build using system deps (#33 )

2024-10-02 12:34:52 +02:00

CHANGELOG.md

chore: bump version to 1.6.2 [skip ci]

2024-10-18 12:08:42 +00:00

CMakeLists.txt

feat: add an experimental v2 parser to improve performance (#29 )

2024-10-11 11:52:33 +02:00

CODE_OF_CONDUCT.md

add OSS release files

2024-08-05 15:49:13 +02:00

CONTRIBUTING.md

add pre-commit and update CONTRIBUTING

2024-08-05 15:55:15 +02:00

LICENSE

add OSS release files

2024-08-05 15:49:13 +02:00

MAINTAINERS.md

update OSS info

2024-08-06 09:52:36 +02:00

poetry.lock

feat: add an experimental v2 parser to improve performance (#29 )

2024-10-11 11:52:33 +02:00

pyproject.toml

chore: bump version to 1.6.2 [skip ci]

2024-10-18 12:08:42 +00:00

README.md

feat!: Upgrade to v2.0.0 (#48 )

2024-10-23 14:15:52 +02:00

README.md

Docling Parse

Simple package to extract text, paths and bitmap images with coordinates from programmatic PDFs. This package is used in the Docling PDF conversion.

Version	Original	Word-level	Snippet-level	Performance
V1		Not Supported		~0.250 page/sec
V2				~0.050 page/sec [~5-10X faster than v1]

Quick start

Install the package from Pypi

pip install docling-parse

Convert a PDF (look in the visualise.py for a more detailed information)

from docling_parse.docling_parse import pdf_parser_v2

# Do this only once to load fonts (avoid initialising it many times)
parser = pdf_parser_v2()

# parser.set_loglevel(1) # 1=error, 2=warning, 3=success, 4=info

doc_file = "my-doc.pdf" # filename
doc_key = f"key={pdf_doc}" # unique document key (eg hash, UUID, etc)

# Load the document from file using filename doc_file. This only loads
# the QPDF document, but no extracted data
success = parser.load_document(doc_key, doc_file)

# Open the file in binary mode and read its contents
# with open(pdf_doc, "rb") as file:
#      file_content = file.read()

# Create a BytesIO object and write the file contents to it
# bytes_io = io.BytesIO(file_content)
# success = parser.load_document_from_bytesio(doc_key, bytes_io)

# Parse the entire document in one go, easier, but could require
# a lot (more) memory as parsing page-by-page
# json_doc = parser.parse_pdf_from_key(doc_key)	

# Get number of pages
num_pages = parser.number_of_pages(doc_key)

# Parse page by page to minimize memory footprint
for page in range(0, num_pages):

    # Internal memory for page is auto-deleted after this call.
    # No need to unload a specifc page 
    json_doc = parser.parse_pdf_from_key_on_page(doc_key, page)

    if "pages" not in json_doc:  # page could not get parsed
       continue

    # parsed page is the first one!				  
    json_page = json_doc["pages"][0] 
    
	# <Insert your own code>

# Unload the (QPDF) document and buffers
parser.unload_document(doc_key)

# Unloads everything at once
# parser.unload_documents()

Use the CLI

$ docling-parse -h
usage: docling-parse [-h] -p PDF

Process a PDF file.

options:
  -h, --help         show this help message and exit
  -p PDF, --pdf PDF  Path to the PDF file

Development

CXX

To build the parse, simply run the following command in the root folder,

rm -rf build; cmake -B ./build; cd build; make

You can run the parser from your build folder. Example from parse_v1,

% ./parse_v1.exe -h
A program to process PDF files or configuration files
Usage:
  PDFProcessor [OPTION...]

  -i, --input arg          Input PDF file
  -c, --config arg         Config file
      --create-config arg  Create config file
  -o, --output arg         Output file
  -l, --loglevel arg       loglevel [error;warning;success;info]
  -h, --help               Print usage

Example from parse_v2,

% ./parse_v2.exe -h
program to process PDF files or configuration files
Usage:
  PDFProcessor [OPTION...]

  -i, --input arg          Input PDF file
  -c, --config arg         Config file
      --create-config arg  Create config file
  -p, --page arg           Pages to process (default: -1 for all) (default:
                           -1)
  -o, --output arg         Output file
  -l, --loglevel arg       loglevel [error;warning;success;info]
  -h, --help               Print usage

If you dont have an input file, then a template input file will be printed on the terminal.

Python

To build the package, simply run (make sure poetry is installed),

poetry build

To test the package, run:

poetry run pytest ./tests -v -s

Contributing

Please read Contributing to Docling Parse for details.

References

If you use Docling in your projects, please consider citing the following:

@techreport{Docling,
  author = {Deep Search Team},
  month = {8},
  title = {Docling Technical Report},
  url = {https://arxiv.org/abs/2408.09869},
  eprint = {2408.09869},
  doi = {10.48550/arXiv.2408.09869},
  version = {1.0.0},
  year = {2024}
}

License

The Docling Parse codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.