mirror of https://github.com/docling-project/docling-parse.git synced 2026-05-17 13:10:49 +00:00

T

Michele Dolfi 4856cdf677 add OSS release files

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

2024-08-05 15:49:13 +02:00

app

adding python resources

2024-08-05 11:35:19 +02:00

cmake

added extra parameters to qpdf build to enable native platofrm crypto

2024-08-05 14:00:31 +02:00

docling_parse

removed the PDF_DATA_DIR and replaced it with the resources

2024-08-05 14:12:37 +02:00

src

fixed segfault in infinite loop

2024-08-05 15:17:27 +02:00

src_data/proj_folders/pdf_library/data

initial commit

2024-07-30 13:31:03 +02:00

tests

removed the PDF_DATA_DIR and replaced it with the resources

2024-08-05 14:12:37 +02:00

.gitignore

removed assembler stuff

2024-07-31 13:58:29 +02:00

build.py

added the build script

2024-07-30 18:00:07 +02:00

CMakeLists.txt

back to O3

2024-08-05 15:18:42 +02:00

CODE_OF_CONDUCT.md

add OSS release files

2024-08-05 15:49:13 +02:00

CONTRIBUTING.md

add OSS release files

2024-08-05 15:49:13 +02:00

LICENSE

add OSS release files

2024-08-05 15:49:13 +02:00

MAINTAINERS.md

add OSS release files

2024-08-05 15:49:13 +02:00

poetry.lock

removed assembler stuff

2024-07-31 13:58:29 +02:00

pyproject.toml

add OSS release files

2024-08-05 15:49:13 +02:00

README.md

add OSS release files

2024-08-05 15:49:13 +02:00

README.md

Docling Parse

Simple package to extract text with coordinates from programmatic PDFs. This package is part of the Docling conversion.

Quick start

Install the package from Pypi

pip install docling-parse

Convert a PDF

from docling_parse import pdf_parser

parser = pdf_parser()
doc = parser.find_cells("mydoc.pdf")

for i, page in enumerate(doc["pages"]):
    for j, cell in enumerate(page["cells"]):
        print(i, "\t", j, "\t", cell["content"]["rnormalized"])

Development

CXX

To build the parse, simply run the following command in the root folder,

rm -rf build; cmake -B ./build; cd build; make

You can run the parser from your build folder with

./parse.exe <input-file> <optional-logging:true>

If you dont have an input file, then a template input file will be printed on the terminal.

Python

To build the package, simply run (make sure poetry is installed),

poetry build

To test the package, run,

poetry run pytest ./tests/test_parse.py

Contributing

Please read Contributing to Docling Parse for details.

References

If you use Docling in your projects, please consider citing the following:

@software{Docling,
author = {Deep Search Team},
month = {7},
title = {{Docling}},
url = {https://github.com/DS4SD/docling},
version = {main},
year = {2024}
}

License

The Docling Parse codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.