mirror of
https://github.com/docling-project/docling-parse.git
synced 2026-05-17 13:10:49 +00:00
4856cdf677771669cfb0419f5a1ebf31663a55d5
Docling Parse
Simple package to extract text with coordinates from programmatic PDFs. This package is part of the Docling conversion.
Quick start
Install the package from Pypi
pip install docling-parse
Convert a PDF
from docling_parse import pdf_parser
parser = pdf_parser()
doc = parser.find_cells("mydoc.pdf")
for i, page in enumerate(doc["pages"]):
for j, cell in enumerate(page["cells"]):
print(i, "\t", j, "\t", cell["content"]["rnormalized"])
Development
CXX
To build the parse, simply run the following command in the root folder,
rm -rf build; cmake -B ./build; cd build; make
You can run the parser from your build folder with
./parse.exe <input-file> <optional-logging:true>
If you dont have an input file, then a template input file will be printed on the terminal.
Python
To build the package, simply run (make sure poetry is installed),
poetry build
To test the package, run,
poetry run pytest ./tests/test_parse.py
Contributing
Please read Contributing to Docling Parse for details.
References
If you use Docling in your projects, please consider citing the following:
@software{Docling,
author = {Deep Search Team},
month = {7},
title = {{Docling}},
url = {https://github.com/DS4SD/docling},
version = {main},
year = {2024}
}
License
The Docling Parse codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.
Languages
C++
79%
Python
17.8%
CMake
2.9%
Shell
0.3%