59 Commits

Author SHA1 Message Date
Christoph Auer b066b26215 feat!: Public threaded PDF parser and rendering API (#265)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2026-05-11 15:37:22 +02:00
Peter W. J. Staar ac0a361a4f feat: rendering of math and latex symbols (#264) 2026-05-08 12:14:22 +02:00
Eric Van Boxsom e56632d962 fix: locale-independent float parsing (fixes docling#1455) (#243)
Signed-off-by: Eric Van Boxsom <14831976+evb87-tech@users.noreply.github.com>
2026-05-06 13:11:31 -04:00
Peter W. J. Staar 8546560474 feat: add jpeg2000 pixel data (#259)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-04-22 08:47:15 +02:00
Peter W. J. Staar b5804c1654 fix: refactored the black to ruff (#258)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-04-18 05:56:59 +02:00
Peter W. J. Staar 7be5d62336 feat: add jbig2 decoder (#252)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-04-17 15:46:44 +02:00
Peter W. J. Staar 70fa30054e feat: adding the cpp analysis script and enhancing the extraction of bitmap types (fix for rotated images). (#250)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-04-15 05:30:14 +02:00
Peter W. J. Staar c3c1e85da3 feat: improve extraction from fillable fields (#247)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-04-07 14:42:32 +02:00
Peter W. J. Staar e7ef57fbf6 feat: extend the renderer (#245)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-04-01 06:48:09 +02:00
Peter W. J. Staar 1f650dd412 fix: bo10k document failures (#244)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-03-23 19:49:33 +01:00
Peter W. J. Staar ae66f6ddf0 feat: add parallelization for parsing (#216)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-03-04 10:42:04 +01:00
Peter W. J. Staar 856c0fedb9 fix: ligatures and unicode chars in Differences (#234)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-03-03 15:40:39 +01:00
Peter W. J. Staar 96570232f6 feat: add config option to remove glyph output (#231)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-02-24 09:36:17 +01:00
Peter W. J. Staar 237cef698a fix: replace fixed-size utf8::append buffers with std::back_inserter to prevent segfaults (#224)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2026-02-20 07:26:30 +01:00
Peter W. J. Staar 6d984796a9 fix: rotated pages (missing commits) (#219)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-02-17 16:50:28 +01:00
Peter W. J. Staar e7812a122a feat: Refactor pdf resources to pdf page item (#215)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-02-13 17:25:41 +01:00
Peter W. J. Staar 67d2922913 feat: refactored the code and removed a lot of extra json parameters (#213)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-02-12 18:23:20 +01:00
Peter W. J. Staar 3272dd8d0b feat: removing the json from the pdf-parser (#210)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-02-11 07:30:12 +01:00
Peter W. J. Staar ea5f1d8d7b feat: renaming lines to shapes and enriching with graphics (color, filling and stroking) (#209)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-02-10 05:35:19 +01:00
Peter W. J. Staar f01ce848aa feat: add decoding config to decode_page (#208)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-02-06 15:32:39 +01:00
Peter W. J. Staar 25672da1e8 feat: add-image-extraction (#207)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-02-04 17:35:00 +01:00
Peter W. J. Staar fe25ac9854 chore: added the constrained test groundtruth (#205)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-01-30 11:22:47 +01:00
Sam Quigley bb0b4ef0b1 fix: recursively traverse parent chain for inherited MediaBox (#204)
Signed-off-by: Sam Quigley <quigley@emerose.com>
2026-01-30 08:46:13 +01:00
Peter W. J. Staar 23c7fb8e8f feat: add typed serialization (#201)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-01-28 17:51:51 +01:00
Peter W. J. Staar a98871e9e3 chore: removed the v2 naming in the code (#198)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-01-26 13:36:37 +01:00
Peter W. J. Staar adcb9b00e5 feat!: Remove deprecated v1 api (#189)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-01-19 17:16:32 +01:00
Peter W. J. Staar ec6149ecd7 fix: updated the font-parsing (#193)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-01-13 09:10:16 +01:00
Peter W. J. Staar 365a7175ce chore: remove timings from test_parse_v2 regression files (#192)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-01-12 16:51:25 +01:00
Peter W. J. Staar a6facf09ec chore: updating test-parse-v2 regression data (#190)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-01-12 14:46:16 +01:00
Michele Dolfi 1d3f78e514 fix: "could not find the page-dimensions" error solved restoring the parent mediabox (#181)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-12-02 06:01:13 -08:00
Peter W. J. Staar 327dc4ba13 fix: 360 rotated pages (#177)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-11-05 16:52:32 +01:00
Peter W. J. Staar f8d53ee481 feat: add perf tools (#165)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-09-16 16:53:46 +02:00
Peter W. J. Staar 5ded3b8f7f fix: media box (#157)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-08-22 12:26:29 +02:00
Peter W. J. Staar fe3482f7d7 feat: add page unloading (#150)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-08-19 08:49:18 +02:00
Michele Dolfi 4a578a165c ci: switch to windows 2025 (#149)
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
2025-08-04 17:00:57 +02:00
Rui Dias Gomes 29d62f58be chore: switch to uv (#135)
Signed-off-by: rmdg88 <rmdg88@gmail.com>
2025-06-24 13:00:03 +02:00
Peter W. J. Staar 8872e736bf feat: Fixed char ordering in text lines (#138)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-06-24 12:43:01 +02:00
Peter W. J. Staar 63972876e8 fix: glyph issue with encodings (#129)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-06-20 10:40:09 +02:00
David Huggins-Daines 38ddbb5256 fix: Use FontMatrix to scale Type3 font metrics (#113)
Signed-off-by: David Huggins-Daines <dhd@ecolingui.ca>
2025-04-09 05:18:15 +02:00
Christoph Auer ca7d584fa3 feat!: Update API, naming, and tests. Move data model to docling-core (#107)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-03-14 13:00:24 +01:00
Peter W. J. Staar c2f9741a5b feat: Establish char_cells, word_cells and line_cells, other fixes (#101)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-02-18 09:54:17 +01:00
Peter W. J. Staar 25b1e64846 feat: add support for RtL (#94)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-02-06 07:11:19 +01:00
Peter W. J. Staar 9718762209 feat: Added the pure chars and fixed the duplicate text (#91)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-02-02 13:57:45 +01:00
Peter W. J. Staar d663eec5fd fix: added the fix for rotated pages (#90)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-01-30 13:28:37 +01:00
Peter W. J. Staar de18986f03 fix: added more updates to better font-parsing (#87)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter W. J. Staar <91719829+PeterStaar-IBM@users.noreply.github.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2025-01-27 08:24:52 +01:00
Peter W. J. Staar 525ed8e380 feat: Update for complex fonts, rendering, and experimental high-level API (#82)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-01-17 18:46:16 +01:00
Peter W. J. Staar 1fccb29d3f feat!: Massive quality improvements to v2 parser and new sanitize_cells API (#73)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 10:20:52 +01:00
Peter W. J. Staar 22cf280b1f feat: add the export of annotations and ToC (#58)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-11-20 14:32:36 +01:00
Christoph Auer 6fdd74870d feat!: Upgrade to v2.0.0 (#48)
* feat!: Upgrade to v2.0.0

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Dummy change

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* rename old parser as pdf_parser_v1

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-23 14:15:52 +02:00
Peter W. J. Staar 48451ad095 feat: fixed the v2 parser to only return the pages that are requested (#47)
* fixed the v2 parser to only return the pages that are requested

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the visualize script

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the default args for compilation

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* put std::make_pair to avoid warnings

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-10-23 10:14:39 +02:00