Commit Graph

291 Commits

Author SHA1 Message Date
Peter Staar 29fe809cf0 refactored the tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-06-05 16:54:37 +02:00
Peter Staar 5c955e6d75 formatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-06-05 16:50:54 +02:00
Peter Staar 40aee366bf added script mode
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-06-05 16:38:09 +02:00
Panos Vagenas d8a5256b2c feat: add table annotations (#304)
* feat: add table annotations

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* refactor annotation types

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* expand to HTML

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* introduce annotation serializer

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* Update dummy_doc.yaml

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-06-05 15:38:46 +02:00
github-actions[bot] 58d93e6eff chore: bump version to 2.33.1 [skip ci] v2.33.1 2025-06-04 09:46:14 +00:00
Michele Dolfi e17eabf0f9 fix: new typer version with new click (#315)
fix: typer version with new click

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-04 11:40:06 +02:00
Christoph Auer defd49efae fix: Support section_header levels in doctags deserialization (#313)
Adding support for section_header levels in doctags deserialization

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-06-03 11:40:13 +02:00
github-actions[bot] 71956ed253 chore: bump version to 2.33.0 [skip ci] v2.33.0 2025-06-02 08:47:52 +00:00
samiuc c521766cb7 feat: add BoundingBox methods for overlap and union calculations (#311)
* feat(BoundingBox): add methods for overlap and union calculations

Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com>

* format files

Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com>

---------

Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com>
Co-authored-by: samiullahchattha <Sami.Ullah1@ibm.com>
2025-06-02 10:45:20 +02:00
Panos Vagenas 8415969608 chore: exclude test data from GH Linguist (#309)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-28 14:06:22 +02:00
Michele Dolfi 1b0b39b0b4 refactor: use uv as dependencies management and packaging (#307)
* use new pyproject.toml format with uv

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update ci/cd scripts

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update MD files

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* build without pre-commit cache

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* run pre-commit from uv

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* small changes

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add uv package install

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Apply suggestions from code review

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>

* docs: update README

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-28 13:06:06 +02:00
github-actions[bot] a523107f1b chore: bump version to 2.32.0 [skip ci] v2.32.0 2025-05-27 13:44:40 +00:00
Panos Vagenas 87b72d6537 fix(HybridChunker): refine max_tokens auto-detection (#306)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-27 15:38:49 +02:00
Panos Vagenas f067c51c48 feat: add annotations in MD & HTML serialization (#295)
* feat: include annotations in MD & HTML serialization

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* (HTML) move annotations into figcaptions

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add explicit beginning/end markers, fix case of excluded refs

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* improve annotation marking, extend tests

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* wrap captions (#305)

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* revert temp test changes

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-05-27 10:49:13 +02:00
Cesar Berrospi Ramis 4a174b5679 chore: fix deprecation warnings (#303)
* chore: fix deprecation warnings

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* chore: disregard deprecated captions from hierarchical chunker in hybrid chunker

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* chore: update poetry lock file

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-05-26 15:58:08 +02:00
github-actions[bot] b021374940 chore: bump version to 2.31.2 [skip ci] v2.31.2 2025-05-22 15:28:08 +00:00
Panos Vagenas ebc356a787 fix: fix hybrid chunker legacy patching (#300)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-22 13:34:50 +02:00
github-actions[bot] 6274b1a3c2 chore: bump version to 2.31.1 [skip ci] v2.31.1 2025-05-20 19:02:29 +00:00
Said Gürbüz ae1cdcedad test: update tests for load_from_doctags (#299)
* update tests for load_from_doctags

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>

* update tests to fix the flakiness

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>

---------

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
2025-05-20 15:06:25 +02:00
Panos Vagenas c49a50e76b fix(markdown): fix case of empty page break string (#298)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-20 10:51:24 +02:00
github-actions[bot] 81760f56fd chore: bump version to 2.31.0 [skip ci] v2.31.0 2025-05-18 08:44:00 +00:00
Panos Vagenas 6a7eb537eb feat: provide visualizer option in HTML split view (#294)
* feat: provide visualizer option in HTML split view

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* loosen test

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-16 11:22:12 +02:00
github-actions[bot] 56f70de798 chore: bump version to 2.30.1 [skip ci] v2.30.1 2025-05-14 15:22:51 +00:00
Christoph Auer aa957cf4b6 fix: Updates for labels and methods to support document GT annotation (#293)
Update labels and get_color

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-05-14 14:35:17 +02:00
github-actions[bot] e3bb22f789 chore: bump version to 2.30.0 [skip ci] v2.30.0 2025-05-06 12:23:25 +00:00
Christoph Auer ad88ecf845 fix: Add unit flags to SegmentedPage (#286)
Add unit flags to SegmentedPage

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-05-06 14:00:47 +02:00
Peter W. J. Staar 7f83f1ce84 feat: add image group serialization in html (#284)
* feat: add-image group serialization in html

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-06 13:59:47 +02:00
Said Gürbüz 511fb98a03 fix: update deserialization for better recovery (#282)
update deserialization for better recovery

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
2025-05-06 10:25:36 +02:00
Peter W. J. Staar 2f0f12160b feat: adding the label picture_group (#283)
* feat: adding the label picture_group

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* renamed group to area

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the docs

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-06 09:43:35 +02:00
Panos Vagenas 7eb9fa96e6 fix: include captions regardless of traverse_pictures flag (#278)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-05 10:11:18 +02:00
Michele Dolfi 4b967ab55c fix: hashlib usage for FIPS (#280)
fix usage of hashlib for FIPS

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-05-02 15:00:02 +02:00
github-actions[bot] 801f0187f7 chore: bump version to 2.29.0 [skip ci] v2.29.0 2025-05-01 05:55:59 +00:00
Panos Vagenas 8677d6e9c6 fix: fix multi-provenance item visualization (#277)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-04-30 15:55:27 +02:00
Panos Vagenas d05fe08546 feat: promote serializers to stable API (#276)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-04-30 11:26:39 +02:00
Rahul Das 591fe59357 fix: added return value for crop_text method in segmentedPdfPage Class (#275)
* Added Return value for crop_text method in segmentedPdfPage Class

Signed-off-by: rahuldas-dev <r.das699@gmail.com>

* fix: added return value of crop_text method along with return type annotation plus doc string

Signed-off-by: rahuldas-dev <r.das699@gmail.com>

---------

Signed-off-by: rahuldas-dev <r.das699@gmail.com>
2025-04-29 13:25:55 +02:00
Said Gürbüz 8f85d056e8 fix: make load_from_doctags method static (#273)
make load_from_doctags method static

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
2025-04-29 07:52:54 +02:00
github-actions[bot] c66d8dd757 chore: bump version to 2.28.1 [skip ci] v2.28.1 2025-04-25 10:49:31 +00:00
Peter W. J. Staar a947440545 fix: visualization of document pages without items (#271)
* fix: visualization of document pages without items

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed MyPy

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed MyPy (2)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed MyPy (3)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed flake

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed flake (2)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-04-25 11:41:08 +02:00
Said Gürbüz d9709d0b12 fix: UnboundLocal variable (#269)
fix UnboundLocal variable

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
2025-04-24 16:51:17 +02:00
github-actions[bot] 03a53dd51d chore: bump version to 2.28.0 [skip ci] v2.28.0 2025-04-23 07:42:47 +00:00
Eugene c30ada64d6 chore: Types fixes (#267)
* docs: Recommend installing all groups for mypy to work properly

Signed-off-by: Eugene <fogaprod@gmail.com>

* chore: Fix some incomplete type defs

Signed-off-by: Eugene <fogaprod@gmail.com>

---------

Signed-off-by: Eugene <fogaprod@gmail.com>
2025-04-22 10:29:20 +02:00
Guillermo 763e1364ff feat: Add tiktoken tokenizers support to HybridChunker (#240)
* feat: Add tiktoken tokenizers support to HybridChunker

Signed-off-by: ruizguille <guillermo@codeawake.com>

* separate OpenAI tokenizer

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: ruizguille <guillermo@codeawake.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
2025-04-18 07:43:20 +02:00
Eugene c19c5164cb chore: typehint DoclingDocument.export_to_dict return type (#264)
Signed-off-by: Eugene <fogaprod@gmail.com>
2025-04-17 12:29:17 +02:00
Panos Vagenas a258d525e1 feat: add visualizers (#263)
* feat: add visualizers

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* make visualizers composable

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* use BoundingRectangle instead of BoundingBox

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* enforce top-left coordinates

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* narrow down test data to first 3 pages

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add file deletions

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-04-17 11:43:45 +02:00
github-actions[bot] 8b676b97ba chore: bump version to 2.27.0 [skip ci] v2.27.0 2025-04-16 14:48:37 +00:00
Peter W. J. Staar d0a49da6de fix: HTML serialization for single image documents (#261)
* fix for HTML serialization for single image documents

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* minor refactor, add test

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-04-16 16:40:27 +02:00
Cesar Berrospi Ramis 1af07218e1 fix(codecov): fix codecov argument and yaml file (#260)
* fix(codecov): fix codecov argument and yaml file

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* ci: set the codecov status to success even if the CI fails

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-04-15 16:53:33 +02:00
Christoph Auer 159f61d6d8 fix: Safer label color API (#259)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-04-15 16:28:10 +02:00
Maxim Lysak caa8aeefae feat: Chart tabular data serialization for HTML serializer (#258)
* Add chart tabular data serialization to HTML serializer

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixing pre-commit issues

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixed table serialization for tabular chart data

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* test for loading doctags with chart data

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Improved loading doctags test with chart example, added tests for chart serialization into html and md

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-04-15 13:31:50 +02:00
Michele Dolfi 64bafa1afe ci: add tests in ci with coverage upload (#257)
* add coverage of tests

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use coverage from pre-commit

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-04-15 10:08:07 +02:00