Peter Staar
29fe809cf0
refactored the tests
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
2025-06-05 16:54:37 +02:00
Peter Staar
5c955e6d75
formatted the code
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
2025-06-05 16:50:54 +02:00
Peter Staar
40aee366bf
added script mode
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
2025-06-05 16:38:09 +02:00
Panos Vagenas
d8a5256b2c
feat: add table annotations ( #304 )
...
* feat: add table annotations
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* refactor annotation types
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* expand to HTML
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* introduce annotation serializer
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* Update dummy_doc.yaml
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
---------
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2025-06-05 15:38:46 +02:00
github-actions[bot]
58d93e6eff
chore: bump version to 2.33.1 [skip ci]
v2.33.1
2025-06-04 09:46:14 +00:00
Michele Dolfi
e17eabf0f9
fix: new typer version with new click ( #315 )
...
fix: typer version with new click
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-06-04 11:40:06 +02:00
Christoph Auer
defd49efae
fix: Support section_header levels in doctags deserialization ( #313 )
...
Adding support for section_header levels in doctags deserialization
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2025-06-03 11:40:13 +02:00
github-actions[bot]
71956ed253
chore: bump version to 2.33.0 [skip ci]
v2.33.0
2025-06-02 08:47:52 +00:00
samiuc
c521766cb7
feat: add BoundingBox methods for overlap and union calculations ( #311 )
...
* feat(BoundingBox): add methods for overlap and union calculations
Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com >
* format files
Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com >
---------
Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com >
Co-authored-by: samiullahchattha <Sami.Ullah1@ibm.com >
2025-06-02 10:45:20 +02:00
Panos Vagenas
8415969608
chore: exclude test data from GH Linguist ( #309 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-05-28 14:06:22 +02:00
Michele Dolfi
1b0b39b0b4
refactor: use uv as dependencies management and packaging ( #307 )
...
* use new pyproject.toml format with uv
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update ci/cd scripts
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update MD files
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* build without pre-commit cache
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* run pre-commit from uv
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* small changes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add uv package install
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* Apply suggestions from code review
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
* docs: update README
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com >
2025-05-28 13:06:06 +02:00
github-actions[bot]
a523107f1b
chore: bump version to 2.32.0 [skip ci]
v2.32.0
2025-05-27 13:44:40 +00:00
Panos Vagenas
87b72d6537
fix(HybridChunker): refine max_tokens auto-detection ( #306 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-05-27 15:38:49 +02:00
Panos Vagenas
f067c51c48
feat: add annotations in MD & HTML serialization ( #295 )
...
* feat: include annotations in MD & HTML serialization
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* (HTML) move annotations into figcaptions
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* add explicit beginning/end markers, fix case of excluded refs
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* improve annotation marking, extend tests
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* wrap captions (#305 )
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* revert temp test changes
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2025-05-27 10:49:13 +02:00
Cesar Berrospi Ramis
4a174b5679
chore: fix deprecation warnings ( #303 )
...
* chore: fix deprecation warnings
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* chore: disregard deprecated captions from hierarchical chunker in hybrid chunker
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* chore: update poetry lock file
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-05-26 15:58:08 +02:00
github-actions[bot]
b021374940
chore: bump version to 2.31.2 [skip ci]
v2.31.2
2025-05-22 15:28:08 +00:00
Panos Vagenas
ebc356a787
fix: fix hybrid chunker legacy patching ( #300 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-05-22 13:34:50 +02:00
github-actions[bot]
6274b1a3c2
chore: bump version to 2.31.1 [skip ci]
v2.31.1
2025-05-20 19:02:29 +00:00
Said Gürbüz
ae1cdcedad
test: update tests for load_from_doctags ( #299 )
...
* update tests for load_from_doctags
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch >
* update tests to fix the flakiness
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch >
---------
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch >
2025-05-20 15:06:25 +02:00
Panos Vagenas
c49a50e76b
fix(markdown): fix case of empty page break string ( #298 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-05-20 10:51:24 +02:00
github-actions[bot]
81760f56fd
chore: bump version to 2.31.0 [skip ci]
v2.31.0
2025-05-18 08:44:00 +00:00
Panos Vagenas
6a7eb537eb
feat: provide visualizer option in HTML split view ( #294 )
...
* feat: provide visualizer option in HTML split view
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* loosen test
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-05-16 11:22:12 +02:00
github-actions[bot]
56f70de798
chore: bump version to 2.30.1 [skip ci]
v2.30.1
2025-05-14 15:22:51 +00:00
Christoph Auer
aa957cf4b6
fix: Updates for labels and methods to support document GT annotation ( #293 )
...
Update labels and get_color
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2025-05-14 14:35:17 +02:00
github-actions[bot]
e3bb22f789
chore: bump version to 2.30.0 [skip ci]
v2.30.0
2025-05-06 12:23:25 +00:00
Christoph Auer
ad88ecf845
fix: Add unit flags to SegmentedPage ( #286 )
...
Add unit flags to SegmentedPage
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2025-05-06 14:00:47 +02:00
Peter W. J. Staar
7f83f1ce84
feat: add image group serialization in html ( #284 )
...
* feat: add-image group serialization in html
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
2025-05-06 13:59:47 +02:00
Said Gürbüz
511fb98a03
fix: update deserialization for better recovery ( #282 )
...
update deserialization for better recovery
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch >
2025-05-06 10:25:36 +02:00
Peter W. J. Staar
2f0f12160b
feat: adding the label picture_group ( #283 )
...
* feat: adding the label picture_group
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* renamed group to area
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the docs
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
2025-05-06 09:43:35 +02:00
Panos Vagenas
7eb9fa96e6
fix: include captions regardless of traverse_pictures flag ( #278 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-05-05 10:11:18 +02:00
Michele Dolfi
4b967ab55c
fix: hashlib usage for FIPS ( #280 )
...
fix usage of hashlib for FIPS
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-05-02 15:00:02 +02:00
github-actions[bot]
801f0187f7
chore: bump version to 2.29.0 [skip ci]
v2.29.0
2025-05-01 05:55:59 +00:00
Panos Vagenas
8677d6e9c6
fix: fix multi-provenance item visualization ( #277 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-04-30 15:55:27 +02:00
Panos Vagenas
d05fe08546
feat: promote serializers to stable API ( #276 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-04-30 11:26:39 +02:00
Rahul Das
591fe59357
fix: added return value for crop_text method in segmentedPdfPage Class ( #275 )
...
* Added Return value for crop_text method in segmentedPdfPage Class
Signed-off-by: rahuldas-dev <r.das699@gmail.com >
* fix: added return value of crop_text method along with return type annotation plus doc string
Signed-off-by: rahuldas-dev <r.das699@gmail.com >
---------
Signed-off-by: rahuldas-dev <r.das699@gmail.com >
2025-04-29 13:25:55 +02:00
Said Gürbüz
8f85d056e8
fix: make load_from_doctags method static ( #273 )
...
make load_from_doctags method static
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch >
2025-04-29 07:52:54 +02:00
github-actions[bot]
c66d8dd757
chore: bump version to 2.28.1 [skip ci]
v2.28.1
2025-04-25 10:49:31 +00:00
Peter W. J. Staar
a947440545
fix: visualization of document pages without items ( #271 )
...
* fix: visualization of document pages without items
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed MyPy
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed MyPy (2)
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed MyPy (3)
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed flake
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed flake (2)
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
2025-04-25 11:41:08 +02:00
Said Gürbüz
d9709d0b12
fix: UnboundLocal variable ( #269 )
...
fix UnboundLocal variable
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch >
2025-04-24 16:51:17 +02:00
github-actions[bot]
03a53dd51d
chore: bump version to 2.28.0 [skip ci]
v2.28.0
2025-04-23 07:42:47 +00:00
Eugene
c30ada64d6
chore: Types fixes ( #267 )
...
* docs: Recommend installing all groups for mypy to work properly
Signed-off-by: Eugene <fogaprod@gmail.com >
* chore: Fix some incomplete type defs
Signed-off-by: Eugene <fogaprod@gmail.com >
---------
Signed-off-by: Eugene <fogaprod@gmail.com >
2025-04-22 10:29:20 +02:00
Guillermo
763e1364ff
feat: Add tiktoken tokenizers support to HybridChunker ( #240 )
...
* feat: Add tiktoken tokenizers support to HybridChunker
Signed-off-by: ruizguille <guillermo@codeawake.com >
* separate OpenAI tokenizer
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: ruizguille <guillermo@codeawake.com >
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com >
2025-04-18 07:43:20 +02:00
Eugene
c19c5164cb
chore: typehint DoclingDocument.export_to_dict return type ( #264 )
...
Signed-off-by: Eugene <fogaprod@gmail.com >
2025-04-17 12:29:17 +02:00
Panos Vagenas
a258d525e1
feat: add visualizers ( #263 )
...
* feat: add visualizers
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* make visualizers composable
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* use BoundingRectangle instead of BoundingBox
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* enforce top-left coordinates
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* narrow down test data to first 3 pages
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* add file deletions
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-04-17 11:43:45 +02:00
github-actions[bot]
8b676b97ba
chore: bump version to 2.27.0 [skip ci]
v2.27.0
2025-04-16 14:48:37 +00:00
Peter W. J. Staar
d0a49da6de
fix: HTML serialization for single image documents ( #261 )
...
* fix for HTML serialization for single image documents
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* minor refactor, add test
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com >
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2025-04-16 16:40:27 +02:00
Cesar Berrospi Ramis
1af07218e1
fix(codecov): fix codecov argument and yaml file ( #260 )
...
* fix(codecov): fix codecov argument and yaml file
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* ci: set the codecov status to success even if the CI fails
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-04-15 16:53:33 +02:00
Christoph Auer
159f61d6d8
fix: Safer label color API ( #259 )
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2025-04-15 16:28:10 +02:00
Maxim Lysak
caa8aeefae
feat: Chart tabular data serialization for HTML serializer ( #258 )
...
* Add chart tabular data serialization to HTML serializer
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Fixing pre-commit issues
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Fixed table serialization for tabular chart data
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* test for loading doctags with chart data
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Improved loading doctags test with chart example, added tests for chart serialization into html and md
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2025-04-15 13:31:50 +02:00
Michele Dolfi
64bafa1afe
ci: add tests in ci with coverage upload ( #257 )
...
* add coverage of tests
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* use coverage from pre-commit
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-04-15 10:08:07 +02:00