523 Commits

Author SHA1 Message Date
github-actions[bot] 7cc62cbde7 chore: bump version to 2.75.0 [skip ci] v2.75.0 2026-05-12 14:54:25 +00:00
Peter W. J. Staar 014948b0e8 feat: updated the HTML serialization (#609)
* updated the HTML serialization

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-05-12 16:49:08 +02:00
vwe-ibm 48c5b97593 fix(DocLang): fix chemistry serialization. (#607)
* fix chemistry serialization. use molecule meta instead of classification

* DCO Remediation Commit for VWE@zurich.ibm.com;1G0002848;Valery Weber <vwe@login4.bluevela.rmf.ibm.com>

I, VWE@zurich.ibm.com;1G0002848;Valery Weber <vwe@login4.bluevela.rmf.ibm.com>, hereby add my Signed-off-by to this commit: e77b7e9478

Signed-off-by: VWE@zurich.ibm.com;1G0002848;Valery Weber <vwe@login3.bluevela.rmf.ibm.com>

---------

Signed-off-by: VWE@zurich.ibm.com;1G0002848;Valery Weber <vwe@login3.bluevela.rmf.ibm.com>
Co-authored-by: VWE@zurich.ibm.com;1G0002848;Valery Weber <vwe@login4.bluevela.rmf.ibm.com>
Co-authored-by: VWE@zurich.ibm.com;1G0002848;Valery Weber <vwe@login3.bluevela.rmf.ibm.com>
2026-05-08 13:43:26 +02:00
Michele Dolfi 6a6512ebe4 docs(security): Document security processes (#606)
add SECURITY.md file

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2026-05-08 10:00:26 +02:00
github-actions[bot] 275bd1d217 chore: bump version to 2.74.1 [skip ci] v2.74.1 2026-04-22 14:33:02 +00:00
Panos Vagenas 2087d0f362 fix: refine ImageRef URI handling (#595)
* fix: refine ImageRef URI handling

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* make URI handling configurable via settings

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-04-22 16:13:10 +02:00
Ahmed Nassar 048f1720f6 fix(doclang): default DoclangDeserializer to page 1 (#590)
* fix(doclang): default DoclangDeserializer to page 1 and allow start offset

Doclang VLM output was assigning provenance to page 0 while DoclingDocument and ground truth use 1-based page numbers, which broke page-aligned matching.

- Default internal _page_no to 1 and add deserialize(..., page_no=1).
- Replace hardcoded self._page_no = 0 with the page_no argument.
- Update roundtrip_list_item_with_inline deserialized fixture for page 1.

* DCO Remediation Commit for Ahmed Nassar AHN@zurich.ibm.com <AHN@zurich.ibm.com>

I, Ahmed Nassar AHN@zurich.ibm.com <AHN@zurich.ibm.com>, hereby add my Signed-off-by to this commit: 0c34abd5c3

Signed-off-by: Ahmed Nassar AHN@zurich.ibm.com <AHN@zurich.ibm.com>

---------

Signed-off-by: Ahmed Nassar AHN@zurich.ibm.com <AHN@zurich.ibm.com>
Co-authored-by: Ahmed Nassar AHN@zurich.ibm.com <AHN@zurich.ibm.com>
2026-04-21 17:54:48 +02:00
Panos Vagenas 473fbacfb9 fix: refine remote filename handling (#591)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-04-21 16:37:04 +02:00
github-actions[bot] 0425dc0c03 chore: bump version to 2.74.0 [skip ci] v2.74.0 2026-04-17 06:48:41 +00:00
Matteo b72af126b7 fix(DocLang): fix chemistry serialization (#584)
added correct serialization of chemistry into doclang

Signed-off-by: MatteoOmenetti <omenetti.matteo@gmail.com>
2026-04-17 07:33:45 +02:00
Smeet Agrawal 9dc882dc48 feat(serializer): add MsExcelMarkdownDocSerializer for sheet-name headings (#587)
* feat(serializer): add MsExcelMarkdownDocSerializer for sheet-name headings

Add `MsExcelMarkdownFallbackSerializer` and `MsExcelMarkdownDocSerializer`
to the serializer package so that `GroupLabel.SHEET` groups are rendered as
level-2 Markdown headings when exporting Excel-sourced documents.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* DCO Remediation Commit for Smeet23 <smeetagrawal2003@gmail.com>

I, Smeet23 <smeetagrawal2003@gmail.com>, hereby add my Signed-off-by to this commit: 2a3808e5dc

Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>

* style: apply ruff formatter to markdown_excel.py

Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>

---------

Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 07:01:13 +02:00
Cesar Berrospi Ramis 6cbdee9626 fix: prevent numeric precision loss in Markdown table serialization (#588)
* fix: preserve numeric precision in markdown serialization

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* test: apply style conventions to test_serialization.py module

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-04-15 17:16:57 +02:00
odelliab f2a61868d4 feat: DocChunk expansion (#549)
* line_chunker

* split table to header and body

* duplicat table headers

* Revert "duplicat table headers"

This reverts commit 5d17bdacf2.

* Revert "split table to header and body"

This reverts commit 91b43f97e4.

* Revert "line_chunker"

This reverts commit 5cc61d93fb.

* chunk expansion + test

Signed-off-by: odelliab <odelliab@il.ibm.com>

* small fixes

Signed-off-by: odelliab <odelliab@il.ibm.com>

* small fixes

Signed-off-by: odelliab <odelliab@il.ibm.com>

* remove unnecessary code

Signed-off-by: odelliab <odelliab@il.ibm.com>

* DCO Remediation Commit for odelliab <odelliab@il.ibm.com>

I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 5cc61d93fb
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 91b43f97e4
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 5d17bdacf2
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: a50392e53c
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: e5894290d5
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 30c72a99be
I, odelliab <91875866+odelliab@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 6aa0019fe0

Signed-off-by: odelliab <odelliab@il.ibm.com>

* change names

Signed-off-by: odelliab <odelliab@il.ibm.com>

* remove some tests

Signed-off-by: odelliab <odelliab@il.ibm.com>

* address review comments

Signed-off-by: odelliab <odelliab@il.ibm.com>

* Apply suggestions from code review

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: odelliab <91875866+odelliab@users.noreply.github.com>

* consolidate tests

Signed-off-by: odelliab <odelliab@il.ibm.com>

* fix failing test

Signed-off-by: odelliab <odelliab@il.ibm.com>

* test: regenerate ground truth for hybrid chunker test

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor: extract chunk expansion into TreeChunkExpander and PageChunkExpander

- Create chunk_expander.py with TreeChunkExpander and PageChunkExpander classes
- Remove DocChunk expansion methods
- Improve docstrings
- Refactor tests: test_doc_chunk_expansion.py → test_chunk_expander.py

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: odelliab <odelliab@il.ibm.com>
Signed-off-by: odelliab <91875866+odelliab@users.noreply.github.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-04-15 17:16:28 +02:00
github-actions[bot] d2e79f4a80 chore: bump version to 2.73.0 [skip ci] v2.73.0 2026-04-09 08:08:14 +00:00
Cesar Berrospi Ramis 18f573899b feat(ouline): extend OutlineDocSerializer with filtering capabilities (#580)
* feat(outline): add filtering params to OutlineDocSerializer

Add start_item and max_level parameters for flexible outline generation.
Refactor get_parts to reduce complexity below Ruff C901 threshold.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(outline): preserve spans in serialization results

Pass span_source to create_ser_result in all outline serializers.
Add test_outline_serialization_spans to verify get_unique_doc_items().

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(outline): reduce complexity and improve code quality

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-04-09 09:56:02 +02:00
Jordan White 7d8c9db2a7 docs: Fixes a typo in CONTRIBUTING.md (#582)
* Fixes a typo in CONTRIBUTING.md

I spotted this while checkign the contributing docs to create my
previous MR and thought it'd be nice to just push up a quick fix.

* DCO Remediation Commit for Jordan White <jordan.page.white@gmail.com>

I, Jordan White <jordan.page.white@gmail.com>, hereby add my Signed-off-by to this commit: b4026f63dd

Signed-off-by: Jordan White <jordan.page.white@gmail.com>

---------

Signed-off-by: Jordan White <jordan.page.white@gmail.com>
2026-04-09 07:57:11 +02:00
Peter W. J. Staar 46a9b5a329 feat: add latex and Tikz as codelabels (#579)
* feat: add latex and Tikz as codelabels

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fix the docs

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-04-09 06:46:18 +02:00
github-actions[bot] b7d35cef4b chore: bump version to 2.72.0 [skip ci] v2.72.0 2026-04-07 12:35:15 +00:00
mergify[bot] aaa1fbb031 ci(mergify): upgrade configuration to current format (#576)
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2026-04-06 09:19:25 +02:00
Panos Vagenas 00c3bb223d feat(Doclang): add newline handling (#575)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-04-01 17:13:37 +02:00
Peter W. J. Staar f20068db91 feat: add transforms in the hierarchy (#572)
* feat: add transforms in the hierarchy

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* implement hierachizing & flattening based on shift/tree operations

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* address conflict

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
2026-04-01 17:12:06 +02:00
github-actions[bot] 0388a99570 chore: bump version to 2.71.0 [skip ci] v2.71.0 2026-03-30 15:47:39 +00:00
Panos Vagenas 0bd5d8e649 feat: add code representation meta field (#573)
* feat: add code representation meta field

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add code language

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* restore custom field API changes

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-03-30 17:43:39 +02:00
Panos Vagenas c9b51520d2 fix(Doclang): improve checkbox serialization & deserialization (#570)
Wrap in `<text>` if needed.

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-03-30 12:31:07 +02:00
Panos Vagenas a1535bc1d8 fix(Doclang): fix serialization order in text items (#571)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-03-30 10:44:45 +02:00
Panos Vagenas 57187b2364 chore: address subtree moving within same parent (#564)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-03-30 10:41:54 +02:00
Panos Vagenas fe9bbfbb0f feat(Doclang): add content layer support (#568)
* feat(Doclang): add content layer support

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* rename layer attribute

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-03-30 10:40:49 +02:00
Vittorio Pippi fb3b603bc4 feat: add handwriting support (#561)
* feat: Add HANDWRITTEN_TEXT label support

Add full integration for the HANDWRITTEN_TEXT document item label.

Changes:
- tokens.py: Add HANDWRITTEN_TEXT to DocumentToken enum and mapping
- document.py: Add HANDWRITTEN_TEXT to DEFAULT_EXPORT_LABELS

Also adds test/test_handwritten_text_label.py for integration tests.

* DCO Remediation Commit for Vittorio Pippi <vpi@wila.zurich.ibm.com>

I, Vittorio Pippi <vpi@wila.zurich.ibm.com>, hereby add my Signed-off-by to this commit: acbdfa3902e74da44a5844ad0cecab8657da4904

Signed-off-by: Vittorio Pippi <vpi@wila.zurich.ibm.com>

* add handwriting support to Doclang

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* DCO Remediation Commit for Vittorio Pippi <vpi@wila.zurich.ibm.com>

I, Vittorio Pippi <vpi@wila.zurich.ibm.com>, hereby add my Signed-off-by to this commit: 000ccc55c5

Signed-off-by: Vittorio Pippi <vpi@wila.zurich.ibm.com>

---------

Signed-off-by: Vittorio Pippi <vpi@wila.zurich.ibm.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Vittorio Pippi <vpi@wila.zurich.ibm.com>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
2026-03-26 08:43:29 +01:00
Panos Vagenas 0cfb663275 fix: extend validation to address duplicate refs (#565)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-03-25 15:43:28 +01:00
Panos Vagenas 159eb8f021 fix(Doclang): fix group serialization (#566)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-03-25 14:31:32 +01:00
Christoph Auer b65dd24212 fix: repair table children when rich table cells break hierarchy (#563)
* fix: repair table children when rich table cells break hierarchy

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* require _normalize_table_children_from_rich_cells to be called explicitly

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2026-03-23 16:01:55 +01:00
github-actions[bot] 2808317b3f chore: bump version to 2.70.2 [skip ci] v2.70.2 2026-03-20 15:37:35 +00:00
Ahmed Nassar 91ee7e2302 fix(Doclang): suppress empty elements in Doclang serialization (#554)
* fix(Doclang): suppress empty elements in Doclang serialization

Add `suppress_empty_elements` parameter to `DoclangParams` that, when
enabled, omits text items, picture groups, tables, and inline elements
that produce no content instead of emitting empty open/close tag pairs.
This avoids cluttering OCR-oriented output with vacant tags.

Signed-off-by: Ahmed Nassar <nassarofficial@gmail.com>

* added tests

Signed-off-by: MatteoOmenetti <omenetti.matteo@gmail.com>

---------

Signed-off-by: Ahmed Nassar <nassarofficial@gmail.com>
Signed-off-by: MatteoOmenetti <omenetti.matteo@gmail.com>
Co-authored-by: MatteoOmenetti <omenetti.matteo@gmail.com>
2026-03-20 14:49:59 +01:00
Panos Vagenas 4807381050 chore: improve key-value migration (#559)
* fix: improve picture KV migration

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* consider page nr in bbox indexing & comparisons

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* remove unused internal method

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* improve further cases

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* fix: support arbitrary cell ids

(Until now a cell's id was assumed to be its position in `.cells`)

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-03-20 14:32:30 +01:00
Cesar Berrospi Ramis de575b7c3a chore: add item label in outline serializer (#558)
* chore(serializer): add item label to outline serializer

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(serializer): leverage pydantic for validating and dumping in outline serializer

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* test(serializer): test '_format_indented_text_line' from outline serializer

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(serializer): pass kwargs to all outline serializers for format consistency

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-03-20 13:16:57 +01:00
samiuc 3e030edc6f fix: expose traverse_pictures in export_to_markdown and export_to_text (#557)
* fix: expose traverse_pictures in export_to_markdown and export_to_text

Signed-off-by: samiuc <sami.ullah.chat@gmail.com>

* address review comment

Signed-off-by: samiuc <sami.ullah.chat@gmail.com>

---------

Signed-off-by: samiuc <sami.ullah.chat@gmail.com>
2026-03-20 08:12:54 +01:00
jhchoi1182 f97ec83f67 fix: sync picture classification enums with DocumentFigureClassifier-v2.0 model (#529)
* fix: add missing picture classification tokens for photograph and geographical_map

The DocumentFigureClassifier model outputs 'photograph' and 'geographical_map'
class names that were not defined in _PictureClassificationToken and
PictureClassificationLabel enums, causing ValueError during DocTags serialization.

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

* fix: add all missing picture classification tokens from DocumentFigureClassifier-v2.0

The v2.0 model outputs 12 additional class names not defined in enums:
scatter_plot, box_plot, table, full_page_image, page_thumbnail,
chemistry_structure, screenshot_from_computer, screenshot_from_manual,
topographical_map, engineering_drawing, music, calendar, crossword_puzzle.

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

* fix: update extract_picture_classification to include v2.0 labels

The all_labels list in extract_picture_classification was missing the
newly added v2.0 classification labels, which would cause classification
info to be lost when loading documents from DocTags format.

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

* chore: reorganize picture classification labels and tokens

Reorganize the list of picture classification labels and tokens to highlight
those from the current model from the rest.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-03-19 16:00:00 +01:00
github-actions[bot] 2a4377b713 chore: bump version to 2.70.1 [skip ci] v2.70.1 2026-03-17 14:06:00 +00:00
Cesar Berrospi Ramis afa5bd9e3a chore: document-level metadata serialization via body field (#551)
* feat: add 'meta' field to DoclingDocument at root level

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* test(metadata): reuse DoclingDocument objects with fixtures

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(serializer): add document-level meta to the outline serialization

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* test: add helper function to compare or regenerate ground truth files

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* test: regenerate (upgrade) ground truth files

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(serializer): set default indent to 2 spaces in outline serializer

To be aligned with other indented text serializations, the outline serializer takes 2 spaces
as default indent for serialization.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(serializer): add line break after ref in outline md serialization

To render a line break after the reference element in the outline markdown serializer, add 2 trailing spaces.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor: rollback adding 'meta' field to DoclingDocument

Revert the changes in commit 7e17c52 since DoclingDocument meta-information can
alreadby be achieved by leveraging the 'body' field, which is of type 'NodeItem'
and thus it has a 'meta' field.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-03-17 14:55:52 +01:00
Cesar Berrospi Ramis 0a3b2787e0 fix(markdown): remove assert statements to support Python optimization mode (#548)
fix(markdown): remove 'assert' statement in runtime code

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-03-17 11:25:39 +01:00
Panos Vagenas c57e50ac43 fix: improve rich table cell validation (#550)
Extend validation to also cover case of missing reference to
rich child on table item.

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-03-16 17:13:53 +01:00
github-actions[bot] 1513f7d171 chore: bump version to 2.70.0 [skip ci] v2.70.0 2026-03-13 15:06:05 +00:00
Panos Vagenas b56f75190f chore(Doclang): remove inline element (#517)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-03-13 15:58:42 +01:00
Panos Vagenas b93d5a3920 feat: introduce field data model incl. Doclang serialization (#519)
* feat: introduce new kv data model

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add test

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* extend tests

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add form-table test

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* update invoice test

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add migration

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* include pre-migration YAML

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add nesting test

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* switch to field naming

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* improve prov migration & location serialization

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* enable proper nested serialization without inline

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add field marker and test

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* generalize marker naming

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* updated API

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* align with TextItem conventions

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* make FieldItem a DocItem

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* extend KV migration scope, extend tree manipulation operations

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* extend kv migration tests

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add test with KV nesting

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* improve key splitting and location assignment

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* address conflicts

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* minor cleanup

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* remove unnecessary files

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add add_content parameter, expose export_to_doclang

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* bump DoclingDocument version

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* address conflicts, add save_as_doclang

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-03-13 15:37:42 +01:00
Cesar Berrospi Ramis 9f3c0c6757 chore: upgrade dependencies to address dependabot alerts (#543)
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-03-13 15:18:38 +01:00
Peter W. J. Staar 8d7859eeec feat: make an experimental outline serializer (#415)
* feat: make an experimental outline serializer

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* work ongoing

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* working outline serializer

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* style(outline): align style of outline serializer to docling-core's

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(serializer): ensure 'include_non_meta' works as designed by CommonParams

Ensure the optional parameter 'include_non_meta' works as designed by CommonParams when used by OutlineDocSerializer.
Add regression tests for OutlineDocSerializer.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(serializer): outline serializer to optionally return in JSON format

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(serializer): enable a custom set of labels for TOC serializer

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* test(serializer): add summary to title in test data

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(serializer): ensure outline serializer keeps structured fields in JSON format

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(serializer): refactor outline serializer to create proper markdown

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(serializer): use pydantic models for the outine serializer JSON representation

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* test: rename outline serialization module to align it with others

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(serializer): add indented text format for outline serializer

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-03-13 15:14:40 +01:00
Cesar Berrospi Ramis af50f1cb07 feat: profile a document or collection (#511)
* feat: profile a document or collection

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(profiler): add deciles and histograms

Add deciles and histograms to the Docling collection statistics.
Add an example script to plot histograms.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(profiler): add option to plot log frequencies in histogram

Add the option to plot the histogram frequencies in logarithmic scale.
Extend README with documentation on the document profiler.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* test(profiler): cover missing lines in doc_profiler with tests

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-03-13 13:36:38 +01:00
odelliab b435090fdf feat: split html table to headers and body (#532)
* line_chunker

* split table to header and body

* duplicat table headers

* Revert "duplicat table headers"

This reverts commit 5d17bdacf2.

* Revert "split table to header and body"

This reverts commit 91b43f97e4.

* Revert "line_chunker"

This reverts commit 5cc61d93fb.

* line chunker

* split table to header and body

* duplicate table headers

* DCO Remediation Commit for odelliab <odelliab@il.ibm.com>

I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 5cc61d93fb
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 91b43f97e4
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 5d17bdacf2
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: a50392e53c
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: e5894290d5
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 30c72a99be
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 510e949692
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 6c3a8f726e
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 0642a0701e

Signed-off-by: odelliab <odelliab@il.ibm.com>

* style changes

* pre-commit fixes

* expected output name change

* DCO Remediation Commit for odelliab <odelliab@il.ibm.com>

I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 9b9ef09007
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: b3699e32cc
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 9d393df546

Signed-off-by: odelliab <odelliab@il.ibm.com>

* Apply suggestion from @ceberam

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: odelliab <91875866+odelliab@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: odelliab <91875866+odelliab@users.noreply.github.com>

* address review comments

Signed-off-by: odelliab <odelliab@il.ibm.com>

* refactor: move get_default_tokenizer to huggingface module

Move the method get_default_tokenizer to module huggingface.py to avoid circular dependencies
and avoid an unconventional import in a method

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* remove unnecessary lines

Signed-off-by: odelliab <odelliab@il.ibm.com>

* split html tables

Signed-off-by: odelliab <odelliab@il.ibm.com>

* add bs4 dependency

Signed-off-by: odelliab <odelliab@il.ibm.com>

* Apply suggestions from code review

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: odelliab <91875866+odelliab@users.noreply.github.com>

* address review comments

Signed-off-by: odelliab <odelliab@il.ibm.com>

* typo

Signed-off-by: odelliab <odelliab@il.ibm.com>

* add beautifulsoup tests

Signed-off-by: odelliab <odelliab@il.ibm.com>

* refactor(serializer): drop dependency 'beautifulsoup4' in HTML serializer

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* test(serializer): refactor tests for HTML serializer

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: odelliab <odelliab@il.ibm.com>
Signed-off-by: odelliab <91875866+odelliab@users.noreply.github.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-03-13 10:40:05 +01:00
Anish Raghavendra e00125c477 feat: handle wide table outliers with LineBasedTokenChunker (#536)
* feat: handle large prefixes by splitting into multiple chunks in LineBasedTokenChunker

Signed-off-by: z404 <anishr890@gmail.com>

* feat: add option to omit table headers on row overflow

Signed-off-by: z404 <anishr890@gmail.com>

* chore: add warning when prefix is omitted on overflow

Signed-off-by: z404 <anishr890@gmail.com>

* refactor: simplify prefix chunking logic by removing redundant method

Signed-off-by: z404 <anishr890@gmail.com>

* fix: ensure prefix appears as standalone chunk when all lines overflow

Signed-off-by: z404 <anishr890@gmail.com>

* docs: add usage examples for LineBasedTokenChunker with prefix handling

Signed-off-by: z404 <anishr890@gmail.com>

* docs: remove examples and describe parameters for LineBasedTokenChunker

Signed-off-by: z404 <anishr890@gmail.com>

* docs: simplify class docstring by removing parameter details

Signed-off-by: z404 <anishr890@gmail.com>

---------

Signed-off-by: z404 <anishr890@gmail.com>
2026-03-13 07:29:46 +01:00
github-actions[bot] 24574110d3 chore: bump version to 2.69.0 [skip ci] v2.69.0 2026-03-09 04:31:38 +00:00