Commit Graph

482 Commits

Author SHA1 Message Date
github-actions[bot] 1513f7d171 chore: bump version to 2.70.0 [skip ci] v2.70.0 2026-03-13 15:06:05 +00:00
Panos Vagenas b56f75190f chore(Doclang): remove inline element (#517)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-03-13 15:58:42 +01:00
Panos Vagenas b93d5a3920 feat: introduce field data model incl. Doclang serialization (#519)
* feat: introduce new kv data model

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add test

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* extend tests

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add form-table test

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* update invoice test

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add migration

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* include pre-migration YAML

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add nesting test

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* switch to field naming

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* improve prov migration & location serialization

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* enable proper nested serialization without inline

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add field marker and test

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* generalize marker naming

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* updated API

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* align with TextItem conventions

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* make FieldItem a DocItem

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* extend KV migration scope, extend tree manipulation operations

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* extend kv migration tests

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add test with KV nesting

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* improve key splitting and location assignment

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* address conflicts

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* minor cleanup

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* remove unnecessary files

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add add_content parameter, expose export_to_doclang

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* bump DoclingDocument version

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* address conflicts, add save_as_doclang

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-03-13 15:37:42 +01:00
Cesar Berrospi Ramis 9f3c0c6757 chore: upgrade dependencies to address dependabot alerts (#543)
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-03-13 15:18:38 +01:00
Peter W. J. Staar 8d7859eeec feat: make an experimental outline serializer (#415)
* feat: make an experimental outline serializer

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* work ongoing

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* working outline serializer

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* style(outline): align style of outline serializer to docling-core's

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(serializer): ensure 'include_non_meta' works as designed by CommonParams

Ensure the optional parameter 'include_non_meta' works as designed by CommonParams when used by OutlineDocSerializer.
Add regression tests for OutlineDocSerializer.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(serializer): outline serializer to optionally return in JSON format

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(serializer): enable a custom set of labels for TOC serializer

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* test(serializer): add summary to title in test data

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(serializer): ensure outline serializer keeps structured fields in JSON format

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(serializer): refactor outline serializer to create proper markdown

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(serializer): use pydantic models for the outine serializer JSON representation

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* test: rename outline serialization module to align it with others

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(serializer): add indented text format for outline serializer

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-03-13 15:14:40 +01:00
Cesar Berrospi Ramis af50f1cb07 feat: profile a document or collection (#511)
* feat: profile a document or collection

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(profiler): add deciles and histograms

Add deciles and histograms to the Docling collection statistics.
Add an example script to plot histograms.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(profiler): add option to plot log frequencies in histogram

Add the option to plot the histogram frequencies in logarithmic scale.
Extend README with documentation on the document profiler.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* test(profiler): cover missing lines in doc_profiler with tests

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-03-13 13:36:38 +01:00
odelliab b435090fdf feat: split html table to headers and body (#532)
* line_chunker

* split table to header and body

* duplicat table headers

* Revert "duplicat table headers"

This reverts commit 5d17bdacf2.

* Revert "split table to header and body"

This reverts commit 91b43f97e4.

* Revert "line_chunker"

This reverts commit 5cc61d93fb.

* line chunker

* split table to header and body

* duplicate table headers

* DCO Remediation Commit for odelliab <odelliab@il.ibm.com>

I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 5cc61d93fb
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 91b43f97e4
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 5d17bdacf2
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: a50392e53c
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: e5894290d5
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 30c72a99be
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 510e949692
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 6c3a8f726e
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 0642a0701e

Signed-off-by: odelliab <odelliab@il.ibm.com>

* style changes

* pre-commit fixes

* expected output name change

* DCO Remediation Commit for odelliab <odelliab@il.ibm.com>

I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 9b9ef09007
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: b3699e32cc
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 9d393df546

Signed-off-by: odelliab <odelliab@il.ibm.com>

* Apply suggestion from @ceberam

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: odelliab <91875866+odelliab@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: odelliab <91875866+odelliab@users.noreply.github.com>

* address review comments

Signed-off-by: odelliab <odelliab@il.ibm.com>

* refactor: move get_default_tokenizer to huggingface module

Move the method get_default_tokenizer to module huggingface.py to avoid circular dependencies
and avoid an unconventional import in a method

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* remove unnecessary lines

Signed-off-by: odelliab <odelliab@il.ibm.com>

* split html tables

Signed-off-by: odelliab <odelliab@il.ibm.com>

* add bs4 dependency

Signed-off-by: odelliab <odelliab@il.ibm.com>

* Apply suggestions from code review

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: odelliab <91875866+odelliab@users.noreply.github.com>

* address review comments

Signed-off-by: odelliab <odelliab@il.ibm.com>

* typo

Signed-off-by: odelliab <odelliab@il.ibm.com>

* add beautifulsoup tests

Signed-off-by: odelliab <odelliab@il.ibm.com>

* refactor(serializer): drop dependency 'beautifulsoup4' in HTML serializer

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* test(serializer): refactor tests for HTML serializer

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: odelliab <odelliab@il.ibm.com>
Signed-off-by: odelliab <91875866+odelliab@users.noreply.github.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-03-13 10:40:05 +01:00
Anish Raghavendra e00125c477 feat: handle wide table outliers with LineBasedTokenChunker (#536)
* feat: handle large prefixes by splitting into multiple chunks in LineBasedTokenChunker

Signed-off-by: z404 <anishr890@gmail.com>

* feat: add option to omit table headers on row overflow

Signed-off-by: z404 <anishr890@gmail.com>

* chore: add warning when prefix is omitted on overflow

Signed-off-by: z404 <anishr890@gmail.com>

* refactor: simplify prefix chunking logic by removing redundant method

Signed-off-by: z404 <anishr890@gmail.com>

* fix: ensure prefix appears as standalone chunk when all lines overflow

Signed-off-by: z404 <anishr890@gmail.com>

* docs: add usage examples for LineBasedTokenChunker with prefix handling

Signed-off-by: z404 <anishr890@gmail.com>

* docs: remove examples and describe parameters for LineBasedTokenChunker

Signed-off-by: z404 <anishr890@gmail.com>

* docs: simplify class docstring by removing parameter details

Signed-off-by: z404 <anishr890@gmail.com>

---------

Signed-off-by: z404 <anishr890@gmail.com>
2026-03-13 07:29:46 +01:00
github-actions[bot] 24574110d3 chore: bump version to 2.69.0 [skip ci] v2.69.0 2026-03-09 04:31:38 +00:00
Michele Dolfi 4eb0d20d04 feat: Loosen dependency version constraints (#534)
* update deps

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix code to support all versions and regen tests

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2026-03-09 05:27:48 +01:00
github-actions[bot] eb900064c8 chore: bump version to 2.68.0 [skip ci] v2.68.0 2026-03-07 12:19:50 +00:00
Cesar Berrospi Ramis a661bb10cb fix: prevent infinite loop in LineBasedTokenChunker with unbreakable tokens (#533)
* fix: prevent infinite loop in LineBasedTokenChunker with unbreakable tokens

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests: add a pytest fixture to test LineBasedTokenChunker efficiently

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* style: apply style conventions to test_line_chunker.py test module

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-03-06 12:52:30 +01:00
Faiq Adzlan 43030488c9 chore: add support for pandas==3.0.0 (#513)
* feat: add support for `pandas==3.0.0`

* DCO Remediation Commit for wanadzhar913 <adzhar.faiq@gmail.com>
I, wanadzhar913 <adzhar.faiq@gmail.com>, hereby add my Signed-off-by to this commit: bb8f81a193

Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>

* Revert "feat: add support for `pandas==3.0.0`"

This reverts commit bb8f81a193.

* chore: add support for pandas==3.0.0

* DCO Remediation Commit for wanadzhar913 <adzhar.faiq@gmail.com>

I, wanadzhar913 <adzhar.faiq@gmail.com>, hereby add my Signed-off-by to this commit: 440484229e
I, wanadzhar913 <adzhar.faiq@gmail.com>, hereby add my Signed-off-by to this commit: b47e96c02d

Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>

* allow pandas >3 and update lock for it

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2026-03-06 11:25:56 +01:00
samiuc e363c951d8 feat: add plain-text serializer (#522)
* feat: add plain-text serializer

Signed-off-by: samiuc <sami.ullah.chat@gmail.com>

* fix code coverage

Signed-off-by: samiuc <sami.ullah.chat@gmail.com>

---------

Signed-off-by: samiuc <sami.ullah.chat@gmail.com>
2026-03-06 10:20:21 +01:00
github-actions[bot] 1c6ae32b46 chore: bump version to 2.67.1 [skip ci] v2.67.1 2026-03-05 09:11:24 +00:00
Ivan Traus 2debe0836f fix: prevent hang in export_to_markdown() on nested RichTableCells (#525)
* fix: prevent hang in export_to_markdown() on nested RichTableCells

When a table cell contains another table (via RichTableCell.ref pointing
to a nested table item), MarkdownTableSerializer recursively invoked
itself through doc_serializer.serialize(). Each level of nesting caused
the outer tabulate() call to receive cell strings that were already
formatted markdown tables, which then got &#124;-encoded and re-processed.
On documents with many such cells (e.g. Wikipedia pages with taxobox or
classification tables containing hyperlink-rich cells), this caused
exponential string growth, eventually exhausting memory or hanging.

Fix: add two module-level helpers:

  _cell_content_has_table(item, doc) — walks the item subtree and
  returns True if any node is a TableItem.

  _mark_subtree_visited(item, doc, visited) — adds all nodes in the
  subtree to the shared 'visited' set that the document serializer uses
  to prevent double-emitting items.

In MarkdownTableSerializer.serialize(), when a RichTableCell's content
contains a nested table, call _mark_subtree_visited() on it and then use
col.text (the precomputed plain-text fallback) instead of calling
doc_serializer.serialize(). This avoids the recursive MarkdownTableSerializer
call entirely, and keeps the visited set consistent so the nested table is
not emitted again as a separate top-level item.

For RichTableCells whose content does NOT contain a nested table (e.g.
italic text, lists, hyperlinks), doc_serializer.serialize() is called as
before, so inline Markdown formatting is preserved.

Bug introduced in docling-core v2.46.0 (commit 1d04154) when RichTableCell
markdown serialization was first added.

Signed-off-by: Ivan Traus <ivan@liminary.io>

* style: apply ruff formatting to markdown serializer

Keep the branch green in local and CI pre-commit runs by applying Ruff's required formatting in the nested rich table serializer path.

Signed-off-by: Ivan Traus <ivan@liminary.io>

* test(markdown): cover descendant table detection in rich-cell helper

Signed-off-by: Ivan Traus <ivan@liminary.io>

* fix(markdown): preserve nested table content as flattened text

Instead of falling back to the opaque col.text placeholder when a
RichTableCell contains a nested table, walk the subtree and collect
text from all grid cells and doc items.  This preserves the actual
inner table content in the markdown output while still avoiding the
recursive doc_serializer.serialize() call that caused exponential
string growth and the OOM hang.

Signed-off-by: Ivan Traus <ivan@liminary.io>

* fix(markdown): use TextItem instead of DocItem in subtree text collector

Signed-off-by: Ivan Traus <ivan@liminary.io>

* refactor(markdown): flatten only nested tables, preserve sibling rich formatting

Address review feedback: replace the all-or-nothing _cell_content_has_table
branch with a _nested_in_table flag propagated through kwargs, so only
MarkdownTableSerializer flattens when nested — sibling rich elements (italic,
lists, etc.) keep their markdown formatting. Also replace getattr with
isinstance(NodeItem) + dot notation and remove string annotations.

Signed-off-by: Ivan Traus <ivan@liminary.io>

---------

Signed-off-by: Ivan Traus <ivan@liminary.io>
2026-03-05 09:54:27 +01:00
github-actions[bot] 7c21840c4f chore: bump version to 2.67.0 [skip ci] v2.67.0 2026-03-04 15:29:11 +00:00
odelliab ea359bcc63 feat: table aware chunking (#527)
* line_chunker

* split table to header and body

* duplicat table headers

* Revert "duplicat table headers"

This reverts commit 5d17bdacf2.

* Revert "split table to header and body"

This reverts commit 91b43f97e4.

* Revert "line_chunker"

This reverts commit 5cc61d93fb.

* line chunker

* split table to header and body

* duplicate table headers

* DCO Remediation Commit for odelliab <odelliab@il.ibm.com>

I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 5cc61d93fb
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 91b43f97e4
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 5d17bdacf2
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: a50392e53c
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: e5894290d5
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 30c72a99be
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 510e949692
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 6c3a8f726e
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 0642a0701e

Signed-off-by: odelliab <odelliab@il.ibm.com>

* style changes

* pre-commit fixes

* expected output name change

* DCO Remediation Commit for odelliab <odelliab@il.ibm.com>

I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 9b9ef09007
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: b3699e32cc
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 9d393df546

Signed-off-by: odelliab <odelliab@il.ibm.com>

* Apply suggestion from @ceberam

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: odelliab <91875866+odelliab@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: odelliab <91875866+odelliab@users.noreply.github.com>

* address review comments

Signed-off-by: odelliab <odelliab@il.ibm.com>

* refactor: move get_default_tokenizer to huggingface module

Move the method get_default_tokenizer to module huggingface.py to avoid circular dependencies
and avoid an unconventional import in a method

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: odelliab <odelliab@il.ibm.com>
Signed-off-by: odelliab <91875866+odelliab@users.noreply.github.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-03-04 16:25:57 +01:00
github-actions[bot] 9eca661df7 chore: bump version to 2.66.0 [skip ci] v2.66.0 2026-02-26 10:46:22 +00:00
Matvei Smirnov c566268e0a fix: rich table triplet serialization (#425)
* fix: rich table triplet serialization

Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Vdaleke <vdalekesmirnov@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor: remove kwargs from 'export_to_dataframe' signature

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Vdaleke <vdalekesmirnov@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-02-26 11:39:34 +01:00
Cesar Berrospi Ramis 73b07572cf fix: support single-column table default serialization (#526)
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-02-24 12:23:01 +01:00
Cesar Berrospi Ramis b8ef7bad1b feat: add WebVTT export and save functionality (#523)
* feat(vtt): export and save to WebVTT format

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(vtt): omit empty blocks in parsing

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* test(vtt): add tests for exporting and saving to WebVTT

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-02-24 10:17:17 +01:00
github-actions[bot] cc4df04ac0 chore: bump version to 2.65.2 [skip ci] v2.65.2 2026-02-23 15:17:00 +00:00
Ultizan 6032c7c175 fix: accept relative URIs in PdfHyperlink without validation failure (#520)
PDF hyperlinks may contain relative paths, internal bookmarks, or
fragment-only references that are not valid absolute URLs. The strict
AnyUrl validation on PdfHyperlink.uri caused the entire page preprocess
stage to fail when such URIs were encountered, resulting in empty
documents and lost content.

Change uri type to Union[AnyUrl, str] with a field_validator that
attempts AnyUrl parsing first (preserving structured metadata like
scheme/host/path) and falls back to str for non-absolute URIs.

Signed-off-by: Ultizan <ultizan@gmail.com>
2026-02-23 16:11:38 +01:00
Christoph Auer 6a04db77aa fix: shift KV/Form graph cell page numbers during DoclingDocument.concatenate (#521)
fix: When concatenating docs, adjust page numbers the GraphCell elements appear on

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2026-02-23 10:35:16 +01:00
Cesar Berrospi Ramis a3b6e3fb89 fix(chunker): propagate 'traverse_pictures' parameter to chunker (#518)
* fix(chunker): propagate 'traverse_pictures' parameter from serializer to chunker

Propagate 'traverse_pictures' parameter from the 'serializer_provider' to the HierarchicalChunker, to ensure
that chunks include any TextItem children from PictureItem.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* test(chunker): simplify imports

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-02-20 20:37:43 +01:00
github-actions[bot] 5a7e567323 chore: bump version to 2.65.1 [skip ci] v2.65.1 2026-02-13 12:10:52 +00:00
Peter W. J. Staar a63685e827 fix: add pdf page widget and hyperlink (#516)
* feat: add hyperlinks and widgets

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added visualization

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-02-13 12:49:50 +01:00
github-actions[bot] f0cbfae063 chore: bump version to 2.65.0 [skip ci] v2.65.0 2026-02-13 11:26:24 +00:00
Peter W. J. Staar 4e472592ed feat: add hyperlinks and widgets (#515)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-02-13 12:22:38 +01:00
Panos Vagenas 9ba605d225 fix(Doclang): fix table cell content deserialization (#512)
* fix(DocLang): fix table cell `<content>` deserialization

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* remove irrelevant file

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-02-13 10:42:30 +01:00
Panos Vagenas aec74d4eb0 fix(Doclang): align image mode, defaulting to placeholder (#506)
* fix(Doclang): align image mode, defaulting to placeholder

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* update crop test

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-02-13 10:38:27 +01:00
Panos Vagenas 134cf75b69 test(Doclang): add chart serialization test (#498)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-02-13 10:31:20 +01:00
Panos Vagenas 1d969d4e37 fix: fix document re-indexing (#510)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-02-12 10:11:14 +01:00
Panos Vagenas 2793dda9d7 fix: switch XML parsing (#509)
* fix: switch xml parsing

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* move types package to dev

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-02-11 14:39:34 +01:00
github-actions[bot] 5fa0d3672c chore: bump version to 2.64.0 [skip ci] v2.64.0 2026-02-09 12:04:16 +00:00
Peter W. J. Staar 6adfbdacdc feat: add PdfShape to SegmentedPdfPage (#507)
* chore: moved PdfLine to PdfShape

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the PdfShape

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* update the _render_shapes

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* moved rendering shapes before the text

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* keep  with deprecation warnings

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-02-09 12:47:54 +01:00
Panos Vagenas 193c25f083 fix(Doclang): fix image URI serialization (#504)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-02-04 09:39:03 +01:00
Panos Vagenas 8005892f6c fix(DocTags): fix deserialization to populate picture meta fields (#505)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-02-03 16:50:34 +01:00
github-actions[bot] 46934cc436 chore: bump version to 2.63.0 [skip ci] v2.63.0 2026-02-03 14:40:27 +00:00
Peter W. J. Staar 409c83e32d feat: add image to BitMapResource (#502)
* feat: add image to BitMapResource

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-02-03 15:22:57 +01:00
Cesar Berrospi Ramis 04cf44b2c5 fix(serialization): add 'traverse_pictures' parameter to serializers (#501)
* fix(serializer): add 'traverse_pictures' to CommonParams

Expose 'traverse_pictures' parameter to serialier common parameters to control
whether to traverse into PictureItem objects to serialize their children.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* test(serializer): add test for 'traverse_pictures' parameter

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-02-03 15:07:55 +01:00
Matteo de2b729617 fix(DocTags): fix picture classification deserialization (#500)
* fixed picture classification deserialization from doctags

Signed-off-by: MatteoOmenetti <omenetti.matteo@gmail.com>

* fixed picture classification deserialization from doctags

Signed-off-by: MatteoOmenetti <omenetti.matteo@gmail.com>

---------

Signed-off-by: MatteoOmenetti <omenetti.matteo@gmail.com>
2026-02-03 10:58:11 +01:00
Panos Vagenas 1d8b78cbad fix(Doclang): fix checkbox serialization (#503)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-02-03 09:54:26 +01:00
Panos Vagenas ad86b85b92 chore: rename IDocTags to Doclang (#494)
* chore: rename IDocTags to Doclang

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* rename remaining file

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-01-30 16:41:09 +01:00
github-actions[bot] 7f4e5271f7 chore: bump version to 2.62.0 [skip ci] v2.62.0 2026-01-30 14:01:11 +00:00
Peter W. J. Staar fd27df1f07 fix(html): visualize picture meta as html collapsible (#497)
* feat: make meta of pictures cleaner in HTML

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* ran pre-commit

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-01-30 14:45:26 +01:00
Cesar Berrospi Ramis 3b0b909c89 fix(markdown): add an option to compact table serialization (#495)
* fix(markdown): add an option to compact table serialization

Add an option in MarkdownTableSerializer to remove padding in tables in markdown serialization.
Propagate the option to MarkdownDocSerializer and DoclingDocument.
Remove unnecessary use of 'mode' argument in 'open' function when mode is 'r'.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(markdown): keep alignment marks in '_compact_table'

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-01-29 16:23:54 +01:00
Panos Vagenas 549a2f1472 fix(IDocTags): fix default location resolution handling (#492)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-01-29 13:52:59 +01:00
Panos Vagenas 62f8d4d838 feat(IDocTags): add rich table support (#491)
* feat(IDocTags): add rich table serialization

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* update deserialization

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-01-28 13:44:57 +01:00