docling-core

mirror of https://github.com/docling-project/docling-core.git synced 2026-05-17 13:10:44 +00:00

Author	SHA1	Message	Date
github-actions[bot]	1513f7d171	chore: bump version to 2.70.0 [skip ci] v2.70.0	2026-03-13 15:06:05 +00:00
Panos Vagenas	b56f75190f	chore(Doclang): remove `inline` element (#517 ) Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2026-03-13 15:58:42 +01:00
Panos Vagenas	b93d5a3920	feat: introduce field data model incl. Doclang serialization (#519 ) * feat: introduce new kv data model Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * add test Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * extend tests Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * add form-table test Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * update invoice test Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * add migration Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * include pre-migration YAML Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * add nesting test Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * switch to field naming Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * improve prov migration & location serialization Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * enable proper nested serialization without inline Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * add field marker and test Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * generalize marker naming Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * updated API Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * align with TextItem conventions Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * make FieldItem a DocItem Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * extend KV migration scope, extend tree manipulation operations Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * extend kv migration tests Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * add test with KV nesting Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * improve key splitting and location assignment Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * address conflicts Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * minor cleanup Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * remove unnecessary files Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * add add_content parameter, expose export_to_doclang Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * bump DoclingDocument version Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * address conflicts, add save_as_doclang Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2026-03-13 15:37:42 +01:00
Cesar Berrospi Ramis	9f3c0c6757	chore: upgrade dependencies to address dependabot alerts (#543 ) Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-03-13 15:18:38 +01:00
Peter W. J. Staar	8d7859eeec	feat: make an experimental outline serializer (#415 ) * feat: make an experimental outline serializer Signed-off-by: Peter Staar <taa@zurich.ibm.com> * work ongoing Signed-off-by: Peter Staar <taa@zurich.ibm.com> * working outline serializer Signed-off-by: Peter Staar <taa@zurich.ibm.com> * style(outline): align style of outline serializer to docling-core's Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * fix(serializer): ensure 'include_non_meta' works as designed by CommonParams Ensure the optional parameter 'include_non_meta' works as designed by CommonParams when used by OutlineDocSerializer. Add regression tests for OutlineDocSerializer. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(serializer): outline serializer to optionally return in JSON format Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(serializer): enable a custom set of labels for TOC serializer Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * test(serializer): add summary to title in test data Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * fix(serializer): ensure outline serializer keeps structured fields in JSON format Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(serializer): refactor outline serializer to create proper markdown Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * refactor(serializer): use pydantic models for the outine serializer JSON representation Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * test: rename outline serialization module to align it with others Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(serializer): add indented text format for outline serializer Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-03-13 15:14:40 +01:00
Cesar Berrospi Ramis	af50f1cb07	feat: profile a document or collection (#511 ) * feat: profile a document or collection Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(profiler): add deciles and histograms Add deciles and histograms to the Docling collection statistics. Add an example script to plot histograms. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(profiler): add option to plot log frequencies in histogram Add the option to plot the histogram frequencies in logarithmic scale. Extend README with documentation on the document profiler. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * test(profiler): cover missing lines in doc_profiler with tests Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-03-13 13:36:38 +01:00
odelliab	b435090fdf	feat: split html table to headers and body (#532 ) * line_chunker * split table to header and body * duplicat table headers * Revert "duplicat table headers" This reverts commit `5d17bdacf2`. * Revert "split table to header and body" This reverts commit `91b43f97e4`. * Revert "line_chunker" This reverts commit `5cc61d93fb`. * line chunker * split table to header and body * duplicate table headers * DCO Remediation Commit for odelliab <odelliab@il.ibm.com> I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: `5cc61d93fb` I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: `91b43f97e4` I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: `5d17bdacf2` I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: `a50392e53c` I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: `e5894290d5` I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: `30c72a99be` I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: `510e949692` I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: `6c3a8f726e` I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: `0642a0701e` Signed-off-by: odelliab <odelliab@il.ibm.com> * style changes * pre-commit fixes * expected output name change * DCO Remediation Commit for odelliab <odelliab@il.ibm.com> I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: `9b9ef09007` I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: `b3699e32cc` I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: `9d393df546` Signed-off-by: odelliab <odelliab@il.ibm.com> * Apply suggestion from @ceberam Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: odelliab <91875866+odelliab@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: odelliab <91875866+odelliab@users.noreply.github.com> * address review comments Signed-off-by: odelliab <odelliab@il.ibm.com> * refactor: move get_default_tokenizer to huggingface module Move the method get_default_tokenizer to module huggingface.py to avoid circular dependencies and avoid an unconventional import in a method Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * remove unnecessary lines Signed-off-by: odelliab <odelliab@il.ibm.com> * split html tables Signed-off-by: odelliab <odelliab@il.ibm.com> * add bs4 dependency Signed-off-by: odelliab <odelliab@il.ibm.com> * Apply suggestions from code review Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: odelliab <91875866+odelliab@users.noreply.github.com> * address review comments Signed-off-by: odelliab <odelliab@il.ibm.com> * typo Signed-off-by: odelliab <odelliab@il.ibm.com> * add beautifulsoup tests Signed-off-by: odelliab <odelliab@il.ibm.com> * refactor(serializer): drop dependency 'beautifulsoup4' in HTML serializer Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * test(serializer): refactor tests for HTML serializer Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: odelliab <odelliab@il.ibm.com> Signed-off-by: odelliab <91875866+odelliab@users.noreply.github.com> Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-03-13 10:40:05 +01:00
Anish Raghavendra	e00125c477	feat: handle wide table outliers with LineBasedTokenChunker (#536 ) * feat: handle large prefixes by splitting into multiple chunks in LineBasedTokenChunker Signed-off-by: z404 <anishr890@gmail.com> * feat: add option to omit table headers on row overflow Signed-off-by: z404 <anishr890@gmail.com> * chore: add warning when prefix is omitted on overflow Signed-off-by: z404 <anishr890@gmail.com> * refactor: simplify prefix chunking logic by removing redundant method Signed-off-by: z404 <anishr890@gmail.com> * fix: ensure prefix appears as standalone chunk when all lines overflow Signed-off-by: z404 <anishr890@gmail.com> * docs: add usage examples for LineBasedTokenChunker with prefix handling Signed-off-by: z404 <anishr890@gmail.com> * docs: remove examples and describe parameters for LineBasedTokenChunker Signed-off-by: z404 <anishr890@gmail.com> * docs: simplify class docstring by removing parameter details Signed-off-by: z404 <anishr890@gmail.com> --------- Signed-off-by: z404 <anishr890@gmail.com>	2026-03-13 07:29:46 +01:00
github-actions[bot]	24574110d3	chore: bump version to 2.69.0 [skip ci] v2.69.0	2026-03-09 04:31:38 +00:00
Michele Dolfi	4eb0d20d04	feat: Loosen dependency version constraints (#534 ) * update deps Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix code to support all versions and regen tests Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2026-03-09 05:27:48 +01:00
github-actions[bot]	eb900064c8	chore: bump version to 2.68.0 [skip ci] v2.68.0	2026-03-07 12:19:50 +00:00
Cesar Berrospi Ramis	a661bb10cb	fix: prevent infinite loop in LineBasedTokenChunker with unbreakable tokens (#533 ) * fix: prevent infinite loop in LineBasedTokenChunker with unbreakable tokens Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * tests: add a pytest fixture to test LineBasedTokenChunker efficiently Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * style: apply style conventions to test_line_chunker.py test module Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-03-06 12:52:30 +01:00
Faiq Adzlan	43030488c9	chore: add support for `pandas==3.0.0` (#513 ) * feat: add support for `pandas==3.0.0` * DCO Remediation Commit for wanadzhar913 <adzhar.faiq@gmail.com> I, wanadzhar913 <adzhar.faiq@gmail.com>, hereby add my Signed-off-by to this commit: `bb8f81a193` Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com> * Revert "feat: add support for `pandas==3.0.0`" This reverts commit `bb8f81a193`. * chore: add support for pandas==3.0.0 * DCO Remediation Commit for wanadzhar913 <adzhar.faiq@gmail.com> I, wanadzhar913 <adzhar.faiq@gmail.com>, hereby add my Signed-off-by to this commit: `440484229e` I, wanadzhar913 <adzhar.faiq@gmail.com>, hereby add my Signed-off-by to this commit: `b47e96c02d` Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com> * allow pandas >3 and update lock for it Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2026-03-06 11:25:56 +01:00
samiuc	e363c951d8	feat: add plain-text serializer (#522 ) * feat: add plain-text serializer Signed-off-by: samiuc <sami.ullah.chat@gmail.com> * fix code coverage Signed-off-by: samiuc <sami.ullah.chat@gmail.com> --------- Signed-off-by: samiuc <sami.ullah.chat@gmail.com>	2026-03-06 10:20:21 +01:00
github-actions[bot]	1c6ae32b46	chore: bump version to 2.67.1 [skip ci] v2.67.1	2026-03-05 09:11:24 +00:00
Ivan Traus	2debe0836f	fix: prevent hang in export_to_markdown() on nested RichTableCells (#525 ) * fix: prevent hang in export_to_markdown() on nested RichTableCells When a table cell contains another table (via RichTableCell.ref pointing to a nested table item), MarkdownTableSerializer recursively invoked itself through doc_serializer.serialize(). Each level of nesting caused the outer tabulate() call to receive cell strings that were already formatted markdown tables, which then got \|-encoded and re-processed. On documents with many such cells (e.g. Wikipedia pages with taxobox or classification tables containing hyperlink-rich cells), this caused exponential string growth, eventually exhausting memory or hanging. Fix: add two module-level helpers: _cell_content_has_table(item, doc) — walks the item subtree and returns True if any node is a TableItem. _mark_subtree_visited(item, doc, visited) — adds all nodes in the subtree to the shared 'visited' set that the document serializer uses to prevent double-emitting items. In MarkdownTableSerializer.serialize(), when a RichTableCell's content contains a nested table, call _mark_subtree_visited() on it and then use col.text (the precomputed plain-text fallback) instead of calling doc_serializer.serialize(). This avoids the recursive MarkdownTableSerializer call entirely, and keeps the visited set consistent so the nested table is not emitted again as a separate top-level item. For RichTableCells whose content does NOT contain a nested table (e.g. italic text, lists, hyperlinks), doc_serializer.serialize() is called as before, so inline Markdown formatting is preserved. Bug introduced in docling-core v2.46.0 (commit `1d04154`) when RichTableCell markdown serialization was first added. Signed-off-by: Ivan Traus <ivan@liminary.io> * style: apply ruff formatting to markdown serializer Keep the branch green in local and CI pre-commit runs by applying Ruff's required formatting in the nested rich table serializer path. Signed-off-by: Ivan Traus <ivan@liminary.io> * test(markdown): cover descendant table detection in rich-cell helper Signed-off-by: Ivan Traus <ivan@liminary.io> * fix(markdown): preserve nested table content as flattened text Instead of falling back to the opaque col.text placeholder when a RichTableCell contains a nested table, walk the subtree and collect text from all grid cells and doc items. This preserves the actual inner table content in the markdown output while still avoiding the recursive doc_serializer.serialize() call that caused exponential string growth and the OOM hang. Signed-off-by: Ivan Traus <ivan@liminary.io> * fix(markdown): use TextItem instead of DocItem in subtree text collector Signed-off-by: Ivan Traus <ivan@liminary.io> * refactor(markdown): flatten only nested tables, preserve sibling rich formatting Address review feedback: replace the all-or-nothing _cell_content_has_table branch with a _nested_in_table flag propagated through kwargs, so only MarkdownTableSerializer flattens when nested — sibling rich elements (italic, lists, etc.) keep their markdown formatting. Also replace getattr with isinstance(NodeItem) + dot notation and remove string annotations. Signed-off-by: Ivan Traus <ivan@liminary.io> --------- Signed-off-by: Ivan Traus <ivan@liminary.io>	2026-03-05 09:54:27 +01:00
github-actions[bot]	7c21840c4f	chore: bump version to 2.67.0 [skip ci] v2.67.0	2026-03-04 15:29:11 +00:00
odelliab	ea359bcc63	feat: table aware chunking (#527 ) * line_chunker * split table to header and body * duplicat table headers * Revert "duplicat table headers" This reverts commit `5d17bdacf2`. * Revert "split table to header and body" This reverts commit `91b43f97e4`. * Revert "line_chunker" This reverts commit `5cc61d93fb`. * line chunker * split table to header and body * duplicate table headers * DCO Remediation Commit for odelliab <odelliab@il.ibm.com> I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: `5cc61d93fb` I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: `91b43f97e4` I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: `5d17bdacf2` I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: `a50392e53c` I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: `e5894290d5` I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: `30c72a99be` I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: `510e949692` I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: `6c3a8f726e` I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: `0642a0701e` Signed-off-by: odelliab <odelliab@il.ibm.com> * style changes * pre-commit fixes * expected output name change * DCO Remediation Commit for odelliab <odelliab@il.ibm.com> I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: `9b9ef09007` I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: `b3699e32cc` I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: `9d393df546` Signed-off-by: odelliab <odelliab@il.ibm.com> * Apply suggestion from @ceberam Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: odelliab <91875866+odelliab@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: odelliab <91875866+odelliab@users.noreply.github.com> * address review comments Signed-off-by: odelliab <odelliab@il.ibm.com> * refactor: move get_default_tokenizer to huggingface module Move the method get_default_tokenizer to module huggingface.py to avoid circular dependencies and avoid an unconventional import in a method Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: odelliab <odelliab@il.ibm.com> Signed-off-by: odelliab <91875866+odelliab@users.noreply.github.com> Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-03-04 16:25:57 +01:00
github-actions[bot]	9eca661df7	chore: bump version to 2.66.0 [skip ci] v2.66.0	2026-02-26 10:46:22 +00:00
Matvei Smirnov	c566268e0a	fix: rich table triplet serialization (#425 ) * fix: rich table triplet serialization Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Signed-off-by: Vdaleke <vdalekesmirnov@gmail.com> Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * refactor: remove kwargs from 'export_to_dataframe' signature Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Vdaleke <vdalekesmirnov@gmail.com> Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-02-26 11:39:34 +01:00
Cesar Berrospi Ramis	73b07572cf	fix: support single-column table default serialization (#526 ) Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-02-24 12:23:01 +01:00
Cesar Berrospi Ramis	b8ef7bad1b	feat: add WebVTT export and save functionality (#523 ) * feat(vtt): export and save to WebVTT format Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * fix(vtt): omit empty blocks in parsing Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * test(vtt): add tests for exporting and saving to WebVTT Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-02-24 10:17:17 +01:00
github-actions[bot]	cc4df04ac0	chore: bump version to 2.65.2 [skip ci] v2.65.2	2026-02-23 15:17:00 +00:00
Ultizan	6032c7c175	fix: accept relative URIs in PdfHyperlink without validation failure (#520 ) PDF hyperlinks may contain relative paths, internal bookmarks, or fragment-only references that are not valid absolute URLs. The strict AnyUrl validation on PdfHyperlink.uri caused the entire page preprocess stage to fail when such URIs were encountered, resulting in empty documents and lost content. Change uri type to Union[AnyUrl, str] with a field_validator that attempts AnyUrl parsing first (preserving structured metadata like scheme/host/path) and falls back to str for non-absolute URIs. Signed-off-by: Ultizan <ultizan@gmail.com>	2026-02-23 16:11:38 +01:00
Christoph Auer	6a04db77aa	fix: shift KV/Form graph cell page numbers during DoclingDocument.concatenate (#521 ) fix: When concatenating docs, adjust page numbers the GraphCell elements appear on Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2026-02-23 10:35:16 +01:00
Cesar Berrospi Ramis	a3b6e3fb89	fix(chunker): propagate 'traverse_pictures' parameter to chunker (#518 ) * fix(chunker): propagate 'traverse_pictures' parameter from serializer to chunker Propagate 'traverse_pictures' parameter from the 'serializer_provider' to the HierarchicalChunker, to ensure that chunks include any TextItem children from PictureItem. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * test(chunker): simplify imports Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-02-20 20:37:43 +01:00
github-actions[bot]	5a7e567323	chore: bump version to 2.65.1 [skip ci] v2.65.1	2026-02-13 12:10:52 +00:00
Peter W. J. Staar	a63685e827	fix: add pdf page widget and hyperlink (#516 ) * feat: add hyperlinks and widgets Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added visualization Signed-off-by: Peter Staar <taa@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com>	2026-02-13 12:49:50 +01:00
github-actions[bot]	f0cbfae063	chore: bump version to 2.65.0 [skip ci] v2.65.0	2026-02-13 11:26:24 +00:00
Peter W. J. Staar	4e472592ed	feat: add hyperlinks and widgets (#515 ) Signed-off-by: Peter Staar <taa@zurich.ibm.com>	2026-02-13 12:22:38 +01:00
Panos Vagenas	9ba605d225	fix(Doclang): fix table cell `content` deserialization (#512 ) * fix(DocLang): fix table cell `<content>` deserialization Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * remove irrelevant file Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2026-02-13 10:42:30 +01:00
Panos Vagenas	aec74d4eb0	fix(Doclang): align image mode, defaulting to placeholder (#506 ) * fix(Doclang): align image mode, defaulting to placeholder Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * update crop test Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2026-02-13 10:38:27 +01:00
Panos Vagenas	134cf75b69	test(Doclang): add chart serialization test (#498 ) Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2026-02-13 10:31:20 +01:00
Panos Vagenas	1d969d4e37	fix: fix document re-indexing (#510 ) Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2026-02-12 10:11:14 +01:00
Panos Vagenas	2793dda9d7	fix: switch XML parsing (#509 ) * fix: switch xml parsing Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * move types package to dev Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2026-02-11 14:39:34 +01:00
github-actions[bot]	5fa0d3672c	chore: bump version to 2.64.0 [skip ci] v2.64.0	2026-02-09 12:04:16 +00:00
Peter W. J. Staar	6adfbdacdc	feat: add PdfShape to SegmentedPdfPage (#507 ) * chore: moved PdfLine to PdfShape Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the PdfShape Signed-off-by: Peter Staar <taa@zurich.ibm.com> * update the _render_shapes Signed-off-by: Peter Staar <taa@zurich.ibm.com> * moved rendering shapes before the text Signed-off-by: Peter Staar <taa@zurich.ibm.com> * keep with deprecation warnings Signed-off-by: Peter Staar <taa@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com>	2026-02-09 12:47:54 +01:00
Panos Vagenas	193c25f083	fix(Doclang): fix image URI serialization (#504 ) Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2026-02-04 09:39:03 +01:00
Panos Vagenas	8005892f6c	fix(DocTags): fix deserialization to populate picture meta fields (#505 ) Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2026-02-03 16:50:34 +01:00
github-actions[bot]	46934cc436	chore: bump version to 2.63.0 [skip ci] v2.63.0	2026-02-03 14:40:27 +00:00
Peter W. J. Staar	409c83e32d	feat: add image to BitMapResource (#502 ) * feat: add image to BitMapResource Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the Signed-off-by: Peter Staar <taa@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com>	2026-02-03 15:22:57 +01:00
Cesar Berrospi Ramis	04cf44b2c5	fix(serialization): add 'traverse_pictures' parameter to serializers (#501 ) * fix(serializer): add 'traverse_pictures' to CommonParams Expose 'traverse_pictures' parameter to serialier common parameters to control whether to traverse into PictureItem objects to serialize their children. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * test(serializer): add test for 'traverse_pictures' parameter Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-02-03 15:07:55 +01:00
Matteo	de2b729617	fix(DocTags): fix picture classification deserialization (#500 ) * fixed picture classification deserialization from doctags Signed-off-by: MatteoOmenetti <omenetti.matteo@gmail.com> * fixed picture classification deserialization from doctags Signed-off-by: MatteoOmenetti <omenetti.matteo@gmail.com> --------- Signed-off-by: MatteoOmenetti <omenetti.matteo@gmail.com>	2026-02-03 10:58:11 +01:00
Panos Vagenas	1d8b78cbad	fix(Doclang): fix checkbox serialization (#503 ) Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2026-02-03 09:54:26 +01:00
Panos Vagenas	ad86b85b92	chore: rename IDocTags to Doclang (#494 ) * chore: rename IDocTags to Doclang Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * rename remaining file Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2026-01-30 16:41:09 +01:00
github-actions[bot]	7f4e5271f7	chore: bump version to 2.62.0 [skip ci] v2.62.0	2026-01-30 14:01:11 +00:00
Peter W. J. Staar	fd27df1f07	fix(html): visualize picture meta as html collapsible (#497 ) * feat: make meta of pictures cleaner in HTML Signed-off-by: Peter Staar <taa@zurich.ibm.com> * ran pre-commit Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the tests Signed-off-by: Peter Staar <taa@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com>	2026-01-30 14:45:26 +01:00
Cesar Berrospi Ramis	3b0b909c89	fix(markdown): add an option to compact table serialization (#495 ) * fix(markdown): add an option to compact table serialization Add an option in MarkdownTableSerializer to remove padding in tables in markdown serialization. Propagate the option to MarkdownDocSerializer and DoclingDocument. Remove unnecessary use of 'mode' argument in 'open' function when mode is 'r'. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(markdown): keep alignment marks in '_compact_table' Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-01-29 16:23:54 +01:00
Panos Vagenas	549a2f1472	fix(IDocTags): fix default location resolution handling (#492 ) Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2026-01-29 13:52:59 +01:00
Panos Vagenas	62f8d4d838	feat(IDocTags): add rich table support (#491 ) * feat(IDocTags): add rich table serialization Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * update deserialization Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2026-01-28 13:44:57 +01:00

1 2 3 4 5 ...

482 Commits