* feat: make an experimental outline serializer
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* work ongoing
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* working outline serializer
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* style(outline): align style of outline serializer to docling-core's
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(serializer): ensure 'include_non_meta' works as designed by CommonParams
Ensure the optional parameter 'include_non_meta' works as designed by CommonParams when used by OutlineDocSerializer.
Add regression tests for OutlineDocSerializer.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(serializer): outline serializer to optionally return in JSON format
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(serializer): enable a custom set of labels for TOC serializer
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* test(serializer): add summary to title in test data
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(serializer): ensure outline serializer keeps structured fields in JSON format
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(serializer): refactor outline serializer to create proper markdown
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(serializer): use pydantic models for the outine serializer JSON representation
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* test: rename outline serialization module to align it with others
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(serializer): add indented text format for outline serializer
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* feat: profile a document or collection
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(profiler): add deciles and histograms
Add deciles and histograms to the Docling collection statistics.
Add an example script to plot histograms.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(profiler): add option to plot log frequencies in histogram
Add the option to plot the histogram frequencies in logarithmic scale.
Extend README with documentation on the document profiler.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* test(profiler): cover missing lines in doc_profiler with tests
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix: prevent infinite loop in LineBasedTokenChunker with unbreakable tokens
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* tests: add a pytest fixture to test LineBasedTokenChunker efficiently
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* style: apply style conventions to test_line_chunker.py test module
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix: prevent hang in export_to_markdown() on nested RichTableCells
When a table cell contains another table (via RichTableCell.ref pointing
to a nested table item), MarkdownTableSerializer recursively invoked
itself through doc_serializer.serialize(). Each level of nesting caused
the outer tabulate() call to receive cell strings that were already
formatted markdown tables, which then got |-encoded and re-processed.
On documents with many such cells (e.g. Wikipedia pages with taxobox or
classification tables containing hyperlink-rich cells), this caused
exponential string growth, eventually exhausting memory or hanging.
Fix: add two module-level helpers:
_cell_content_has_table(item, doc) — walks the item subtree and
returns True if any node is a TableItem.
_mark_subtree_visited(item, doc, visited) — adds all nodes in the
subtree to the shared 'visited' set that the document serializer uses
to prevent double-emitting items.
In MarkdownTableSerializer.serialize(), when a RichTableCell's content
contains a nested table, call _mark_subtree_visited() on it and then use
col.text (the precomputed plain-text fallback) instead of calling
doc_serializer.serialize(). This avoids the recursive MarkdownTableSerializer
call entirely, and keeps the visited set consistent so the nested table is
not emitted again as a separate top-level item.
For RichTableCells whose content does NOT contain a nested table (e.g.
italic text, lists, hyperlinks), doc_serializer.serialize() is called as
before, so inline Markdown formatting is preserved.
Bug introduced in docling-core v2.46.0 (commit 1d04154) when RichTableCell
markdown serialization was first added.
Signed-off-by: Ivan Traus <ivan@liminary.io>
* style: apply ruff formatting to markdown serializer
Keep the branch green in local and CI pre-commit runs by applying Ruff's required formatting in the nested rich table serializer path.
Signed-off-by: Ivan Traus <ivan@liminary.io>
* test(markdown): cover descendant table detection in rich-cell helper
Signed-off-by: Ivan Traus <ivan@liminary.io>
* fix(markdown): preserve nested table content as flattened text
Instead of falling back to the opaque col.text placeholder when a
RichTableCell contains a nested table, walk the subtree and collect
text from all grid cells and doc items. This preserves the actual
inner table content in the markdown output while still avoiding the
recursive doc_serializer.serialize() call that caused exponential
string growth and the OOM hang.
Signed-off-by: Ivan Traus <ivan@liminary.io>
* fix(markdown): use TextItem instead of DocItem in subtree text collector
Signed-off-by: Ivan Traus <ivan@liminary.io>
* refactor(markdown): flatten only nested tables, preserve sibling rich formatting
Address review feedback: replace the all-or-nothing _cell_content_has_table
branch with a _nested_in_table flag propagated through kwargs, so only
MarkdownTableSerializer flattens when nested — sibling rich elements (italic,
lists, etc.) keep their markdown formatting. Also replace getattr with
isinstance(NodeItem) + dot notation and remove string annotations.
Signed-off-by: Ivan Traus <ivan@liminary.io>
---------
Signed-off-by: Ivan Traus <ivan@liminary.io>
* feat(vtt): export and save to WebVTT format
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(vtt): omit empty blocks in parsing
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* test(vtt): add tests for exporting and saving to WebVTT
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
PDF hyperlinks may contain relative paths, internal bookmarks, or
fragment-only references that are not valid absolute URLs. The strict
AnyUrl validation on PdfHyperlink.uri caused the entire page preprocess
stage to fail when such URIs were encountered, resulting in empty
documents and lost content.
Change uri type to Union[AnyUrl, str] with a field_validator that
attempts AnyUrl parsing first (preserving structured metadata like
scheme/host/path) and falls back to str for non-absolute URIs.
Signed-off-by: Ultizan <ultizan@gmail.com>
* fix(chunker): propagate 'traverse_pictures' parameter from serializer to chunker
Propagate 'traverse_pictures' parameter from the 'serializer_provider' to the HierarchicalChunker, to ensure
that chunks include any TextItem children from PictureItem.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* test(chunker): simplify imports
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore: moved PdfLine to PdfShape
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the PdfShape
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* update the _render_shapes
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* moved rendering shapes before the text
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* keep with deprecation warnings
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fix(serializer): add 'traverse_pictures' to CommonParams
Expose 'traverse_pictures' parameter to serialier common parameters to control
whether to traverse into PictureItem objects to serialize their children.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* test(serializer): add test for 'traverse_pictures' parameter
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(markdown): add an option to compact table serialization
Add an option in MarkdownTableSerializer to remove padding in tables in markdown serialization.
Propagate the option to MarkdownDocSerializer and DoclingDocument.
Remove unnecessary use of 'mode' argument in 'open' function when mode is 'r'.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(markdown): keep alignment marks in '_compact_table'
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>