* fix(doclang): default DoclangDeserializer to page 1 and allow start offset
Doclang VLM output was assigning provenance to page 0 while DoclingDocument and ground truth use 1-based page numbers, which broke page-aligned matching.
- Default internal _page_no to 1 and add deserialize(..., page_no=1).
- Replace hardcoded self._page_no = 0 with the page_no argument.
- Update roundtrip_list_item_with_inline deserialized fixture for page 1.
* DCO Remediation Commit for Ahmed Nassar AHN@zurich.ibm.com <AHN@zurich.ibm.com>
I, Ahmed Nassar AHN@zurich.ibm.com <AHN@zurich.ibm.com>, hereby add my Signed-off-by to this commit: 0c34abd5c3
Signed-off-by: Ahmed Nassar AHN@zurich.ibm.com <AHN@zurich.ibm.com>
---------
Signed-off-by: Ahmed Nassar AHN@zurich.ibm.com <AHN@zurich.ibm.com>
Co-authored-by: Ahmed Nassar AHN@zurich.ibm.com <AHN@zurich.ibm.com>
* feat(serializer): add MsExcelMarkdownDocSerializer for sheet-name headings
Add `MsExcelMarkdownFallbackSerializer` and `MsExcelMarkdownDocSerializer`
to the serializer package so that `GroupLabel.SHEET` groups are rendered as
level-2 Markdown headings when exporting Excel-sourced documents.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* DCO Remediation Commit for Smeet23 <smeetagrawal2003@gmail.com>
I, Smeet23 <smeetagrawal2003@gmail.com>, hereby add my Signed-off-by to this commit: 2a3808e5dc
Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>
* style: apply ruff formatter to markdown_excel.py
Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>
---------
Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: repair table children when rich table cells break hierarchy
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* require _normalize_table_children_from_rich_cells to be called explicitly
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix(Doclang): suppress empty elements in Doclang serialization
Add `suppress_empty_elements` parameter to `DoclangParams` that, when
enabled, omits text items, picture groups, tables, and inline elements
that produce no content instead of emitting empty open/close tag pairs.
This avoids cluttering OCR-oriented output with vacant tags.
Signed-off-by: Ahmed Nassar <nassarofficial@gmail.com>
* added tests
Signed-off-by: MatteoOmenetti <omenetti.matteo@gmail.com>
---------
Signed-off-by: Ahmed Nassar <nassarofficial@gmail.com>
Signed-off-by: MatteoOmenetti <omenetti.matteo@gmail.com>
Co-authored-by: MatteoOmenetti <omenetti.matteo@gmail.com>
* chore(serializer): add item label to outline serializer
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(serializer): leverage pydantic for validating and dumping in outline serializer
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* test(serializer): test '_format_indented_text_line' from outline serializer
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(serializer): pass kwargs to all outline serializers for format consistency
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix: add missing picture classification tokens for photograph and geographical_map
The DocumentFigureClassifier model outputs 'photograph' and 'geographical_map'
class names that were not defined in _PictureClassificationToken and
PictureClassificationLabel enums, causing ValueError during DocTags serialization.
Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
* fix: add all missing picture classification tokens from DocumentFigureClassifier-v2.0
The v2.0 model outputs 12 additional class names not defined in enums:
scatter_plot, box_plot, table, full_page_image, page_thumbnail,
chemistry_structure, screenshot_from_computer, screenshot_from_manual,
topographical_map, engineering_drawing, music, calendar, crossword_puzzle.
Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
* fix: update extract_picture_classification to include v2.0 labels
The all_labels list in extract_picture_classification was missing the
newly added v2.0 classification labels, which would cause classification
info to be lost when loading documents from DocTags format.
Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
* chore: reorganize picture classification labels and tokens
Reorganize the list of picture classification labels and tokens to highlight
those from the current model from the rest.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* feat: add 'meta' field to DoclingDocument at root level
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* test(metadata): reuse DoclingDocument objects with fixtures
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(serializer): add document-level meta to the outline serialization
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* test: add helper function to compare or regenerate ground truth files
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* test: regenerate (upgrade) ground truth files
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(serializer): set default indent to 2 spaces in outline serializer
To be aligned with other indented text serializations, the outline serializer takes 2 spaces
as default indent for serialization.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(serializer): add line break after ref in outline md serialization
To render a line break after the reference element in the outline markdown serializer, add 2 trailing spaces.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor: rollback adding 'meta' field to DoclingDocument
Revert the changes in commit 7e17c52 since DoclingDocument meta-information can
alreadby be achieved by leveraging the 'body' field, which is of type 'NodeItem'
and thus it has a 'meta' field.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* feat: make an experimental outline serializer
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* work ongoing
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* working outline serializer
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* style(outline): align style of outline serializer to docling-core's
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(serializer): ensure 'include_non_meta' works as designed by CommonParams
Ensure the optional parameter 'include_non_meta' works as designed by CommonParams when used by OutlineDocSerializer.
Add regression tests for OutlineDocSerializer.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(serializer): outline serializer to optionally return in JSON format
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(serializer): enable a custom set of labels for TOC serializer
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* test(serializer): add summary to title in test data
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(serializer): ensure outline serializer keeps structured fields in JSON format
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(serializer): refactor outline serializer to create proper markdown
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(serializer): use pydantic models for the outine serializer JSON representation
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* test: rename outline serialization module to align it with others
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(serializer): add indented text format for outline serializer
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* feat: profile a document or collection
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(profiler): add deciles and histograms
Add deciles and histograms to the Docling collection statistics.
Add an example script to plot histograms.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(profiler): add option to plot log frequencies in histogram
Add the option to plot the histogram frequencies in logarithmic scale.
Extend README with documentation on the document profiler.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* test(profiler): cover missing lines in doc_profiler with tests
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>