Commit Graph

34 Commits

Author SHA1 Message Date
Peter W. J. Staar 46a9b5a329 feat: add latex and Tikz as codelabels (#579)
* feat: add latex and Tikz as codelabels

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fix the docs

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-04-09 06:46:18 +02:00
Panos Vagenas 0bd5d8e649 feat: add code representation meta field (#573)
* feat: add code representation meta field

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add code language

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* restore custom field API changes

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-03-30 17:43:39 +02:00
Vittorio Pippi fb3b603bc4 feat: add handwriting support (#561)
* feat: Add HANDWRITTEN_TEXT label support

Add full integration for the HANDWRITTEN_TEXT document item label.

Changes:
- tokens.py: Add HANDWRITTEN_TEXT to DocumentToken enum and mapping
- document.py: Add HANDWRITTEN_TEXT to DEFAULT_EXPORT_LABELS

Also adds test/test_handwritten_text_label.py for integration tests.

* DCO Remediation Commit for Vittorio Pippi <vpi@wila.zurich.ibm.com>

I, Vittorio Pippi <vpi@wila.zurich.ibm.com>, hereby add my Signed-off-by to this commit: acbdfa3902e74da44a5844ad0cecab8657da4904

Signed-off-by: Vittorio Pippi <vpi@wila.zurich.ibm.com>

* add handwriting support to Doclang

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* DCO Remediation Commit for Vittorio Pippi <vpi@wila.zurich.ibm.com>

I, Vittorio Pippi <vpi@wila.zurich.ibm.com>, hereby add my Signed-off-by to this commit: 000ccc55c5

Signed-off-by: Vittorio Pippi <vpi@wila.zurich.ibm.com>

---------

Signed-off-by: Vittorio Pippi <vpi@wila.zurich.ibm.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Vittorio Pippi <vpi@wila.zurich.ibm.com>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
2026-03-26 08:43:29 +01:00
Panos Vagenas b93d5a3920 feat: introduce field data model incl. Doclang serialization (#519)
* feat: introduce new kv data model

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add test

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* extend tests

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add form-table test

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* update invoice test

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add migration

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* include pre-migration YAML

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add nesting test

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* switch to field naming

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* improve prov migration & location serialization

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* enable proper nested serialization without inline

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add field marker and test

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* generalize marker naming

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* updated API

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* align with TextItem conventions

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* make FieldItem a DocItem

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* extend KV migration scope, extend tree manipulation operations

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* extend kv migration tests

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add test with KV nesting

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* improve key splitting and location assignment

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* address conflicts

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* minor cleanup

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* remove unnecessary files

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add add_content parameter, expose export_to_doclang

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* bump DoclingDocument version

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* address conflicts, add save_as_doclang

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2026-03-13 15:37:42 +01:00
Cesar Berrospi Ramis c8f3c01a61 feat: model and serializer for audio tracks (#426)
* refactor: move WebVTT data model from docling

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(webvtt): deal with HTML entities in cue text spans

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(webvtt): support more WebVTT models

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(DoclingDocument): create a new provenance model for media file types

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(webvtt): make WebVTTTimestamp public

Since WebVTTTimestamp is used in DoclingDocument, the class should be public.
Strengthen validation of cue language start tag annotation.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(webvtt): set languages to a list of strings in ProvenanceTrack

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(webvtt): add test for ProvenanceTrack

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(webvtt): make all WebVTT classes public for reuse

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(webvtt): preserve newlines as WebVTTLineTerminator

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(webvtt): set ProvenanceTrack time fields as float

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(webvtt): ensure start time offsets are in sequence

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(webvtt): improve regex to remove note,region,style blocks

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(webvtt): parse the WebVTT file title

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(webvtt): rebase to latest changes in idoctags

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* feat(webvtt): add WebVTT serializer

Add a DoclingDocument serializer to WebVTT format.
Improve WebVTT data model.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(webvtt): add 'text/vtt' as extra mimetype

Add 'text/vtt' as extra MIME type to support WebVTT serialization, since it is not
supported by 'mimetypes' with python < 3.11

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(webvtt): roll back DocItem.prov as list of ProvenanceItem

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(webvtt): fix test with STYLE and NOTE blocks

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* style(webvtt): apply X | Y annotation instead of Optional, Union

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(webvtt): simplify TrackProvenance model with tags

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(webvtt): align class and field names to new 'source' type

Classes and fields that are related to the new source type should aign with their names.
The term 'provenance' will identify the legacy implementation.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(DoclingDocument): drop the validation on field assignment

Drop the validation on field assignment in NodeItem objects.
Add the 'source' argument in the convenient function 'add_text' to create TextItem with track source data.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

refactor(webvtt): drop cue span classes, 'lang' and 'c' tags

Drop WebVTT formatting features not covered by Docling across formats.
Only 'u', 'b', 'i', and 'v' are supported and without classes.
Make 'v' tag explicit as 'voice' feature in SourceTrack class.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-01-27 14:45:03 +01:00
Florian Schwarb c73904e68e style: replace black, isort, flake8 and autoflake with ruff (#456)
* Added ruff to dev dependencies

Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>

* Added ruff settings to pyproject.toml as in docling

Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>

* Cleanup uf pyproject.toml

Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>

* Copied settings for ruff pre-commit hooks from docling

Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>

* Excluded test/data/** from ruff formatting / linting

Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>

* ruff format

Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>

* Added some ignore statements to pyproject.toml such that ruff check raises fewer issues

Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>

* ruff check --fix

Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>

* Ignored some more rules

Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>

* Fixed the rest of the errors that would only concern 1 - 3 files

Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>

* Added another ignore related to df for DataFrame names

Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>

* Modified CONTRIBUTING.md such that black / isort are replaced by ruff

Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>

* Added UP045 to ignore list such that Optional[...] does not raise

Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>

* Moved .flake8 configs to pyproject.toml

Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>

* Moved autoflake to be used with ruff

Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>

* Moved all .flake8 settings to pyproject.toml to be compatible with ruff (i.e. no separate [tool.flake8] section

Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>

* Removed flake8 from .pre-commit hooks

Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>

* Applied ruff format (again); formatted some files as the line-length = 120 equals now what was set for the .flake8 settings

Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>

* Set max-complexity to 30 (as was originally) in the pyproject.toml as one linting check would fail

Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>

* Adding PD901 to ignore list such that pre-commit hooks run fully again

Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>

* Replaced dtype | None syntax by Optional[dtype] in remaining places

Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>

* chore: fix 'test' ref in pyproject

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* style: remove typing List, Set, Tuple, Dict

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* style: remove UP015 check from ignore list

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* style: remove UP034 check from ignore list

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* style: normalize dashes in comments and docstrings

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* style: remove PD901 check from ignore list

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* style: remove C403 check from ignore list

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* style: remove C403, C413, C416 check from ignore list

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* style: remove E203, F811 check from ignore list

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Florian Schwarb <florian.schwarb@gmail.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-01-13 17:03:10 +01:00
Siva 1d2e0c7ebe feat(DocItem): Add comments field for linking annotations to document… (#465)
* feat(DocItem): Add comments field for linking annotations to document items

Implements support for linking comments (from Word/PPT documents) to their
annotated content using the established FloatingItem/RefItem pattern.

Changes:
- Add `comments: List[RefItem]` field to DocItem class
- Update `_update_breadth_first_with_lookup()` to handle comment references on deletion
- Bump CURRENT_VERSION to 1.9.0
- Fix version comparison bug (string vs integer for minor version)
- Add 4 new tests for comments functionality
- Update test data files for new schema

Closes: docling-project/docling#464
Related: docling-project/docling#2834

Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>

* improve comment Pydantic serialization

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add add_comment, update tests

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* introduce fine-granular references with span ranges

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* simplify last test

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
2026-01-08 05:00:57 +01:00
Peter W. J. Staar 09ef91c272 fix: add JSON to CodeLanguageLabel (#413)
* fix: add JSON to CodeLanguageLabel

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the docs

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-11-04 14:06:47 +01:00
Panos Vagenas 2ee3cacdd6 feat: add metadata model hierarchy (#408)
* feat: add metadata model hierarchy

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add deprecation, add first migration

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* extend annotations migration

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* update with feedback

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* expose main prediction

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* ideas on enforcing separation between standard and custom fields

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add custom field setter method

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* update Markdown serialization

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* revert description, add include_non_meta, showcase custom serializer for summaries

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* simplify customization

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* fix reference exclusion

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* eliminate serialization dupliation between meta & (legacy) annotations

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* remove old file

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* fix item used in get_parts for meta ser

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* serialize GroupItem meta prior to content, DocItem meta after content

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* restore ser order for all nodeitems

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* move meta serialization into DocSerializer.serialize() to maintain seamless chunking integration

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add allow- & block-lists for meta names, add std field name enum

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add HTML serializer, document meta field names, rename SMILES field

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* bump DoclingDocument version

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* make TabularChartMetaField.title optional, expose new classes through __init__.py, add MetaUtils

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add DocTags serialization, revert smiles to smi to prevent confusion with plural

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-10-30 11:15:14 +01:00
Maxim Lysak b13267f18b feat: Introduction of fillable TableCell (#384)
* New properties for TableCell, corner_header and fillable

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated docs

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* bumped docling document version to 1.7.0

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-09-09 15:48:42 +02:00
Panos Vagenas 1d04154378 feat: add rich table cells (#368)
* feat: add rich table cells

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* propagate cell text resolution, cover row deletions

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add doctags, fix referential integrity, expand tests, reenable mypy

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* bump DoclingDocument version

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* simplify / remove serialize_cell

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* update rich table cell refs in doc indexing

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* update notebook

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* expose new classes in `docling_core.types.doc`

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-08-27 10:07:47 +02:00
Peter W. J. Staar eb2538eb3a feat: added different content-layers (#345)
* feat: added different content-layers

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* renamed concealed to invisible

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed docs

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Update docling_core/types/doc/document.py

Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Update docs/DoclingDocument.json

Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-07-02 17:28:53 +02:00
Panos Vagenas 14a4fdee87 feat: remodel lists, add MD & HTML ser. params, enable unset marker (#339)
* feat: remodel lists

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* prepare test document separate markers

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add MD/HTML serializer options, expand test data

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* create list groups where not in place

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add auto-increment logic

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* restore UnorderedList as deprecated alias (for backwards compatibility)

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* enable unset marker case

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* make ordered markers able to be "unset" (empty) too

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* rename default marker mode to auto

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-06-27 14:52:25 +02:00
Christoph Auer aa430d3776 feat: New labels for CVAT annotation (#314)
* New labels and utility for CVAT annotation

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add EMPTY_VALUE to TextItem

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Small fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* align usage of content layer param (#326)

* align usage of content layer param

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* parametrize content layers in visualizers

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-06-11 19:25:13 +02:00
Panos Vagenas ae961299a5 feat: add subscript & superscript formatting (#319)
* feat: add subscript & superscript formatting

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* switch to enum

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-06-06 09:01:32 +02:00
Panos Vagenas d8a5256b2c feat: add table annotations (#304)
* feat: add table annotations

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* refactor annotation types

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* expand to HTML

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* introduce annotation serializer

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* Update dummy_doc.yaml

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-06-05 15:38:46 +02:00
Cesar Berrospi Ramis 4a174b5679 chore: fix deprecation warnings (#303)
* chore: fix deprecation warnings

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* chore: disregard deprecated captions from hierarchical chunker in hybrid chunker

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* chore: update poetry lock file

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-05-26 15:58:08 +02:00
Christoph Auer aa957cf4b6 fix: Updates for labels and methods to support document GT annotation (#293)
Update labels and get_color

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-05-14 14:35:17 +02:00
Peter W. J. Staar 2f0f12160b feat: adding the label picture_group (#283)
* feat: adding the label picture_group

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* renamed group to area

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the docs

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-06 09:43:35 +02:00
Maxim Lysak e9259a5f87 feat: Support of DocTags charts (serialization and deserialization) (#229)
* Doctags charts deserialization into Docling Document, added PictureTabularChartData

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Correction for chart label mapping

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* simplified label mapping

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Improved docummentation comment

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added stacked_bar_chart to labels and tokens, as well as adjusted doctags_load

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* doctags serializer for charts

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* - Fixed DocTags table deserialization
- Updated tests

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* cleaning

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Allowing DocItemLabel.CHART as a label for PictureItem

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-04-07 12:27:48 +02:00
Panos Vagenas a7cdc87411 feat: add serializers, text formatting, update Markdown export (#182)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-03-13 17:54:32 +01:00
Panos Vagenas 2abaf9b537 feat: add inline groups, revamp Markdown export incl. list groups (#156)
* feat: add inline groups, update markdown export

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* rename parameter for clarity

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* drop redundant `traverse_inline`, use only `visited`

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* update API usage after rebase

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* minor refactor

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* extend nesting logic to list groups

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* (MD export) add inline formulas, placeholders for `KeyValueItem`, `FormItem`

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* include 3rd-level list item in test

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* update test file with more recent conversion result

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* update list construction pattern so that sublists are added to actual ListItems

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-02-25 09:48:10 +01:00
Saidgurbuz d622800750 feat: Introduce Key-Value and Forms items (#158)
* Draft KeyValueItem content

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* added add_key_value_item method

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>

* add KeyValueLink

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>

* update an add_key_value_item argument

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>

* remove KeyOrValueCellType and KeyValueLinkType

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>

* update tests for KeyValueItem

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>

* add union method to create bbox that covers all the given bboxes

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>

* added the , and rewrote it to make it more general with a GraphItem, as well as having cell- and link-labels

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the code due to MyPy and reformatted

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the reference docs with form-key

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated image

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* removed the circle

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the figure, testing now ...

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added square in image

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* removed square in image

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* tests should go through

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* update poetry.lock versions

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>

* fix import issue

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>

* add validator for links in GraphData

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added field_validator

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* rename add_key_values and add_form

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>

* rename union to enclosingbbox

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
2025-02-19 09:50:45 +01:00
Christoph Auer 7267c3f571 fix: Fix inheritance of CodeItem for backward compatibility (#162)
* fix: Fix inheritance of CodeItem for backward compatibility

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update docs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-14 17:48:34 +01:00
Peter W. J. Staar 916323fb55 feat: Redefine CodeItem as floating object with captions (#160)
* updated the CodeItem with captions

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* use FloatItem and add captions to add_code interface

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-14 11:34:40 +01:00
Christoph Auer 786f0c6833 feat: Add ContentLayer attribute to designate items to body or furniture (#148)
* feat: Add ContentLayer attribute to designate items to body or furniture

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* introduce safer data gen mechanism, update chunking test data

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* Do not make test rely on order in yaml

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* chore: format fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: legacy_to_docling_doc must use content_layer

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add content_layer in iterate_items

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Bump format version, add model_validator for old page_header,page_footer in body

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Change to before model_validator

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Address review comments

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-02-10 10:58:41 +01:00
Matteo c940aa5ca9 feat: Add CodeItem as pydantic type, update export methods and APIs (#129)
* added code item

* added code item

* added code item

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* added code item

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* added code item

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* added code item

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* added code item

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* add constraints to allow numpy > 2.1.0 on python3.13 and others

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Add CodeItem to ContentItem

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* added CodeItem in ContentItem tagged union.

* added enum for programming languages

* removed double CodeItem in ContentItem Union

* fixed type of code_language in CodeItem class

* fixed sorting of programming languages, not sorted anymore by value of string but variable name

---------

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-01-17 16:59:33 +01:00
Peter W. J. Staar 5101dd8845 feat: added the new label comment_section in the groups (#114)
* added the new label comment_section in the groups

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-12-17 14:23:18 +01:00
Christoph Auer aeaf89de10 feat: Add group labels for form and key-value areas (#110)
Add group labels for form and key-value

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-16 13:40:01 +01:00
Panos Vagenas 047a1960af fix: improve doc item typing (#105)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-12-13 12:51:53 +01:00
Cesar Berrospi Ramis 266e7fc603 build: set pydantic version to 2.10.3 (#93)
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2024-12-04 14:26:22 +01:00
Peter W. J. Staar ef49fd3f34 feat: adding HTML export to DoclingDocument, adding export of images in png with links to Markdown & HTML (#69)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-27 05:27:23 +01:00
Matteo 36b7bea53a feat: added pydantic models to store charts data (pie, bar, stacked bar, line, scatter) (#52)
* added pydantic models to store charts data (pie, bar, line)

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* changed chart models hierarchy structure, added StackedBarChart class

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* added scatter chart and addresed Peter's comments

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* fixed names of classes

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

---------

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
2024-10-29 15:49:09 +01:00
Panos Vagenas e12d6a70c3 fix: fix legacy doc ref (#48)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-18 15:21:01 +02:00