* refactor: move WebVTT data model from docling
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(webvtt): deal with HTML entities in cue text spans
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(webvtt): support more WebVTT models
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(DoclingDocument): create a new provenance model for media file types
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(webvtt): make WebVTTTimestamp public
Since WebVTTTimestamp is used in DoclingDocument, the class should be public.
Strengthen validation of cue language start tag annotation.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(webvtt): set languages to a list of strings in ProvenanceTrack
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* tests(webvtt): add test for ProvenanceTrack
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(webvtt): make all WebVTT classes public for reuse
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(webvtt): preserve newlines as WebVTTLineTerminator
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(webvtt): set ProvenanceTrack time fields as float
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(webvtt): ensure start time offsets are in sequence
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(webvtt): improve regex to remove note,region,style blocks
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(webvtt): parse the WebVTT file title
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(webvtt): rebase to latest changes in idoctags
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* feat(webvtt): add WebVTT serializer
Add a DoclingDocument serializer to WebVTT format.
Improve WebVTT data model.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(webvtt): add 'text/vtt' as extra mimetype
Add 'text/vtt' as extra MIME type to support WebVTT serialization, since it is not
supported by 'mimetypes' with python < 3.11
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(webvtt): roll back DocItem.prov as list of ProvenanceItem
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* tests(webvtt): fix test with STYLE and NOTE blocks
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* style(webvtt): apply X | Y annotation instead of Optional, Union
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(webvtt): simplify TrackProvenance model with tags
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(webvtt): align class and field names to new 'source' type
Classes and fields that are related to the new source type should aign with their names.
The term 'provenance' will identify the legacy implementation.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(DoclingDocument): drop the validation on field assignment
Drop the validation on field assignment in NodeItem objects.
Add the 'source' argument in the convenient function 'add_text' to create TextItem with track source data.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
refactor(webvtt): drop cue span classes, 'lang' and 'c' tags
Drop WebVTT formatting features not covered by Docling across formats.
Only 'u', 'b', 'i', and 'v' are supported and without classes.
Make 'v' tag explicit as 'voice' feature in SourceTrack class.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* Added ruff to dev dependencies
Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>
* Added ruff settings to pyproject.toml as in docling
Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>
* Cleanup uf pyproject.toml
Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>
* Copied settings for ruff pre-commit hooks from docling
Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>
* Excluded test/data/** from ruff formatting / linting
Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>
* ruff format
Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>
* Added some ignore statements to pyproject.toml such that ruff check raises fewer issues
Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>
* ruff check --fix
Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>
* Ignored some more rules
Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>
* Fixed the rest of the errors that would only concern 1 - 3 files
Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>
* Added another ignore related to df for DataFrame names
Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>
* Modified CONTRIBUTING.md such that black / isort are replaced by ruff
Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>
* Added UP045 to ignore list such that Optional[...] does not raise
Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>
* Moved .flake8 configs to pyproject.toml
Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>
* Moved autoflake to be used with ruff
Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>
* Moved all .flake8 settings to pyproject.toml to be compatible with ruff (i.e. no separate [tool.flake8] section
Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>
* Removed flake8 from .pre-commit hooks
Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>
* Applied ruff format (again); formatted some files as the line-length = 120 equals now what was set for the .flake8 settings
Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>
* Set max-complexity to 30 (as was originally) in the pyproject.toml as one linting check would fail
Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>
* Adding PD901 to ignore list such that pre-commit hooks run fully again
Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>
* Replaced dtype | None syntax by Optional[dtype] in remaining places
Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>
* chore: fix 'test' ref in pyproject
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* style: remove typing List, Set, Tuple, Dict
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* style: remove UP015 check from ignore list
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* style: remove UP034 check from ignore list
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* style: normalize dashes in comments and docstrings
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* style: remove PD901 check from ignore list
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* style: remove C403 check from ignore list
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* style: remove C403, C413, C416 check from ignore list
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* style: remove E203, F811 check from ignore list
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Florian Schwarb <florian.schwarb@gmail.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* feat(DocItem): Add comments field for linking annotations to document items
Implements support for linking comments (from Word/PPT documents) to their
annotated content using the established FloatingItem/RefItem pattern.
Changes:
- Add `comments: List[RefItem]` field to DocItem class
- Update `_update_breadth_first_with_lookup()` to handle comment references on deletion
- Bump CURRENT_VERSION to 1.9.0
- Fix version comparison bug (string vs integer for minor version)
- Add 4 new tests for comments functionality
- Update test data files for new schema
Closes: docling-project/docling#464
Related: docling-project/docling#2834
Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>
* improve comment Pydantic serialization
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
* add add_comment, update tests
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
* introduce fine-granular references with span ranges
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
* simplify last test
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
---------
Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
* feat: add inline groups, update markdown export
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* rename parameter for clarity
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* drop redundant `traverse_inline`, use only `visited`
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
* update API usage after rebase
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
* minor refactor
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
* extend nesting logic to list groups
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
* (MD export) add inline formulas, placeholders for `KeyValueItem`, `FormItem`
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
* include 3rd-level list item in test
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
* update test file with more recent conversion result
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
* update list construction pattern so that sublists are added to actual ListItems
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
---------
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
* Draft KeyValueItem content
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* added add_key_value_item method
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
* add KeyValueLink
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
* update an add_key_value_item argument
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
* remove KeyOrValueCellType and KeyValueLinkType
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
* update tests for KeyValueItem
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
* add union method to create bbox that covers all the given bboxes
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
* added the , and rewrote it to make it more general with a GraphItem, as well as having cell- and link-labels
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the code due to MyPy and reformatted
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the reference docs with form-key
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated image
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* removed the circle
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the figure, testing now ...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added square in image
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* removed square in image
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* tests should go through
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* update poetry.lock versions
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
* fix import issue
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
* add validator for links in GraphData
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added field_validator
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* rename add_key_values and add_form
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
* rename union to enclosingbbox
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
* fix: Fix inheritance of CodeItem for backward compatibility
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update docs
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* added the new label comment_section in the groups
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>