Files
Cesar Berrospi Ramis c8f3c01a61 feat: model and serializer for audio tracks (#426)
* refactor: move WebVTT data model from docling

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(webvtt): deal with HTML entities in cue text spans

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(webvtt): support more WebVTT models

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(DoclingDocument): create a new provenance model for media file types

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(webvtt): make WebVTTTimestamp public

Since WebVTTTimestamp is used in DoclingDocument, the class should be public.
Strengthen validation of cue language start tag annotation.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(webvtt): set languages to a list of strings in ProvenanceTrack

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(webvtt): add test for ProvenanceTrack

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(webvtt): make all WebVTT classes public for reuse

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(webvtt): preserve newlines as WebVTTLineTerminator

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(webvtt): set ProvenanceTrack time fields as float

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(webvtt): ensure start time offsets are in sequence

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(webvtt): improve regex to remove note,region,style blocks

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(webvtt): parse the WebVTT file title

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(webvtt): rebase to latest changes in idoctags

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* feat(webvtt): add WebVTT serializer

Add a DoclingDocument serializer to WebVTT format.
Improve WebVTT data model.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(webvtt): add 'text/vtt' as extra mimetype

Add 'text/vtt' as extra MIME type to support WebVTT serialization, since it is not
supported by 'mimetypes' with python < 3.11

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(webvtt): roll back DocItem.prov as list of ProvenanceItem

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(webvtt): fix test with STYLE and NOTE blocks

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* style(webvtt): apply X | Y annotation instead of Optional, Union

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(webvtt): simplify TrackProvenance model with tags

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(webvtt): align class and field names to new 'source' type

Classes and fields that are related to the new source type should aign with their names.
The term 'provenance' will identify the legacy implementation.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(DoclingDocument): drop the validation on field assignment

Drop the validation on field assignment in NodeItem objects.
Add the 'source' argument in the convenient function 'add_text' to create TextItem with track source data.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

refactor(webvtt): drop cue span classes, 'lang' and 'c' tags

Drop WebVTT formatting features not covered by Docling across formats.
Only 'u', 'b', 'i', and 'v' are supported and without classes.
Make 'v' tag explicit as 'voice' feature in SourceTrack class.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-01-27 14:45:03 +01:00
..