fix(html): preserve fragment-only anchor links during path resolution
Fragment-only hrefs (e.g. href="#section1") were resolved as filesystem
paths when source_uri was set, breaking internal document navigation.
Add '#' to the skip-resolution prefixes in _resolve_relative_path() so
fragment links pass through unchanged.
Partially addresses #2929
Signed-off-by: aatrey56 <aatrey.sahay@gmail.com>
- Implementation of HTML backend that (optionally) uses headless browser (via Playwright) to materialize HTML pages into images, and add provenances with bboxes to all elements in the converted docling document.
- Conversion preserves reading order given by HTML DOM tree
- Added support for HTML "input" fields: checkboxes, radiobuttons, text inputs, etc.
- Added support to Key-Value convention in HTML (i.e. elements with id "key1" and "key1_value1" will be paired as key-values, see test cases as examples)
- Heuristic that glues independent inline HTML elements with single-character text in them into larger text blocks
- Support for inline styling (bold, italic, etc.)
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
* fix(html): fix broken document tree and quadratic PictureItems in rich table cells
Three related bugs in the HTML backend when processing table cells that
contain rich content (RichTableCell), as found on Wikipedia pages with
large reference, taxobox, or classification tables:
Bug 1 — orphaned InlineGroups causing broken parent/child relationships
------------------------------------------------------------------------
When _use_inline_group() created an InlineGroup node (for paragraphs
containing multiple hyperlinks, e.g. "text <a> and <a>"), it was added
as a child of the current parent via doc.add_group(), but its RefItem was
never appended to added_refs / provs_in_cell. This meant:
- group_cell_elements() reparented the text items inside the InlineGroup
(because their individual refs WERE in added_refs), moving them from
body → outer_group_element.
- The InlineGroup itself remained in body.children still pointing to
those same text items as its .children.
- Result: two nodes (InlineGroup and outer_group_element) claimed the
same child items, with contradictory .parent pointers. This broken
tree caused double-serialization of text items in export_to_markdown().
Fix: make _use_inline_group() yield the RefItem of the created group.
Callers (_flush_buffer, _handle_block, _handle_list) now track the
InlineGroup ref instead of individual leaf refs when a group was created.
group_cell_elements() then reparents the whole InlineGroup (with its
children intact) rather than orphaning it.
Bug 2 — quadratic PictureItem creation from stray outer image loop
-------------------------------------------------------------------
In _handle_block() for <table> tags, after parse_table_data() had already
walked the entire table subtree (including nested tables) and emitted
PictureItems for every <img>, there was an additional outer loop:
for img_tag in tag("img"):
im_ref2 = self._emit_image(tag, doc)
Because BeautifulSoup's .find_all("img") on a tag finds ALL descendant
<img> elements (including those in nested tables), this loop processed
every image in the entire subtree again. A table nested N levels deep
caused N*(N+1)/2 duplicate PictureItems per image (quadratic growth).
Fix: remove the outer loop. Images are already handled by parse_table_data()
-> _use_table_cell_context() -> _walk() -> _emit_image().
Bug 3 — missing space separator between nested table cell text
--------------------------------------------------------------
HTMLDocumentBackend.get_text() uses _extract_text_recursively(), which
only appended a trailing space for <p> and <li> tags. When a table cell
contained a nested <table>, adjacent <th> or <td> elements without
whitespace NavigableString nodes between them were concatenated directly
(e.g. "TypeSound" instead of "Type Sound").
Fix: add "th" and "td" to the trailing-space tag set so that the text
content of each cell is separated by a space.
Bug 1 and Bug 2 were introduced in docling v2.55.0 (commit c803abe) with
rich table cell support.
Signed-off-by: Ivan Traus <ivan@liminary.io>
* test(html): align markdown fixtures with current docling-core behavior
Signed-off-by: Ivan Traus <ivan@liminary.io>
* test(xbrl): update XBRL fixture after get_text() cell spacing fix
The Bug 3 fix (adding th/td to trailing-space tags in get_text())
affects the XBRL backend which internally uses HTMLDocumentBackend.
Regenerate the mlac-20251231 fixture to match the corrected text
extraction.
Signed-off-by: Ivan Traus <ivan@liminary.io>
* chore(deps): bump docling-core to 2.67.1, regenerate fixtures and trim tests
Update uv.lock to pull in the merged nested-table flattening fix
(docling-core#525). Regenerate markdown fixtures that now show flattened
text instead of invalid embedded table syntax. Trim verbose test
docstrings and remove narrating comments.
Signed-off-by: Ivan Traus <ivan@liminary.io>
* fix: annotate _use_inline_group return type and regenerate docx fixtures
Add Generator[RefItem | None, None, None] return type and Google-style
Yields section to _use_inline_group. Regenerate docx ground truth
fixtures affected by docling-core 2.67.1 nested-table flattening.
Signed-off-by: Ivan Traus <ivan@liminary.io>
* refactor: use Iterator type hint and remove redundant test
Apply feedback: use Iterator instead of Generator, drop type from Yields docstring, and remove
test_e2e_rich_table_cells_markdown (already covered by test_e2e_html_conversions).
Signed-off-by: Ivan Traus <ivan@liminary.io>
* style(html): apply indent to docstrings
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Ivan Traus <ivan@liminary.io>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Fix p elements having block-level elements anywhere inside as browsers do.
Fix wrong type annotations.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(html): simplify parsing of simple table cells
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* tests(html): add test for rich table cells
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(html): ensure table cells with formatted text are parsed as RichTableCell
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(html): simplify process_rich_table_cells since only rich cells are processed
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(html): formatted cell runs should be parsed as text items respecting the order
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore: pin latest docling-core and update uv.lock
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore: upgrade dependencies on uv.lock
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* feat: add backend options support to document backends
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* feat: enhance document backends with generic backend options and improve HTML image handling
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* Refactor tests for declarativebackend
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(HTML): improve image caption handling and ensure backend options are set correctly
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix: enhance HTML backend image handling and add support for local file paths
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore: Add ground truth data for test data
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(HTML): skip loading SVG files in image data handling
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(html): simplify backend options and address gaps
Backend options for DeclarativeDocumentBackend classes and only when necessary.
Refactor caption parsing in 'img' elements and remove dummy text.
Replace deprecated annotations from Typing library with native types.
Replace typing annotations according to pydantic guidelines.
Some documentation with pydantic annotations.
Fix diff issue with test files.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* tests(html): add tests and fix bugs
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(html): refactor backend options
Move backend option classes to its own module within datamodel package.
Rename 'source_location' with 'source_uri' in HTMLBackendOptions.
Rename 'image_fetch' with 'fetch_images' in HTMLBackendOptions.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(markdown): create a class for the markdown backend options
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* Fix for the proper headers support in rich tables in HTML
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* cleaning up
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Compatibility with older Python versions
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Fixing Furniture before the first heading rule
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added minimalistic test case
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* added html for the test
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
* Rich tables support for HTML backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Decoupling JATS backend from HTML backend, ways of creating tables changed significantly
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* updated and added tests
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Refactored parse_table_data in html_backend into few smaller functions
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Changing scope of few functions in html_backend.py, making them static, when possible
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Fix for HTML tables that have tbody and/or thead, now these tables are also properly supported
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
* chore(html): refactor parser to leverage context managers
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(html): parse inline code snippets, also from list items
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(html): remove hidden tags
Remove tags that are not meant to be displayed.
Add regression tests for code blocks, inline code, and hidden tags.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* re-implement links for html backend.
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
* fix inline groups in list items. write specific test for find_parent_annotation of _extract_text_and_hyperlink_recursively.
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
* implement hack for images.
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
---------
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>