docling

mirror of https://github.com/docling-project/docling.git synced 2026-05-17 13:10:38 +00:00

Author	SHA1	Message	Date
Cesar Berrospi Ramis	e00735dd59	fix(docx): fix OMML equation handling and improve type safety (#3381 ) * fix(docx): handle missing chr attribute in groupChr OMML elements Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * fix(docx): escape spaces in OMML limit text for proper LaTeX rendering Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * fix(docx): fix inline equation reconstruction to prevent tag corruption Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(docx): add type hints and docstrings to OMML module Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * fix(docx): fix genfrac formatting and eliminate grouping function warnings Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * fix(docx): handle unmapped characters in OMML % formatting Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-05-04 10:58:25 +02:00
pateltejas	72942486ff	fix(pptx): skip malformed picture shapes instead of aborting conversion (#3372 ) * fix(pptx): skip malformed picture shapes instead of aborting conversion MsPowerpointDocumentBackend._handle_pictures reads embedded image bytes via python-pptx's shape.image accessor. On PPTX files with slightly malformed <p:pic> shapes, shape.image raises three exceptions that the existing (UnidentifiedImageError, OSError, ValueError) clause does not catch, so one bad picture aborts conversion of the entire presentation: - InvalidXmlError when <p:blipFill> is missing - KeyError when <a:blip r:embed> points to an unknown relationship - AttributeError when the embedded part's content-type isn't an image These files open normally in Keynote and Google Drive, so the backend should handle them as gracefully as it already handles truncated or unreadable image payloads. This follows the same pattern as #2914, which extended the same except tuple with ValueError to handle linked (external) image references. The three cases above are the remaining shape.image failure modes that still escape. Extend the except tuple to cover the three cases and log the same warning used for other unreadable images, leaving the rest of the presentation to convert normally. Add a regression fixture with one malformed picture per failure mode plus a focused test. Fixes #3371 Signed-off-by: pateltejas <tejas226@hotmail.com> * refactor(pptx): use warnings.warn for malformed picture skips Address PR review feedback: use Python's warnings module with UserWarning to signal the skip to callers instead of logging.Logger.warning, matching the pattern used in msword_backend for "Skipping external image reference". This makes the skip visible via standard warning filters and catchable in tests. Update the regression test to assert the warning is emitted via pytest.warns, which also suppresses the message during the test run so it doesn't clutter suite output. Signed-off-by: pateltejas <tejas226@hotmail.com> --------- Signed-off-by: pateltejas <tejas226@hotmail.com>	2026-04-29 08:29:08 +02:00
Cesar Berrospi Ramis	3df80e7f46	fix(docx): OMML conversion failures for unsupported limit functions (#3359 ) * fix(docx): handle unsupported limit functions gracefully in OMML conversion Replace RuntimeError with graceful fallback for unknown limit functions in do_limlow(). Add argmax and argmin to LIM_FUNC dictionary for proper LaTeX rendering. Fixes conversion failures when Word documents contain mathematical operators not previously supported in the limit function dictionary. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * test(docx): regenerate ground truth files Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-04-28 14:43:24 +02:00
Cesar Berrospi Ramis	c455a65e36	feat(docx): add checkbox parsing support (#3349 ) * feat(docx): add checkbox parsing support to MsWordDocumentBackend Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * refactor(docx): remove duplicate code in text element handling Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * docs(docx): update checkbox method docstrings Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * refactor(docx): use self._BLIP_NAMESPACES for w14 namespace in checkbox methods Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-04-28 14:38:43 +02:00
Aatrey Sahay	f2c03edb30	fix(html):preserve fragment-only anchor links during path resolution (#3262 ) fix(html): preserve fragment-only anchor links during path resolution Fragment-only hrefs (e.g. href="#section1") were resolved as filesystem paths when source_uri was set, breaking internal document navigation. Add '#' to the skip-resolution prefixes in _resolve_relative_path() so fragment links pass through unchanged. Partially addresses #2929 Signed-off-by: aatrey56 <aatrey.sahay@gmail.com>	2026-04-28 10:28:23 +02:00
Cesar Berrospi Ramis	2ddaa3be97	feat(docx): extract VML images with v:imagedata elements (#3343 ) feat(docx): Extract VML images with v:imagedata elements Add VML image support with EMF/WMF conversion and consolidate image handler code. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-04-22 08:46:36 +02:00
Matvei Smirnov	3a3c8f68dd	fix(pptx)!: assign pptx notes to ContentLayer.NOTES (#3341 ) Signed-off-by: Matvei Smirnov <vdalekesmirnov@gmail.com>	2026-04-21 18:35:43 +02:00
Cesar Berrospi Ramis	c7615123e6	fix(docx): handle inline formulas in list items (#3304 ) * fix(docx) Handle inline formulas in list items Fixes issue where inline formulas in list items were ignored during conversion. Added helper methods to eliminate code duplication. Updated test data with list items containing inline equations. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * fix(docx): collect element refs in _add_inline_equations_to_parent Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-04-17 07:33:20 +02:00
pateltejas	043ed2dd3d	fix(pptx): handle NotImplementedError from shape.shape_type (#3309 ) * fix(pptx): handle NotImplementedError from shape.shape_type python-pptx raises NotImplementedError from Shape.shape_type for <p:sp> elements that aren't placeholders, autoshapes, textboxes, or freeforms (e.g. shapes with empty <p:spPr> from Google Slides exports, LibreOffice, or Keynote). handle_groups() and handle_shapes() access shape_type without catching this, crashing the entire conversion. Add a _safe_shape_type() helper that returns None on NotImplementedError, so unrecognized shapes skip only the GROUP recursion and PICTURE extraction while text and table extraction proceed normally. Fixes #3308 Signed-off-by: Tejas Patel <tejas226@hotmail.com> * Fix lint Signed-off-by: Tejas Patel <tejas226@hotmail.com> --------- Signed-off-by: Tejas Patel <tejas226@hotmail.com>	2026-04-17 06:59:48 +02:00
Cesar Berrospi Ramis	740c386730	fix(docx): isolate list state in table cells (#3294 ) * fix(docx): isolate list state in table cells Lists with the same numId in different table cells were incorrectly merged. Added context manager to isolate list state during cell processing. Includes test cases and updated ground truth files. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * style(docx): modernize type hints to use PEP 604 union syntax Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-04-15 09:51:37 +02:00
vwe-ibm	9b4b67b23e	feat: add signature/stamp html block to DC document (#3251 )	2026-04-09 06:13:22 +02:00
Smeet Agrawal	61809252ec	fix(latex): discard arguments of filtered spacing commands (#3245 ) Commands like \vspace{-1mm} and \hspace{0.2cm} were being filtered at the command level but their argument values were leaking through as plain text nodes. Ensure that when a spacing/ignored command is encountered, its arguments are also suppressed. - Add vspace, hspace, vspace, hspace, addvspace to MACROS_SPACING - Guard against spacing/ignored macros in _process_macro_node_inline so their brace arguments are not extracted as inline text - Guard against spacing/ignored macros in _nodes_to_text so dimension values do not leak when processing footnotes, captions, etc. - Update ground truth files to reflect corrected output Fixes #3240 Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com> Co-authored-by: Smeet Agrawal <smeetagrawal23@gmail.com>	2026-04-08 11:19:56 +02:00
Hussain Arslan	524edcce73	fix(pdf): propagate hyperlinks to DoclingDocument text items (#3131 ) * fix(pdf): propagate hyperlinks to DoclingDocument text items docling-parse already extracts PdfHyperlink objects with bounding rectangles and URIs into SegmentedPdfPage.hyperlinks, and TextItem already has a hyperlink field. However, the PDF pipeline never matched hyperlink annotations to text clusters — the data was available but never propagated. Add spatial matching of PDF hyperlinks to text clusters during page assembly, then pass the resolved hyperlink through the reading order model to the final DoclingDocument. Changes: - Add hyperlink field to TextElement (base_models.py) - Add _match_hyperlink() to PageAssembleModel that spatially matches cluster bboxes against hyperlink annotation rects, aggregating coverage per URI to handle wrapped links with multiple rects - Thread hyperlink= through add_text(), add_heading(), add_list_item() calls in ReadingOrderModel - Drop hyperlink on text merge when constituent clusters disagree - Fall back to Path when AnyUrl validation fails (matches HTML backend) - Regenerate affected ground truth files - Add unit tests for _match_hyperlink() edge cases Closes #3096 Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com> * fix(pdf): recover unmatched hyperlinks as REFERENCE items Track consumed hyperlink indices during cluster matching so that hyperlinks which don't meet the overlap threshold are not silently dropped. Unmatched hyperlinks that overlap text clusters are materialized as synthetic REFERENCE TextElements. Also propagate hyperlinks through FORMULA items in reading-order assembly. Signed-off-by: macbook <macbook@users.noreply.github.com> Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com> * DCO Remediation Commit for hussainarslan <m.hussain.arslan@gmail.com> I, hussainarslan <m.hussain.arslan@gmail.com>, hereby add my Signed-off-by to this commit: `71a8d900bd` Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com> * test: regenerate reference data for hyperlink propagation Update groundtruth files for 2206.01062, 2305.03393v1, and textbox.docx to reflect hyperlink fields on text items and new REFERENCE items for unmatched hyperlinks. Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com> * Revert "test: regenerate reference data for hyperlink propagation" This reverts commit 374f478ebf71e7e43b1b98d7106375c7f3d77101. Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com> * Revert "fix(pdf): recover unmatched hyperlinks as REFERENCE items" This reverts commit e0e9b9225fa5caa0a7b2578a29600a9531edc624. Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com> * test: regenerate groundtruth for hyperlink propagation Regenerate the affected docling_v2 PDF and DOCX fixtures after rerunning the hyperlink propagation groundtruth suite and switch the hyperlink coverage selection helper to the explicit items() form to avoid a type ignore. Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com> * test: regenerate groundtruth for docling_core 1.10.0 Regenerate the affected docling_v2 PDF and DOCX fixtures with the current docling_core schema version so committed groundtruth stays compatible with CI and example loading. Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com> --------- Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com> Signed-off-by: macbook <macbook@users.noreply.github.com>	2026-03-31 08:58:21 +02:00
lif	89c68f8ec3	fix: parse LaTeX macros in multicolumn/multirow table cells (#3204 ) * fix: parse LaTeX macros in multicolumn/multirow table cells Table cells in multicolumn and multirow used raw LaTeX content (e.g. \textbf{Header}) wrapped in LatexCharsNode, which passed through _nodes_to_text() verbatim. Parse the content string with LatexWalker so formatting macros get properly resolved to text. Signed-off-by: majiayu000 <1835304752@qq.com> * test: add regression test for LaTeX formatting in table cells Add test_latex_table_formatting_in_cells to verify multicolumn/multirow cells with \textbf, \textit, \tiny, and nested formatting produce clean text output without raw LaTeX commands (issue #3199). Also narrow except clause from bare Exception to LatexWalkerParseError for both multicolumn and multirow parse fallbacks. Signed-off-by: majiayu000 <1835304752@qq.com> --------- Signed-off-by: majiayu000 <1835304752@qq.com>	2026-03-30 07:31:26 +02:00
Giulio Leone	e36125ba2d	fix(omml): correct LaTeX output for fractions, math operators, and functions (#3122 ) * fix(omml): correct LaTeX output for fractions, math operators, and functions Fixes three related bugs in OMML-to-LaTeX conversion: A) Fraction raised to a power now produces correct grouping braces: {\frac{(x-c)}{v}}^{2} instead of \frac{(x-c)}{v}^{2} Adds dedicated do_ssub/do_ssup/do_ssubsup handlers that wrap complex base expressions (fractions, radicals) in braces. B) EN DASH (U+2013) and CIRCUMFLEX (U+005E) inside math runs are now mapped to their math-mode equivalents (- and ^) instead of being escaped as \text{\textendash} and \text{\textasciicircum}. C) Adds missing standard math functions to the FUNC dict: log, ln, exp, det, gcd, deg, hom, ker, dim, arg, inf, sup, lim, Pr. These now emit proper LaTeX commands (e.g. \log) instead of falling back to plain italic text. Closes #3120 Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * test(omml): add test documents for OMML-to-LaTeX conversion bugs Add three minimal DOCX files exercising the fixed edge cases: - omml_frac_superscript.docx: fraction as superscript base (Bug A) - omml_text_escapes_in_math.docx: en-dash and caret in math runs (Bug B) - omml_func_log.docx: log function recognition (Bug C) Each file includes matching groundtruth (md, json, itxt). Requested-by: @dolfim-ibm Signed-off-by: Giulio Leone <giulioleone10@gmail.com> Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * fix(omml): avoid double-wrapping nested sub/sup containers Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * fix(omml): fix Bug B caret escape + use issue #3120 test documents Bug B fix: prevent escape_latex from re-escaping characters that process_unicode intentionally mapped to math operators. The caret character U+005E inside <m:r><m:t> math runs was being converted to ^ by _MATH_CHAR_MAP, then immediately re-escaped to \^ by escape_latex. Now do_r restores math-mapped chars after escaping. Result: x - y\^2 → x - y^2 (correct superscript) Test documents: replace minimal programmatic fixtures (~1.2 KB) with the real Word documents from issue #3120 reporter (smroels, ~37 KB each). Regenerate all groundtruth. Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * test: regenerate groundtruth for omml_text_escapes_in_math Update .itxt to use proper indented-text export format (item hierarchy) and refresh .json to match current converter output. Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * test(omml): regenerate indented text snapshots The OMML regression documents were exported into the .itxt fixtures using the wrong format, so the real DOCX end-to-end check failed even though the rebased converter output was correct. Regenerate the two broken indented-text snapshots from the current branch so the MS Word E2E test verifies the actual converter behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * style(omml): apply ruff format normalization Normalize the multiline condition in omml.py to match the repository ruff-format output so the pre-commit gate stays clean on the refreshed PR head. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * DCO Remediation Commit for giulio-leone <giulio97.leone@gmail.com> I, giulio-leone <giulio97.leone@gmail.com>, hereby add my Signed-off-by to this commit: `08001d9c5c` Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> --------- Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> Signed-off-by: Giulio Leone <giulioleone10@gmail.com> Co-authored-by: giulio-leone <giulio.leone@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-25 07:07:31 +01:00
Maxim Lysak	1c74a9b9c7	feat: Implementation of HTML backend with headless browser (#2969 ) - Implementation of HTML backend that (optionally) uses headless browser (via Playwright) to materialize HTML pages into images, and add provenances with bboxes to all elements in the converted docling document. - Conversion preserves reading order given by HTML DOM tree - Added support for HTML "input" fields: checkboxes, radiobuttons, text inputs, etc. - Added support to Key-Value convention in HTML (i.e. elements with id "key1" and "key1_value1" will be paired as key-values, see test cases as examples) - Heuristic that glues independent inline HTML elements with single-character text in them into larger text blocks - Support for inline styling (bold, italic, etc.) Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2026-03-24 14:28:57 +01:00
Giulio Leone	90d6dd4e87	fix(docx): split multiple OMML equations into separate formula items (#3123 ) * fix(msword): split multiple OMML equations into separate formula items When a DOCX paragraph contains multiple sibling <m:oMath> elements (e.g. separate equations on one line), the converter previously concatenated them into a single LaTeX string because element.iter() walks all descendants depth-first. Fix: iterate direct children of the paragraph element first to correctly identify sibling <m:oMath> elements, converting each independently. Falls back to deep iteration only when oMath elements are nested inside wrapper elements. Also splits standalone multi-equation paragraphs into individual FORMULA document items instead of merging them into one. Closes #3121 Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * test(msword): add multi-equation paragraph test document Add a minimal DOCX file containing two separate oMath elements in one paragraph with a text separator, along with groundtruth output files for markdown, json, and plain text export. Requested-by: @dolfim-ibm Signed-off-by: Giulio Leone <giulioleone10@gmail.com> Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * test(msword): regenerate multi-equation indented-text snapshot Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * test: replace test doc with issue #3121 attachment Use the real Word document from the issue reporter (smroels) instead of the minimal programmatic fixture. The new document contains three sibling <m:oMath> elements in one paragraph, matching the exact failing shape described in #3121. Regenerate groundtruth to match the richer document structure. Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * test: regenerate groundtruth for omml_multi_equation_paragraph Re-run document conversion with current code to update .itxt and .json groundtruth files. The .itxt had stale structure from the previous programmatic fixture; the new real-document conversion produces the correct output with three separate formula items. Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * style(docx): rerun ruff formatter for msword backend Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * refactor(docx): drop unused tag_name binding Remove the unused local in the direct oMath iteration path so the code reads clearly and the outstanding review comment is fully addressed without changing equation-handling behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * DCO Remediation Commit for giulio-leone <giulio97.leone@gmail.com> I, giulio-leone <giulio97.leone@gmail.com>, hereby add my Signed-off-by to this commit: `84cc70b55e` Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * test(docx): cover equation paragraph branches Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * test(docx): reuse backend fixture in msword tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> --------- Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> Signed-off-by: Giulio Leone <giulioleone10@gmail.com> Co-authored-by: giulio-leone <giulio.leone@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-24 09:42:16 +01:00
Emre Çalışır	2f7c09e0d8	fix(docx): Missing list items after numbered header (#2665 ) (#2678 ) * fix(docx): Correct list numbering with interleaved numIds and hierarchical markers Fixes incorrect numbering and missing items in DOCX documents that use multiple interleaved numbering sequences (numIds). Changes: * Reset sub-level counters in _get_list_counter when a parent level advances, preventing counter bleed-across (e.g. "4. Functional Requirements" now correctly renders as "1. Functional Requirements") * Add _build_enum_marker helper to produce hierarchical markers in "1.2.3." format instead of flat single-level counters * Fix anchor-based level calculation in new-sequence branch: use level_at_new_list + ilevel instead of _get_level() to correctly place items from a different numId at the right document level * Only set level_at_new_list in the else case (when None) to avoid corrupting the anchor when switching between interleaved numIds * Remove _reset_list_counters_for_new_sequence from new-sequence branch so that returning to a previously seen numId continues its counter (e.g. Appendix A=1, B=2, C=3 instead of A=1, B=1, C=1) Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com> * style(docx): apply Ruff formatting to msword_backend Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com> * test(docx): add unit tests for list counter, enum marker, and sequence reset helpers Adds test_list_counter_and_enum_marker covering helper methods introduced in the list numbering fix: counter increment, sub-level reset on parent advance, hierarchical marker building, and selective sequence reset. Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com> * fix(docx): read start values from abstractNum for correct numbering When Word creates a new numbering definition that continues from a previous list, it embeds start values in the abstractNum XML instead of reusing the same numId. Docling previously ignored these start values and always initialized counters from 1, producing incorrect markers like "1.1.1." instead of "2.3.1.". Changes: * Add _get_level_element helper to extract level XML from abstractNum, eliminating duplicated XML traversal in _is_numbered_list * Add _get_start_value to read w:start from the numbering definition * Initialize counters in _get_list_counter using start values * Use start values as fallback in _build_enum_marker for parent levels that have not been explicitly incremented Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com> * test(docx): add interleaved numId edge case to unit_test_headers_numbered Extends the existing test document with an Appendix section that uses a different numId, followed by list items that resume the original numbering sequence with Word-embedded start values (e.g. 2.3.1.). Updates groundtruth files accordingly. Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com> * style(docx): add return type annotation to _get_level_element Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com> --------- Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>	2026-03-20 21:24:27 +01:00
Ron	8ae0974a9d	fix: handle external image relationships in MsWordDocumentBackend (#3114 ) * fix: handle external image relationships in MsWordDocumentBackend When a .docx file contains image relationships with TargetMode="External" (common in documents saved from web browsers), accessing `_Relationship.target_part` raises ValueError because external relationships don't have a target part within the package. Check `rel.is_external` before accessing `target_part`, emitting a UserWarning with the external target URL and returning None so external images fall through to the existing "image cannot be found" handling. Includes test with ground truth files for a .docx with external image references. Fixes #3113 Signed-off-by: rongo-ms <127863751+rongo-ms@users.noreply.github.com> * chore: upgrade dependencies in uv.lock file Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: rongo-ms <127863751+rongo-ms@users.noreply.github.com> Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-03-19 14:22:21 +01:00
Peter W. J. Staar	4ccd1d465d	feat: Add support for TableFormer v2 (#3013 ) * ran DCO Signed-off-by: Peter Staar <taa@zurich.ibm.com> * modified tf Signed-off-by: Peter Staar <taa@zurich.ibm.com> * Added to TableFormer v1 Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added convenience methods for quality testing Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated with comments Signed-off-by: Peter Staar <taa@zurich.ibm.com> * ran pre-commit Signed-off-by: Peter Staar <taa@zurich.ibm.com> * chore: update lockfile Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * fix: Align __init__ args with factory method kwargs Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * chore: bump docling-ibm-models version Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * fix mypy type stubs error with torchvision Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * fix: Add torch/torchvision direct deps Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Remove global torch imports Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * add test diffs Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * fix jats/xbrl test generate, updated test GT from docling-core upgrade Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * fixed the in the cli Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated uv lock Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixing merge conflicts between main Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixing merge conflicts between main (2) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * upgrade uv.lock Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Ahmed Nassar AHN@zurich.ibm.com <AHN@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com>	2026-03-10 11:57:00 +01:00
Ivan Traus	80f75b8896	fix(html): fix broken document tree and quadratic complexity in rich table cells (#3025 ) * fix(html): fix broken document tree and quadratic PictureItems in rich table cells Three related bugs in the HTML backend when processing table cells that contain rich content (RichTableCell), as found on Wikipedia pages with large reference, taxobox, or classification tables: Bug 1 — orphaned InlineGroups causing broken parent/child relationships ------------------------------------------------------------------------ When _use_inline_group() created an InlineGroup node (for paragraphs containing multiple hyperlinks, e.g. "text <a> and <a>"), it was added as a child of the current parent via doc.add_group(), but its RefItem was never appended to added_refs / provs_in_cell. This meant: - group_cell_elements() reparented the text items inside the InlineGroup (because their individual refs WERE in added_refs), moving them from body → outer_group_element. - The InlineGroup itself remained in body.children still pointing to those same text items as its .children. - Result: two nodes (InlineGroup and outer_group_element) claimed the same child items, with contradictory .parent pointers. This broken tree caused double-serialization of text items in export_to_markdown(). Fix: make _use_inline_group() yield the RefItem of the created group. Callers (_flush_buffer, _handle_block, _handle_list) now track the InlineGroup ref instead of individual leaf refs when a group was created. group_cell_elements() then reparents the whole InlineGroup (with its children intact) rather than orphaning it. Bug 2 — quadratic PictureItem creation from stray outer image loop ------------------------------------------------------------------- In _handle_block() for <table> tags, after parse_table_data() had already walked the entire table subtree (including nested tables) and emitted PictureItems for every <img>, there was an additional outer loop: for img_tag in tag("img"): im_ref2 = self._emit_image(tag, doc) Because BeautifulSoup's .find_all("img") on a tag finds ALL descendant <img> elements (including those in nested tables), this loop processed every image in the entire subtree again. A table nested N levels deep caused N(N+1)/2 duplicate PictureItems per image (quadratic growth). Fix: remove the outer loop. Images are already handled by parse_table_data() -> _use_table_cell_context() -> _walk() -> _emit_image(). Bug 3 — missing space separator between nested table cell text -------------------------------------------------------------- HTMLDocumentBackend.get_text() uses _extract_text_recursively(), which only appended a trailing space for <p> and <li> tags. When a table cell contained a nested <table>, adjacent <th> or <td> elements without whitespace NavigableString nodes between them were concatenated directly (e.g. "TypeSound" instead of "Type Sound"). Fix: add "th" and "td" to the trailing-space tag set so that the text content of each cell is separated by a space. Bug 1 and Bug 2 were introduced in docling v2.55.0 (commit `c803abe`) with rich table cell support. Signed-off-by: Ivan Traus <ivan@liminary.io> test(html): align markdown fixtures with current docling-core behavior Signed-off-by: Ivan Traus <ivan@liminary.io> * test(xbrl): update XBRL fixture after get_text() cell spacing fix The Bug 3 fix (adding th/td to trailing-space tags in get_text()) affects the XBRL backend which internally uses HTMLDocumentBackend. Regenerate the mlac-20251231 fixture to match the corrected text extraction. Signed-off-by: Ivan Traus <ivan@liminary.io> * chore(deps): bump docling-core to 2.67.1, regenerate fixtures and trim tests Update uv.lock to pull in the merged nested-table flattening fix (docling-core#525). Regenerate markdown fixtures that now show flattened text instead of invalid embedded table syntax. Trim verbose test docstrings and remove narrating comments. Signed-off-by: Ivan Traus <ivan@liminary.io> * fix: annotate _use_inline_group return type and regenerate docx fixtures Add Generator[RefItem \| None, None, None] return type and Google-style Yields section to _use_inline_group. Regenerate docx ground truth fixtures affected by docling-core 2.67.1 nested-table flattening. Signed-off-by: Ivan Traus <ivan@liminary.io> * refactor: use Iterator type hint and remove redundant test Apply feedback: use Iterator instead of Generator, drop type from Yields docstring, and remove test_e2e_rich_table_cells_markdown (already covered by test_e2e_html_conversions). Signed-off-by: Ivan Traus <ivan@liminary.io> * style(html): apply indent to docstrings Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Ivan Traus <ivan@liminary.io> Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-03-10 09:48:21 +01:00
Christoph Auer	3b7bba0212	chore: Revert unintended test ground truth changes from #3019 (#3093 ) add test diffs Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2026-03-09 17:38:34 +01:00
Aditya Sasidhar	1192714b53	fix: add parse timeout to legacy LaTeX documents (#3019 ) * a quick 30 second timeout for each file ( this does seem exorbitant but ill have to go over the average parse time of large files and decide upon an upper limit and an average limit, next commit needs individual node ignorance instead of file itself Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * fix: bypass mypy attr-defined for parse_timeout in late options Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * feat(latex): SOTA improvements from pandoc — theorems, preamble metadata, math envs, bugfixes Features: - Theorem/proof/lemma/corollary/definition/remark/example/conjecture environments - Proof environment with conditional QED ◻ symbol - \paragraph and \subparagraph as headings (levels 4, 5) - \author, \date, \title extracted from preamble - \href preserves URL as [text](url) - \renewcommand and \providecommand macro extraction - dmath/dgroup/darray/subequations math environments - \input cycle detection with depth limit of 10 - quote/quotation/verse environment handling Bugfixes: - Fixed UnboundLocalError in _extract_custom_macros - Fixed _extract_verbatim_content regex stealing content - Fixed is_valid() rejecting preamble-only fragments - Removed unused deepcopy import - Unified recursion depth limits to 10 Tests: - 7 new tests, 1 updated, ground-truth regenerated Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * Added some more dangerous macors to the ignore list, the is_valid() function now accpets \documentstyle too and added some essential and primitive layout passes Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * removed the restrictive nature of the is_valid() function Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * added test coverage to the added features and got rid of the time formatted parsing test which caused hanging on the python 3.10 during the CI/CD testing pipline Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> --------- Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>	2026-03-09 10:52:56 +01:00
Br1an	cd9dd10ccf	fix(docx): preserve URL fragments and query params in hyperlinks (#3050 ) Remove `Path()` wrapper from hyperlink address extraction. `Path()` is designed for filesystem paths and strips URL fragments (#) and query parameters (?), causing truncated hyperlinks in DOCX output. Signed-off-by: Br1an67 <932039080@qq.com>	2026-03-06 11:35:17 +01:00
Cesar Berrospi Ramis	56eb12782c	fix(docx): handle list items immediately after numbered headings (#3070 ) fix(docx): create a new list group with a list item after a heading When a list with a different 'numid' appears after a heading (marked as list item too), a new list group needs to be created to avoid inconsistencies (list item under heading). Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-03-06 09:30:48 +01:00
Br1an	859c302310	fix(xlsx): handle OneCellAnchor images in Excel backend (#3045 ) * fix: handle OneCellAnchor images in Excel backend Add support for OneCellAnchor image positioning in _find_images_in_sheet(). Previously, only TwoCellAnchor images had their position extracted; images with OneCellAnchor (the default when inserting images in Excel) would default to bounding box (0,0,0,0), placing them all at the top-left corner regardless of their actual position. Now OneCellAnchor images use the anchor cell as their bounding box origin, correctly preserving the image's position in the output document. * DCO Remediation Commit for Br1an67 <932039080@qq.com> I, Br1an67 <932039080@qq.com>, hereby add my Signed-off-by to this commit: `cd878618ff` Signed-off-by: Br1an67 <932039080@qq.com> --------- Signed-off-by: Br1an67 <932039080@qq.com>	2026-03-02 12:55:17 +01:00
Cesar Berrospi Ramis	334ba6e51f	feat: create a backend parser for XBRL instance reports (#3017 ) * build(xbrl): add Arelle as open-source library for XBRL Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * feat(xbrl): design and implement a backend parser for XBRL documents Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * test: remove print statements to reduce verbosity Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * style(XBRL): apply PEP8 naming convention for acronyms Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * refactor(XBRL): set XBRL dependencies as optional Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-02-24 16:52:02 +01:00
Peter W. J. Staar	bf417e6d26	feat: Introduce docling-parse v5 and deprecate old docling-parse backends (#2872 ) * feat: simplifying towards docling-parse v5 Signed-off-by: Peter Staar <taa@zurich.ibm.com> * working on integrating docling-parse v5 Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added the test_backend_docling_parse Signed-off-by: Peter Staar <taa@zurich.ibm.com> * Updated the docling-parse to 5.3.0 Signed-off-by: Peter Staar <taa@zurich.ibm.com> * ran the pre-commit Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the backend_docling_parse Signed-off-by: Peter Staar <taa@zurich.ibm.com> * ran pre-commit Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the groundtruth to deal with rounding errors Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated comments for later docling-parse integrations Signed-off-by: Peter Staar <taa@zurich.ibm.com> * ran pre-commit Signed-off-by: Peter Staar <taa@zurich.ibm.com> * Make DoclingParseV2 and DoclingParseV4 backend stubs that route to new backend, emit warning. Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * lock docling-parse Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * updated to 3.5.2 Signed-off-by: Peter Staar <taa@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2026-02-17 20:27:56 +01:00
Cesar Berrospi Ramis	a1b0e3fd6b	fix(csv): set default delimiter by default (#3005 ) If the delimiter cannot be determined, assume the default delimiter (comma). As a result, address single-column CSV, which triggered a parsing error. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-02-17 20:26:05 +01:00
jhchoi1182	1f914826bb	fix: add failed pages to DoclingDocument for page break consistency (#2939 ) * fix: add failed pages to DoclingDocument for page break consistency When some PDF pages fail to parse, they were not added to DoclingDocument.pages, causing page break markers to be incorrect during export. This adds failed/skipped pages with their size info (if available) to maintain correct page numbering and structure. - Add _add_failed_pages_to_document() method in StandardPdfPipeline - Add test cases for failed page handling - Add test cases for normal page handling (regression test) - Add test PDF files Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com> * fix: ensure resource cleanup and simplify type hints - Wrap page_backend usage in try-finally to guarantee unload (prevents resource leaks). - Simplify redundant 'float \| None \| None' type hint. Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com> * fix: add groundtruth for normal_4pages.pdf and exclude failing PDFs from e2e test Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com> * fix: ensure correct status assertion for failed pages in tests Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com> --------- Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>	2026-02-13 13:35:35 +01:00
Aditya Sasidhar	e6ccb8b2c1	feat: added support for parsing LaTeX (.tex) documents (#2890 ) * feat: added support for parsing LaTeX (.tex) documents Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * feat: implement PR #2890 feedback for LaTeX backend - Add text formatting options (bold, italic, underline) for LaTeX macros - Enhance image embedding with PIL and ImageRef.from_pil() - Refactor list processing to use GroupItem structure - Refactor bibliography to use GroupItem structure - Add nested list test coverage - All tests passing (39/39), all linters passing Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * DCO Remediation Commit for Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: f19f135b431d489cd8bf3982524505a0bbd8696d Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * DCO Remediation Commit for Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: f19f135b431d489cd8bf3982524505a0bbd8696d Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * feat: enhance latex backend with robustness fixes and ground truth - Add custom macro expansion for improved text quality - Fix preamble filtering to remove metadata garbage - Support recursive \input{} and \include{} file loading - Organize test data into subdirectories for complex papers - Add full end-to-end ground truth for 4 major arXiv papers (Attention, Mistral, DeepSeek, OTSL) - Pass all 41 unit tests and pre-commit checks Addresses @cau-git feedback for ground-truth data. Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * fix: minor formatting in test file Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * feat: enhance LaTeX backend with robust math and figure support - Fixed re.error: bad escape in macro expansion by using lambda in re.sub - Fixed sentences breaking at inline math ($) by preserving it within paragraphs - Improved figure environment with proper grouping and structured representation - Fixed crashes on documents starting with % comments - Added comprehensive unit tests and updated all ground truth data Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * WIP: saving work for laptop migration Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * got rid of the line breaking issues, still some do exist Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * fix: generalized LaTeX macro parsing and robustness improvements This commit addresses several issues with LaTeX parsing: - Correctly handle unknown macros (like \ion{N}{2}) inline to avoid line breaks. - Fix extraction of structural macros (section, caption, etc.) vs text-only groups. - Address PR feedback regarding inline math spacing and splitting. - Regenerate ground truth files reflecting these improvements. Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * style: apply automatic formatting fixes Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * style: fix ruff linter and formatter errors Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * fix: typing issues identified by mypy Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * style: apply formatting fixes to tests Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * fix: update groundtruth files for latex backend Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * fixed the ackward line breaking issue, turns out im stupid at considering text buffer * i forgot to add the groundtruth so here it is * DCO Remediation Commit for Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: `7e032635ef` I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: `aeba688384` Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * Ran the precommit as requested Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> --------- Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>	2026-02-10 15:13:09 +01:00
Rashid Ul Islam	3110c439da	fix(backend): improve Excel table bounds detection and flatten merged cells (#2778 ) * fix(backend): improve Excel table detection with BFS and configurable tolerance Replaces the bounding-box strategy with a Flood Fill (BFS) algorithm to correctly detect non-rectangular tables. Reverts span flattening to preserve semantic structure. Adds 'gap_tolerance' option to backend. Signed-off-by: Rashidul Islam <rasidulislam71@gmail.com> * fix(backend): improve Excel table detection with BFS and configurable tolerance Replaces the bounding-box strategy with a Flood Fill (BFS) algorithm to correctly detect non-rectangular tables. Reverts span flattening to preserve semantic structure. Adds 'gap_tolerance' option to backend. Signed-off-by: Rashidul Islam <rasidulislam71@gmail.com> * chore: reverse unnecessary file changes Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Rashidul Islam <rasidulislam71@gmail.com> Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-02-02 10:31:11 +01:00
Sam Quigley	5e452a2e8f	fix(pptx): handle picture shapes with external image references (#2914 ) * fix(pptx): handle picture shapes with external image references When processing PowerPoint files containing picture shapes that reference external images (rather than embedded images), the python-pptx library raises a ValueError("no embedded image") when accessing the `image` property. Previously, this caused the entire document conversion to fail because: 1. The `hasattr(shape, "image")` check at line 690 would trigger the property getter, which raises ValueError (hasattr only catches AttributeError, not ValueError) 2. The exception handler in `_handle_pictures()` only caught UnidentifiedImageError and OSError, not ValueError This fix: - Removes the unnecessary hasattr check since we already verify the shape type is MSO_SHAPE_TYPE.PICTURE - Adds ValueError to the exception handler in `_handle_pictures()` so that picture shapes with external references are gracefully skipped with a warning instead of crashing the pipeline Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * DCO Remediation Commit for Sam Quigley <quigley@emerose.com> I, Sam Quigley <quigley@emerose.com>, hereby add my Signed-off-by to this commit: `e69779e07b` Signed-off-by: Sam Quigley <quigley@emerose.com> * tests(pptx): add a linked image to test the fix on `e69779e` Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Sam Quigley <quigley@emerose.com> Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-02-01 11:44:29 +01:00
Cesar Berrospi Ramis	0602a7cdab	feat: webvtt and source tracker (#2787 ) * refactor(provenance): account for provenance as union of ProvenanceItem and ProvenanceTrack Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * refactor(webvtt): update WebVTTDocumentBackend with new docling-core classes Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * refactor(webvtt): preserve new lines and add helper handlers Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * refactor(webvtt): set ProvenanceTrack timinings as float type Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * style(asr): remove unnecessary imports Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * refactor(asr): use ProvenanceTrack in ASR pipeline Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * tests(webvtt): add additional tests Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(webvtt): parse the title of the WEBVTT file Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * refactor(webvtt): apply refactoring of TrackProvenance from docling-core Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * style(webvtt): apply X \| Y annotation instead of Optional, Union Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * refactor(webvtt): drop cue span classes, 'lang' and 'c' tags Drop WebVTT formatting features not covered by Docling across formats. Only 'u', 'b', 'i', and 'v' are supported and without classes. Align with docling-core v2.62.0 Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * build: pin docling-core 2.62.0 Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-01-30 17:44:03 +01:00
Siva	b6ca094519	feat: add support for Word document comments extraction (#2834 ) * feat: add support for Word document comments extraction (fixes #485) Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com> * fix: address PR review feedback for comments extraction - Change DocItemLabel.PARAGRAPH to TEXT (deprecating PARAGRAPH) - Change initials format from '(initials)' to 'author: initials' - Change timestamp format to include 'time:' prefix - Update test assertions and regenerate ground truth files Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com> * chore: update comment format and move format documentation from inline comment to function docstring Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com> * Use docling-core v2.58.0 add_comment() API to properly link Word document comments to their annotated text items via FineRef references. - Import FineRef from docling_core.types.doc.document - Refactor _add_comments to use doc.add_comment(targets=[...]) API - Parse DOCX XML for commentRangeStart/End markers in _extract_comment_ranges - Track paragraph-to-items mapping for comment linking - Fallback to unlinked comments in COMMENT_SECTION group when no targets found Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com> * - Extract comment IDs directly during paragraph element processing to match element IDs - Clear paragraph mappings at start of each conversion for consistent behavior - Always create comment groups and use add_comment() API with targets - Add _get_comment_ids_for_element() helper to extract comment markers from XML - Regenerate ground-truth files (JSON/MD/itxt) with comments field properly linked Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com> * fix: remove incorrect ground-truth files, keep versions with comments field Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com> * fix: reference comment groups instead of text items in comments field Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com> --------- Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2026-01-26 09:58:46 +01:00
Cesar Berrospi Ramis	86eaef5b45	fix(md): handle pipe symbols that are not table markers (#2904 ) * fix(md): handle pipe symbols that are not table markers Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore: update uv.lock with latest docling-core 2.60.2 Update uv.lock file with the latest release of docling-core (2.60.2). Update (fix) ground truth files for testing markdown serialization to be in sync with the serialization fix (issue 2880). Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-01-23 15:19:09 +01:00
Tong Luo	999dbb2765	fix: PPTX parsing: bullet points not grouped correctly under subheadings (#2663 ) (#2855 ) * fix: PPTX parsing: bullet points not grouped correctly under subheadings (#2663) Signed-off-by: Tong Luo <luotng@cn.ibm.com> * fix: PPTX parsing: bullet points not grouped correctly under subheadings, support Python 3.9 (#2663) Signed-off-by: Tong Luo <luotng@cn.ibm.com> * fix: PPTX parsing: optimized code naming, descriptions (#2663) Signed-off-by: Tong Luo <luotng@cn.ibm.com> * fix: PPTX parsing: optimized code naming, descriptions (#2663) Signed-off-by: Tong Luo <luotng@cn.ibm.com> * docs(pptx): updated docstrings in pptx backend parser Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Tong Luo <luotng@cn.ibm.com> Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-01-21 14:34:24 +01:00
Michele Dolfi	a1f8bddcb7	chore: update locked deps in CI (#2895 ) * chore: update locked deps Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update test results (to review) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2026-01-20 11:45:57 +01:00
Michele Dolfi	19af03f539	feat: Support for DeepSeek-OCR in VLM pipeline (#2798 ) * add parsing of annotated markdown and definition of new ResponseFormat for the VLM pipeline Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix broken html in test Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update result with initial text Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * move parsing to vlm pipeline Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * restore md from main Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * process table structure Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * simplify and refactor Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * factor out deepseekocr utils Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * renaming Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * refactor common logic in vlm parsing logic Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add deepseek-ocr with ollama Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update tests for new annotation format Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix parsing of title Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * more test data Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add picture item Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix bbox parsing Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * remove old tests Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add test parsing deepseek md Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * rename test Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add test with ollama conversion Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix test and mark methods as private Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2026-01-09 18:42:40 +01:00
Cesar Berrospi Ramis	5c1f8f0171	fix(docx): handle grouped pictures (#2861 ) Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-01-09 09:42:52 +01:00
Michele Dolfi	595115d892	fix(markdown): allow text before headers also in mixed markdown and html (#2801 ) * fix missing content in mixed markdown Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * delete elements outside of iterate_items Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * no need for new test files Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add new export without extra furniture title Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add html options Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * use html options Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-12-17 13:54:07 +01:00
Cesar Berrospi Ramis	d007ba0e6f	fix(html): tackle paragraphs with block-level elements (#2720 ) Fix p elements having block-level elements anywhere inside as browsers do. Fix wrong type annotations. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2025-12-05 12:52:53 +01:00
Matvei Smirnov	aebe25cf00	fix(html): prevent hierarchy reset in rich table cells (#2716 ) * fix(html): restore parents after rich cell walking Signed-off-by: Matvei Smirnov <vdalekesmirnov@gmail.com> * fix(html): add table cell context manager, update tests Signed-off-by: Matvei Smirnov <vdalekesmirnov@gmail.com> * fix(html): table with heading test data Signed-off-by: Matvei Smirnov <vdalekesmirnov@gmail.com> --------- Signed-off-by: Matvei Smirnov <vdalekesmirnov@gmail.com>	2025-12-03 18:52:23 +01:00
Cesar Berrospi Ramis	c97715f5fd	fix(docx): parse integrals as n-ary objects without chr element (#2712 ) Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2025-12-03 11:25:52 +01:00
glypt	54cd6d7406	fix: do not consider singleton cells in xlsx as TableItems but rather TextItems (#2589 ) fix: do not handle 1x1 cell as a tableitem but as a textitem Signed-off-by: glypt <8trash-can8@protonmail.ch>	2025-11-27 16:25:32 +01:00
Michele Dolfi	e58055465c	fix(docx): Missing list items after numbered header (#2665 ) * fix #2250. list items after numbered headers Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add test for new case Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * chore(docx): remove unnecessary check Remove 'current_parent is None' check in '_add_list_item' function since it will always be None. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2025-11-24 08:49:21 +01:00
Cesar Berrospi Ramis	054c4a634d	fix(docx): parse page headers and footers (#2599 ) * fix(docx): parse page headers and footers Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(docx): rename _add_header with _add_heading To avoid confusion, rename _add_header function name with _add_heading since the function is about adding section headings. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(docx): extend the page header and footer parsing to any content type Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(docx): fix _add_header_footer function Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2025-11-10 16:10:12 +01:00
Cesar Berrospi Ramis	ef623ffcee	fix(docx): slow table parsing (#2553 ) * chore(docx): remove unnecessary import Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * fix(docx): simplify parsing of simple tables Simplify the parsing of tables with just text (no rich cells). Move nested function group_cell_elements out of _handle_tables for readability. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(docx): reuse method for finding inline pictures Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(docx): format strikethrough text Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * tests(docx): use fixtures to avoid converting same file multiple times Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * fix(docx): remove unnecessary argument docx_obj in functions Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * tests(docx): add test for rich table cells Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(docx): small improvements in backend and its unit tests Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(docx): parse superscript and subscript formatted text Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2025-11-06 05:25:53 +01:00
Cesar Berrospi Ramis	0ba8d5d9e3	fix(html): slow table parsing (#2582 ) * fix(html): simplify parsing of simple table cells Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * tests(html): add test for rich table cells Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * fix(html): ensure table cells with formatted text are parsed as RichTableCell Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * refactor(html): simplify process_rich_table_cells since only rich cells are processed Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * fix(html): formatted cell runs should be parsed as text items respecting the order Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore: pin latest docling-core and update uv.lock Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore: upgrade dependencies on uv.lock Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2025-11-06 05:25:36 +01:00
glypt	d9c90eb45e	fix: xlsx cell parsing, now returning values instead of formulas (#2520 ) * fix: xlsx doc parsing, now returning values instead of formulas Signed-off-by: glypt <8trash-can8@protonmail.ch> * fix: add test for better coverage of xlsx backend Signed-off-by: glypt <8trash-can8@protonmail.ch> * fix: add the total of ducks as a formula in the tests/data This also adds the test that the value 310 is contained in the table. Without the fix from the previous commit, it would return "B7+C7" Signed-off-by: glypt <8trash-can8@protonmail.ch> --------- Signed-off-by: glypt <8trash-can8@protonmail.ch>	2025-10-29 11:35:51 +01:00

1 2 3 4

163 Commits