Commit Graph

181 Commits

Author SHA1 Message Date
Cesar Berrospi Ramis e00735dd59 fix(docx): fix OMML equation handling and improve type safety (#3381)
* fix(docx): handle missing chr attribute in groupChr OMML elements

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(docx): escape spaces in OMML limit text for proper LaTeX rendering

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(docx): fix inline equation reconstruction to prevent tag corruption

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): add type hints and docstrings to OMML module

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(docx): fix genfrac formatting and eliminate grouping function warnings

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(docx): handle unmapped characters in OMML % formatting

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-05-04 10:58:25 +02:00
pateltejas 72942486ff fix(pptx): skip malformed picture shapes instead of aborting conversion (#3372)
* fix(pptx): skip malformed picture shapes instead of aborting conversion

MsPowerpointDocumentBackend._handle_pictures reads embedded image bytes via python-pptx's shape.image accessor. On PPTX files with slightly malformed <p:pic> shapes, shape.image raises three exceptions that the existing (UnidentifiedImageError, OSError, ValueError) clause does not catch, so one bad picture aborts conversion of the entire presentation:

- InvalidXmlError when <p:blipFill> is missing
- KeyError when <a:blip r:embed> points to an unknown relationship
- AttributeError when the embedded part's content-type isn't an image

These files open normally in Keynote and Google Drive, so the backend should handle them as gracefully as it already handles truncated or unreadable image payloads.

This follows the same pattern as #2914, which extended the same except tuple with ValueError to handle linked (external) image references. The three cases above are the remaining shape.image failure modes that still escape.

Extend the except tuple to cover the three cases and log the same warning used for other unreadable images, leaving the rest of the presentation to convert normally. Add a regression fixture with one malformed picture per failure mode plus a focused test.

Fixes #3371

Signed-off-by: pateltejas <tejas226@hotmail.com>

* refactor(pptx): use warnings.warn for malformed picture skips

Address PR review feedback: use Python's warnings module with UserWarning to signal the skip to callers instead of logging.Logger.warning, matching the pattern used in msword_backend for "Skipping external image reference". This makes the skip visible via standard warning filters and catchable in tests.

Update the regression test to assert the warning is emitted via pytest.warns, which also suppresses the message during the test run so it doesn't clutter suite output.

Signed-off-by: pateltejas <tejas226@hotmail.com>

---------

Signed-off-by: pateltejas <tejas226@hotmail.com>
2026-04-29 08:29:08 +02:00
Cesar Berrospi Ramis 3df80e7f46 fix(docx): OMML conversion failures for unsupported limit functions (#3359)
* fix(docx): handle unsupported limit functions gracefully in OMML conversion

Replace RuntimeError with graceful fallback for unknown limit functions in do_limlow().
Add argmax and argmin to LIM_FUNC dictionary for proper LaTeX rendering.
Fixes conversion failures when Word documents contain mathematical operators
not previously supported in the limit function dictionary.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* test(docx): regenerate ground truth files

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-04-28 14:43:24 +02:00
Cesar Berrospi Ramis c455a65e36 feat(docx): add checkbox parsing support (#3349)
* feat(docx): add checkbox parsing support to MsWordDocumentBackend

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(docx): remove duplicate code in text element handling

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* docs(docx): update checkbox method docstrings

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(docx): use self._BLIP_NAMESPACES for w14 namespace in checkbox methods

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-04-28 14:38:43 +02:00
Aatrey Sahay f2c03edb30 fix(html):preserve fragment-only anchor links during path resolution (#3262)
fix(html): preserve fragment-only anchor links during path resolution

Fragment-only hrefs (e.g. href="#section1") were resolved as filesystem
paths when source_uri was set, breaking internal document navigation.

Add '#' to the skip-resolution prefixes in _resolve_relative_path() so
fragment links pass through unchanged.

Partially addresses #2929

Signed-off-by: aatrey56 <aatrey.sahay@gmail.com>
2026-04-28 10:28:23 +02:00
Cesar Berrospi Ramis 2ddaa3be97 feat(docx): extract VML images with v:imagedata elements (#3343)
feat(docx): Extract VML images with v:imagedata elements

Add VML image support with EMF/WMF conversion and consolidate image handler code.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-04-22 08:46:36 +02:00
Matvei Smirnov 3a3c8f68dd fix(pptx)!: assign pptx notes to ContentLayer.NOTES (#3341)
Signed-off-by: Matvei Smirnov <vdalekesmirnov@gmail.com>
2026-04-21 18:35:43 +02:00
Cesar Berrospi Ramis c7615123e6 fix(docx): handle inline formulas in list items (#3304)
* fix(docx) Handle inline formulas in list items

Fixes issue where inline formulas in list items were ignored during conversion.
Added helper methods to eliminate code duplication.
Updated test data with list items containing inline equations.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(docx): collect element refs in _add_inline_equations_to_parent

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-04-17 07:33:20 +02:00
pateltejas 043ed2dd3d fix(pptx): handle NotImplementedError from shape.shape_type (#3309)
* fix(pptx): handle NotImplementedError from shape.shape_type

python-pptx raises NotImplementedError from Shape.shape_type for
<p:sp> elements that aren't placeholders, autoshapes, textboxes, or
freeforms (e.g. shapes with empty <p:spPr> from Google Slides exports,
LibreOffice, or Keynote). handle_groups() and handle_shapes() access
shape_type without catching this, crashing the entire conversion.

Add a _safe_shape_type() helper that returns None on
NotImplementedError, so unrecognized shapes skip only the GROUP
recursion and PICTURE extraction while text and table extraction
proceed normally.

Fixes #3308

Signed-off-by: Tejas Patel <tejas226@hotmail.com>

* Fix lint

Signed-off-by: Tejas Patel <tejas226@hotmail.com>

---------

Signed-off-by: Tejas Patel <tejas226@hotmail.com>
2026-04-17 06:59:48 +02:00
Cesar Berrospi Ramis 740c386730 fix(docx): isolate list state in table cells (#3294)
* fix(docx): isolate list state in table cells

Lists with the same numId in different table cells were incorrectly
merged. Added context manager to isolate list state during cell
processing. Includes test cases and updated ground truth files.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* style(docx): modernize type hints to use PEP 604 union syntax

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-04-15 09:51:37 +02:00
vwe-ibm 9b4b67b23e feat: add signature/stamp html block to DC document (#3251) 2026-04-09 06:13:22 +02:00
Smeet Agrawal 61809252ec fix(latex): discard arguments of filtered spacing commands (#3245)
Commands like \vspace{-1mm} and \hspace{0.2cm} were being filtered
at the command level but their argument values were leaking through as
plain text nodes. Ensure that when a spacing/ignored command is
encountered, its arguments are also suppressed.

- Add vspace, hspace, vspace*, hspace*, addvspace to MACROS_SPACING
- Guard against spacing/ignored macros in _process_macro_node_inline
  so their brace arguments are not extracted as inline text
- Guard against spacing/ignored macros in _nodes_to_text so dimension
  values do not leak when processing footnotes, captions, etc.
- Update ground truth files to reflect corrected output

Fixes #3240

Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com>
Co-authored-by: Smeet Agrawal <smeetagrawal23@gmail.com>
2026-04-08 11:19:56 +02:00
Hussain Arslan 524edcce73 fix(pdf): propagate hyperlinks to DoclingDocument text items (#3131)
* fix(pdf): propagate hyperlinks to DoclingDocument text items

docling-parse already extracts PdfHyperlink objects with bounding
rectangles and URIs into SegmentedPdfPage.hyperlinks, and TextItem
already has a hyperlink field. However, the PDF pipeline never matched
hyperlink annotations to text clusters — the data was available but
never propagated.

Add spatial matching of PDF hyperlinks to text clusters during page
assembly, then pass the resolved hyperlink through the reading order
model to the final DoclingDocument.

Changes:
- Add hyperlink field to TextElement (base_models.py)
- Add _match_hyperlink() to PageAssembleModel that spatially matches
  cluster bboxes against hyperlink annotation rects, aggregating
  coverage per URI to handle wrapped links with multiple rects
- Thread hyperlink= through add_text(), add_heading(), add_list_item()
  calls in ReadingOrderModel
- Drop hyperlink on text merge when constituent clusters disagree
- Fall back to Path when AnyUrl validation fails (matches HTML backend)
- Regenerate affected ground truth files
- Add unit tests for _match_hyperlink() edge cases

Closes #3096

Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com>

* fix(pdf): recover unmatched hyperlinks as REFERENCE items

Track consumed hyperlink indices during cluster matching so that
hyperlinks which don't meet the overlap threshold are not silently
dropped. Unmatched hyperlinks that overlap text clusters are
materialized as synthetic REFERENCE TextElements. Also propagate
hyperlinks through FORMULA items in reading-order assembly.

Signed-off-by: macbook <macbook@users.noreply.github.com>
Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com>

* DCO Remediation Commit for hussainarslan <m.hussain.arslan@gmail.com>

I, hussainarslan <m.hussain.arslan@gmail.com>, hereby add my Signed-off-by to this commit: 71a8d900bd

Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com>

* test: regenerate reference data for hyperlink propagation

Update groundtruth files for 2206.01062, 2305.03393v1, and
textbox.docx to reflect hyperlink fields on text items and
new REFERENCE items for unmatched hyperlinks.

Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com>

* Revert "test: regenerate reference data for hyperlink propagation"

This reverts commit 374f478ebf71e7e43b1b98d7106375c7f3d77101.

Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com>

* Revert "fix(pdf): recover unmatched hyperlinks as REFERENCE items"

This reverts commit e0e9b9225fa5caa0a7b2578a29600a9531edc624.

Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com>

* test: regenerate groundtruth for hyperlink propagation

Regenerate the affected docling_v2 PDF and DOCX fixtures after rerunning the hyperlink propagation groundtruth suite and switch the hyperlink coverage selection helper to the explicit items() form to avoid a type ignore.

Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com>

* test: regenerate groundtruth for docling_core 1.10.0

Regenerate the affected docling_v2 PDF and DOCX fixtures with the current docling_core schema version so committed groundtruth stays compatible with CI and example loading.

Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com>

---------

Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com>
Signed-off-by: macbook <macbook@users.noreply.github.com>
2026-03-31 08:58:21 +02:00
lif 89c68f8ec3 fix: parse LaTeX macros in multicolumn/multirow table cells (#3204)
* fix: parse LaTeX macros in multicolumn/multirow table cells

Table cells in multicolumn and multirow used raw LaTeX content
(e.g. \textbf{Header}) wrapped in LatexCharsNode, which passed
through _nodes_to_text() verbatim. Parse the content string with
LatexWalker so formatting macros get properly resolved to text.

Signed-off-by: majiayu000 <1835304752@qq.com>

* test: add regression test for LaTeX formatting in table cells

Add test_latex_table_formatting_in_cells to verify multicolumn/multirow
cells with \textbf, \textit, \tiny, and nested formatting produce clean
text output without raw LaTeX commands (issue #3199).

Also narrow except clause from bare Exception to LatexWalkerParseError
for both multicolumn and multirow parse fallbacks.

Signed-off-by: majiayu000 <1835304752@qq.com>

---------

Signed-off-by: majiayu000 <1835304752@qq.com>
2026-03-30 07:31:26 +02:00
Giulio Leone e36125ba2d fix(omml): correct LaTeX output for fractions, math operators, and functions (#3122)
* fix(omml): correct LaTeX output for fractions, math operators, and functions

Fixes three related bugs in OMML-to-LaTeX conversion:

A) Fraction raised to a power now produces correct grouping braces:
   {\frac{(x-c)}{v}}^{2} instead of \frac{(x-c)}{v}^{2}
   Adds dedicated do_ssub/do_ssup/do_ssubsup handlers that wrap
   complex base expressions (fractions, radicals) in braces.

B) EN DASH (U+2013) and CIRCUMFLEX (U+005E) inside math runs are
   now mapped to their math-mode equivalents (- and ^) instead of
   being escaped as \text{\textendash} and \text{\textasciicircum}.

C) Adds missing standard math functions to the FUNC dict: log, ln,
   exp, det, gcd, deg, hom, ker, dim, arg, inf, sup, lim, Pr.
   These now emit proper LaTeX commands (e.g. \log) instead of
   falling back to plain italic text.

Closes #3120

Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

* test(omml): add test documents for OMML-to-LaTeX conversion bugs

Add three minimal DOCX files exercising the fixed edge cases:
- omml_frac_superscript.docx: fraction as superscript base (Bug A)
- omml_text_escapes_in_math.docx: en-dash and caret in math runs (Bug B)
- omml_func_log.docx: log function recognition (Bug C)

Each file includes matching groundtruth (md, json, itxt).

Requested-by: @dolfim-ibm
Signed-off-by: Giulio Leone <giulioleone10@gmail.com>
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

* fix(omml): avoid double-wrapping nested sub/sup containers

Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

* fix(omml): fix Bug B caret escape + use issue #3120 test documents

Bug B fix: prevent escape_latex from re-escaping characters that
process_unicode intentionally mapped to math operators.  The caret
character U+005E inside <m:r><m:t> math runs was being converted
to ^ by _MATH_CHAR_MAP, then immediately re-escaped to \^ by
escape_latex.  Now do_r restores math-mapped chars after escaping.

Result: x - y\^2 → x - y^2 (correct superscript)

Test documents: replace minimal programmatic fixtures (~1.2 KB)
with the real Word documents from issue #3120 reporter (smroels,
~37 KB each).  Regenerate all groundtruth.

Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

* test: regenerate groundtruth for omml_text_escapes_in_math

Update .itxt to use proper indented-text export format (item hierarchy)
and refresh .json to match current converter output.

Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

* test(omml): regenerate indented text snapshots

The OMML regression documents were exported into the .itxt fixtures using the
wrong format, so the real DOCX end-to-end check failed even though the rebased
converter output was correct.

Regenerate the two broken indented-text snapshots from the current branch so
the MS Word E2E test verifies the actual converter behavior.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* style(omml): apply ruff format normalization

Normalize the multiline condition in omml.py to match the repository
ruff-format output so the pre-commit gate stays clean on the refreshed
PR head.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

* DCO Remediation Commit for giulio-leone <giulio97.leone@gmail.com>

I, giulio-leone <giulio97.leone@gmail.com>, hereby add my Signed-off-by to this commit: 08001d9c5c

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

---------

Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Signed-off-by: Giulio Leone <giulioleone10@gmail.com>
Co-authored-by: giulio-leone <giulio.leone@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-25 07:07:31 +01:00
Maxim Lysak 1c74a9b9c7 feat: Implementation of HTML backend with headless browser (#2969)
- Implementation of HTML backend that (optionally) uses headless browser (via Playwright) to materialize HTML pages into images, and add provenances with bboxes to all elements in the converted docling document.
- Conversion preserves reading order given by HTML DOM tree
- Added support for HTML "input" fields: checkboxes, radiobuttons, text inputs, etc.
- Added support to Key-Value convention in HTML (i.e. elements with id "key1" and "key1_value1" will be paired as key-values, see test cases as examples)
- Heuristic that glues independent inline HTML elements with single-character text in them into larger text blocks
- Support for inline styling (bold, italic, etc.)

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2026-03-24 14:28:57 +01:00
Giulio Leone 90d6dd4e87 fix(docx): split multiple OMML equations into separate formula items (#3123)
* fix(msword): split multiple OMML equations into separate formula items

When a DOCX paragraph contains multiple sibling <m:oMath> elements
(e.g. separate equations on one line), the converter previously
concatenated them into a single LaTeX string because element.iter()
walks all descendants depth-first.

Fix: iterate direct children of the paragraph element first to
correctly identify sibling <m:oMath> elements, converting each
independently. Falls back to deep iteration only when oMath
elements are nested inside wrapper elements.

Also splits standalone multi-equation paragraphs into individual
FORMULA document items instead of merging them into one.

Closes #3121

Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

* test(msword): add multi-equation paragraph test document

Add a minimal DOCX file containing two separate oMath elements
in one paragraph with a text separator, along with groundtruth
output files for markdown, json, and plain text export.

Requested-by: @dolfim-ibm
Signed-off-by: Giulio Leone <giulioleone10@gmail.com>
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

* test(msword): regenerate multi-equation indented-text snapshot

Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

* test: replace test doc with issue #3121 attachment

Use the real Word document from the issue reporter (smroels)
instead of the minimal programmatic fixture. The new document
contains three sibling <m:oMath> elements in one paragraph,
matching the exact failing shape described in #3121.

Regenerate groundtruth to match the richer document structure.

Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

* test: regenerate groundtruth for omml_multi_equation_paragraph

Re-run document conversion with current code to update .itxt and .json
groundtruth files. The .itxt had stale structure from the previous
programmatic fixture; the new real-document conversion produces the
correct output with three separate formula items.

Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

* style(docx): rerun ruff formatter for msword backend

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* refactor(docx): drop unused tag_name binding

Remove the unused local in the direct oMath iteration path so the code
reads clearly and the outstanding review comment is fully addressed
without changing equation-handling behavior.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

* DCO Remediation Commit for giulio-leone <giulio97.leone@gmail.com>

I, giulio-leone <giulio97.leone@gmail.com>, hereby add my Signed-off-by to this commit: 84cc70b55e

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

* test(docx): cover equation paragraph branches

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

* test(docx): reuse backend fixture in msword tests

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

---------

Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Signed-off-by: Giulio Leone <giulioleone10@gmail.com>
Co-authored-by: giulio-leone <giulio.leone@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-24 09:42:16 +01:00
Emre Çalışır 2f7c09e0d8 fix(docx): Missing list items after numbered header (#2665) (#2678)
* fix(docx): Correct list numbering with interleaved numIds and hierarchical markers

  Fixes incorrect numbering and missing items in DOCX documents that use
  multiple interleaved numbering sequences (numIds).

  Changes:
  * Reset sub-level counters in _get_list_counter when a parent level
    advances, preventing counter bleed-across (e.g. "4. Functional
    Requirements" now correctly renders as "1. Functional Requirements")
  * Add _build_enum_marker helper to produce hierarchical markers in
    "1.2.3." format instead of flat single-level counters
  * Fix anchor-based level calculation in new-sequence branch: use
    level_at_new_list + ilevel instead of _get_level() to correctly
    place items from a different numId at the right document level
  * Only set level_at_new_list in the else case (when None) to avoid
    corrupting the anchor when switching between interleaved numIds
  * Remove _reset_list_counters_for_new_sequence from new-sequence branch
    so that returning to a previously seen numId continues its counter
    (e.g. Appendix A=1, B=2, C=3 instead of A=1, B=1, C=1)

Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>

* style(docx): apply Ruff formatting to msword_backend

Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>

* test(docx): add unit tests for list counter, enum marker, and sequence reset helpers

  Adds test_list_counter_and_enum_marker covering helper methods introduced
  in the list numbering fix: counter increment, sub-level reset on parent
  advance, hierarchical marker building, and selective sequence reset.

Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>

* fix(docx): read start values from abstractNum for correct numbering

    When Word creates a new numbering definition that continues from a
    previous list, it embeds start values in the abstractNum XML instead of
    reusing the same numId. Docling previously ignored these start values
    and always initialized counters from 1, producing incorrect markers
    like "1.1.1." instead of "2.3.1.".

    Changes:
    * Add _get_level_element helper to extract level XML from abstractNum,
      eliminating duplicated XML traversal in _is_numbered_list
    * Add _get_start_value to read w:start from the numbering definition
    * Initialize counters in _get_list_counter using start values
    * Use start values as fallback in _build_enum_marker for parent levels
      that have not been explicitly incremented

Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>

* test(docx): add interleaved numId edge case to unit_test_headers_numbered

    Extends the existing test document with an Appendix section that uses a
    different numId, followed by list items that resume the original
    numbering sequence with Word-embedded start values (e.g. 2.3.1.).
    Updates groundtruth files accordingly.

Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>

* style(docx): add return type annotation to _get_level_element

Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>

---------

Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>
2026-03-20 21:24:27 +01:00
Ron 8ae0974a9d fix: handle external image relationships in MsWordDocumentBackend (#3114)
* fix: handle external image relationships in MsWordDocumentBackend

When a .docx file contains image relationships with TargetMode="External"
(common in documents saved from web browsers), accessing
`_Relationship.target_part` raises ValueError because external relationships
don't have a target part within the package.

Check `rel.is_external` before accessing `target_part`, emitting a
UserWarning with the external target URL and returning None so external
images fall through to the existing "image cannot be found" handling.

Includes test with ground truth files for a .docx with external image
references.

Fixes #3113

Signed-off-by: rongo-ms <127863751+rongo-ms@users.noreply.github.com>

* chore: upgrade dependencies in uv.lock file

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: rongo-ms <127863751+rongo-ms@users.noreply.github.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-03-19 14:22:21 +01:00
Peter W. J. Staar 4ccd1d465d feat: Add support for TableFormer v2 (#3013)
* ran DCO

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* modified tf

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Added  to TableFormer v1

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added convenience methods for quality testing

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated with comments

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* ran pre-commit

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* chore: update lockfile

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Align __init__ args with factory method kwargs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* chore: bump docling-ibm-models version

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix mypy type stubs error with torchvision

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Add torch/torchvision direct deps

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove global torch imports

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* add test diffs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix jats/xbrl test generate, updated test GT from docling-core upgrade

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fixed the  in the cli

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated uv lock

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixing merge conflicts between main

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixing merge conflicts between main (2)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* upgrade uv.lock

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Ahmed Nassar AHN@zurich.ibm.com <AHN@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2026-03-10 11:57:00 +01:00
Ivan Traus 80f75b8896 fix(html): fix broken document tree and quadratic complexity in rich table cells (#3025)
* fix(html): fix broken document tree and quadratic PictureItems in rich table cells

Three related bugs in the HTML backend when processing table cells that
contain rich content (RichTableCell), as found on Wikipedia pages with
large reference, taxobox, or classification tables:

Bug 1 — orphaned InlineGroups causing broken parent/child relationships
------------------------------------------------------------------------
When _use_inline_group() created an InlineGroup node (for paragraphs
containing multiple hyperlinks, e.g. "text <a> and <a>"), it was added
as a child of the current parent via doc.add_group(), but its RefItem was
never appended to added_refs / provs_in_cell. This meant:

  - group_cell_elements() reparented the text items inside the InlineGroup
    (because their individual refs WERE in added_refs), moving them from
    body → outer_group_element.
  - The InlineGroup itself remained in body.children still pointing to
    those same text items as its .children.
  - Result: two nodes (InlineGroup and outer_group_element) claimed the
    same child items, with contradictory .parent pointers. This broken
    tree caused double-serialization of text items in export_to_markdown().

Fix: make _use_inline_group() yield the RefItem of the created group.
Callers (_flush_buffer, _handle_block, _handle_list) now track the
InlineGroup ref instead of individual leaf refs when a group was created.
group_cell_elements() then reparents the whole InlineGroup (with its
children intact) rather than orphaning it.

Bug 2 — quadratic PictureItem creation from stray outer image loop
-------------------------------------------------------------------
In _handle_block() for <table> tags, after parse_table_data() had already
walked the entire table subtree (including nested tables) and emitted
PictureItems for every <img>, there was an additional outer loop:

    for img_tag in tag("img"):
        im_ref2 = self._emit_image(tag, doc)

Because BeautifulSoup's .find_all("img") on a tag finds ALL descendant
<img> elements (including those in nested tables), this loop processed
every image in the entire subtree again. A table nested N levels deep
caused N*(N+1)/2 duplicate PictureItems per image (quadratic growth).

Fix: remove the outer loop. Images are already handled by parse_table_data()
-> _use_table_cell_context() -> _walk() -> _emit_image().

Bug 3 — missing space separator between nested table cell text
--------------------------------------------------------------
HTMLDocumentBackend.get_text() uses _extract_text_recursively(), which
only appended a trailing space for <p> and <li> tags. When a table cell
contained a nested <table>, adjacent <th> or <td> elements without
whitespace NavigableString nodes between them were concatenated directly
(e.g. "TypeSound" instead of "Type Sound").

Fix: add "th" and "td" to the trailing-space tag set so that the text
content of each cell is separated by a space.

Bug 1 and Bug 2 were introduced in docling v2.55.0 (commit c803abe) with
rich table cell support.

Signed-off-by: Ivan Traus <ivan@liminary.io>

* test(html): align markdown fixtures with current docling-core behavior

Signed-off-by: Ivan Traus <ivan@liminary.io>

* test(xbrl): update XBRL fixture after get_text() cell spacing fix

The Bug 3 fix (adding th/td to trailing-space tags in get_text())
affects the XBRL backend which internally uses HTMLDocumentBackend.
Regenerate the mlac-20251231 fixture to match the corrected text
extraction.

Signed-off-by: Ivan Traus <ivan@liminary.io>

* chore(deps): bump docling-core to 2.67.1, regenerate fixtures and trim tests

Update uv.lock to pull in the merged nested-table flattening fix
(docling-core#525). Regenerate markdown fixtures that now show flattened
text instead of invalid embedded table syntax. Trim verbose test
docstrings and remove narrating comments.

Signed-off-by: Ivan Traus <ivan@liminary.io>

* fix: annotate _use_inline_group return type and regenerate docx fixtures

Add Generator[RefItem | None, None, None] return type and Google-style
Yields section to _use_inline_group. Regenerate docx ground truth
fixtures affected by docling-core 2.67.1 nested-table flattening.

Signed-off-by: Ivan Traus <ivan@liminary.io>

* refactor: use Iterator type hint and remove redundant test

Apply feedback: use Iterator instead of Generator, drop type from Yields docstring, and remove
test_e2e_rich_table_cells_markdown (already covered by test_e2e_html_conversions).

Signed-off-by: Ivan Traus <ivan@liminary.io>

* style(html): apply indent to docstrings

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Ivan Traus <ivan@liminary.io>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-03-10 09:48:21 +01:00
Christoph Auer 3b7bba0212 chore: Revert unintended test ground truth changes from #3019 (#3093)
add test diffs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2026-03-09 17:38:34 +01:00
Aditya Sasidhar 1192714b53 fix: add parse timeout to legacy LaTeX documents (#3019)
* a quick 30 second timeout for each file ( this does seem exorbitant but ill have to go over the average parse time of large files and decide upon an upper limit and an average limit, next commit needs individual node ignorance instead of file itself

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* fix: bypass mypy attr-defined for parse_timeout in late options

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* feat(latex): SOTA improvements from pandoc — theorems, preamble metadata, math envs, bugfixes

Features:
- Theorem/proof/lemma/corollary/definition/remark/example/conjecture environments
- Proof environment with conditional QED ◻ symbol
- \paragraph and \subparagraph as headings (levels 4, 5)
- \author, \date, \title extracted from preamble
- \href preserves URL as [text](url)
- \renewcommand and \providecommand macro extraction
- dmath/dgroup/darray/subequations math environments
- \input cycle detection with depth limit of 10
- quote/quotation/verse environment handling

Bugfixes:
- Fixed UnboundLocalError in _extract_custom_macros
- Fixed _extract_verbatim_content regex stealing content
- Fixed is_valid() rejecting preamble-only fragments
- Removed unused deepcopy import
- Unified recursion depth limits to 10

Tests:
- 7 new tests, 1 updated, ground-truth regenerated

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* Added some more dangerous macors to the ignore list, the is_valid() function now accpets \documentstyle too and added some essential and primitive layout passes

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* removed the restrictive nature of the is_valid() function

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* added test coverage to the added features and got rid of the time formatted parsing test which caused hanging on the python 3.10 during the CI/CD testing pipline

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

---------

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
2026-03-09 10:52:56 +01:00
Br1an cd9dd10ccf fix(docx): preserve URL fragments and query params in hyperlinks (#3050)
Remove `Path()` wrapper from hyperlink address extraction. `Path()`
is designed for filesystem paths and strips URL fragments (#) and
query parameters (?), causing truncated hyperlinks in DOCX output.

Signed-off-by: Br1an67 <932039080@qq.com>
2026-03-06 11:35:17 +01:00
Cesar Berrospi Ramis 56eb12782c fix(docx): handle list items immediately after numbered headings (#3070)
fix(docx): create a new list group with a list item after a heading

When a list with a different 'numid' appears after a heading (marked as list item too), a
new list group needs to be created to avoid inconsistencies (list item under heading).

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-03-06 09:30:48 +01:00
Br1an 859c302310 fix(xlsx): handle OneCellAnchor images in Excel backend (#3045)
* fix: handle OneCellAnchor images in Excel backend

Add support for OneCellAnchor image positioning in _find_images_in_sheet().
Previously, only TwoCellAnchor images had their position extracted; images
with OneCellAnchor (the default when inserting images in Excel) would
default to bounding box (0,0,0,0), placing them all at the top-left
corner regardless of their actual position.

Now OneCellAnchor images use the anchor cell as their bounding box
origin, correctly preserving the image's position in the output document.

* DCO Remediation Commit for Br1an67 <932039080@qq.com>

I, Br1an67 <932039080@qq.com>, hereby add my Signed-off-by to this commit: cd878618ff

Signed-off-by: Br1an67 <932039080@qq.com>

---------

Signed-off-by: Br1an67 <932039080@qq.com>
2026-03-02 12:55:17 +01:00
Cesar Berrospi Ramis 334ba6e51f feat: create a backend parser for XBRL instance reports (#3017)
* build(xbrl): add Arelle as open-source library for XBRL

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* feat(xbrl): design and implement a backend parser for XBRL documents

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* test: remove print statements to reduce verbosity

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* style(XBRL): apply PEP8 naming convention for acronyms

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(XBRL): set XBRL dependencies as optional

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-02-24 16:52:02 +01:00
Peter W. J. Staar bf417e6d26 feat: Introduce docling-parse v5 and deprecate old docling-parse backends (#2872)
* feat: simplifying towards docling-parse v5

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* working on integrating docling-parse v5

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the test_backend_docling_parse

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Updated the docling-parse to 5.3.0

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* ran the pre-commit

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the backend_docling_parse

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* ran pre-commit

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the groundtruth to deal with rounding errors

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated comments for later docling-parse integrations

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* ran pre-commit

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Make DoclingParseV2 and DoclingParseV4 backend stubs that route to new backend, emit warning.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* lock docling-parse

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* updated to 3.5.2

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2026-02-17 20:27:56 +01:00
Cesar Berrospi Ramis a1b0e3fd6b fix(csv): set default delimiter by default (#3005)
If the delimiter cannot be determined, assume the default delimiter (comma).
As a result, address single-column CSV, which triggered a parsing error.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-02-17 20:26:05 +01:00
jhchoi1182 1f914826bb fix: add failed pages to DoclingDocument for page break consistency (#2939)
* fix: add failed pages to DoclingDocument for page break consistency

When some PDF pages fail to parse, they were not added to
DoclingDocument.pages, causing page break markers to be incorrect
during export. This adds failed/skipped pages with their size info
(if available) to maintain correct page numbering and structure.

- Add _add_failed_pages_to_document() method in StandardPdfPipeline
- Add test cases for failed page handling
- Add test cases for normal page handling (regression test)
- Add test PDF files

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

* fix: ensure resource cleanup and simplify type hints

- Wrap page_backend usage in try-finally to guarantee unload (prevents resource leaks).
- Simplify redundant 'float | None | None' type hint.

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

* fix: add groundtruth for normal_4pages.pdf and exclude failing PDFs from e2e test

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

* fix: ensure correct status assertion for failed pages in tests

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

---------

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
2026-02-13 13:35:35 +01:00
Aditya Sasidhar e6ccb8b2c1 feat: added support for parsing LaTeX (.tex) documents (#2890)
* feat: added support for parsing LaTeX (.tex) documents

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* feat: implement PR #2890 feedback for LaTeX backend

- Add text formatting options (bold, italic, underline) for LaTeX macros
- Enhance image embedding with PIL and ImageRef.from_pil()
- Refactor list processing to use GroupItem structure
- Refactor bibliography to use GroupItem structure
- Add nested list test coverage
- All tests passing (39/39), all linters passing

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* DCO Remediation Commit for Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: f19f135b431d489cd8bf3982524505a0bbd8696d

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* DCO Remediation Commit for Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: f19f135b431d489cd8bf3982524505a0bbd8696d

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* feat: enhance latex backend with robustness fixes and ground truth

- Add custom macro expansion for improved text quality
- Fix preamble filtering to remove metadata garbage
- Support recursive \input{} and \include{} file loading
- Organize test data into subdirectories for complex papers
- Add full end-to-end ground truth for 4 major arXiv papers (Attention, Mistral, DeepSeek, OTSL)
- Pass all 41 unit tests and pre-commit checks

Addresses @cau-git feedback for ground-truth data.

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* fix: minor formatting in test file

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* feat: enhance LaTeX backend with robust math and figure support

- Fixed re.error: bad escape in macro expansion by using lambda in re.sub
- Fixed sentences breaking at inline math ($) by preserving it within paragraphs
- Improved figure environment with proper grouping and structured representation
- Fixed crashes on documents starting with % comments
- Added comprehensive unit tests and updated all ground truth data

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* WIP: saving work for laptop migration

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* got rid of the line breaking issues, still some do exist

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* fix: generalized LaTeX macro parsing and robustness improvements

This commit addresses several issues with LaTeX parsing:
- Correctly handle unknown macros (like \ion{N}{2}) inline to avoid line breaks.
- Fix extraction of structural macros (section, caption, etc.) vs text-only groups.
- Address PR feedback regarding inline math spacing and splitting.
- Regenerate ground truth files reflecting these improvements.

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* style: apply automatic formatting fixes

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* style: fix ruff linter and formatter errors

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* fix: typing issues identified by mypy

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* style: apply formatting fixes to tests

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* fix: update groundtruth files for latex backend

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* fixed the ackward line breaking issue, turns out im stupid at considering text buffer

* i forgot to add the groundtruth so here it is

* DCO Remediation Commit for Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: 7e032635ef
I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: aeba688384

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* Ran the precommit as requested

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

---------

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
2026-02-10 15:13:09 +01:00
Rashid Ul Islam 3110c439da fix(backend): improve Excel table bounds detection and flatten merged cells (#2778)
* fix(backend): improve Excel table detection with BFS and configurable tolerance

Replaces the bounding-box strategy with a Flood Fill (BFS) algorithm to correctly detect non-rectangular tables. Reverts span flattening to preserve semantic structure. Adds 'gap_tolerance' option to backend.

Signed-off-by: Rashidul Islam <rasidulislam71@gmail.com>

* fix(backend): improve Excel table detection with BFS and configurable tolerance

Replaces the bounding-box strategy with a Flood Fill (BFS) algorithm to correctly detect non-rectangular tables. Reverts span flattening to preserve semantic structure. Adds 'gap_tolerance' option to backend.

Signed-off-by: Rashidul Islam <rasidulislam71@gmail.com>

* chore: reverse unnecessary file changes

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Rashidul Islam <rasidulislam71@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-02-02 10:31:11 +01:00
Sam Quigley 5e452a2e8f fix(pptx): handle picture shapes with external image references (#2914)
* fix(pptx): handle picture shapes with external image references

When processing PowerPoint files containing picture shapes that reference
external images (rather than embedded images), the python-pptx library
raises a ValueError("no embedded image") when accessing the `image`
property.

Previously, this caused the entire document conversion to fail because:

1. The `hasattr(shape, "image")` check at line 690 would trigger the
   property getter, which raises ValueError (hasattr only catches
   AttributeError, not ValueError)

2. The exception handler in `_handle_pictures()` only caught
   UnidentifiedImageError and OSError, not ValueError

This fix:
- Removes the unnecessary hasattr check since we already verify the
  shape type is MSO_SHAPE_TYPE.PICTURE
- Adds ValueError to the exception handler in `_handle_pictures()` so
  that picture shapes with external references are gracefully skipped
  with a warning instead of crashing the pipeline

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* DCO Remediation Commit for Sam Quigley <quigley@emerose.com>

I, Sam Quigley <quigley@emerose.com>, hereby add my Signed-off-by to this commit: e69779e07b

Signed-off-by: Sam Quigley <quigley@emerose.com>

* tests(pptx): add a linked image to test the fix on e69779e

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Sam Quigley <quigley@emerose.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-02-01 11:44:29 +01:00
Cesar Berrospi Ramis 0602a7cdab feat: webvtt and source tracker (#2787)
* refactor(provenance): account for provenance as union of ProvenanceItem and ProvenanceTrack

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(webvtt): update WebVTTDocumentBackend with new docling-core classes

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(webvtt): preserve new lines and add helper handlers

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(webvtt): set ProvenanceTrack timinings as float type

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* style(asr): remove unnecessary imports

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(asr): use ProvenanceTrack in ASR pipeline

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(webvtt): add additional tests

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(webvtt): parse the title of the WEBVTT file

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(webvtt): apply refactoring of TrackProvenance from docling-core

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* style(webvtt): apply X | Y annotation instead of Optional, Union

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(webvtt): drop cue span classes, 'lang' and 'c' tags

Drop WebVTT formatting features not covered by Docling across formats.
Only 'u', 'b', 'i', and 'v' are supported and without classes.
Align with docling-core v2.62.0

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* build: pin docling-core 2.62.0

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-01-30 17:44:03 +01:00
Siva b6ca094519 feat: add support for Word document comments extraction (#2834)
* feat: add support for Word document comments extraction (fixes #485)

Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>

* fix: address PR review feedback for comments extraction

- Change DocItemLabel.PARAGRAPH to TEXT (deprecating PARAGRAPH)
- Change initials format from '(initials)' to 'author: initials'
- Change timestamp format to include 'time:' prefix
- Update test assertions and regenerate ground truth files

Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>

* chore: update comment format and move format documentation from inline comment to function docstring

Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>

* Use docling-core v2.58.0 add_comment() API to properly link Word document
comments to their annotated text items via FineRef references.

- Import FineRef from docling_core.types.doc.document
- Refactor _add_comments to use doc.add_comment(targets=[...]) API
- Parse DOCX XML for commentRangeStart/End markers in _extract_comment_ranges
- Track paragraph-to-items mapping for comment linking
- Fallback to unlinked comments in COMMENT_SECTION group when no targets found

Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>

* - Extract comment IDs directly during paragraph element processing to match element IDs
- Clear paragraph mappings at start of each conversion for consistent behavior
- Always create comment groups and use add_comment() API with targets
- Add _get_comment_ids_for_element() helper to extract comment markers from XML
- Regenerate ground-truth files (JSON/MD/itxt) with comments field properly linked

Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>

* fix: remove incorrect ground-truth files, keep versions with comments field

Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>

* fix: reference comment groups instead of text items in comments field

Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>

---------

Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2026-01-26 09:58:46 +01:00
Cesar Berrospi Ramis 86eaef5b45 fix(md): handle pipe symbols that are not table markers (#2904)
* fix(md): handle pipe symbols that are not table markers

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore: update uv.lock with latest docling-core 2.60.2

Update uv.lock file with the latest release of docling-core (2.60.2).
Update (fix) ground truth files for testing markdown serialization to be
in sync with the serialization fix (issue 2880).

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-01-23 15:19:09 +01:00
Tong Luo 999dbb2765 fix: PPTX parsing: bullet points not grouped correctly under subheadings (#2663) (#2855)
* fix: PPTX parsing: bullet points not grouped correctly under subheadings (#2663)

Signed-off-by: Tong Luo <luotng@cn.ibm.com>

* fix: PPTX parsing: bullet points not grouped correctly under subheadings, support Python 3.9 (#2663)

Signed-off-by: Tong Luo <luotng@cn.ibm.com>

* fix: PPTX parsing: optimized code naming, descriptions (#2663)

Signed-off-by: Tong Luo <luotng@cn.ibm.com>

* fix: PPTX parsing: optimized code naming, descriptions (#2663)

Signed-off-by: Tong Luo <luotng@cn.ibm.com>

* docs(pptx): updated docstrings in pptx backend parser

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Tong Luo <luotng@cn.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-01-21 14:34:24 +01:00
Michele Dolfi a1f8bddcb7 chore: update locked deps in CI (#2895)
* chore: update locked deps

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update test results (to review)

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2026-01-20 11:45:57 +01:00
Michele Dolfi 19af03f539 feat: Support for DeepSeek-OCR in VLM pipeline (#2798)
* add parsing of annotated markdown and definition of new ResponseFormat for the VLM pipeline

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix broken html in test

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update result with initial text

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* move parsing to vlm pipeline

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* restore md from main

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* process table structure

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* simplify and refactor

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* factor out deepseekocr utils

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* renaming

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* refactor common logic in vlm parsing logic

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add deepseek-ocr with ollama

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update tests for new annotation format

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix parsing of title

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* more test data

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add picture item

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix bbox parsing

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove old tests

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add test parsing deepseek md

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename test

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add test with ollama conversion

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix test and mark methods as private

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2026-01-09 18:42:40 +01:00
Cesar Berrospi Ramis 5c1f8f0171 fix(docx): handle grouped pictures (#2861)
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-01-09 09:42:52 +01:00
Michele Dolfi 595115d892 fix(markdown): allow text before headers also in mixed markdown and html (#2801)
* fix missing content in mixed markdown

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* delete elements outside of iterate_items

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* no need for new test files

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add new export without extra furniture title

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add html options

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use html options

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-12-17 13:54:07 +01:00
Cesar Berrospi Ramis d007ba0e6f fix(html): tackle paragraphs with block-level elements (#2720)
Fix p elements having block-level elements anywhere inside as browsers do.
Fix wrong type annotations.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-12-05 12:52:53 +01:00
Matvei Smirnov aebe25cf00 fix(html): prevent hierarchy reset in rich table cells (#2716)
* fix(html): restore parents after rich cell walking

Signed-off-by: Matvei Smirnov <vdalekesmirnov@gmail.com>

* fix(html): add table cell context manager, update tests

Signed-off-by: Matvei Smirnov <vdalekesmirnov@gmail.com>

* fix(html): table with heading test data

Signed-off-by: Matvei Smirnov <vdalekesmirnov@gmail.com>

---------

Signed-off-by: Matvei Smirnov <vdalekesmirnov@gmail.com>
2025-12-03 18:52:23 +01:00
Cesar Berrospi Ramis c97715f5fd fix(docx): parse integrals as n-ary objects without chr element (#2712)
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-12-03 11:25:52 +01:00
glypt 54cd6d7406 fix: do not consider singleton cells in xlsx as TableItems but rather TextItems (#2589)
fix: do not handle 1x1 cell as a tableitem but as a textitem

Signed-off-by: glypt <8trash-can8@protonmail.ch>
2025-11-27 16:25:32 +01:00
Christoph Auer 134436245a feat(experimental): Add experimental TableCropsLayoutModel (#2669)
* feat: Scaffolding for layout and table model plugin factory

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add missing files

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add base options classes for layout and table

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* feat(experimental): Add experimental TableCropsLayoutModel

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add example

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-11-25 05:14:51 +01:00
Michele Dolfi e58055465c fix(docx): Missing list items after numbered header (#2665)
* fix #2250. list items after numbered headers

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add test for new case

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* chore(docx): remove unnecessary check

Remove 'current_parent is None' check in '_add_list_item' function since it
will always be None.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-11-24 08:49:21 +01:00
Cesar Berrospi Ramis 054c4a634d fix(docx): parse page headers and footers (#2599)
* fix(docx): parse page headers and footers

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): rename _add_header with _add_heading

To avoid confusion, rename _add_header function name with _add_heading
since the function is about adding section headings.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): extend the page header and footer parsing to any content type

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): fix _add_header_footer function

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-11-10 16:10:12 +01:00
Cesar Berrospi Ramis ef623ffcee fix(docx): slow table parsing (#2553)
* chore(docx): remove unnecessary import

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(docx): simplify parsing of simple tables

Simplify the parsing of tables with just text (no rich cells).
Move nested function group_cell_elements out of _handle_tables for readability.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): reuse method for finding inline pictures

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): format strikethrough text

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(docx): use fixtures to avoid converting same file multiple times

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(docx): remove unnecessary argument docx_obj in functions

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(docx): add test for rich table cells

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): small improvements in backend and its unit tests

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): parse superscript and subscript formatted text

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-11-06 05:25:53 +01:00
Cesar Berrospi Ramis 0ba8d5d9e3 fix(html): slow table parsing (#2582)
* fix(html): simplify parsing of simple table cells

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(html): add test for rich table cells

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(html): ensure table cells with formatted text are parsed as RichTableCell

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(html): simplify process_rich_table_cells since only rich cells are processed

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(html): formatted cell runs should be parsed as text items respecting the order

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore: pin latest docling-core and update uv.lock

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore: upgrade dependencies on uv.lock

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-11-06 05:25:36 +01:00