* fix(docx): handle missing chr attribute in groupChr OMML elements
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(docx): escape spaces in OMML limit text for proper LaTeX rendering
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(docx): fix inline equation reconstruction to prevent tag corruption
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): add type hints and docstrings to OMML module
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(docx): fix genfrac formatting and eliminate grouping function warnings
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(docx): handle unmapped characters in OMML % formatting
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(pptx): skip malformed picture shapes instead of aborting conversion
MsPowerpointDocumentBackend._handle_pictures reads embedded image bytes via python-pptx's shape.image accessor. On PPTX files with slightly malformed <p:pic> shapes, shape.image raises three exceptions that the existing (UnidentifiedImageError, OSError, ValueError) clause does not catch, so one bad picture aborts conversion of the entire presentation:
- InvalidXmlError when <p:blipFill> is missing
- KeyError when <a:blip r:embed> points to an unknown relationship
- AttributeError when the embedded part's content-type isn't an image
These files open normally in Keynote and Google Drive, so the backend should handle them as gracefully as it already handles truncated or unreadable image payloads.
This follows the same pattern as #2914, which extended the same except tuple with ValueError to handle linked (external) image references. The three cases above are the remaining shape.image failure modes that still escape.
Extend the except tuple to cover the three cases and log the same warning used for other unreadable images, leaving the rest of the presentation to convert normally. Add a regression fixture with one malformed picture per failure mode plus a focused test.
Fixes#3371
Signed-off-by: pateltejas <tejas226@hotmail.com>
* refactor(pptx): use warnings.warn for malformed picture skips
Address PR review feedback: use Python's warnings module with UserWarning to signal the skip to callers instead of logging.Logger.warning, matching the pattern used in msword_backend for "Skipping external image reference". This makes the skip visible via standard warning filters and catchable in tests.
Update the regression test to assert the warning is emitted via pytest.warns, which also suppresses the message during the test run so it doesn't clutter suite output.
Signed-off-by: pateltejas <tejas226@hotmail.com>
---------
Signed-off-by: pateltejas <tejas226@hotmail.com>
* fix(docx): handle unsupported limit functions gracefully in OMML conversion
Replace RuntimeError with graceful fallback for unknown limit functions in do_limlow().
Add argmax and argmin to LIM_FUNC dictionary for proper LaTeX rendering.
Fixes conversion failures when Word documents contain mathematical operators
not previously supported in the limit function dictionary.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* test(docx): regenerate ground truth files
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* feat(docx): add checkbox parsing support to MsWordDocumentBackend
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(docx): remove duplicate code in text element handling
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* docs(docx): update checkbox method docstrings
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(docx): use self._BLIP_NAMESPACES for w14 namespace in checkbox methods
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
fix(html): preserve fragment-only anchor links during path resolution
Fragment-only hrefs (e.g. href="#section1") were resolved as filesystem
paths when source_uri was set, breaking internal document navigation.
Add '#' to the skip-resolution prefixes in _resolve_relative_path() so
fragment links pass through unchanged.
Partially addresses #2929
Signed-off-by: aatrey56 <aatrey.sahay@gmail.com>
feat(docx): Extract VML images with v:imagedata elements
Add VML image support with EMF/WMF conversion and consolidate image handler code.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(docx) Handle inline formulas in list items
Fixes issue where inline formulas in list items were ignored during conversion.
Added helper methods to eliminate code duplication.
Updated test data with list items containing inline equations.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(docx): collect element refs in _add_inline_equations_to_parent
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(pptx): handle NotImplementedError from shape.shape_type
python-pptx raises NotImplementedError from Shape.shape_type for
<p:sp> elements that aren't placeholders, autoshapes, textboxes, or
freeforms (e.g. shapes with empty <p:spPr> from Google Slides exports,
LibreOffice, or Keynote). handle_groups() and handle_shapes() access
shape_type without catching this, crashing the entire conversion.
Add a _safe_shape_type() helper that returns None on
NotImplementedError, so unrecognized shapes skip only the GROUP
recursion and PICTURE extraction while text and table extraction
proceed normally.
Fixes#3308
Signed-off-by: Tejas Patel <tejas226@hotmail.com>
* Fix lint
Signed-off-by: Tejas Patel <tejas226@hotmail.com>
---------
Signed-off-by: Tejas Patel <tejas226@hotmail.com>
* fix(docx): isolate list state in table cells
Lists with the same numId in different table cells were incorrectly
merged. Added context manager to isolate list state during cell
processing. Includes test cases and updated ground truth files.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* style(docx): modernize type hints to use PEP 604 union syntax
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Commands like \vspace{-1mm} and \hspace{0.2cm} were being filtered
at the command level but their argument values were leaking through as
plain text nodes. Ensure that when a spacing/ignored command is
encountered, its arguments are also suppressed.
- Add vspace, hspace, vspace*, hspace*, addvspace to MACROS_SPACING
- Guard against spacing/ignored macros in _process_macro_node_inline
so their brace arguments are not extracted as inline text
- Guard against spacing/ignored macros in _nodes_to_text so dimension
values do not leak when processing footnotes, captions, etc.
- Update ground truth files to reflect corrected output
Fixes#3240
Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com>
Co-authored-by: Smeet Agrawal <smeetagrawal23@gmail.com>
* fix(pdf): propagate hyperlinks to DoclingDocument text items
docling-parse already extracts PdfHyperlink objects with bounding
rectangles and URIs into SegmentedPdfPage.hyperlinks, and TextItem
already has a hyperlink field. However, the PDF pipeline never matched
hyperlink annotations to text clusters — the data was available but
never propagated.
Add spatial matching of PDF hyperlinks to text clusters during page
assembly, then pass the resolved hyperlink through the reading order
model to the final DoclingDocument.
Changes:
- Add hyperlink field to TextElement (base_models.py)
- Add _match_hyperlink() to PageAssembleModel that spatially matches
cluster bboxes against hyperlink annotation rects, aggregating
coverage per URI to handle wrapped links with multiple rects
- Thread hyperlink= through add_text(), add_heading(), add_list_item()
calls in ReadingOrderModel
- Drop hyperlink on text merge when constituent clusters disagree
- Fall back to Path when AnyUrl validation fails (matches HTML backend)
- Regenerate affected ground truth files
- Add unit tests for _match_hyperlink() edge cases
Closes#3096
Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com>
* fix(pdf): recover unmatched hyperlinks as REFERENCE items
Track consumed hyperlink indices during cluster matching so that
hyperlinks which don't meet the overlap threshold are not silently
dropped. Unmatched hyperlinks that overlap text clusters are
materialized as synthetic REFERENCE TextElements. Also propagate
hyperlinks through FORMULA items in reading-order assembly.
Signed-off-by: macbook <macbook@users.noreply.github.com>
Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com>
* DCO Remediation Commit for hussainarslan <m.hussain.arslan@gmail.com>
I, hussainarslan <m.hussain.arslan@gmail.com>, hereby add my Signed-off-by to this commit: 71a8d900bd
Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com>
* test: regenerate reference data for hyperlink propagation
Update groundtruth files for 2206.01062, 2305.03393v1, and
textbox.docx to reflect hyperlink fields on text items and
new REFERENCE items for unmatched hyperlinks.
Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com>
* Revert "test: regenerate reference data for hyperlink propagation"
This reverts commit 374f478ebf71e7e43b1b98d7106375c7f3d77101.
Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com>
* Revert "fix(pdf): recover unmatched hyperlinks as REFERENCE items"
This reverts commit e0e9b9225fa5caa0a7b2578a29600a9531edc624.
Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com>
* test: regenerate groundtruth for hyperlink propagation
Regenerate the affected docling_v2 PDF and DOCX fixtures after rerunning the hyperlink propagation groundtruth suite and switch the hyperlink coverage selection helper to the explicit items() form to avoid a type ignore.
Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com>
* test: regenerate groundtruth for docling_core 1.10.0
Regenerate the affected docling_v2 PDF and DOCX fixtures with the current docling_core schema version so committed groundtruth stays compatible with CI and example loading.
Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com>
---------
Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com>
Signed-off-by: macbook <macbook@users.noreply.github.com>
* fix: parse LaTeX macros in multicolumn/multirow table cells
Table cells in multicolumn and multirow used raw LaTeX content
(e.g. \textbf{Header}) wrapped in LatexCharsNode, which passed
through _nodes_to_text() verbatim. Parse the content string with
LatexWalker so formatting macros get properly resolved to text.
Signed-off-by: majiayu000 <1835304752@qq.com>
* test: add regression test for LaTeX formatting in table cells
Add test_latex_table_formatting_in_cells to verify multicolumn/multirow
cells with \textbf, \textit, \tiny, and nested formatting produce clean
text output without raw LaTeX commands (issue #3199).
Also narrow except clause from bare Exception to LatexWalkerParseError
for both multicolumn and multirow parse fallbacks.
Signed-off-by: majiayu000 <1835304752@qq.com>
---------
Signed-off-by: majiayu000 <1835304752@qq.com>
* fix(omml): correct LaTeX output for fractions, math operators, and functions
Fixes three related bugs in OMML-to-LaTeX conversion:
A) Fraction raised to a power now produces correct grouping braces:
{\frac{(x-c)}{v}}^{2} instead of \frac{(x-c)}{v}^{2}
Adds dedicated do_ssub/do_ssup/do_ssubsup handlers that wrap
complex base expressions (fractions, radicals) in braces.
B) EN DASH (U+2013) and CIRCUMFLEX (U+005E) inside math runs are
now mapped to their math-mode equivalents (- and ^) instead of
being escaped as \text{\textendash} and \text{\textasciicircum}.
C) Adds missing standard math functions to the FUNC dict: log, ln,
exp, det, gcd, deg, hom, ker, dim, arg, inf, sup, lim, Pr.
These now emit proper LaTeX commands (e.g. \log) instead of
falling back to plain italic text.
Closes#3120
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
* test(omml): add test documents for OMML-to-LaTeX conversion bugs
Add three minimal DOCX files exercising the fixed edge cases:
- omml_frac_superscript.docx: fraction as superscript base (Bug A)
- omml_text_escapes_in_math.docx: en-dash and caret in math runs (Bug B)
- omml_func_log.docx: log function recognition (Bug C)
Each file includes matching groundtruth (md, json, itxt).
Requested-by: @dolfim-ibm
Signed-off-by: Giulio Leone <giulioleone10@gmail.com>
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
* fix(omml): avoid double-wrapping nested sub/sup containers
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
* fix(omml): fix Bug B caret escape + use issue #3120 test documents
Bug B fix: prevent escape_latex from re-escaping characters that
process_unicode intentionally mapped to math operators. The caret
character U+005E inside <m:r><m:t> math runs was being converted
to ^ by _MATH_CHAR_MAP, then immediately re-escaped to \^ by
escape_latex. Now do_r restores math-mapped chars after escaping.
Result: x - y\^2 → x - y^2 (correct superscript)
Test documents: replace minimal programmatic fixtures (~1.2 KB)
with the real Word documents from issue #3120 reporter (smroels,
~37 KB each). Regenerate all groundtruth.
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
* test: regenerate groundtruth for omml_text_escapes_in_math
Update .itxt to use proper indented-text export format (item hierarchy)
and refresh .json to match current converter output.
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
* test(omml): regenerate indented text snapshots
The OMML regression documents were exported into the .itxt fixtures using the
wrong format, so the real DOCX end-to-end check failed even though the rebased
converter output was correct.
Regenerate the two broken indented-text snapshots from the current branch so
the MS Word E2E test verifies the actual converter behavior.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* style(omml): apply ruff format normalization
Normalize the multiline condition in omml.py to match the repository
ruff-format output so the pre-commit gate stays clean on the refreshed
PR head.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
* DCO Remediation Commit for giulio-leone <giulio97.leone@gmail.com>
I, giulio-leone <giulio97.leone@gmail.com>, hereby add my Signed-off-by to this commit: 08001d9c5c
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
---------
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Signed-off-by: Giulio Leone <giulioleone10@gmail.com>
Co-authored-by: giulio-leone <giulio.leone@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Implementation of HTML backend that (optionally) uses headless browser (via Playwright) to materialize HTML pages into images, and add provenances with bboxes to all elements in the converted docling document.
- Conversion preserves reading order given by HTML DOM tree
- Added support for HTML "input" fields: checkboxes, radiobuttons, text inputs, etc.
- Added support to Key-Value convention in HTML (i.e. elements with id "key1" and "key1_value1" will be paired as key-values, see test cases as examples)
- Heuristic that glues independent inline HTML elements with single-character text in them into larger text blocks
- Support for inline styling (bold, italic, etc.)
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
* fix(msword): split multiple OMML equations into separate formula items
When a DOCX paragraph contains multiple sibling <m:oMath> elements
(e.g. separate equations on one line), the converter previously
concatenated them into a single LaTeX string because element.iter()
walks all descendants depth-first.
Fix: iterate direct children of the paragraph element first to
correctly identify sibling <m:oMath> elements, converting each
independently. Falls back to deep iteration only when oMath
elements are nested inside wrapper elements.
Also splits standalone multi-equation paragraphs into individual
FORMULA document items instead of merging them into one.
Closes#3121
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
* test(msword): add multi-equation paragraph test document
Add a minimal DOCX file containing two separate oMath elements
in one paragraph with a text separator, along with groundtruth
output files for markdown, json, and plain text export.
Requested-by: @dolfim-ibm
Signed-off-by: Giulio Leone <giulioleone10@gmail.com>
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
* test(msword): regenerate multi-equation indented-text snapshot
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
* test: replace test doc with issue #3121 attachment
Use the real Word document from the issue reporter (smroels)
instead of the minimal programmatic fixture. The new document
contains three sibling <m:oMath> elements in one paragraph,
matching the exact failing shape described in #3121.
Regenerate groundtruth to match the richer document structure.
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
* test: regenerate groundtruth for omml_multi_equation_paragraph
Re-run document conversion with current code to update .itxt and .json
groundtruth files. The .itxt had stale structure from the previous
programmatic fixture; the new real-document conversion produces the
correct output with three separate formula items.
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
* style(docx): rerun ruff formatter for msword backend
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* refactor(docx): drop unused tag_name binding
Remove the unused local in the direct oMath iteration path so the code
reads clearly and the outstanding review comment is fully addressed
without changing equation-handling behavior.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
* DCO Remediation Commit for giulio-leone <giulio97.leone@gmail.com>
I, giulio-leone <giulio97.leone@gmail.com>, hereby add my Signed-off-by to this commit: 84cc70b55e
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
* test(docx): cover equation paragraph branches
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
* test(docx): reuse backend fixture in msword tests
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
---------
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Signed-off-by: Giulio Leone <giulioleone10@gmail.com>
Co-authored-by: giulio-leone <giulio.leone@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* fix(docx): Correct list numbering with interleaved numIds and hierarchical markers
Fixes incorrect numbering and missing items in DOCX documents that use
multiple interleaved numbering sequences (numIds).
Changes:
* Reset sub-level counters in _get_list_counter when a parent level
advances, preventing counter bleed-across (e.g. "4. Functional
Requirements" now correctly renders as "1. Functional Requirements")
* Add _build_enum_marker helper to produce hierarchical markers in
"1.2.3." format instead of flat single-level counters
* Fix anchor-based level calculation in new-sequence branch: use
level_at_new_list + ilevel instead of _get_level() to correctly
place items from a different numId at the right document level
* Only set level_at_new_list in the else case (when None) to avoid
corrupting the anchor when switching between interleaved numIds
* Remove _reset_list_counters_for_new_sequence from new-sequence branch
so that returning to a previously seen numId continues its counter
(e.g. Appendix A=1, B=2, C=3 instead of A=1, B=1, C=1)
Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>
* style(docx): apply Ruff formatting to msword_backend
Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>
* test(docx): add unit tests for list counter, enum marker, and sequence reset helpers
Adds test_list_counter_and_enum_marker covering helper methods introduced
in the list numbering fix: counter increment, sub-level reset on parent
advance, hierarchical marker building, and selective sequence reset.
Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>
* fix(docx): read start values from abstractNum for correct numbering
When Word creates a new numbering definition that continues from a
previous list, it embeds start values in the abstractNum XML instead of
reusing the same numId. Docling previously ignored these start values
and always initialized counters from 1, producing incorrect markers
like "1.1.1." instead of "2.3.1.".
Changes:
* Add _get_level_element helper to extract level XML from abstractNum,
eliminating duplicated XML traversal in _is_numbered_list
* Add _get_start_value to read w:start from the numbering definition
* Initialize counters in _get_list_counter using start values
* Use start values as fallback in _build_enum_marker for parent levels
that have not been explicitly incremented
Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>
* test(docx): add interleaved numId edge case to unit_test_headers_numbered
Extends the existing test document with an Appendix section that uses a
different numId, followed by list items that resume the original
numbering sequence with Word-embedded start values (e.g. 2.3.1.).
Updates groundtruth files accordingly.
Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>
* style(docx): add return type annotation to _get_level_element
Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>
---------
Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>
* fix: handle external image relationships in MsWordDocumentBackend
When a .docx file contains image relationships with TargetMode="External"
(common in documents saved from web browsers), accessing
`_Relationship.target_part` raises ValueError because external relationships
don't have a target part within the package.
Check `rel.is_external` before accessing `target_part`, emitting a
UserWarning with the external target URL and returning None so external
images fall through to the existing "image cannot be found" handling.
Includes test with ground truth files for a .docx with external image
references.
Fixes#3113
Signed-off-by: rongo-ms <127863751+rongo-ms@users.noreply.github.com>
* chore: upgrade dependencies in uv.lock file
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: rongo-ms <127863751+rongo-ms@users.noreply.github.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* ran DCO
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* modified tf
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* Added to TableFormer v1
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added convenience methods for quality testing
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated with comments
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* ran pre-commit
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* chore: update lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Align __init__ args with factory method kwargs
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* chore: bump docling-ibm-models version
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix mypy type stubs error with torchvision
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Add torch/torchvision direct deps
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Remove global torch imports
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* add test diffs
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix jats/xbrl test generate, updated test GT from docling-core upgrade
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fixed the in the cli
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated uv lock
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixing merge conflicts between main
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixing merge conflicts between main (2)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* upgrade uv.lock
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Ahmed Nassar AHN@zurich.ibm.com <AHN@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
* fix(html): fix broken document tree and quadratic PictureItems in rich table cells
Three related bugs in the HTML backend when processing table cells that
contain rich content (RichTableCell), as found on Wikipedia pages with
large reference, taxobox, or classification tables:
Bug 1 — orphaned InlineGroups causing broken parent/child relationships
------------------------------------------------------------------------
When _use_inline_group() created an InlineGroup node (for paragraphs
containing multiple hyperlinks, e.g. "text <a> and <a>"), it was added
as a child of the current parent via doc.add_group(), but its RefItem was
never appended to added_refs / provs_in_cell. This meant:
- group_cell_elements() reparented the text items inside the InlineGroup
(because their individual refs WERE in added_refs), moving them from
body → outer_group_element.
- The InlineGroup itself remained in body.children still pointing to
those same text items as its .children.
- Result: two nodes (InlineGroup and outer_group_element) claimed the
same child items, with contradictory .parent pointers. This broken
tree caused double-serialization of text items in export_to_markdown().
Fix: make _use_inline_group() yield the RefItem of the created group.
Callers (_flush_buffer, _handle_block, _handle_list) now track the
InlineGroup ref instead of individual leaf refs when a group was created.
group_cell_elements() then reparents the whole InlineGroup (with its
children intact) rather than orphaning it.
Bug 2 — quadratic PictureItem creation from stray outer image loop
-------------------------------------------------------------------
In _handle_block() for <table> tags, after parse_table_data() had already
walked the entire table subtree (including nested tables) and emitted
PictureItems for every <img>, there was an additional outer loop:
for img_tag in tag("img"):
im_ref2 = self._emit_image(tag, doc)
Because BeautifulSoup's .find_all("img") on a tag finds ALL descendant
<img> elements (including those in nested tables), this loop processed
every image in the entire subtree again. A table nested N levels deep
caused N*(N+1)/2 duplicate PictureItems per image (quadratic growth).
Fix: remove the outer loop. Images are already handled by parse_table_data()
-> _use_table_cell_context() -> _walk() -> _emit_image().
Bug 3 — missing space separator between nested table cell text
--------------------------------------------------------------
HTMLDocumentBackend.get_text() uses _extract_text_recursively(), which
only appended a trailing space for <p> and <li> tags. When a table cell
contained a nested <table>, adjacent <th> or <td> elements without
whitespace NavigableString nodes between them were concatenated directly
(e.g. "TypeSound" instead of "Type Sound").
Fix: add "th" and "td" to the trailing-space tag set so that the text
content of each cell is separated by a space.
Bug 1 and Bug 2 were introduced in docling v2.55.0 (commit c803abe) with
rich table cell support.
Signed-off-by: Ivan Traus <ivan@liminary.io>
* test(html): align markdown fixtures with current docling-core behavior
Signed-off-by: Ivan Traus <ivan@liminary.io>
* test(xbrl): update XBRL fixture after get_text() cell spacing fix
The Bug 3 fix (adding th/td to trailing-space tags in get_text())
affects the XBRL backend which internally uses HTMLDocumentBackend.
Regenerate the mlac-20251231 fixture to match the corrected text
extraction.
Signed-off-by: Ivan Traus <ivan@liminary.io>
* chore(deps): bump docling-core to 2.67.1, regenerate fixtures and trim tests
Update uv.lock to pull in the merged nested-table flattening fix
(docling-core#525). Regenerate markdown fixtures that now show flattened
text instead of invalid embedded table syntax. Trim verbose test
docstrings and remove narrating comments.
Signed-off-by: Ivan Traus <ivan@liminary.io>
* fix: annotate _use_inline_group return type and regenerate docx fixtures
Add Generator[RefItem | None, None, None] return type and Google-style
Yields section to _use_inline_group. Regenerate docx ground truth
fixtures affected by docling-core 2.67.1 nested-table flattening.
Signed-off-by: Ivan Traus <ivan@liminary.io>
* refactor: use Iterator type hint and remove redundant test
Apply feedback: use Iterator instead of Generator, drop type from Yields docstring, and remove
test_e2e_rich_table_cells_markdown (already covered by test_e2e_html_conversions).
Signed-off-by: Ivan Traus <ivan@liminary.io>
* style(html): apply indent to docstrings
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Ivan Traus <ivan@liminary.io>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* a quick 30 second timeout for each file ( this does seem exorbitant but ill have to go over the average parse time of large files and decide upon an upper limit and an average limit, next commit needs individual node ignorance instead of file itself
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* fix: bypass mypy attr-defined for parse_timeout in late options
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* feat(latex): SOTA improvements from pandoc — theorems, preamble metadata, math envs, bugfixes
Features:
- Theorem/proof/lemma/corollary/definition/remark/example/conjecture environments
- Proof environment with conditional QED ◻ symbol
- \paragraph and \subparagraph as headings (levels 4, 5)
- \author, \date, \title extracted from preamble
- \href preserves URL as [text](url)
- \renewcommand and \providecommand macro extraction
- dmath/dgroup/darray/subequations math environments
- \input cycle detection with depth limit of 10
- quote/quotation/verse environment handling
Bugfixes:
- Fixed UnboundLocalError in _extract_custom_macros
- Fixed _extract_verbatim_content regex stealing content
- Fixed is_valid() rejecting preamble-only fragments
- Removed unused deepcopy import
- Unified recursion depth limits to 10
Tests:
- 7 new tests, 1 updated, ground-truth regenerated
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* Added some more dangerous macors to the ignore list, the is_valid() function now accpets \documentstyle too and added some essential and primitive layout passes
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* removed the restrictive nature of the is_valid() function
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* added test coverage to the added features and got rid of the time formatted parsing test which caused hanging on the python 3.10 during the CI/CD testing pipline
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
---------
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
Remove `Path()` wrapper from hyperlink address extraction. `Path()`
is designed for filesystem paths and strips URL fragments (#) and
query parameters (?), causing truncated hyperlinks in DOCX output.
Signed-off-by: Br1an67 <932039080@qq.com>
fix(docx): create a new list group with a list item after a heading
When a list with a different 'numid' appears after a heading (marked as list item too), a
new list group needs to be created to avoid inconsistencies (list item under heading).
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix: handle OneCellAnchor images in Excel backend
Add support for OneCellAnchor image positioning in _find_images_in_sheet().
Previously, only TwoCellAnchor images had their position extracted; images
with OneCellAnchor (the default when inserting images in Excel) would
default to bounding box (0,0,0,0), placing them all at the top-left
corner regardless of their actual position.
Now OneCellAnchor images use the anchor cell as their bounding box
origin, correctly preserving the image's position in the output document.
* DCO Remediation Commit for Br1an67 <932039080@qq.com>
I, Br1an67 <932039080@qq.com>, hereby add my Signed-off-by to this commit: cd878618ff
Signed-off-by: Br1an67 <932039080@qq.com>
---------
Signed-off-by: Br1an67 <932039080@qq.com>
* build(xbrl): add Arelle as open-source library for XBRL
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* feat(xbrl): design and implement a backend parser for XBRL documents
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* test: remove print statements to reduce verbosity
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* style(XBRL): apply PEP8 naming convention for acronyms
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(XBRL): set XBRL dependencies as optional
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* feat: simplifying towards docling-parse v5
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* working on integrating docling-parse v5
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added the test_backend_docling_parse
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* Updated the docling-parse to 5.3.0
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* ran the pre-commit
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the backend_docling_parse
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* ran pre-commit
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the groundtruth to deal with rounding errors
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated comments for later docling-parse integrations
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* ran pre-commit
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* Make DoclingParseV2 and DoclingParseV4 backend stubs that route to new backend, emit warning.
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* lock docling-parse
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* updated to 3.5.2
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
If the delimiter cannot be determined, assume the default delimiter (comma).
As a result, address single-column CSV, which triggered a parsing error.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix: add failed pages to DoclingDocument for page break consistency
When some PDF pages fail to parse, they were not added to
DoclingDocument.pages, causing page break markers to be incorrect
during export. This adds failed/skipped pages with their size info
(if available) to maintain correct page numbering and structure.
- Add _add_failed_pages_to_document() method in StandardPdfPipeline
- Add test cases for failed page handling
- Add test cases for normal page handling (regression test)
- Add test PDF files
Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
* fix: ensure resource cleanup and simplify type hints
- Wrap page_backend usage in try-finally to guarantee unload (prevents resource leaks).
- Simplify redundant 'float | None | None' type hint.
Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
* fix: add groundtruth for normal_4pages.pdf and exclude failing PDFs from e2e test
Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
* fix: ensure correct status assertion for failed pages in tests
Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
---------
Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
* feat: added support for parsing LaTeX (.tex) documents
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* feat: implement PR #2890 feedback for LaTeX backend
- Add text formatting options (bold, italic, underline) for LaTeX macros
- Enhance image embedding with PIL and ImageRef.from_pil()
- Refactor list processing to use GroupItem structure
- Refactor bibliography to use GroupItem structure
- Add nested list test coverage
- All tests passing (39/39), all linters passing
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* DCO Remediation Commit for Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: f19f135b431d489cd8bf3982524505a0bbd8696d
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* DCO Remediation Commit for Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: f19f135b431d489cd8bf3982524505a0bbd8696d
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* feat: enhance latex backend with robustness fixes and ground truth
- Add custom macro expansion for improved text quality
- Fix preamble filtering to remove metadata garbage
- Support recursive \input{} and \include{} file loading
- Organize test data into subdirectories for complex papers
- Add full end-to-end ground truth for 4 major arXiv papers (Attention, Mistral, DeepSeek, OTSL)
- Pass all 41 unit tests and pre-commit checks
Addresses @cau-git feedback for ground-truth data.
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* fix: minor formatting in test file
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* feat: enhance LaTeX backend with robust math and figure support
- Fixed re.error: bad escape in macro expansion by using lambda in re.sub
- Fixed sentences breaking at inline math ($) by preserving it within paragraphs
- Improved figure environment with proper grouping and structured representation
- Fixed crashes on documents starting with % comments
- Added comprehensive unit tests and updated all ground truth data
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* WIP: saving work for laptop migration
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* got rid of the line breaking issues, still some do exist
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* fix: generalized LaTeX macro parsing and robustness improvements
This commit addresses several issues with LaTeX parsing:
- Correctly handle unknown macros (like \ion{N}{2}) inline to avoid line breaks.
- Fix extraction of structural macros (section, caption, etc.) vs text-only groups.
- Address PR feedback regarding inline math spacing and splitting.
- Regenerate ground truth files reflecting these improvements.
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* style: apply automatic formatting fixes
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* style: fix ruff linter and formatter errors
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* fix: typing issues identified by mypy
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* style: apply formatting fixes to tests
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* fix: update groundtruth files for latex backend
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* fixed the ackward line breaking issue, turns out im stupid at considering text buffer
* i forgot to add the groundtruth so here it is
* DCO Remediation Commit for Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: 7e032635ef
I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: aeba688384
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* Ran the precommit as requested
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
---------
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* fix(backend): improve Excel table detection with BFS and configurable tolerance
Replaces the bounding-box strategy with a Flood Fill (BFS) algorithm to correctly detect non-rectangular tables. Reverts span flattening to preserve semantic structure. Adds 'gap_tolerance' option to backend.
Signed-off-by: Rashidul Islam <rasidulislam71@gmail.com>
* fix(backend): improve Excel table detection with BFS and configurable tolerance
Replaces the bounding-box strategy with a Flood Fill (BFS) algorithm to correctly detect non-rectangular tables. Reverts span flattening to preserve semantic structure. Adds 'gap_tolerance' option to backend.
Signed-off-by: Rashidul Islam <rasidulislam71@gmail.com>
* chore: reverse unnecessary file changes
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Rashidul Islam <rasidulislam71@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(pptx): handle picture shapes with external image references
When processing PowerPoint files containing picture shapes that reference
external images (rather than embedded images), the python-pptx library
raises a ValueError("no embedded image") when accessing the `image`
property.
Previously, this caused the entire document conversion to fail because:
1. The `hasattr(shape, "image")` check at line 690 would trigger the
property getter, which raises ValueError (hasattr only catches
AttributeError, not ValueError)
2. The exception handler in `_handle_pictures()` only caught
UnidentifiedImageError and OSError, not ValueError
This fix:
- Removes the unnecessary hasattr check since we already verify the
shape type is MSO_SHAPE_TYPE.PICTURE
- Adds ValueError to the exception handler in `_handle_pictures()` so
that picture shapes with external references are gracefully skipped
with a warning instead of crashing the pipeline
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* DCO Remediation Commit for Sam Quigley <quigley@emerose.com>
I, Sam Quigley <quigley@emerose.com>, hereby add my Signed-off-by to this commit: e69779e07b
Signed-off-by: Sam Quigley <quigley@emerose.com>
* tests(pptx): add a linked image to test the fix on e69779e
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Sam Quigley <quigley@emerose.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(provenance): account for provenance as union of ProvenanceItem and ProvenanceTrack
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(webvtt): update WebVTTDocumentBackend with new docling-core classes
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(webvtt): preserve new lines and add helper handlers
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(webvtt): set ProvenanceTrack timinings as float type
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* style(asr): remove unnecessary imports
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(asr): use ProvenanceTrack in ASR pipeline
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* tests(webvtt): add additional tests
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(webvtt): parse the title of the WEBVTT file
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(webvtt): apply refactoring of TrackProvenance from docling-core
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* style(webvtt): apply X | Y annotation instead of Optional, Union
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(webvtt): drop cue span classes, 'lang' and 'c' tags
Drop WebVTT formatting features not covered by Docling across formats.
Only 'u', 'b', 'i', and 'v' are supported and without classes.
Align with docling-core v2.62.0
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* build: pin docling-core 2.62.0
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* feat: add support for Word document comments extraction (fixes#485)
Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>
* fix: address PR review feedback for comments extraction
- Change DocItemLabel.PARAGRAPH to TEXT (deprecating PARAGRAPH)
- Change initials format from '(initials)' to 'author: initials'
- Change timestamp format to include 'time:' prefix
- Update test assertions and regenerate ground truth files
Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>
* chore: update comment format and move format documentation from inline comment to function docstring
Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>
* Use docling-core v2.58.0 add_comment() API to properly link Word document
comments to their annotated text items via FineRef references.
- Import FineRef from docling_core.types.doc.document
- Refactor _add_comments to use doc.add_comment(targets=[...]) API
- Parse DOCX XML for commentRangeStart/End markers in _extract_comment_ranges
- Track paragraph-to-items mapping for comment linking
- Fallback to unlinked comments in COMMENT_SECTION group when no targets found
Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>
* - Extract comment IDs directly during paragraph element processing to match element IDs
- Clear paragraph mappings at start of each conversion for consistent behavior
- Always create comment groups and use add_comment() API with targets
- Add _get_comment_ids_for_element() helper to extract comment markers from XML
- Regenerate ground-truth files (JSON/MD/itxt) with comments field properly linked
Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>
* fix: remove incorrect ground-truth files, keep versions with comments field
Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>
* fix: reference comment groups instead of text items in comments field
Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>
---------
Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
* fix(md): handle pipe symbols that are not table markers
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore: update uv.lock with latest docling-core 2.60.2
Update uv.lock file with the latest release of docling-core (2.60.2).
Update (fix) ground truth files for testing markdown serialization to be
in sync with the serialization fix (issue 2880).
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix missing content in mixed markdown
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* delete elements outside of iterate_items
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* no need for new test files
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add new export without extra furniture title
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add html options
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* use html options
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Fix p elements having block-level elements anywhere inside as browsers do.
Fix wrong type annotations.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix#2250. list items after numbered headers
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add test for new case
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* chore(docx): remove unnecessary check
Remove 'current_parent is None' check in '_add_list_item' function since it
will always be None.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(docx): parse page headers and footers
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): rename _add_header with _add_heading
To avoid confusion, rename _add_header function name with _add_heading
since the function is about adding section headings.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): extend the page header and footer parsing to any content type
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): fix _add_header_footer function
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): remove unnecessary import
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(docx): simplify parsing of simple tables
Simplify the parsing of tables with just text (no rich cells).
Move nested function group_cell_elements out of _handle_tables for readability.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): reuse method for finding inline pictures
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): format strikethrough text
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* tests(docx): use fixtures to avoid converting same file multiple times
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(docx): remove unnecessary argument docx_obj in functions
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* tests(docx): add test for rich table cells
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): small improvements in backend and its unit tests
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): parse superscript and subscript formatted text
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(html): simplify parsing of simple table cells
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* tests(html): add test for rich table cells
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(html): ensure table cells with formatted text are parsed as RichTableCell
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(html): simplify process_rich_table_cells since only rich cells are processed
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(html): formatted cell runs should be parsed as text items respecting the order
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore: pin latest docling-core and update uv.lock
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore: upgrade dependencies on uv.lock
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix: xlsx doc parsing, now returning values instead of formulas
Signed-off-by: glypt <8trash-can8@protonmail.ch>
* fix: add test for better coverage of xlsx backend
Signed-off-by: glypt <8trash-can8@protonmail.ch>
* fix: add the total of ducks as a formula in the tests/data
This also adds the test that the value 310 is contained in the table.
Without the fix from the previous commit, it would return "B7+C7"
Signed-off-by: glypt <8trash-can8@protonmail.ch>
---------
Signed-off-by: glypt <8trash-can8@protonmail.ch>