* fix(docx): handle missing chr attribute in groupChr OMML elements
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(docx): escape spaces in OMML limit text for proper LaTeX rendering
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(docx): fix inline equation reconstruction to prevent tag corruption
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): add type hints and docstrings to OMML module
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(docx): fix genfrac formatting and eliminate grouping function warnings
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(docx): handle unmapped characters in OMML % formatting
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(docx): handle unsupported limit functions gracefully in OMML conversion
Replace RuntimeError with graceful fallback for unknown limit functions in do_limlow().
Add argmax and argmin to LIM_FUNC dictionary for proper LaTeX rendering.
Fixes conversion failures when Word documents contain mathematical operators
not previously supported in the limit function dictionary.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* test(docx): regenerate ground truth files
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* feat(docx): add checkbox parsing support to MsWordDocumentBackend
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(docx): remove duplicate code in text element handling
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* docs(docx): update checkbox method docstrings
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(docx): use self._BLIP_NAMESPACES for w14 namespace in checkbox methods
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
feat(docx): Extract VML images with v:imagedata elements
Add VML image support with EMF/WMF conversion and consolidate image handler code.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(docx) Handle inline formulas in list items
Fixes issue where inline formulas in list items were ignored during conversion.
Added helper methods to eliminate code duplication.
Updated test data with list items containing inline equations.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(docx): collect element refs in _add_inline_equations_to_parent
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(docx): isolate list state in table cells
Lists with the same numId in different table cells were incorrectly
merged. Added context manager to isolate list state during cell
processing. Includes test cases and updated ground truth files.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* style(docx): modernize type hints to use PEP 604 union syntax
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(omml): correct LaTeX output for fractions, math operators, and functions
Fixes three related bugs in OMML-to-LaTeX conversion:
A) Fraction raised to a power now produces correct grouping braces:
{\frac{(x-c)}{v}}^{2} instead of \frac{(x-c)}{v}^{2}
Adds dedicated do_ssub/do_ssup/do_ssubsup handlers that wrap
complex base expressions (fractions, radicals) in braces.
B) EN DASH (U+2013) and CIRCUMFLEX (U+005E) inside math runs are
now mapped to their math-mode equivalents (- and ^) instead of
being escaped as \text{\textendash} and \text{\textasciicircum}.
C) Adds missing standard math functions to the FUNC dict: log, ln,
exp, det, gcd, deg, hom, ker, dim, arg, inf, sup, lim, Pr.
These now emit proper LaTeX commands (e.g. \log) instead of
falling back to plain italic text.
Closes#3120
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
* test(omml): add test documents for OMML-to-LaTeX conversion bugs
Add three minimal DOCX files exercising the fixed edge cases:
- omml_frac_superscript.docx: fraction as superscript base (Bug A)
- omml_text_escapes_in_math.docx: en-dash and caret in math runs (Bug B)
- omml_func_log.docx: log function recognition (Bug C)
Each file includes matching groundtruth (md, json, itxt).
Requested-by: @dolfim-ibm
Signed-off-by: Giulio Leone <giulioleone10@gmail.com>
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
* fix(omml): avoid double-wrapping nested sub/sup containers
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
* fix(omml): fix Bug B caret escape + use issue #3120 test documents
Bug B fix: prevent escape_latex from re-escaping characters that
process_unicode intentionally mapped to math operators. The caret
character U+005E inside <m:r><m:t> math runs was being converted
to ^ by _MATH_CHAR_MAP, then immediately re-escaped to \^ by
escape_latex. Now do_r restores math-mapped chars after escaping.
Result: x - y\^2 → x - y^2 (correct superscript)
Test documents: replace minimal programmatic fixtures (~1.2 KB)
with the real Word documents from issue #3120 reporter (smroels,
~37 KB each). Regenerate all groundtruth.
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
* test: regenerate groundtruth for omml_text_escapes_in_math
Update .itxt to use proper indented-text export format (item hierarchy)
and refresh .json to match current converter output.
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
* test(omml): regenerate indented text snapshots
The OMML regression documents were exported into the .itxt fixtures using the
wrong format, so the real DOCX end-to-end check failed even though the rebased
converter output was correct.
Regenerate the two broken indented-text snapshots from the current branch so
the MS Word E2E test verifies the actual converter behavior.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* style(omml): apply ruff format normalization
Normalize the multiline condition in omml.py to match the repository
ruff-format output so the pre-commit gate stays clean on the refreshed
PR head.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
* DCO Remediation Commit for giulio-leone <giulio97.leone@gmail.com>
I, giulio-leone <giulio97.leone@gmail.com>, hereby add my Signed-off-by to this commit: 08001d9c5c
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
---------
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Signed-off-by: Giulio Leone <giulioleone10@gmail.com>
Co-authored-by: giulio-leone <giulio.leone@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* fix(msword): split multiple OMML equations into separate formula items
When a DOCX paragraph contains multiple sibling <m:oMath> elements
(e.g. separate equations on one line), the converter previously
concatenated them into a single LaTeX string because element.iter()
walks all descendants depth-first.
Fix: iterate direct children of the paragraph element first to
correctly identify sibling <m:oMath> elements, converting each
independently. Falls back to deep iteration only when oMath
elements are nested inside wrapper elements.
Also splits standalone multi-equation paragraphs into individual
FORMULA document items instead of merging them into one.
Closes#3121
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
* test(msword): add multi-equation paragraph test document
Add a minimal DOCX file containing two separate oMath elements
in one paragraph with a text separator, along with groundtruth
output files for markdown, json, and plain text export.
Requested-by: @dolfim-ibm
Signed-off-by: Giulio Leone <giulioleone10@gmail.com>
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
* test(msword): regenerate multi-equation indented-text snapshot
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
* test: replace test doc with issue #3121 attachment
Use the real Word document from the issue reporter (smroels)
instead of the minimal programmatic fixture. The new document
contains three sibling <m:oMath> elements in one paragraph,
matching the exact failing shape described in #3121.
Regenerate groundtruth to match the richer document structure.
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
* test: regenerate groundtruth for omml_multi_equation_paragraph
Re-run document conversion with current code to update .itxt and .json
groundtruth files. The .itxt had stale structure from the previous
programmatic fixture; the new real-document conversion produces the
correct output with three separate formula items.
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
* style(docx): rerun ruff formatter for msword backend
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* refactor(docx): drop unused tag_name binding
Remove the unused local in the direct oMath iteration path so the code
reads clearly and the outstanding review comment is fully addressed
without changing equation-handling behavior.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
* DCO Remediation Commit for giulio-leone <giulio97.leone@gmail.com>
I, giulio-leone <giulio97.leone@gmail.com>, hereby add my Signed-off-by to this commit: 84cc70b55e
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
* test(docx): cover equation paragraph branches
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
* test(docx): reuse backend fixture in msword tests
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
---------
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Signed-off-by: Giulio Leone <giulioleone10@gmail.com>
Co-authored-by: giulio-leone <giulio.leone@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* fix(docx): Correct list numbering with interleaved numIds and hierarchical markers
Fixes incorrect numbering and missing items in DOCX documents that use
multiple interleaved numbering sequences (numIds).
Changes:
* Reset sub-level counters in _get_list_counter when a parent level
advances, preventing counter bleed-across (e.g. "4. Functional
Requirements" now correctly renders as "1. Functional Requirements")
* Add _build_enum_marker helper to produce hierarchical markers in
"1.2.3." format instead of flat single-level counters
* Fix anchor-based level calculation in new-sequence branch: use
level_at_new_list + ilevel instead of _get_level() to correctly
place items from a different numId at the right document level
* Only set level_at_new_list in the else case (when None) to avoid
corrupting the anchor when switching between interleaved numIds
* Remove _reset_list_counters_for_new_sequence from new-sequence branch
so that returning to a previously seen numId continues its counter
(e.g. Appendix A=1, B=2, C=3 instead of A=1, B=1, C=1)
Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>
* style(docx): apply Ruff formatting to msword_backend
Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>
* test(docx): add unit tests for list counter, enum marker, and sequence reset helpers
Adds test_list_counter_and_enum_marker covering helper methods introduced
in the list numbering fix: counter increment, sub-level reset on parent
advance, hierarchical marker building, and selective sequence reset.
Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>
* fix(docx): read start values from abstractNum for correct numbering
When Word creates a new numbering definition that continues from a
previous list, it embeds start values in the abstractNum XML instead of
reusing the same numId. Docling previously ignored these start values
and always initialized counters from 1, producing incorrect markers
like "1.1.1." instead of "2.3.1.".
Changes:
* Add _get_level_element helper to extract level XML from abstractNum,
eliminating duplicated XML traversal in _is_numbered_list
* Add _get_start_value to read w:start from the numbering definition
* Initialize counters in _get_list_counter using start values
* Use start values as fallback in _build_enum_marker for parent levels
that have not been explicitly incremented
Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>
* test(docx): add interleaved numId edge case to unit_test_headers_numbered
Extends the existing test document with an Appendix section that uses a
different numId, followed by list items that resume the original
numbering sequence with Word-embedded start values (e.g. 2.3.1.).
Updates groundtruth files accordingly.
Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>
* style(docx): add return type annotation to _get_level_element
Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>
---------
Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>
* fix: handle external image relationships in MsWordDocumentBackend
When a .docx file contains image relationships with TargetMode="External"
(common in documents saved from web browsers), accessing
`_Relationship.target_part` raises ValueError because external relationships
don't have a target part within the package.
Check `rel.is_external` before accessing `target_part`, emitting a
UserWarning with the external target URL and returning None so external
images fall through to the existing "image cannot be found" handling.
Includes test with ground truth files for a .docx with external image
references.
Fixes#3113
Signed-off-by: rongo-ms <127863751+rongo-ms@users.noreply.github.com>
* chore: upgrade dependencies in uv.lock file
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: rongo-ms <127863751+rongo-ms@users.noreply.github.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
fix(docx): create a new list group with a list item after a heading
When a list with a different 'numid' appears after a heading (marked as list item too), a
new list group needs to be created to avoid inconsistencies (list item under heading).
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* feat: add support for Word document comments extraction (fixes#485)
Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>
* fix: address PR review feedback for comments extraction
- Change DocItemLabel.PARAGRAPH to TEXT (deprecating PARAGRAPH)
- Change initials format from '(initials)' to 'author: initials'
- Change timestamp format to include 'time:' prefix
- Update test assertions and regenerate ground truth files
Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>
* chore: update comment format and move format documentation from inline comment to function docstring
Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>
* Use docling-core v2.58.0 add_comment() API to properly link Word document
comments to their annotated text items via FineRef references.
- Import FineRef from docling_core.types.doc.document
- Refactor _add_comments to use doc.add_comment(targets=[...]) API
- Parse DOCX XML for commentRangeStart/End markers in _extract_comment_ranges
- Track paragraph-to-items mapping for comment linking
- Fallback to unlinked comments in COMMENT_SECTION group when no targets found
Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>
* - Extract comment IDs directly during paragraph element processing to match element IDs
- Clear paragraph mappings at start of each conversion for consistent behavior
- Always create comment groups and use add_comment() API with targets
- Add _get_comment_ids_for_element() helper to extract comment markers from XML
- Regenerate ground-truth files (JSON/MD/itxt) with comments field properly linked
Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>
* fix: remove incorrect ground-truth files, keep versions with comments field
Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>
* fix: reference comment groups instead of text items in comments field
Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>
---------
Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
* fix#2250. list items after numbered headers
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add test for new case
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* chore(docx): remove unnecessary check
Remove 'current_parent is None' check in '_add_list_item' function since it
will always be None.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(docx): parse page headers and footers
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): rename _add_header with _add_heading
To avoid confusion, rename _add_header function name with _add_heading
since the function is about adding section headings.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): extend the page header and footer parsing to any content type
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): fix _add_header_footer function
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): remove unnecessary import
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(docx): simplify parsing of simple tables
Simplify the parsing of tables with just text (no rich cells).
Move nested function group_cell_elements out of _handle_tables for readability.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): reuse method for finding inline pictures
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): format strikethrough text
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* tests(docx): use fixtures to avoid converting same file multiple times
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(docx): remove unnecessary argument docx_obj in functions
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* tests(docx): add test for rich table cells
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): small improvements in backend and its unit tests
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): parse superscript and subscript formatted text
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* Export of DrawingML figures into docling document
* Adding libreoffice env var and libreoffice to checks image
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* DCO Remediation Commit for Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
I, Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>, hereby add my Signed-off-by to this commit: 9518fffcad
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* Enforcing apt get update
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* Only display drawingml warning once per document
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* add util to test libreoffice and exclude files from test when not found
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* check libreoffice only once
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* Only initialise converter if needed
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
* fix(msword_backend): Identify text in the same line after an image / image anchor #1425
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>
* test: add test file and case for fix(msword_backend): Identify text in the same line after an image / image anchor #1425
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>
* test: added groundtruth test files for fix(msword_backend): Identify text in the same line after an image / image anchor #1425
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>
* fix: extraneous empty paragraphs for test files
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>
---------
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>
Co-authored-by: Michael Krissgau <michael.krissgau@ibm.com>
* fix(docx): merged cells not properly converted
Fix conversion issue of merged cells in Word tables leading to repeated text.
Simplify Word table conversion code.
Add docx file with several table formats for regression tests.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* chore: add type hinting to docx backend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* fixes for referencing drawing blip in wordx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added safety try-except when trying to load pillow image from a docx blob. Added explicit dependency on lxml.
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added test for word file with embedded emf images, re-generated full tests for docx, eased up dependency on lxml
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Updated lxml dependency version
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
* Handling of single-cell tables in DOCX backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* returned try-catch on tables handling
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* cleaned
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* proceed processing the content of single cell table as if its just part of the body
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added example of trickly 1 cell table docx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>