Commit Graph

29 Commits

Author SHA1 Message Date
Cesar Berrospi Ramis e00735dd59 fix(docx): fix OMML equation handling and improve type safety (#3381)
* fix(docx): handle missing chr attribute in groupChr OMML elements

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(docx): escape spaces in OMML limit text for proper LaTeX rendering

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(docx): fix inline equation reconstruction to prevent tag corruption

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): add type hints and docstrings to OMML module

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(docx): fix genfrac formatting and eliminate grouping function warnings

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(docx): handle unmapped characters in OMML % formatting

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-05-04 10:58:25 +02:00
Cesar Berrospi Ramis 3df80e7f46 fix(docx): OMML conversion failures for unsupported limit functions (#3359)
* fix(docx): handle unsupported limit functions gracefully in OMML conversion

Replace RuntimeError with graceful fallback for unknown limit functions in do_limlow().
Add argmax and argmin to LIM_FUNC dictionary for proper LaTeX rendering.
Fixes conversion failures when Word documents contain mathematical operators
not previously supported in the limit function dictionary.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* test(docx): regenerate ground truth files

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-04-28 14:43:24 +02:00
Cesar Berrospi Ramis c455a65e36 feat(docx): add checkbox parsing support (#3349)
* feat(docx): add checkbox parsing support to MsWordDocumentBackend

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(docx): remove duplicate code in text element handling

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* docs(docx): update checkbox method docstrings

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(docx): use self._BLIP_NAMESPACES for w14 namespace in checkbox methods

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-04-28 14:38:43 +02:00
Cesar Berrospi Ramis 2ddaa3be97 feat(docx): extract VML images with v:imagedata elements (#3343)
feat(docx): Extract VML images with v:imagedata elements

Add VML image support with EMF/WMF conversion and consolidate image handler code.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-04-22 08:46:36 +02:00
Cesar Berrospi Ramis c7615123e6 fix(docx): handle inline formulas in list items (#3304)
* fix(docx) Handle inline formulas in list items

Fixes issue where inline formulas in list items were ignored during conversion.
Added helper methods to eliminate code duplication.
Updated test data with list items containing inline equations.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(docx): collect element refs in _add_inline_equations_to_parent

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-04-17 07:33:20 +02:00
Cesar Berrospi Ramis 740c386730 fix(docx): isolate list state in table cells (#3294)
* fix(docx): isolate list state in table cells

Lists with the same numId in different table cells were incorrectly
merged. Added context manager to isolate list state during cell
processing. Includes test cases and updated ground truth files.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* style(docx): modernize type hints to use PEP 604 union syntax

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-04-15 09:51:37 +02:00
Giulio Leone e36125ba2d fix(omml): correct LaTeX output for fractions, math operators, and functions (#3122)
* fix(omml): correct LaTeX output for fractions, math operators, and functions

Fixes three related bugs in OMML-to-LaTeX conversion:

A) Fraction raised to a power now produces correct grouping braces:
   {\frac{(x-c)}{v}}^{2} instead of \frac{(x-c)}{v}^{2}
   Adds dedicated do_ssub/do_ssup/do_ssubsup handlers that wrap
   complex base expressions (fractions, radicals) in braces.

B) EN DASH (U+2013) and CIRCUMFLEX (U+005E) inside math runs are
   now mapped to their math-mode equivalents (- and ^) instead of
   being escaped as \text{\textendash} and \text{\textasciicircum}.

C) Adds missing standard math functions to the FUNC dict: log, ln,
   exp, det, gcd, deg, hom, ker, dim, arg, inf, sup, lim, Pr.
   These now emit proper LaTeX commands (e.g. \log) instead of
   falling back to plain italic text.

Closes #3120

Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

* test(omml): add test documents for OMML-to-LaTeX conversion bugs

Add three minimal DOCX files exercising the fixed edge cases:
- omml_frac_superscript.docx: fraction as superscript base (Bug A)
- omml_text_escapes_in_math.docx: en-dash and caret in math runs (Bug B)
- omml_func_log.docx: log function recognition (Bug C)

Each file includes matching groundtruth (md, json, itxt).

Requested-by: @dolfim-ibm
Signed-off-by: Giulio Leone <giulioleone10@gmail.com>
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

* fix(omml): avoid double-wrapping nested sub/sup containers

Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

* fix(omml): fix Bug B caret escape + use issue #3120 test documents

Bug B fix: prevent escape_latex from re-escaping characters that
process_unicode intentionally mapped to math operators.  The caret
character U+005E inside <m:r><m:t> math runs was being converted
to ^ by _MATH_CHAR_MAP, then immediately re-escaped to \^ by
escape_latex.  Now do_r restores math-mapped chars after escaping.

Result: x - y\^2 → x - y^2 (correct superscript)

Test documents: replace minimal programmatic fixtures (~1.2 KB)
with the real Word documents from issue #3120 reporter (smroels,
~37 KB each).  Regenerate all groundtruth.

Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

* test: regenerate groundtruth for omml_text_escapes_in_math

Update .itxt to use proper indented-text export format (item hierarchy)
and refresh .json to match current converter output.

Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

* test(omml): regenerate indented text snapshots

The OMML regression documents were exported into the .itxt fixtures using the
wrong format, so the real DOCX end-to-end check failed even though the rebased
converter output was correct.

Regenerate the two broken indented-text snapshots from the current branch so
the MS Word E2E test verifies the actual converter behavior.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* style(omml): apply ruff format normalization

Normalize the multiline condition in omml.py to match the repository
ruff-format output so the pre-commit gate stays clean on the refreshed
PR head.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

* DCO Remediation Commit for giulio-leone <giulio97.leone@gmail.com>

I, giulio-leone <giulio97.leone@gmail.com>, hereby add my Signed-off-by to this commit: 08001d9c5c

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

---------

Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Signed-off-by: Giulio Leone <giulioleone10@gmail.com>
Co-authored-by: giulio-leone <giulio.leone@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-25 07:07:31 +01:00
Giulio Leone 90d6dd4e87 fix(docx): split multiple OMML equations into separate formula items (#3123)
* fix(msword): split multiple OMML equations into separate formula items

When a DOCX paragraph contains multiple sibling <m:oMath> elements
(e.g. separate equations on one line), the converter previously
concatenated them into a single LaTeX string because element.iter()
walks all descendants depth-first.

Fix: iterate direct children of the paragraph element first to
correctly identify sibling <m:oMath> elements, converting each
independently. Falls back to deep iteration only when oMath
elements are nested inside wrapper elements.

Also splits standalone multi-equation paragraphs into individual
FORMULA document items instead of merging them into one.

Closes #3121

Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

* test(msword): add multi-equation paragraph test document

Add a minimal DOCX file containing two separate oMath elements
in one paragraph with a text separator, along with groundtruth
output files for markdown, json, and plain text export.

Requested-by: @dolfim-ibm
Signed-off-by: Giulio Leone <giulioleone10@gmail.com>
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

* test(msword): regenerate multi-equation indented-text snapshot

Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

* test: replace test doc with issue #3121 attachment

Use the real Word document from the issue reporter (smroels)
instead of the minimal programmatic fixture. The new document
contains three sibling <m:oMath> elements in one paragraph,
matching the exact failing shape described in #3121.

Regenerate groundtruth to match the richer document structure.

Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

* test: regenerate groundtruth for omml_multi_equation_paragraph

Re-run document conversion with current code to update .itxt and .json
groundtruth files. The .itxt had stale structure from the previous
programmatic fixture; the new real-document conversion produces the
correct output with three separate formula items.

Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

* style(docx): rerun ruff formatter for msword backend

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* refactor(docx): drop unused tag_name binding

Remove the unused local in the direct oMath iteration path so the code
reads clearly and the outstanding review comment is fully addressed
without changing equation-handling behavior.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

* DCO Remediation Commit for giulio-leone <giulio97.leone@gmail.com>

I, giulio-leone <giulio97.leone@gmail.com>, hereby add my Signed-off-by to this commit: 84cc70b55e

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

* test(docx): cover equation paragraph branches

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

* test(docx): reuse backend fixture in msword tests

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>

---------

Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Signed-off-by: Giulio Leone <giulioleone10@gmail.com>
Co-authored-by: giulio-leone <giulio.leone@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-24 09:42:16 +01:00
Emre Çalışır 2f7c09e0d8 fix(docx): Missing list items after numbered header (#2665) (#2678)
* fix(docx): Correct list numbering with interleaved numIds and hierarchical markers

  Fixes incorrect numbering and missing items in DOCX documents that use
  multiple interleaved numbering sequences (numIds).

  Changes:
  * Reset sub-level counters in _get_list_counter when a parent level
    advances, preventing counter bleed-across (e.g. "4. Functional
    Requirements" now correctly renders as "1. Functional Requirements")
  * Add _build_enum_marker helper to produce hierarchical markers in
    "1.2.3." format instead of flat single-level counters
  * Fix anchor-based level calculation in new-sequence branch: use
    level_at_new_list + ilevel instead of _get_level() to correctly
    place items from a different numId at the right document level
  * Only set level_at_new_list in the else case (when None) to avoid
    corrupting the anchor when switching between interleaved numIds
  * Remove _reset_list_counters_for_new_sequence from new-sequence branch
    so that returning to a previously seen numId continues its counter
    (e.g. Appendix A=1, B=2, C=3 instead of A=1, B=1, C=1)

Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>

* style(docx): apply Ruff formatting to msword_backend

Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>

* test(docx): add unit tests for list counter, enum marker, and sequence reset helpers

  Adds test_list_counter_and_enum_marker covering helper methods introduced
  in the list numbering fix: counter increment, sub-level reset on parent
  advance, hierarchical marker building, and selective sequence reset.

Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>

* fix(docx): read start values from abstractNum for correct numbering

    When Word creates a new numbering definition that continues from a
    previous list, it embeds start values in the abstractNum XML instead of
    reusing the same numId. Docling previously ignored these start values
    and always initialized counters from 1, producing incorrect markers
    like "1.1.1." instead of "2.3.1.".

    Changes:
    * Add _get_level_element helper to extract level XML from abstractNum,
      eliminating duplicated XML traversal in _is_numbered_list
    * Add _get_start_value to read w:start from the numbering definition
    * Initialize counters in _get_list_counter using start values
    * Use start values as fallback in _build_enum_marker for parent levels
      that have not been explicitly incremented

Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>

* test(docx): add interleaved numId edge case to unit_test_headers_numbered

    Extends the existing test document with an Appendix section that uses a
    different numId, followed by list items that resume the original
    numbering sequence with Word-embedded start values (e.g. 2.3.1.).
    Updates groundtruth files accordingly.

Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>

* style(docx): add return type annotation to _get_level_element

Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>

---------

Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>
2026-03-20 21:24:27 +01:00
Ron 8ae0974a9d fix: handle external image relationships in MsWordDocumentBackend (#3114)
* fix: handle external image relationships in MsWordDocumentBackend

When a .docx file contains image relationships with TargetMode="External"
(common in documents saved from web browsers), accessing
`_Relationship.target_part` raises ValueError because external relationships
don't have a target part within the package.

Check `rel.is_external` before accessing `target_part`, emitting a
UserWarning with the external target URL and returning None so external
images fall through to the existing "image cannot be found" handling.

Includes test with ground truth files for a .docx with external image
references.

Fixes #3113

Signed-off-by: rongo-ms <127863751+rongo-ms@users.noreply.github.com>

* chore: upgrade dependencies in uv.lock file

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: rongo-ms <127863751+rongo-ms@users.noreply.github.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-03-19 14:22:21 +01:00
Cesar Berrospi Ramis 56eb12782c fix(docx): handle list items immediately after numbered headings (#3070)
fix(docx): create a new list group with a list item after a heading

When a list with a different 'numid' appears after a heading (marked as list item too), a
new list group needs to be created to avoid inconsistencies (list item under heading).

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-03-06 09:30:48 +01:00
Siva b6ca094519 feat: add support for Word document comments extraction (#2834)
* feat: add support for Word document comments extraction (fixes #485)

Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>

* fix: address PR review feedback for comments extraction

- Change DocItemLabel.PARAGRAPH to TEXT (deprecating PARAGRAPH)
- Change initials format from '(initials)' to 'author: initials'
- Change timestamp format to include 'time:' prefix
- Update test assertions and regenerate ground truth files

Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>

* chore: update comment format and move format documentation from inline comment to function docstring

Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>

* Use docling-core v2.58.0 add_comment() API to properly link Word document
comments to their annotated text items via FineRef references.

- Import FineRef from docling_core.types.doc.document
- Refactor _add_comments to use doc.add_comment(targets=[...]) API
- Parse DOCX XML for commentRangeStart/End markers in _extract_comment_ranges
- Track paragraph-to-items mapping for comment linking
- Fallback to unlinked comments in COMMENT_SECTION group when no targets found

Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>

* - Extract comment IDs directly during paragraph element processing to match element IDs
- Clear paragraph mappings at start of each conversion for consistent behavior
- Always create comment groups and use add_comment() API with targets
- Add _get_comment_ids_for_element() helper to extract comment markers from XML
- Regenerate ground-truth files (JSON/MD/itxt) with comments field properly linked

Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>

* fix: remove incorrect ground-truth files, keep versions with comments field

Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>

* fix: reference comment groups instead of text items in comments field

Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>

---------

Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2026-01-26 09:58:46 +01:00
Cesar Berrospi Ramis 5c1f8f0171 fix(docx): handle grouped pictures (#2861)
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-01-09 09:42:52 +01:00
Cesar Berrospi Ramis c97715f5fd fix(docx): parse integrals as n-ary objects without chr element (#2712)
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-12-03 11:25:52 +01:00
Michele Dolfi e58055465c fix(docx): Missing list items after numbered header (#2665)
* fix #2250. list items after numbered headers

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add test for new case

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* chore(docx): remove unnecessary check

Remove 'current_parent is None' check in '_add_list_item' function since it
will always be None.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-11-24 08:49:21 +01:00
Cesar Berrospi Ramis 054c4a634d fix(docx): parse page headers and footers (#2599)
* fix(docx): parse page headers and footers

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): rename _add_header with _add_heading

To avoid confusion, rename _add_header function name with _add_heading
since the function is about adding section headings.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): extend the page header and footer parsing to any content type

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): fix _add_header_footer function

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-11-10 16:10:12 +01:00
Cesar Berrospi Ramis ef623ffcee fix(docx): slow table parsing (#2553)
* chore(docx): remove unnecessary import

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(docx): simplify parsing of simple tables

Simplify the parsing of tables with just text (no rich cells).
Move nested function group_cell_elements out of _handle_tables for readability.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): reuse method for finding inline pictures

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): format strikethrough text

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(docx): use fixtures to avoid converting same file multiple times

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(docx): remove unnecessary argument docx_obj in functions

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(docx): add test for rich table cells

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): small improvements in backend and its unit tests

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): parse superscript and subscript formatted text

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-11-06 05:25:53 +01:00
Rafael Teixeira de Lima 16829939cf feat(docx): Process drawingml objects in docx (#2453)
* Export of DrawingML figures into docling document

* Adding libreoffice env var and libreoffice to checks image

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* DCO Remediation Commit for Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

I, Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>, hereby add my Signed-off-by to this commit: 9518fffcad

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Enforcing apt get update

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Only display drawingml warning once per document

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* add util to test libreoffice and exclude files from test when not found

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* check libreoffice only once

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Only initialise converter if needed

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-15 10:58:08 +02:00
Rafael Teixeira de Lima 0b83609531 fix(docx): Adding plain latex equations to table cells (#1986)
* Adding plain latex equations to table cells

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Adding test files

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-07-24 11:02:24 +02:00
mkrssg 1350a8d3e5 fix(msword_backend): Identify text in the same line after an image #1425 (#1610)
* fix(msword_backend): Identify text in the same line after an image / image anchor #1425

Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>

* test: add test file and case for fix(msword_backend): Identify text in the same line after an image / image anchor #1425

Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>

* test: added groundtruth test files for fix(msword_backend): Identify text in the same line after an image / image anchor #1425

Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>

* fix: extraneous empty paragraphs for test files

Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>

---------

Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>
Co-authored-by: Michael Krissgau <michael.krissgau@ibm.com>
2025-06-20 10:55:30 +02:00
Panos Vagenas 61d0d6c755 test: mark flaky test (#1698)
* test: cleanse Word test file

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* mark textbox file test as flaky

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* fix path usage

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-06-03 13:13:44 +02:00
AndrewTsai0406 12a0e64892 feat: add textbox content extraction in msword_backend (#1538)
* feat: add textbox content extraction in msword_backend

Signed-off-by: Andrew <tsai247365@gmail.com>

* feat: add textbox content extraction in msword_backend

Signed-off-by: Andrew <tsai247365@gmail.com>

* feat: add textbox content extraction in msword_backend

Signed-off-by: Andrew <tsai247365@gmail.com>

---------

Signed-off-by: Andrew <tsai247365@gmail.com>
2025-05-19 15:01:36 +02:00
Simon Jégou bfcab3d677 feat(docx): add text formatting and hyperlink support (#630)
* feat: Enable markdown text formatting for docx

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Fix imports

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use Formatting

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle hyperlink

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle formatting properly for DocItemLabel.PARAGRAPH

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use inline group

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle bullet lists

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Strip elements

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Strip elements

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Run black and mypy

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle header and footer

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use inline_fmt everywhere

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Run precommit

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Address feedback

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Fix add_list_item

Signed-off-by: SimJeg <sjegou@nvidia.com>

* fix minor bugs, mark helper methods internal

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: SimJeg <sjegou@nvidia.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
2025-04-03 15:11:50 +02:00
Rafael Teixeira de Lima 6eb718f849 feat: equations to latex in MSWord backend (with inline groups) (#1114)
* Equation groups

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* fix: Proper handling of orphan IDs in layout postprocessing (#1118)

* Fix the handling of orphan IDs in layout postprocessing

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update test cases

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* chore: bump version to 2.25.2 [skip ci]

* docs: add description of DOCLING_ARTIFACTS_PATH env var (#1124)

add env var in docs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* fix(CLI): fix help message for abort options (#1130)

fix help message

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* perf: New revision code formula model and document picture classifier (#1140)

* new version code formula model

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* new version document picture classifier

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* new code formula model

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* restored original code formula test pdf

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

---------

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* feat: Use new TableFormer model weights and default to accurate model version (#1100)

* feat: New tableformer model weights [WIP]

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Updated TF version

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated tests, after merging with Main, Switched to Accurate TF model by default

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* chore: bump version to 2.26.0 [skip ci]

* fix: Pass tests, update docling-core to 2.22.0 (#1150)

fix: update docling-core to 2.22.0

Update dependency library docling-core to latest release 2.22.0
Fix regression tests and ground truth files

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* Updating content hash

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com>
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-03-13 15:12:22 +01:00
Cesar Berrospi Ramis 0cd81a8122 fix(docx): merged table cells not properly converted (#857)
* fix(docx): merged cells not properly converted

Fix conversion issue of merged cells in Word tables leading to repeated text.
Simplify Word table conversion code.
Add docx file with several table formats for regression tests.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* chore: add type hinting to docx backend

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-02-03 10:20:03 +01:00
Maxim Lysak 2c037ae62e fix: Fixed docx import with headers that are also lists (#842)
* Fix for docx when headers are also lists, now recorded as appropriate headers and subheaders, unit test included

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Update docling/backend/msword_backend.py

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com>

* Update docling/backend/msword_backend.py

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-01-31 10:51:21 +01:00
Maxim Lysak d0a1180478 fix: Fixes for wordx (#432)
* fixes for referencing drawing blip in wordx

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added safety try-except when trying to load pillow image from a docx blob. Added explicit dependency on lxml.

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added test for word file with embedded emf images, re-generated full tests for docx, eased up dependency on lxml

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated lxml dependency version

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-26 14:44:43 +01:00
Maxim Lysak fb8ba861e2 fix: Handling of single-cell tables in DOCX backend (#314)
* Handling of single-cell tables in DOCX backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* returned try-catch on tables handling

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* cleaned

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* proceed processing the content of single cell table as if its just part of the body

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added example of trickly 1 cell table docx

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-12 15:20:55 +01:00
Peter W. J. Staar f542460af3 fix: fix duplicate title and heading + add e2e tests for html and docx (#186)
* add real e2e tests for html and docx

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the output of itxt

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the text

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the tests (2)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the examples (1)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the output of the test

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the tests, moved the ground-truth

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* moved the ground-truth data

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the html tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* restructure title fix (#187)

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-30 13:14:56 +01:00