docling

mirror of https://github.com/docling-project/docling.git synced 2026-05-17 13:10:38 +00:00

Author	SHA1	Message	Date
Cesar Berrospi Ramis	e00735dd59	fix(docx): fix OMML equation handling and improve type safety (#3381 ) * fix(docx): handle missing chr attribute in groupChr OMML elements Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * fix(docx): escape spaces in OMML limit text for proper LaTeX rendering Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * fix(docx): fix inline equation reconstruction to prevent tag corruption Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(docx): add type hints and docstrings to OMML module Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * fix(docx): fix genfrac formatting and eliminate grouping function warnings Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * fix(docx): handle unmapped characters in OMML % formatting Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-05-04 10:58:25 +02:00
Cesar Berrospi Ramis	3df80e7f46	fix(docx): OMML conversion failures for unsupported limit functions (#3359 ) * fix(docx): handle unsupported limit functions gracefully in OMML conversion Replace RuntimeError with graceful fallback for unknown limit functions in do_limlow(). Add argmax and argmin to LIM_FUNC dictionary for proper LaTeX rendering. Fixes conversion failures when Word documents contain mathematical operators not previously supported in the limit function dictionary. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * test(docx): regenerate ground truth files Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-04-28 14:43:24 +02:00
Cesar Berrospi Ramis	c455a65e36	feat(docx): add checkbox parsing support (#3349 ) * feat(docx): add checkbox parsing support to MsWordDocumentBackend Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * refactor(docx): remove duplicate code in text element handling Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * docs(docx): update checkbox method docstrings Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * refactor(docx): use self._BLIP_NAMESPACES for w14 namespace in checkbox methods Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-04-28 14:38:43 +02:00
Cesar Berrospi Ramis	2ddaa3be97	feat(docx): extract VML images with v:imagedata elements (#3343 ) feat(docx): Extract VML images with v:imagedata elements Add VML image support with EMF/WMF conversion and consolidate image handler code. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-04-22 08:46:36 +02:00
Cesar Berrospi Ramis	c7615123e6	fix(docx): handle inline formulas in list items (#3304 ) * fix(docx) Handle inline formulas in list items Fixes issue where inline formulas in list items were ignored during conversion. Added helper methods to eliminate code duplication. Updated test data with list items containing inline equations. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * fix(docx): collect element refs in _add_inline_equations_to_parent Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-04-17 07:33:20 +02:00
Cesar Berrospi Ramis	740c386730	fix(docx): isolate list state in table cells (#3294 ) * fix(docx): isolate list state in table cells Lists with the same numId in different table cells were incorrectly merged. Added context manager to isolate list state during cell processing. Includes test cases and updated ground truth files. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * style(docx): modernize type hints to use PEP 604 union syntax Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-04-15 09:51:37 +02:00
Giulio Leone	e36125ba2d	fix(omml): correct LaTeX output for fractions, math operators, and functions (#3122 ) * fix(omml): correct LaTeX output for fractions, math operators, and functions Fixes three related bugs in OMML-to-LaTeX conversion: A) Fraction raised to a power now produces correct grouping braces: {\frac{(x-c)}{v}}^{2} instead of \frac{(x-c)}{v}^{2} Adds dedicated do_ssub/do_ssup/do_ssubsup handlers that wrap complex base expressions (fractions, radicals) in braces. B) EN DASH (U+2013) and CIRCUMFLEX (U+005E) inside math runs are now mapped to their math-mode equivalents (- and ^) instead of being escaped as \text{\textendash} and \text{\textasciicircum}. C) Adds missing standard math functions to the FUNC dict: log, ln, exp, det, gcd, deg, hom, ker, dim, arg, inf, sup, lim, Pr. These now emit proper LaTeX commands (e.g. \log) instead of falling back to plain italic text. Closes #3120 Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * test(omml): add test documents for OMML-to-LaTeX conversion bugs Add three minimal DOCX files exercising the fixed edge cases: - omml_frac_superscript.docx: fraction as superscript base (Bug A) - omml_text_escapes_in_math.docx: en-dash and caret in math runs (Bug B) - omml_func_log.docx: log function recognition (Bug C) Each file includes matching groundtruth (md, json, itxt). Requested-by: @dolfim-ibm Signed-off-by: Giulio Leone <giulioleone10@gmail.com> Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * fix(omml): avoid double-wrapping nested sub/sup containers Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * fix(omml): fix Bug B caret escape + use issue #3120 test documents Bug B fix: prevent escape_latex from re-escaping characters that process_unicode intentionally mapped to math operators. The caret character U+005E inside <m:r><m:t> math runs was being converted to ^ by _MATH_CHAR_MAP, then immediately re-escaped to \^ by escape_latex. Now do_r restores math-mapped chars after escaping. Result: x - y\^2 → x - y^2 (correct superscript) Test documents: replace minimal programmatic fixtures (~1.2 KB) with the real Word documents from issue #3120 reporter (smroels, ~37 KB each). Regenerate all groundtruth. Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * test: regenerate groundtruth for omml_text_escapes_in_math Update .itxt to use proper indented-text export format (item hierarchy) and refresh .json to match current converter output. Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * test(omml): regenerate indented text snapshots The OMML regression documents were exported into the .itxt fixtures using the wrong format, so the real DOCX end-to-end check failed even though the rebased converter output was correct. Regenerate the two broken indented-text snapshots from the current branch so the MS Word E2E test verifies the actual converter behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * style(omml): apply ruff format normalization Normalize the multiline condition in omml.py to match the repository ruff-format output so the pre-commit gate stays clean on the refreshed PR head. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * DCO Remediation Commit for giulio-leone <giulio97.leone@gmail.com> I, giulio-leone <giulio97.leone@gmail.com>, hereby add my Signed-off-by to this commit: `08001d9c5c` Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> --------- Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> Signed-off-by: Giulio Leone <giulioleone10@gmail.com> Co-authored-by: giulio-leone <giulio.leone@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-25 07:07:31 +01:00
Giulio Leone	90d6dd4e87	fix(docx): split multiple OMML equations into separate formula items (#3123 ) * fix(msword): split multiple OMML equations into separate formula items When a DOCX paragraph contains multiple sibling <m:oMath> elements (e.g. separate equations on one line), the converter previously concatenated them into a single LaTeX string because element.iter() walks all descendants depth-first. Fix: iterate direct children of the paragraph element first to correctly identify sibling <m:oMath> elements, converting each independently. Falls back to deep iteration only when oMath elements are nested inside wrapper elements. Also splits standalone multi-equation paragraphs into individual FORMULA document items instead of merging them into one. Closes #3121 Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * test(msword): add multi-equation paragraph test document Add a minimal DOCX file containing two separate oMath elements in one paragraph with a text separator, along with groundtruth output files for markdown, json, and plain text export. Requested-by: @dolfim-ibm Signed-off-by: Giulio Leone <giulioleone10@gmail.com> Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * test(msword): regenerate multi-equation indented-text snapshot Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * test: replace test doc with issue #3121 attachment Use the real Word document from the issue reporter (smroels) instead of the minimal programmatic fixture. The new document contains three sibling <m:oMath> elements in one paragraph, matching the exact failing shape described in #3121. Regenerate groundtruth to match the richer document structure. Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * test: regenerate groundtruth for omml_multi_equation_paragraph Re-run document conversion with current code to update .itxt and .json groundtruth files. The .itxt had stale structure from the previous programmatic fixture; the new real-document conversion produces the correct output with three separate formula items. Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * style(docx): rerun ruff formatter for msword backend Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * refactor(docx): drop unused tag_name binding Remove the unused local in the direct oMath iteration path so the code reads clearly and the outstanding review comment is fully addressed without changing equation-handling behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * DCO Remediation Commit for giulio-leone <giulio97.leone@gmail.com> I, giulio-leone <giulio97.leone@gmail.com>, hereby add my Signed-off-by to this commit: `84cc70b55e` Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * test(docx): cover equation paragraph branches Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * test(docx): reuse backend fixture in msword tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> --------- Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> Signed-off-by: Giulio Leone <giulioleone10@gmail.com> Co-authored-by: giulio-leone <giulio.leone@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-24 09:42:16 +01:00
Emre Çalışır	2f7c09e0d8	fix(docx): Missing list items after numbered header (#2665 ) (#2678 ) * fix(docx): Correct list numbering with interleaved numIds and hierarchical markers Fixes incorrect numbering and missing items in DOCX documents that use multiple interleaved numbering sequences (numIds). Changes: * Reset sub-level counters in _get_list_counter when a parent level advances, preventing counter bleed-across (e.g. "4. Functional Requirements" now correctly renders as "1. Functional Requirements") * Add _build_enum_marker helper to produce hierarchical markers in "1.2.3." format instead of flat single-level counters * Fix anchor-based level calculation in new-sequence branch: use level_at_new_list + ilevel instead of _get_level() to correctly place items from a different numId at the right document level * Only set level_at_new_list in the else case (when None) to avoid corrupting the anchor when switching between interleaved numIds * Remove _reset_list_counters_for_new_sequence from new-sequence branch so that returning to a previously seen numId continues its counter (e.g. Appendix A=1, B=2, C=3 instead of A=1, B=1, C=1) Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com> * style(docx): apply Ruff formatting to msword_backend Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com> * test(docx): add unit tests for list counter, enum marker, and sequence reset helpers Adds test_list_counter_and_enum_marker covering helper methods introduced in the list numbering fix: counter increment, sub-level reset on parent advance, hierarchical marker building, and selective sequence reset. Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com> * fix(docx): read start values from abstractNum for correct numbering When Word creates a new numbering definition that continues from a previous list, it embeds start values in the abstractNum XML instead of reusing the same numId. Docling previously ignored these start values and always initialized counters from 1, producing incorrect markers like "1.1.1." instead of "2.3.1.". Changes: * Add _get_level_element helper to extract level XML from abstractNum, eliminating duplicated XML traversal in _is_numbered_list * Add _get_start_value to read w:start from the numbering definition * Initialize counters in _get_list_counter using start values * Use start values as fallback in _build_enum_marker for parent levels that have not been explicitly incremented Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com> * test(docx): add interleaved numId edge case to unit_test_headers_numbered Extends the existing test document with an Appendix section that uses a different numId, followed by list items that resume the original numbering sequence with Word-embedded start values (e.g. 2.3.1.). Updates groundtruth files accordingly. Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com> * style(docx): add return type annotation to _get_level_element Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com> --------- Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>	2026-03-20 21:24:27 +01:00
Ron	8ae0974a9d	fix: handle external image relationships in MsWordDocumentBackend (#3114 ) * fix: handle external image relationships in MsWordDocumentBackend When a .docx file contains image relationships with TargetMode="External" (common in documents saved from web browsers), accessing `_Relationship.target_part` raises ValueError because external relationships don't have a target part within the package. Check `rel.is_external` before accessing `target_part`, emitting a UserWarning with the external target URL and returning None so external images fall through to the existing "image cannot be found" handling. Includes test with ground truth files for a .docx with external image references. Fixes #3113 Signed-off-by: rongo-ms <127863751+rongo-ms@users.noreply.github.com> * chore: upgrade dependencies in uv.lock file Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: rongo-ms <127863751+rongo-ms@users.noreply.github.com> Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-03-19 14:22:21 +01:00
Cesar Berrospi Ramis	56eb12782c	fix(docx): handle list items immediately after numbered headings (#3070 ) fix(docx): create a new list group with a list item after a heading When a list with a different 'numid' appears after a heading (marked as list item too), a new list group needs to be created to avoid inconsistencies (list item under heading). Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-03-06 09:30:48 +01:00
Siva	b6ca094519	feat: add support for Word document comments extraction (#2834 ) * feat: add support for Word document comments extraction (fixes #485) Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com> * fix: address PR review feedback for comments extraction - Change DocItemLabel.PARAGRAPH to TEXT (deprecating PARAGRAPH) - Change initials format from '(initials)' to 'author: initials' - Change timestamp format to include 'time:' prefix - Update test assertions and regenerate ground truth files Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com> * chore: update comment format and move format documentation from inline comment to function docstring Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com> * Use docling-core v2.58.0 add_comment() API to properly link Word document comments to their annotated text items via FineRef references. - Import FineRef from docling_core.types.doc.document - Refactor _add_comments to use doc.add_comment(targets=[...]) API - Parse DOCX XML for commentRangeStart/End markers in _extract_comment_ranges - Track paragraph-to-items mapping for comment linking - Fallback to unlinked comments in COMMENT_SECTION group when no targets found Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com> * - Extract comment IDs directly during paragraph element processing to match element IDs - Clear paragraph mappings at start of each conversion for consistent behavior - Always create comment groups and use add_comment() API with targets - Add _get_comment_ids_for_element() helper to extract comment markers from XML - Regenerate ground-truth files (JSON/MD/itxt) with comments field properly linked Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com> * fix: remove incorrect ground-truth files, keep versions with comments field Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com> * fix: reference comment groups instead of text items in comments field Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com> --------- Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2026-01-26 09:58:46 +01:00
Cesar Berrospi Ramis	5c1f8f0171	fix(docx): handle grouped pictures (#2861 ) Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-01-09 09:42:52 +01:00
Cesar Berrospi Ramis	c97715f5fd	fix(docx): parse integrals as n-ary objects without chr element (#2712 ) Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2025-12-03 11:25:52 +01:00
Michele Dolfi	e58055465c	fix(docx): Missing list items after numbered header (#2665 ) * fix #2250. list items after numbered headers Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add test for new case Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * chore(docx): remove unnecessary check Remove 'current_parent is None' check in '_add_list_item' function since it will always be None. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2025-11-24 08:49:21 +01:00
Cesar Berrospi Ramis	054c4a634d	fix(docx): parse page headers and footers (#2599 ) * fix(docx): parse page headers and footers Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(docx): rename _add_header with _add_heading To avoid confusion, rename _add_header function name with _add_heading since the function is about adding section headings. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(docx): extend the page header and footer parsing to any content type Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(docx): fix _add_header_footer function Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2025-11-10 16:10:12 +01:00
Cesar Berrospi Ramis	ef623ffcee	fix(docx): slow table parsing (#2553 ) * chore(docx): remove unnecessary import Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * fix(docx): simplify parsing of simple tables Simplify the parsing of tables with just text (no rich cells). Move nested function group_cell_elements out of _handle_tables for readability. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(docx): reuse method for finding inline pictures Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(docx): format strikethrough text Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * tests(docx): use fixtures to avoid converting same file multiple times Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * fix(docx): remove unnecessary argument docx_obj in functions Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * tests(docx): add test for rich table cells Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(docx): small improvements in backend and its unit tests Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(docx): parse superscript and subscript formatted text Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2025-11-06 05:25:53 +01:00
Rafael Teixeira de Lima	16829939cf	feat(docx): Process drawingml objects in docx (#2453 ) * Export of DrawingML figures into docling document * Adding libreoffice env var and libreoffice to checks image Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * DCO Remediation Commit for Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> I, Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>, hereby add my Signed-off-by to this commit: `9518fffcad` Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Enforcing apt get update Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Only display drawingml warning once per document Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * add util to test libreoffice and exclude files from test when not found Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * check libreoffice only once Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Only initialise converter if needed Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> --------- Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2025-10-15 10:58:08 +02:00
Rafael Teixeira de Lima	0b83609531	fix(docx): Adding plain latex equations to table cells (#1986 ) * Adding plain latex equations to table cells Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Adding test files Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> --------- Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>	2025-07-24 11:02:24 +02:00
mkrssg	1350a8d3e5	fix(msword_backend): Identify text in the same line after an image #1425 (#1610 ) * fix(msword_backend): Identify text in the same line after an image / image anchor #1425 Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com> * test: add test file and case for fix(msword_backend): Identify text in the same line after an image / image anchor #1425 Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com> * test: added groundtruth test files for fix(msword_backend): Identify text in the same line after an image / image anchor #1425 Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com> * fix: extraneous empty paragraphs for test files Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com> --------- Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com> Co-authored-by: Michael Krissgau <michael.krissgau@ibm.com>	2025-06-20 10:55:30 +02:00
Panos Vagenas	61d0d6c755	test: mark flaky test (#1698 ) * test: cleanse Word test file Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * mark textbox file test as flaky Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * fix path usage Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2025-06-03 13:13:44 +02:00
AndrewTsai0406	12a0e64892	feat: add textbox content extraction in msword_backend (#1538 ) * feat: add textbox content extraction in msword_backend Signed-off-by: Andrew <tsai247365@gmail.com> * feat: add textbox content extraction in msword_backend Signed-off-by: Andrew <tsai247365@gmail.com> * feat: add textbox content extraction in msword_backend Signed-off-by: Andrew <tsai247365@gmail.com> --------- Signed-off-by: Andrew <tsai247365@gmail.com>	2025-05-19 15:01:36 +02:00
Simon Jégou	bfcab3d677	feat(docx): add text formatting and hyperlink support (#630 ) * feat: Enable markdown text formatting for docx Signed-off-by: SimJeg <sjegou@nvidia.com> * Fix imports Signed-off-by: SimJeg <sjegou@nvidia.com> * Use Formatting Signed-off-by: SimJeg <sjegou@nvidia.com> * Handle hyperlink Signed-off-by: SimJeg <sjegou@nvidia.com> * Handle formatting properly for DocItemLabel.PARAGRAPH Signed-off-by: SimJeg <sjegou@nvidia.com> * Use inline group Signed-off-by: SimJeg <sjegou@nvidia.com> * Handle bullet lists Signed-off-by: SimJeg <sjegou@nvidia.com> * Strip elements Signed-off-by: SimJeg <sjegou@nvidia.com> * Strip elements Signed-off-by: SimJeg <sjegou@nvidia.com> * Run black and mypy Signed-off-by: SimJeg <sjegou@nvidia.com> * Handle header and footer Signed-off-by: SimJeg <sjegou@nvidia.com> * Use inline_fmt everywhere Signed-off-by: SimJeg <sjegou@nvidia.com> * Run precommit Signed-off-by: SimJeg <sjegou@nvidia.com> * Address feedback Signed-off-by: SimJeg <sjegou@nvidia.com> * Fix add_list_item Signed-off-by: SimJeg <sjegou@nvidia.com> * fix minor bugs, mark helper methods internal Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: SimJeg <sjegou@nvidia.com> Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>	2025-04-03 15:11:50 +02:00
Rafael Teixeira de Lima	6eb718f849	feat: equations to latex in MSWord backend (with inline groups) (#1114 ) * Equation groups Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * fix: Proper handling of orphan IDs in layout postprocessing (#1118) * Fix the handling of orphan IDs in layout postprocessing Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update test cases Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * chore: bump version to 2.25.2 [skip ci] * docs: add description of DOCLING_ARTIFACTS_PATH env var (#1124) add env var in docs Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * fix(CLI): fix help message for abort options (#1130) fix help message Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * perf: New revision code formula model and document picture classifier (#1140) * new version code formula model Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> * new version document picture classifier Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> * new code formula model Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> * restored original code formula test pdf Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> --------- Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * feat: Use new TableFormer model weights and default to accurate model version (#1100) * feat: New tableformer model weights [WIP] Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> * Updated TF version Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Updated tests, after merging with Main, Switched to Accurate TF model by default Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * chore: bump version to 2.26.0 [skip ci] * fix: Pass tests, update docling-core to 2.22.0 (#1150) fix: update docling-core to 2.22.0 Update dependency library docling-core to latest release 2.22.0 Fix regression tests and ground truth files Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * Updating content hash Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> --------- Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Co-authored-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com> Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-03-13 15:12:22 +01:00
Cesar Berrospi Ramis	0cd81a8122	fix(docx): merged table cells not properly converted (#857 ) * fix(docx): merged cells not properly converted Fix conversion issue of merged cells in Word tables leading to repeated text. Simplify Word table conversion code. Add docx file with several table formats for regression tests. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * chore: add type hinting to docx backend Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-02-03 10:20:03 +01:00
Maxim Lysak	2c037ae62e	fix: Fixed docx import with headers that are also lists (#842 ) * Fix for docx when headers are also lists, now recorded as appropriate headers and subheaders, unit test included Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Update docling/backend/msword_backend.py Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com> * Update docling/backend/msword_backend.py Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-01-31 10:51:21 +01:00
Maxim Lysak	d0a1180478	fix: Fixes for wordx (#432 ) * fixes for referencing drawing blip in wordx Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added safety try-except when trying to load pillow image from a docx blob. Added explicit dependency on lxml. Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added test for word file with embedded emf images, re-generated full tests for docx, eased up dependency on lxml Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Updated lxml dependency version Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-11-26 14:44:43 +01:00
Maxim Lysak	fb8ba861e2	fix: Handling of single-cell tables in DOCX backend (#314 ) * Handling of single-cell tables in DOCX backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * returned try-catch on tables handling Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * cleaned Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * proceed processing the content of single cell table as if its just part of the body Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added example of trickly 1 cell table docx Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-11-12 15:20:55 +01:00
Peter W. J. Staar	f542460af3	fix: fix duplicate title and heading + add e2e tests for html and docx (#186 ) * add real e2e tests for html and docx Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the output of itxt Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted the text Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the tests Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the tests (2) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the examples (1) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the output of the test Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the tests, moved the ground-truth Signed-off-by: Peter Staar <taa@zurich.ibm.com> * moved the ground-truth data Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the html tests Signed-off-by: Peter Staar <taa@zurich.ibm.com> * restructure title fix (#187) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-10-30 13:14:56 +01:00

29 Commits