docling

mirror of https://github.com/docling-project/docling.git synced 2026-05-17 13:10:38 +00:00

Author	SHA1	Message	Date
pateltejas	72942486ff	fix(pptx): skip malformed picture shapes instead of aborting conversion (#3372 ) * fix(pptx): skip malformed picture shapes instead of aborting conversion MsPowerpointDocumentBackend._handle_pictures reads embedded image bytes via python-pptx's shape.image accessor. On PPTX files with slightly malformed <p:pic> shapes, shape.image raises three exceptions that the existing (UnidentifiedImageError, OSError, ValueError) clause does not catch, so one bad picture aborts conversion of the entire presentation: - InvalidXmlError when <p:blipFill> is missing - KeyError when <a:blip r:embed> points to an unknown relationship - AttributeError when the embedded part's content-type isn't an image These files open normally in Keynote and Google Drive, so the backend should handle them as gracefully as it already handles truncated or unreadable image payloads. This follows the same pattern as #2914, which extended the same except tuple with ValueError to handle linked (external) image references. The three cases above are the remaining shape.image failure modes that still escape. Extend the except tuple to cover the three cases and log the same warning used for other unreadable images, leaving the rest of the presentation to convert normally. Add a regression fixture with one malformed picture per failure mode plus a focused test. Fixes #3371 Signed-off-by: pateltejas <tejas226@hotmail.com> * refactor(pptx): use warnings.warn for malformed picture skips Address PR review feedback: use Python's warnings module with UserWarning to signal the skip to callers instead of logging.Logger.warning, matching the pattern used in msword_backend for "Skipping external image reference". This makes the skip visible via standard warning filters and catchable in tests. Update the regression test to assert the warning is emitted via pytest.warns, which also suppresses the message during the test run so it doesn't clutter suite output. Signed-off-by: pateltejas <tejas226@hotmail.com> --------- Signed-off-by: pateltejas <tejas226@hotmail.com>	2026-04-29 08:29:08 +02:00
pateltejas	043ed2dd3d	fix(pptx): handle NotImplementedError from shape.shape_type (#3309 ) * fix(pptx): handle NotImplementedError from shape.shape_type python-pptx raises NotImplementedError from Shape.shape_type for <p:sp> elements that aren't placeholders, autoshapes, textboxes, or freeforms (e.g. shapes with empty <p:spPr> from Google Slides exports, LibreOffice, or Keynote). handle_groups() and handle_shapes() access shape_type without catching this, crashing the entire conversion. Add a _safe_shape_type() helper that returns None on NotImplementedError, so unrecognized shapes skip only the GROUP recursion and PICTURE extraction while text and table extraction proceed normally. Fixes #3308 Signed-off-by: Tejas Patel <tejas226@hotmail.com> * Fix lint Signed-off-by: Tejas Patel <tejas226@hotmail.com> --------- Signed-off-by: Tejas Patel <tejas226@hotmail.com>	2026-04-17 06:59:48 +02:00
Sam Quigley	5e452a2e8f	fix(pptx): handle picture shapes with external image references (#2914 ) * fix(pptx): handle picture shapes with external image references When processing PowerPoint files containing picture shapes that reference external images (rather than embedded images), the python-pptx library raises a ValueError("no embedded image") when accessing the `image` property. Previously, this caused the entire document conversion to fail because: 1. The `hasattr(shape, "image")` check at line 690 would trigger the property getter, which raises ValueError (hasattr only catches AttributeError, not ValueError) 2. The exception handler in `_handle_pictures()` only caught UnidentifiedImageError and OSError, not ValueError This fix: - Removes the unnecessary hasattr check since we already verify the shape type is MSO_SHAPE_TYPE.PICTURE - Adds ValueError to the exception handler in `_handle_pictures()` so that picture shapes with external references are gracefully skipped with a warning instead of crashing the pipeline Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * DCO Remediation Commit for Sam Quigley <quigley@emerose.com> I, Sam Quigley <quigley@emerose.com>, hereby add my Signed-off-by to this commit: `e69779e07b` Signed-off-by: Sam Quigley <quigley@emerose.com> * tests(pptx): add a linked image to test the fix on `e69779e` Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Sam Quigley <quigley@emerose.com> Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-02-01 11:44:29 +01:00
Tong Luo	999dbb2765	fix: PPTX parsing: bullet points not grouped correctly under subheadings (#2663 ) (#2855 ) * fix: PPTX parsing: bullet points not grouped correctly under subheadings (#2663) Signed-off-by: Tong Luo <luotng@cn.ibm.com> * fix: PPTX parsing: bullet points not grouped correctly under subheadings, support Python 3.9 (#2663) Signed-off-by: Tong Luo <luotng@cn.ibm.com> * fix: PPTX parsing: optimized code naming, descriptions (#2663) Signed-off-by: Tong Luo <luotng@cn.ibm.com> * fix: PPTX parsing: optimized code naming, descriptions (#2663) Signed-off-by: Tong Luo <luotng@cn.ibm.com> * docs(pptx): updated docstrings in pptx backend parser Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Tong Luo <luotng@cn.ibm.com> Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2026-01-21 14:34:24 +01:00
Martin Wind	f28d23cf03	fix: pptx line break and space handling (#1664 ) Signed-off-by: Martin Wind <martin.wind@im-c.at>	2025-06-16 10:44:30 +02:00
Maciej Wieczorek	b454aa1551	feat: Add PPTX notes slides (#474 ) * feat: Add PPTX notes slides Presenter notes may have useful information and should also be extracted. Signed-off-by: Maciej Wieczorek <maciej@wieczorek.co> * feat: Move presenter notes into furniture Signed-off-by: Maciej Wieczorek <maciej@wieczorek.co> --------- Signed-off-by: Maciej Wieczorek <maciej@wieczorek.co>	2025-03-19 14:52:09 +01:00
Maxim Lysak	7a97d7119f	feat: Extracting picture data for raster images found in PPTX (#349 ) * Added picture data for pptx pictures Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added tests for pptx Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Inferring image DPI from pptx file Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-11-18 15:22:28 +01:00
Peter W. J. Staar	f542460af3	fix: fix duplicate title and heading + add e2e tests for html and docx (#186 ) * add real e2e tests for html and docx Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the output of itxt Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted the text Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the tests Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the tests (2) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the examples (1) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the output of the test Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the tests, moved the ground-truth Signed-off-by: Peter Staar <taa@zurich.ibm.com> * moved the ground-truth data Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the html tests Signed-off-by: Peter Staar <taa@zurich.ibm.com> * restructure title fix (#187) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-10-30 13:14:56 +01:00

8 Commits