Commit Graph

8 Commits

Author SHA1 Message Date
pateltejas 72942486ff fix(pptx): skip malformed picture shapes instead of aborting conversion (#3372)
* fix(pptx): skip malformed picture shapes instead of aborting conversion

MsPowerpointDocumentBackend._handle_pictures reads embedded image bytes via python-pptx's shape.image accessor. On PPTX files with slightly malformed <p:pic> shapes, shape.image raises three exceptions that the existing (UnidentifiedImageError, OSError, ValueError) clause does not catch, so one bad picture aborts conversion of the entire presentation:

- InvalidXmlError when <p:blipFill> is missing
- KeyError when <a:blip r:embed> points to an unknown relationship
- AttributeError when the embedded part's content-type isn't an image

These files open normally in Keynote and Google Drive, so the backend should handle them as gracefully as it already handles truncated or unreadable image payloads.

This follows the same pattern as #2914, which extended the same except tuple with ValueError to handle linked (external) image references. The three cases above are the remaining shape.image failure modes that still escape.

Extend the except tuple to cover the three cases and log the same warning used for other unreadable images, leaving the rest of the presentation to convert normally. Add a regression fixture with one malformed picture per failure mode plus a focused test.

Fixes #3371

Signed-off-by: pateltejas <tejas226@hotmail.com>

* refactor(pptx): use warnings.warn for malformed picture skips

Address PR review feedback: use Python's warnings module with UserWarning to signal the skip to callers instead of logging.Logger.warning, matching the pattern used in msword_backend for "Skipping external image reference". This makes the skip visible via standard warning filters and catchable in tests.

Update the regression test to assert the warning is emitted via pytest.warns, which also suppresses the message during the test run so it doesn't clutter suite output.

Signed-off-by: pateltejas <tejas226@hotmail.com>

---------

Signed-off-by: pateltejas <tejas226@hotmail.com>
2026-04-29 08:29:08 +02:00
pateltejas 043ed2dd3d fix(pptx): handle NotImplementedError from shape.shape_type (#3309)
* fix(pptx): handle NotImplementedError from shape.shape_type

python-pptx raises NotImplementedError from Shape.shape_type for
<p:sp> elements that aren't placeholders, autoshapes, textboxes, or
freeforms (e.g. shapes with empty <p:spPr> from Google Slides exports,
LibreOffice, or Keynote). handle_groups() and handle_shapes() access
shape_type without catching this, crashing the entire conversion.

Add a _safe_shape_type() helper that returns None on
NotImplementedError, so unrecognized shapes skip only the GROUP
recursion and PICTURE extraction while text and table extraction
proceed normally.

Fixes #3308

Signed-off-by: Tejas Patel <tejas226@hotmail.com>

* Fix lint

Signed-off-by: Tejas Patel <tejas226@hotmail.com>

---------

Signed-off-by: Tejas Patel <tejas226@hotmail.com>
2026-04-17 06:59:48 +02:00
Sam Quigley 5e452a2e8f fix(pptx): handle picture shapes with external image references (#2914)
* fix(pptx): handle picture shapes with external image references

When processing PowerPoint files containing picture shapes that reference
external images (rather than embedded images), the python-pptx library
raises a ValueError("no embedded image") when accessing the `image`
property.

Previously, this caused the entire document conversion to fail because:

1. The `hasattr(shape, "image")` check at line 690 would trigger the
   property getter, which raises ValueError (hasattr only catches
   AttributeError, not ValueError)

2. The exception handler in `_handle_pictures()` only caught
   UnidentifiedImageError and OSError, not ValueError

This fix:
- Removes the unnecessary hasattr check since we already verify the
  shape type is MSO_SHAPE_TYPE.PICTURE
- Adds ValueError to the exception handler in `_handle_pictures()` so
  that picture shapes with external references are gracefully skipped
  with a warning instead of crashing the pipeline

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* DCO Remediation Commit for Sam Quigley <quigley@emerose.com>

I, Sam Quigley <quigley@emerose.com>, hereby add my Signed-off-by to this commit: e69779e07b

Signed-off-by: Sam Quigley <quigley@emerose.com>

* tests(pptx): add a linked image to test the fix on e69779e

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Sam Quigley <quigley@emerose.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-02-01 11:44:29 +01:00
Tong Luo 999dbb2765 fix: PPTX parsing: bullet points not grouped correctly under subheadings (#2663) (#2855)
* fix: PPTX parsing: bullet points not grouped correctly under subheadings (#2663)

Signed-off-by: Tong Luo <luotng@cn.ibm.com>

* fix: PPTX parsing: bullet points not grouped correctly under subheadings, support Python 3.9 (#2663)

Signed-off-by: Tong Luo <luotng@cn.ibm.com>

* fix: PPTX parsing: optimized code naming, descriptions (#2663)

Signed-off-by: Tong Luo <luotng@cn.ibm.com>

* fix: PPTX parsing: optimized code naming, descriptions (#2663)

Signed-off-by: Tong Luo <luotng@cn.ibm.com>

* docs(pptx): updated docstrings in pptx backend parser

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Tong Luo <luotng@cn.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-01-21 14:34:24 +01:00
Martin Wind f28d23cf03 fix: pptx line break and space handling (#1664)
Signed-off-by: Martin Wind <martin.wind@im-c.at>
2025-06-16 10:44:30 +02:00
Maciej Wieczorek b454aa1551 feat: Add PPTX notes slides (#474)
* feat: Add PPTX notes slides

Presenter notes may have useful information and should also be extracted.

Signed-off-by: Maciej Wieczorek <maciej@wieczorek.co>

* feat: Move presenter notes into furniture

Signed-off-by: Maciej Wieczorek <maciej@wieczorek.co>

---------

Signed-off-by: Maciej Wieczorek <maciej@wieczorek.co>
2025-03-19 14:52:09 +01:00
Maxim Lysak 7a97d7119f feat: Extracting picture data for raster images found in PPTX (#349)
* Added picture data for pptx pictures

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added tests for pptx

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Inferring image DPI from pptx file

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-18 15:22:28 +01:00
Peter W. J. Staar f542460af3 fix: fix duplicate title and heading + add e2e tests for html and docx (#186)
* add real e2e tests for html and docx

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the output of itxt

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the text

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the tests (2)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the examples (1)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the output of the test

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the tests, moved the ground-truth

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* moved the ground-truth data

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the html tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* restructure title fix (#187)

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-30 13:14:56 +01:00