Module: parsers/pptxparser.py

Purpose:

This module serves as the public interface for interacting with PPTX files and parsing their contents.

Platform:

Linux/Windows | Python 3.11+

Developer:

J Berendt

Email:

development@s3dev.uk

Comments:

n/a

Example:

For example code usage, please refer to the PPTXParser class docstring.

class PPTXParser(path: str)[source]

Bases: _PPTXTextParser

PPTX document parser.

Parameters:

path (str) – Full path to the PPTX document to be parsed.

Example:

Extract text from a PPTX file:

>>> from docp_parsers import PPTXParser

>>> pptx = PPTXParser(path='/path/to/myfile.pptx')
>>> pptx.extract_text()

# Access the text on slide 1.
>>> pg1 = pptx.doc.slides[1].content
property doc: DocPPTX

Accessor to the document object.

extract_text(*, remove_newlines: bool = False, convert_to_ascii: bool = True, **kwargs) None

Extract text from the document.

A list of slides, with extracted content can be accessed using the slides attribute.

Parameters:
  • remove_newlines (bool, optional) – If True, the newline characters are replaced with a space. Defaults to False.

  • convert_to_ascii (bool, optional) – When a non-ASCII character is found, an attempt is made to convert it to an associated ASCII character. If a character cannot be converted, it is replaced with a '?'. Defaults to True.

Keyword Args:
  • None

Returns:

None.

property slides: list

Accessor to the PPTX document’s slide objects.