Module: parsers/pptxparser.py

Purpose:: This module serves as the public interface for interacting with PPTX files and parsing their contents.
Platform:: Linux/Windows | Python 3.11+
Developer:: J Berendt
Email:: development@s3dev.uk
Comments:: n/a
Example:: For example code usage, please refer to the PPTXParser class docstring.

class PPTXParser(path: str)[source]

Bases: _PPTXTextParser

PPTX document parser.

Parameters:

path (str) – Full path to the PPTX document to be parsed.

Example:

Extract text from a PPTX file:

>>> from docp_parsers import PPTXParser

>>> pptx = PPTXParser(path='/path/to/myfile.pptx')
>>> pptx.extract_text()

# Access the text on slide 1.
>>> pg1 = pptx.doc.slides[1].content

property doc: DocPPTX: Accessor to the document object.

extract_text(*, remove_newlines: bool = False, convert_to_ascii: bool = True, **kwargs) → None

Extract text from the document.

A list of slides, with extracted content can be accessed using the slides attribute.

Parameters:

remove_newlines (bool, optional) – If True, the newline characters are replaced with a space. Defaults to False.
convert_to_ascii (bool, optional) – When a non-ASCII character is found, an attempt is made to convert it to an associated ASCII character. If a character cannot be converted, it is replaced with a '?'. Defaults to True.

Keyword Args:

None

Returns:

None.

property slides: list: Accessor to the PPTX document’s slide objects.