Base (Private) Module: parsers/_pptxtextparser.py

Purpose:: This module provides the logic for parsing text from a PPTX document.
Platform:: Linux/Windows | Python 3.11+
Developer:: J Berendt
Email:: development@s3dev.uk

Attention

This module is not designed to be interacted with directly, only via the appropriate interface class(es).

Rather, please create an instance of a PPTX document parsing object using the following:

PPTXParser

class _PPTXTextParser(path: str)[source]

Bases: _PPTXBaseParser

Private PPTX document text parser intermediate class.

Parameters:

path (str) – Full path to the PPTX document.

Example:

Extract text from a PPTX file:

>>> from docp_parsers import PPTXParser

>>> pptx = PPTXParser(path='/path/to/myfile.pptx')
>>> pptx.extract_text()

# Access the text on slide 1.
>>> pg1 = pptx.doc.slides[1].content

extract_text(*, remove_newlines: bool = False, convert_to_ascii: bool = True, **kwargs) → None[source]

Extract text from the document.

A list of slides, with extracted content can be accessed using the slides attribute.

Parameters:

remove_newlines (bool, optional) – If True, the newline characters are replaced with a space. Defaults to False.
convert_to_ascii (bool, optional) – When a non-ASCII character is found, an attempt is made to convert it to an associated ASCII character. If a character cannot be converted, it is replaced with a '?'. Defaults to True.

Keyword Args:

None

Returns:

None.

_extract_text(remove_newlines: bool, convert_to_ascii: bool) → None[source]

Extract the text from all shapes on all slides.

Parameters:

remove_newlines (bool) – Replace the newline characters with a space.
convert_to_ascii (bool) – Attempt to convert any non-ASCII characters to their ASCII equivalent.

The text extracted from each slide is stored as a TextObject which is appended to the slide’s texts attribute.

_open() → None

Open the PPTX document for reading.

Before opening the file, a test is performed to ensure the PPTX is valid. The file must:

exist

be a ZIP archive, per the file signature

have a .pptx file extension

Other Operations:

Store the pptx.Presentation parser object returned from the pptx.Presentation() instance creation into the doc.parser attribute.
Store the number of pages into the doc.npages attribute.
Store the document’s meta data into the doc.metadata attribute.

Raises:

TypeError – Raised if the file type criteria above are not met.

_set_paths() → None: Set the document’s file path attributes.

property doc: DocPPTX: Accessor to the document object.

property slides: list: Accessor to the PPTX document’s slide objects.