Base (Private) Module: parsers/_pptxtextparser.py
- Purpose:
This module provides the logic for parsing text from a PPTX document.
- Platform:
Linux/Windows | Python 3.11+
- Developer:
J Berendt
- Email:
Attention
This module is not designed to be interacted with directly, only via the appropriate interface class(es).
Rather, please create an instance of a PPTX document parsing object using the following:
- class _PPTXTextParser(path: str)[source]
Bases:
_PPTXBaseParserPrivate PPTX document text parser intermediate class.
- Parameters:
path (str) – Full path to the PPTX document.
- Example:
Extract text from a PPTX file:
>>> from docp_parsers import PPTXParser >>> pptx = PPTXParser(path='/path/to/myfile.pptx') >>> pptx.extract_text() # Access the text on slide 1. >>> pg1 = pptx.doc.slides[1].content
- extract_text(*, remove_newlines: bool = False, convert_to_ascii: bool = True, **kwargs) None[source]
Extract text from the document.
A list of slides, with extracted content can be accessed using the
slidesattribute.- Parameters:
remove_newlines (bool, optional) – If True, the newline characters are replaced with a space. Defaults to False.
convert_to_ascii (bool, optional) – When a non-ASCII character is found, an attempt is made to convert it to an associated ASCII character. If a character cannot be converted, it is replaced with a
'?'. Defaults to True.
- Keyword Args:
None
- Returns:
None.
- _extract_text(remove_newlines: bool, convert_to_ascii: bool) None[source]
Extract the text from all shapes on all slides.
- Parameters:
remove_newlines (bool) – Replace the newline characters with a space.
convert_to_ascii (bool) – Attempt to convert any non-ASCII characters to their ASCII equivalent.
The text extracted from each slide is stored as a
TextObjectwhich is appended to the slide’stextsattribute.
- _open() None
Open the PPTX document for reading.
Before opening the file, a test is performed to ensure the PPTX is valid. The file must:
exist
be a ZIP archive, per the file signature
have a .pptx file extension
- Other Operations:
Store the
pptx.Presentationparser object returned from thepptx.Presentation()instance creation into thedoc.parserattribute.Store the number of pages into the
doc.npagesattribute.Store the document’s meta data into the
doc.metadataattribute.
- Raises:
TypeError – Raised if the file type criteria above are not met.
- _set_paths() None
Set the document’s file path attributes.
- property slides: list
Accessor to the PPTX document’s slide objects.