Module: objects/pdfobject.py

Purpose:: This module provides the ‘PDF Document’ object structure into which PDF documents are parsed into for transport and onward use.
Platform:: Linux/Windows | Python 3.11+
Developer:: J Berendt
Email:: development@s3dev.uk
Comments:: n/a

class DocPDF[source]

Bases: _DocBase

Container class for storing data parsed from a PDF file.

property pages: list[PageObject]

A list of containing an object for each page in the document.

Tip

The page number index aligns to the page number in the PDF file.

For example, to access the PageObject for page 42, use:

pages[42]

property marked_content: bool

Indicate if the document was parsed using marked-content tags.

PDF documents can be created with ‘marked content’ tags. When a PDF document is parsed using tags, as this flag indicates, the parser respects columns and other page formatting schemes. If a multi-column page is parsed without tags, the parser reads straight across the line, thus corrupting the text.

property tables: list: Accessor to data extracted from a document’s tables.

property basename: str: Accessor for the file’s basename.

property documents: list

Accessor to the Document objects.

These objects are used for passing into text splitters and for loading documents (and embeddings) into vector databases.

property filepath: str: Accessor for the explicit path to this file.

property metadata: dict | object: The meta data as extracted from the document.

property npages: int: The number of pages successfully extracted from the source.

property ntables: int: The number of tables successfully extracted from the source.

property parser: object: Accessor to the underlying document parser’s functionality.