Module: objects/pdfobject.py

Purpose:

This module provides the ‘PDF Document’ object structure into which PDF documents are parsed into for transport and onward use.

Platform:

Linux/Windows | Python 3.11+

Developer:

J Berendt

Email:

development@s3dev.uk

Comments:

n/a

class DocPDF[source]

Bases: _DocBase

Container class for storing data parsed from a PDF file.

property pages: list[PageObject]

A list of containing an object for each page in the document.

Tip

The page number index aligns to the page number in the PDF file.

For example, to access the PageObject for page 42, use:

pages[42]
property marked_content: bool

Indicate if the document was parsed using marked-content tags.

PDF documents can be created with ‘marked content’ tags. When a PDF document is parsed using tags, as this flag indicates, the parser respects columns and other page formatting schemes. If a multi-column page is parsed without tags, the parser reads straight across the line, thus corrupting the text.

property tables: list

Accessor to data extracted from a document’s tables.

property basename: str

Accessor for the file’s basename.

property documents: list

Accessor to the Document objects.

These objects are used for passing into text splitters and for loading documents (and embeddings) into vector databases.

property filepath: str

Accessor for the explicit path to this file.

property metadata: dict | object

The meta data as extracted from the document.

property npages: int

The number of pages successfully extracted from the source.

property ntables: int

The number of tables successfully extracted from the source.

property parser: object

Accessor to the underlying document parser’s functionality.