Module: objects/pdfobject.py
- Purpose:
This module provides the ‘PDF Document’ object structure into which PDF documents are parsed into for transport and onward use.
- Platform:
Linux/Windows | Python 3.11+
- Developer:
J Berendt
- Email:
- Comments:
n/a
- class DocPDF[source]
Bases:
_DocBaseContainer class for storing data parsed from a PDF file.
- property pages: list[PageObject]
A list of containing an object for each page in the document.
Tip
The page number index aligns to the page number in the PDF file.
For example, to access the
PageObjectfor page 42, use:pages[42]
- property marked_content: bool
Indicate if the document was parsed using marked-content tags.
PDF documents can be created with ‘marked content’ tags. When a PDF document is parsed using tags, as this flag indicates, the parser respects columns and other page formatting schemes. If a multi-column page is parsed without tags, the parser reads straight across the line, thus corrupting the text.
- property tables: list
Accessor to data extracted from a document’s tables.
- property basename: str
Accessor for the file’s basename.
- property documents: list
Accessor to the
Documentobjects.These objects are used for passing into text splitters and for loading documents (and embeddings) into vector databases.
- property filepath: str
Accessor for the explicit path to this file.
- property metadata: dict | object
The meta data as extracted from the document.
- property npages: int
The number of pages successfully extracted from the source.
- property ntables: int
The number of tables successfully extracted from the source.
- property parser: object
Accessor to the underlying document parser’s functionality.