Base (Private) Module: parsers/_pdftableparser.py

Purpose:: This module provides the logic for parsing tables from a PDF document.
Platform:: Linux
Developer:: J Berendt
Email:: jeremy.berendt@rolls-royce.com

Attention

This module is not designed to be interacted with directly, only via the appropriate interface class(es).

Rather, please create an instance of a PDF document parsing object using the following:

PDFParser

class _PDFTableParser(path: str)[source]

Bases: _PDFBaseParser

Private PDF document table parser intermediate class.

Parameters:

path (str) – Full path to the PDF document.

Example:

Extract tables from a PDF file:

>>> from docp_parsers import PDFParser

>>> pdf = PDFParser(path='/path/to/myfile.pdf')
>>> pdf.extract_tables()

>>> tables = pdf.doc.tables

extract_tables(table_settings: dict = None, as_dataframe: bool = False, to_csv: bool = True, verbose: bool = False, **kwargs) → None[source]

Extract tables from the document.

Before a table is extracted, a number of validation tests are performed to verify what has been identified as a ‘table’ is actually a table which might be useful to the user.

Each ‘valid’ table is written as a CSV file on the user’s desktop.

Additionally, the extracted table data is stored to the class’ tables attribute.

Parameters:

table_settings (dict, optional) – Table settings to be used for the table extraction. Defaults to None, which is replaced by the value in the config.
as_dataframe (bool, optional) – By default, the extracted tables are returned as a list of (lists of lists), for example: all_tables[table[rows[data]]]. However, if this argument is True, the table data is returned as a list of pandas.DataFrame objects. In this case, the first row of the table is used as the header, and all remaining rows are treated as data. Note: This will not work properly for all tables. Defaults to False.
to_csv (bool, optional) – Dump extracted table data to a CSV file, one per table. Defaults to True.
verbose (bool, optional) – Display how many tables were extracted, and the path to their location.

Keyword Args:

Additional keyword args_tbl to be added to the table_settings argument of pdfplumber’s extract_table() method.

_create_table_directory_path()[source]

Create the output directory for table data.

If the directory does not exist, it is created.

_create_table_file_path(pageno: int, tblno: int) → str[source]

Create the filename for the table.

Parameters:

pageno (int) – Page from which the table was extracted.
tblno (int) – Number of the table on the page, starting at 1.

Returns:

Explicit path to the file to be written.

Return type:

str

static _filter_tables(tables: list, threshold: int = 5000) → list[source]

Remove tables from the passed list which are deemed invalid.

Parameters:

tables (list) – A list of tables as detected by the Page.find_table() method.
threshold (int, optional) – Minimum pixel area for a detected table to be returned. Defaults to 5000.

Rationale:

An ‘invalid’ table is determined by the number of pixels which the table covered. Any table which is less than (N) pixels is likely a block of text which has been categorised as a ‘table’, but is not.

Returns:

A list of tables whose pixel area is greater than threshold.

Return type:

list

_table_header_footer(table: list[list]) → bool[source]

Verify a table is not a header or footer.

Parameters:

table (list[list]) – Table (a list of lists) be a analysed.

Rationale:

A table is determined to be a header or footer if any of the line contained in the ‘common lines list’ are found in the table.

If any of these lines are found, the table is determined to be a header/footer, True is returned.

Returns:

False if the table is not a header/footer, otherwise True.

Return type:

bool

_to_buffer(data: list[list]) → StringIO[source]

Write the table data into a string buffer.

Parameters:: data (list[list]) – The table data as a list of lists to be written to a buffer.
Returns:: A string buffer as an io.StringIO object.
Return type:: io.StringIO

_to_csv(buffer: StringIO, pageno: int, tableno: int) → int[source]

Write a table (from the buffer) to CSV.

Parameters:

buffer (io.StringIO) – A pre-processed StringIO object containing table data to be written.
pageno (int) – Page number from the Page object.
tableno (int) – Number of the table on the page, based at 1.

Returns:

1 if the file was written, otherwise 0. This is used by the caller to track the number of CSV files written.

Return type:

int

_to_df(buffer: StringIO)[source]

Write a table (from the buffer) to a DataFrame.

Once written, the DataFrame is appended to doc.tables list of tables.

Parameters:: buffer (io.StringIO) – A pre-processed StringIO object containing table data to be written.

_get_crop_coordinates(skip_header: bool = False, skip_footer: bool = False) → tuple[float]

Determine the bounding box coordinates.

These coordinates are used for removing the header and/or footer.

Parameters:

skip_header (bool, optional) – If True, set the coordinates such that the header is skipped. Defaults to False.
skip_footer (bool, optional) – If True, set the coordinates such that the footer is skipped. Defaults to False.

Logic:

When excluding a header and/or footer, the following page numbers are used for header/footer position detection, given the length of the document:

Number of pages [1]: 1

Number of pages [2,10]: 2

Number of pages [11,]: 5

Returns:

A bounding box tuple of the following form, to be passed directly into the Page.crop() method:

(x0, top, x1, bottom)

Return type:

tuple

_open() → None

Open the PDF document for reading.

Before opening the file, a test is performed to ensure the PDF is valid. The file must:

exist

be a valid PDF file, per the file signature

have a .pdf file extension

Other Operations:

Store the pdfplumber parser object returned from the pdfplumber.open() function into the doc.parser attribute.
Store the number of pages into the doc.npages attribute.
Store the document’s meta data into the doc.metadata attribute.

Raises:

TypeError – Raised if the file type criteria above are not met.

static _prepare_row(row: list) → str

Prepare a table row for writing to a CSV file.

Parameters:

row (list) – A list of strings, constituting a table row.

Processing Tasks:

For each element in the row:

Remove any double quote characters (ASCII and Unicode).

Replace any empty values with 'None'.

If the element contains a comma, wrap the element in double quotes.

Attempt to convert any non-ASCII characters to an associated ASCII character. If the replacement cannot be made, the character is replaced with a '?'.

Returns:

A processed comma-separated string, ready to be written to a CSV file.

Return type:

str

_scan_common() → list[str]

Scan the PDF document to find the most common lines.

Rationale:

Generally, the most common lines in a document will be the header and footer, as these are expected to be repeated on each page of the document.

‘Most common’ is defined as line occurring on 90% of the pages throughout the document. Therefore, only documents with more than three pages are scanned. Otherwise, the 90% may exclude relevant pieces of the document (as was discovered in testing).

Logic:

For documents with more than three pages, the entire PDF is read through and each line extracted. The occurrence of each line is counted, with the most common occurrences returned to the caller.

The returned lines are to be passed into a page search to determine the x/y coordinates of the header and footer.

Returns:

For documents with more than three pages, a list containing the most common lines in the document. Otherwise, an empty list if returned.

Return type:

list

_set_paths() → None: Set the document’s file path attributes.

property doc: DocPDF: Accessor to the document object.

property pages: list: Accessor to the PDF’s page objects.

property tables: list: Accessor to the PDF’s table objects.