Module: parsers/pdfparser.py

Purpose:

This module serves as the public interface for interacting with PDF files and parsing their contents.

Platform:

Linux/Windows | Python 3.11+

Developer:

J Berendt

Email:

development@s3dev.uk

Comments:

n/a

Example:

For example code usage, please refer to the PDFParser class docstring.

class PDFParser(path: str)[source]

Bases: _PDFTableParser, _PDFTextParser

PDF document parser.

Parameters:

path (str) – Full path to the PDF document to be parsed.

Example:

Extract text from a PDF file:

>>> from docp_parsers import PDFParser

>>> pdf = PDFParser(path='/path/to/myfile.pdf')
>>> pdf.extract_text()

# Access the content of page 1.
>>> pg1 = pdf.pages[1].content
'Lorem ipsum dolor sit amet, consectetur adipiscing elit,
 sed do eiusmod tempor incididunt ut labore et dolore magna
 aliqua.'

Extract tables from a PDF file:

>>> from docp_parsers import PDFParser

>>> pdf = PDFParser('/path/to/myfile.pdf')
>>> pdf.extract_tables()

# Access the first table.
>>> tbl1 = pdf.tables[1]
property doc: DocPDF

Accessor to the document object.

extract_tables(table_settings: dict = None, as_dataframe: bool = False, to_csv: bool = True, verbose: bool = False, **kwargs) None

Extract tables from the document.

Before a table is extracted, a number of validation tests are performed to verify what has been identified as a ‘table’ is actually a table which might be useful to the user.

Each ‘valid’ table is written as a CSV file on the user’s desktop.

Additionally, the extracted table data is stored to the class’ tables attribute.

Parameters:
  • table_settings (dict, optional) – Table settings to be used for the table extraction. Defaults to None, which is replaced by the value in the config.

  • as_dataframe (bool, optional) – By default, the extracted tables are returned as a list of (lists of lists), for example: all_tables[table[rows[data]]]. However, if this argument is True, the table data is returned as a list of pandas.DataFrame objects. In this case, the first row of the table is used as the header, and all remaining rows are treated as data. Note: This will not work properly for all tables. Defaults to False.

  • to_csv (bool, optional) – Dump extracted table data to a CSV file, one per table. Defaults to True.

  • verbose (bool, optional) – Display how many tables were extracted, and the path to their location.

Keyword Args:

Additional keyword args_tbl to be added to the table_settings argument of pdfplumber’s extract_table() method.

extract_text(*, remove_header: bool = False, remove_footer: bool = False, remove_newlines: bool = False, ignore_tags: set = None, convert_to_ascii: bool = True, x_tolerance: int = 3, y_tolerance: int = 3, **kwargs)

Extract text from the document.

If the PDF document contains ‘marked content’ tags, these tags are used to extract the text as this is a more accurate approach and respects the structure of the page(s). Otherwise, a bounding box method is used to extract the text. If instructed, the header and/or footer regions can be excluded.

Tip

If a tag-based extract is used, the header/footer should be automatically excluded as these will often have an ‘Artifact’ tag, which is excluded by default, by passing ignore_tags=None.

To keep the header and footer, pass ignore_tags='na'.

A list of pages, with extracted content can be accessed using the self.doc.pages attribute.

Tip

Alldocumenttextisrunningtogether

When examining the parsed content, e.g.:

pdf.doc.pages[2].content

and you observe alltexthasbeenruntogether, this is a sign that marked content tags were not available for processing, so OCR was employed.

To add separation to the words, the x_tolerance keyword argument can be passed in with a value < 3 (as the default value is 3). For example:

pdf.extract_text(x_tolerance=2)

Re-examine the parsed content and the words should now be separated.

Parameters:
  • remove_header (bool, optional) – If True, the header is cropped (skipped) from text extraction. This only applies to the bounding box extraction method. Defaults to False.

  • remove_footer (bool, optional) – If True, the footer is cropped (skipped) from text extraction. This only applies to the bounding box extraction method. Defaults to False.

  • remove_newlines (bool, optional) – If True, the newline characters are replaced with a space. Defaults to False.

  • ignore_tags (set, optional) – If provided, these are the PDF ‘marked content’ tags which will be ignored. Note that the PDF document must contain tags, otherwise the bounding box method is used and this argument is ignored. Defaults to {'Artifact'}, as these generally relate to a header and/or footer. To include all tags, (not skip any) pass this argument as 'na'.

  • convert_to_ascii (bool, optional) – When a non-ASCII character is found, an attempt is made to convert it to an associated ASCII character. If a character cannot be converted, it is replaced with a '?'. Defaults to True.

  • x_tolerance (int, optional) – Adds space where the difference between x1 of one character and the x0 of the next character is greater than x_tolerance. Defaults to 3.

  • y_tolerance (int, optional) – Adds space where the difference between y1 of one character and the y0 of the next character is greater than y_tolerance. Defaults to 3.

Keyword Args:

Keyword args_txt to be passed directly into pdfplumber’s .extract_text method.

Returns:

None.

property pages: list

Accessor to the PDF’s page objects.

property tables: list

Accessor to the PDF’s table objects.