Module: parsers/pdfparser.py
- Purpose:
This module serves as the public interface for interacting with PDF files and parsing their contents.
- Platform:
Linux/Windows | Python 3.11+
- Developer:
J Berendt
- Email:
- Comments:
n/a
- Example:
For example code usage, please refer to the
PDFParserclass docstring.
- class PDFParser(path: str)[source]
Bases:
_PDFTableParser,_PDFTextParserPDF document parser.
- Parameters:
path (str) – Full path to the PDF document to be parsed.
- Example:
Extract text from a PDF file:
>>> from docp_parsers import PDFParser >>> pdf = PDFParser(path='/path/to/myfile.pdf') >>> pdf.extract_text() # Access the content of page 1. >>> pg1 = pdf.pages[1].content 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.'
Extract tables from a PDF file:
>>> from docp_parsers import PDFParser >>> pdf = PDFParser('/path/to/myfile.pdf') >>> pdf.extract_tables() # Access the first table. >>> tbl1 = pdf.tables[1]
- extract_tables(table_settings: dict = None, as_dataframe: bool = False, to_csv: bool = True, verbose: bool = False, **kwargs) None
Extract tables from the document.
Before a table is extracted, a number of validation tests are performed to verify what has been identified as a ‘table’ is actually a table which might be useful to the user.
Each ‘valid’ table is written as a CSV file on the user’s desktop.
Additionally, the extracted table data is stored to the class’
tablesattribute.- Parameters:
table_settings (dict, optional) – Table settings to be used for the table extraction. Defaults to None, which is replaced by the value in the config.
as_dataframe (bool, optional) – By default, the extracted tables are returned as a list of (lists of lists), for example: all_tables[table[rows[data]]]. However, if this argument is
True, the table data is returned as a list ofpandas.DataFrameobjects. In this case, the first row of the table is used as the header, and all remaining rows are treated as data. Note: This will not work properly for all tables. Defaults to False.to_csv (bool, optional) – Dump extracted table data to a CSV file, one per table. Defaults to True.
verbose (bool, optional) – Display how many tables were extracted, and the path to their location.
- Keyword Args:
Additional keyword args_tbl to be added to the
table_settingsargument ofpdfplumber’sextract_table()method.
- extract_text(*, remove_header: bool = False, remove_footer: bool = False, remove_newlines: bool = False, ignore_tags: set = None, convert_to_ascii: bool = True, x_tolerance: int = 3, y_tolerance: int = 3, **kwargs)
Extract text from the document.
If the PDF document contains ‘marked content’ tags, these tags are used to extract the text as this is a more accurate approach and respects the structure of the page(s). Otherwise, a bounding box method is used to extract the text. If instructed, the header and/or footer regions can be excluded.
Tip
If a tag-based extract is used, the header/footer should be automatically excluded as these will often have an ‘Artifact’ tag, which is excluded by default, by passing
ignore_tags=None.To keep the header and footer, pass
ignore_tags='na'.A list of pages, with extracted content can be accessed using the
self.doc.pagesattribute.Tip
Alldocumenttextisrunningtogether
When examining the parsed content, e.g.:
pdf.doc.pages[2].content
and you observe alltexthasbeenruntogether, this is a sign that marked content tags were not available for processing, so OCR was employed.
To add separation to the words, the
x_tolerancekeyword argument can be passed in with a value < 3 (as the default value is 3). For example:pdf.extract_text(x_tolerance=2)
Re-examine the parsed content and the words should now be separated.
- Parameters:
remove_header (bool, optional) – If True, the header is cropped (skipped) from text extraction. This only applies to the bounding box extraction method. Defaults to False.
remove_footer (bool, optional) – If True, the footer is cropped (skipped) from text extraction. This only applies to the bounding box extraction method. Defaults to False.
remove_newlines (bool, optional) – If True, the newline characters are replaced with a space. Defaults to False.
ignore_tags (set, optional) – If provided, these are the PDF ‘marked content’ tags which will be ignored. Note that the PDF document must contain tags, otherwise the bounding box method is used and this argument is ignored. Defaults to
{'Artifact'}, as these generally relate to a header and/or footer. To include all tags, (not skip any) pass this argument as'na'.convert_to_ascii (bool, optional) – When a non-ASCII character is found, an attempt is made to convert it to an associated ASCII character. If a character cannot be converted, it is replaced with a
'?'. Defaults to True.x_tolerance (int, optional) – Adds space where the difference between x1 of one character and the x0 of the next character is greater than x_tolerance. Defaults to 3.
y_tolerance (int, optional) – Adds space where the difference between y1 of one character and the y0 of the next character is greater than y_tolerance. Defaults to 3.
- Keyword Args:
Keyword args_txt to be passed directly into
pdfplumber’s.extract_textmethod.- Returns:
None.
- property pages: list
Accessor to the PDF’s page objects.
- property tables: list
Accessor to the PDF’s table objects.