Base (Private) Module: parsers/_pdftextparser.py
- Purpose:
This module provides the logic for parsing text from a PDF document.
- Platform:
Linux/Windows | Python 3.11+
- Developer:
J Berendt
- Email:
Attention
This module is not designed to be interacted with directly, only via the appropriate interface class(es).
Rather, please create an instance of a PDF document parsing object using the following:
Note
Multi-processing
Text extraction through multi-processing has been tested and
is not feasible due to an error indicating
the pdfplumber.page.Page object can not be pickled. This
object was being passed into the extraction method as the
object contains the extract_text() function.
Additionally, multi-threading has also been tested and
it was determined to be too complex and inefficient. This was
tested using the concurrent.futures.ThreadPoolExecutor
class and two documents, 14 and 92 pages; the timings are
shown below. The multi-threaded approach took longer to
process and added unnecessary complexity to the code base.
As a side-effect, the pages are processed and stored out of
order which would require a re-order, adding more complexity.
It has therefore been determined that this module will remain single-threaded.
Multi-Thread Timings
Single-threaded:
14 page document: ~2 seconds
92 page document: ~32 seconds
Multi-threaded:
14 page document: ~2 seconds
92 page document: ~35 seconds
- class _PDFTextParser(path: str)[source]
Bases:
_PDFBaseParserPrivate PDF document text parser intermediate class.
- Parameters:
path (str) – Full path to the PDF document.
- Example:
Extract text from a PDF file:
>>> from docp_parsers import PDFParser >>> pdf = PDFParser(path='/path/to/myfile.pdf') >>> pdf.extract_text() # Access the content of page 1. >>> pg1 = pdf.doc.pages[1].content
- extract_text(*, remove_header: bool = False, remove_footer: bool = False, remove_newlines: bool = False, ignore_tags: set = None, convert_to_ascii: bool = True, x_tolerance: int = 3, y_tolerance: int = 3, **kwargs)[source]
Extract text from the document.
If the PDF document contains ‘marked content’ tags, these tags are used to extract the text as this is a more accurate approach and respects the structure of the page(s). Otherwise, a bounding box method is used to extract the text. If instructed, the header and/or footer regions can be excluded.
Tip
If a tag-based extract is used, the header/footer should be automatically excluded as these will often have an ‘Artifact’ tag, which is excluded by default, by passing
ignore_tags=None.To keep the header and footer, pass
ignore_tags='na'.A list of pages, with extracted content can be accessed using the
self.doc.pagesattribute.Tip
Alldocumenttextisrunningtogether
When examining the parsed content, e.g.:
pdf.doc.pages[2].content
and you observe alltexthasbeenruntogether, this is a sign that marked content tags were not available for processing, so OCR was employed.
To add separation to the words, the
x_tolerancekeyword argument can be passed in with a value < 3 (as the default value is 3). For example:pdf.extract_text(x_tolerance=2)
Re-examine the parsed content and the words should now be separated.
- Parameters:
remove_header (bool, optional) – If True, the header is cropped (skipped) from text extraction. This only applies to the bounding box extraction method. Defaults to False.
remove_footer (bool, optional) – If True, the footer is cropped (skipped) from text extraction. This only applies to the bounding box extraction method. Defaults to False.
remove_newlines (bool, optional) – If True, the newline characters are replaced with a space. Defaults to False.
ignore_tags (set, optional) – If provided, these are the PDF ‘marked content’ tags which will be ignored. Note that the PDF document must contain tags, otherwise the bounding box method is used and this argument is ignored. Defaults to
{'Artifact'}, as these generally relate to a header and/or footer. To include all tags, (not skip any) pass this argument as'na'.convert_to_ascii (bool, optional) – When a non-ASCII character is found, an attempt is made to convert it to an associated ASCII character. If a character cannot be converted, it is replaced with a
'?'. Defaults to True.x_tolerance (int, optional) – Adds space where the difference between x1 of one character and the x0 of the next character is greater than x_tolerance. Defaults to 3.
y_tolerance (int, optional) – Adds space where the difference between y1 of one character and the y0 of the next character is greater than y_tolerance. Defaults to 3.
- Keyword Args:
Keyword args_txt to be passed directly into
pdfplumber’s.extract_textmethod.- Returns:
None.
- _extract_text_using_bbox(**kwargs)[source]
Extract text using a bbox for finding the header and footer.
- Keyword Arguments:
Those passed by the caller (
extract_text()) to be passed directly into the underlying.extract_textmethod.
- _extract_text_using_tags(**kwargs)[source]
Extract text using tags.
The tags defined by the
ignore_tagsare skipped.- Keyword Arguments:
Those passed by the caller,
extract_text().
- static _text_from_tags(page: pdfplumber.page.Page, ignored: set) str[source]
Generate a page of text extracted from tags.
When extracting text from tags, newlines are not encoded and must be derived. For each character on the page, the top and bottom coordinates are compared to determine when a newline should be inserted. If both the top and bottom of the current character are greater than the previous character, a newline is inserted into the text stream.
- Parameters:
page (pdfplumber.page.Page) – Page to be parsed.
ignored (set) – A set containing the tags to be ignored.
- Yields:
str – Each character on the page, providing its tag is not to be ignored. Or, a newline character if the current character’s coordinates are greater than (lower on the page) than the previous character.
- _uses_marked_content() bool[source]
Test whether the document can be parsed using tags.
Marked content allows us to parse the PDF using tags (rather than OCR) which is more accurate not only in terms of character recognition, but also with regard to the structure of the text on a page.
- Logic:
If the document’s catalog shows
Marked: True, thenTrueis returned immediately.Otherwise, a second attempt is made which detects marked content tags on the first three pages. If no tags are found, a third attempt is made by searching the first 10 pages. If tags are found during either of these attempts,
Trueis returned immediately.Finally, if no marked content or tags were found,
Falseis returned.- Returns:
Returns True if the document can be parsed using marked content tags, otherwise False.
- Return type:
bool
- _get_crop_coordinates(skip_header: bool = False, skip_footer: bool = False) tuple[float]
Determine the bounding box coordinates.
These coordinates are used for removing the header and/or footer.
- Parameters:
skip_header (bool, optional) – If True, set the coordinates such that the header is skipped. Defaults to False.
skip_footer (bool, optional) – If True, set the coordinates such that the footer is skipped. Defaults to False.
- Logic:
When excluding a header and/or footer, the following page numbers are used for header/footer position detection, given the length of the document:
Number of pages [1]: 1
Number of pages [2,10]: 2
Number of pages [11,]: 5
- Returns:
A bounding box tuple of the following form, to be passed directly into the
Page.crop()method:(x0, top, x1, bottom)
- Return type:
tuple
- _open() None
Open the PDF document for reading.
Before opening the file, a test is performed to ensure the PDF is valid. The file must:
exist
be a valid PDF file, per the file signature
have a .pdf file extension
- Other Operations:
Store the
pdfplumberparser object returned from thepdfplumber.open()function into thedoc.parserattribute.Store the number of pages into the
doc.npagesattribute.Store the document’s meta data into the
doc.metadataattribute.
- Raises:
TypeError – Raised if the file type criteria above are not met.
- static _prepare_row(row: list) str
Prepare a table row for writing to a CSV file.
- Parameters:
row (list) – A list of strings, constituting a table row.
- Processing Tasks:
For each element in the row:
Remove any double quote characters (ASCII and Unicode).
Replace any empty values with
'None'.If the element contains a comma, wrap the element in double quotes.
Attempt to convert any non-ASCII characters to an associated ASCII character. If the replacement cannot be made, the character is replaced with a
'?'.
- Returns:
A processed comma-separated string, ready to be written to a CSV file.
- Return type:
str
- _scan_common() list[str]
Scan the PDF document to find the most common lines.
- Rationale:
Generally, the most common lines in a document will be the header and footer, as these are expected to be repeated on each page of the document.
‘Most common’ is defined as line occurring on 90% of the pages throughout the document. Therefore, only documents with more than three pages are scanned. Otherwise, the 90% may exclude relevant pieces of the document (as was discovered in testing).
- Logic:
For documents with more than three pages, the entire PDF is read through and each line extracted. The occurrence of each line is counted, with the most common occurrences returned to the caller.
The returned lines are to be passed into a page search to determine the x/y coordinates of the header and footer.
- Returns:
For documents with more than three pages, a list containing the most common lines in the document. Otherwise, an empty list if returned.
- Return type:
list
- _set_paths() None
Set the document’s file path attributes.
- property pages: list
Accessor to the PDF’s page objects.
- property tables: list
Accessor to the PDF’s table objects.