Reading a PDF: How Hard Can That Be?

By George Harpur, CEO and Co-Founder at Aluma

PDF is the most common way to pass documents between organisations, so any Intelligent Document Processing (IDP) system needs to be able to extract the contents of PDFs ahead of any other processing. Easy, right? Well... no. Reliable, efficient, and accurate content extraction from diverse sources is far more challenging than it appears. To find out why, read on...

The PDF Puzzle: A History of Complexity

To understand the challenges with the PDF ('Portable Document Format'), it's useful to know a bit about its history. It was originally introduced as a proprietary format by Adobe in 1993, and became an open ISO standard in 2008. It was based on Adobe's PostScript, a language for communicating exactly how pages should be laid out by a printer: Adobe created PDF by modifying PostScript a little, and making it applicable to screens as well as printers. It has since become the de facto standard for fixed-format documents (those where the page size and layout is fully specified, as opposed to HTML for example, where the browser has flexibility in how it chooses to render the content).

Due to its long history, layout-centric origins, and multiple bolt-on features, PDF text extraction is complex. While PDF/A (a stricter version of the standard) helps somewhat, its adoption is limited. In practice. the only rule for PDF generation is, ‘as long as it displays OK in a viewer, anything goes’ and this leads to all sorts of weird and wonderful ways that documents can be constructed internally.

Decoding PDF Content: Text, Images and Vectors

As a 'universal' document format, there are three main ways to represent text within a PDF. The first is by placing specific characters in a particular font on the page – this is common for documents that have been converted from another electronic file, such as Microsoft Word, and is sometimes called 'electronic content'. The second is by putting a bitmap picture of the text on the page – a document that has come directly from a scanner or camera will be represented this way. The third, less common, is that the text is 'drawn' on the page using vector graphics (sets of lines) without any reference to character codes – packages that directly create PDF output may take this approach to avoid the need to use and embed fonts. It is relatively easy (in theory at least) to extract text of the first type, but if a page contains either of the other two, it will always need to be subjected to some form of OCR (Optical Character Recognition), which is the process of converting pictures to text.

The three ways of representing a character within a PDF: as a defined character within a font; as a bitmap; or as a set of vectors

It is also worth noting that there are other forms of PDF content beyond these three, including form fields (areas that can be filled in dynamically), annotations, embedded metadata, and 'rich media' which includes sound, video and 3D graphics – the specification is extensive! To further complicate things, a PDF document, and even a single page, may be a hybrid of any of these content types. Common examples include:

a mixture of characters and images containing text. This can be seen in some invoices, for example, where the main body of the invoice has been produced directly by the accounting package, but the header (which includes critical information such as the supplier's name and address) has been provided as a bitmap. In can also occur in PDF renditions of Word documents or web pages, where the content contains images (such as diagrams) which nevertheless contain useful text.
images with invisible text behind. This variant is seen when a scanned image has already been subject to OCR, and the OCR package has kept the exact look of the original document (the bitmap), but embedded the text by putting invisible characters on the page, so that the document can be searched or indexed by a document management system, for example. If you've ever opened a scanned document in Acrobat or a browser and been able to select and copy text, you've seen an example of this. When you receive such a document from a third party it is not wise to trust the quality of the OCR text, however, since it may have been processed by a mediocre engine, and so for best accuracy such documents should always be re-OCRed by an IDP system.
a 'package' of documents generated by multiple disparate systems, each of which uses a different generation mechanism, resulting in hybrids of all three methods. This is commonly seen in mortgage loan processing or health insurance claims processing, for example.

The Pitfalls of PDF Text Extraction

Although it appears that the first content type (direct text) should be simple to read, because it doesn't require OCR, even this can have several complications:

The modern standard for character encoding is called Unicode, and the PDF/A standard requires this, but many other PDFs, particularly those produced by legacy systems, may use different and sometimes completely arbitrary codes for the characters, making it impossible to convert these to their Unicode equivalents. If you've ever copied and pasted text from a PDF into a text editor and ended up with gibberish, you've seen an example of this. If the character set can not be inferred, we need to fall back to using OCR in such cases.
PDF generators do not necessarily embed text in the correct logical order, so it's important to analyse the layout and reorder the text appropriately ahead of further processing.
There are various other oddities and shortcuts used by some PDF creators which need to be detected and handled. One example is to create a bold effect by rendering the text a second time with a slight offset rather than using a bold font – the parser needs to spot and deal with this to avoid ending up with two copies of the text.

Why Not Just OCR Everything?

Given these complexities, a simplistic approach would be to OCR every page, and indeed many IDP systems do just that. There are, however, a few reasons why this is not a good idea if efficiency and accuracy are your goals:

1) Efficiency: OCR is a compute-intensive task requiring significant CPU/GPU usage, so rendering every page to an image and then processing that image back to text is going to result in much greater processing times and cost than simply extracting the text directly whenever possible. As an efficiency fanatic, it breaks my heart a little when I see this happening, although admittedly, it's a little better than the early days of scan-based capture systems when often the only way to handle electronic documents was to print them out and scan them back in!

2) Accuracy: although modern OCR engines are highly accurate, there are still ambiguities in many fonts where the OCR may not deliver back the same Unicode character originally embedded in the document. A classic case is trying to distinguish between ‘O’ and ‘0’ when it is not clear from the context, such as in an alpha-numeric code. A more extreme example of this is that, at the time of writing, the Google OCR engine sometimes reads the text 'MIN' as the Greek characters mu, iota and nu, even in documents that are otherwise in English. In most fonts, these characters look identical but have different Unicode characters – if uncorrected this can cause great confusion in subsequent processing! Extracting the original Unicode straight from the document (when present) avoids the risk of such ambiguities.

3) Completeness: sometimes there is electronic text in a document that is not fully visible but which we still need to extract. In a PDF form, for example, the user may have entered more text in a field than fits in the box provided: in the rendered version of the page, the text is not fully visible – it is there however, and can be extracted by direct parsing, just not by the 'render and OCR' approach. Another case is where comments are annotated on the document: sometimes it is useful to capture and process these, but they are not typically visible in an image-rendered version.

4) File size and fidelity: fully-featured IDP systems use an input PDF for many things other than just content extraction: it may be auto-separated into individual documents; it will often be displayed to the user for human-in-the-loop validation of the extracted data; it may be further processed (e.g. to redact sensitive data); and the final version, enhanced with OCR content, is commonly exported for downstream use in workflow, document or records management systems. It is critical therefore that: (a) the file size be kept small (for efficiency of rendering in a browser, and for minimising long-term storage costs); and (b) the final PDF be as faithful as possible to the original. An image-rendered version of a previously electronic PDF is likely worse on all counts: it can easily be 5-10 times the size (often more when dealing with color documents); key features like form fields and annotations will be lost; and the image quality is reduced (zooming in produces a 'blocky' image because individual pixels start to be visible).

AI’s Role in PDF Processing: Hype vs. Reality

Riding on the latest wave of AI hype, it's fashionable to claim that OCR has been fully replaced by recent AI models, leading to eye-catching headlines like 'OCR is dead'. As with all hype, there's a kernel of truth in this, but it's overblown and misleading. What is true is that when researchers extended LLMs (Large Language Models) to handle inputs other than just text, resulting in LMMs (Large Multi-modal Models), they found that the system could be trained to extract text directly from images without requiring an explicit OCR step – OCR, if you like, was an 'emergent property' of the model. What's also true is that there are some interesting benefits to this approach, most notably that the model can be contextually aware of other layout features like lines and logos which are often removed as unwanted 'noise' by dedicated OCR engines at an early stage of processing – this context can help with some more complex tasks like table extraction.

However, here's why the rumours of OCR's death are greatly exaggerated:

if you define OCR as the process of converting images to text, then LMMs are still doing OCR, just in a slightly different form than dedicated OCR engines.
just because LMMs can perform the OCR function doesn't mean they're going to do it better or faster than a dedicated OCR engine, and in fact this is unlikely to be the case. As evidence, most of the major LMM providers (including OpenAI and Google) in fact call upon a dedicated OCR engine internally when given an image that requires it. They feed both the OCR results (in the form of a 'text embedding') and an encoding of the original image (an 'image embedding') into the main model.
the simplistic approach of just feeding a PDF to an LMM is great for those who want an off-the-shelf way to answer simple queries about a document, but is not sufficient for serious IDP systems, where we need more than just the extraction result. In particular, current LMMs are not able to give accurate information about exactly where a particular piece of text was found (at best they can indicate the approximate area on the page), and this information is critical for feedback to human operators for rapid review (image highlighting) and for embedding the text into the PDF for lasso text selection or downstream indexing and search.
when OCR is embedded in this way, it is also not possible to extract any detailed confidence values – this information is essential to produce a trustworthy system, and I've covered the topic of confidence in more detail here.
an LMM call that requires it to handle an image-based PDF can easily take 5 to 10 times as long as the equivalent 'text only' query, and that might be several seconds or more. This only becomes worse when the PDF runs to multiple pages, and for many models it becomes impossible to get any response at all beyond a certain page count, either because of timeouts or system-imposed page limits.
for more advanced processing, there may need to be multiple queries made to the LMM, and in most cases this will mean that the PDF has to processed, from scratch, multiple times, further compounding the efficiency issues and resulting in additional cost and unacceptably long processing times.

So, when some fresh-faced startup has found that they can throw a PDF into a LMM, ask it a few pertinent questions, and claim to have an IDP system 'without OCR', they actually are doing OCR, one way or the other, they just don't know it! They're also doing it in a way that's often non-optimal, and fails to provide critical outputs like position and confidence data.

Aluma’s ‘Smart OCR’ Solution

At Aluma, we love to come up with neat solutions to tricky problems like this, and when you're processing millions of pages a day, any compromise on processing efficiency or accuracy is simply not acceptable. Our approach to the PDF-handling dilemma is a multi-step process that we call 'Smart OCR':

on receipt of a PDF, we analyse every page to check for content (images or particular forms of vector graphics) that may benefit from OCR;
we strip out any existing hidden OCR text, as it may not be trustworthy;
we extract all text that's directly embedded within the document, checking for cases where this results in garbage, as discussed earlier – any such pages will also be marked for OCR;
we OCR (in parallel, for speed) just those pages that require it;
we merge in the OCR results as hidden text, but only in those locations on the page that need it, to ensure there's no repeated text in places where the content was already electronic.

The 5-step Smart OCR process

None of the above precludes the use of LMMs, when the situation demands it – we will often feed the appropriate pages of PDF to an LMM when performing complex table extraction, for example, but because we already have independent OCR in this case, we can cross-check the responses against the document text to guard against LMMs' well-known propensity to hallucinate (discussed here).

Although complex, we believe that Smart OCR delivers the best possible blend of processing speed and accuracy in text extraction to supply the subsequent steps of document classification and data extraction. Furthermore, it produces output PDFs that are compact and as faithful as possible to the originals, yet still always enhanced with full content for lasso selection, indexing and search.

Conclusion: Choosing the Right PDF Processing System

If your organization receives and needs to process PDFs from a variety of sources, you should be aware that they can be a dog’s breakfast of types and formats. Everyone can process the easy ones; not everyone can deal with the tough ones – make sure you’re using a system that can handle them all well.

About the Author

George Harpur is the CEO and Co-Founder at Aluma. Since receiving a PhD in Machine Learning from the University of Cambridge in the 90s, he has dedicated his career to applying AI and other technologies to document processing. Along the way, this has led to a love/hate relationship with various file formats, and PDF in particular, as he’s learned to come to terms with clients’ expectations that all PDFs, no matter how badly formed, must be processable.

Resources

Reading a PDF: How Hard Can That Be?

The PDF Puzzle: A History of Complexity

Decoding PDF Content: Text, Images and Vectors

The Pitfalls of PDF Text Extraction

Why Not Just OCR Everything?

AI’s Role in PDF Processing: Hype vs. Reality

Aluma’s ‘Smart OCR’ Solution

Conclusion: Choosing the Right PDF Processing System

About the Author

More from the blog

How Intelligent Document Processing is Revolutionizing Pharma Supply Chains

My Experience of Joining a Remote-first Company

A New Day in Document Security