By Ed Wingate, Head of Strategic Partnerships at Aluma

Despite living in an increasingly digital world, many organizations still find themselves needing to deal with large amounts of paper. Banks, insurance companies, legal firms, and government departments regularly need to handle hundreds if not thousands of documents daily and turn them from paper into digital documents and data.

High-volume scanners have existed for many years, as has the capture software to accompany them. In the early days, documents were simply scanned and passed through optical character recognition (OCR) tools to recognize the text on the page. This served to digitize the paper document but organizations want more than just machine-readable pdfs — they want to extract the data from the document itself.

So technology adapted and advanced to enable data extraction. Various techniques such as zonal templates and regular expression matching help organizations but each of these requires high levels of upfront configuration, custom coding and scripting to extract data. Even with that upfront effort, further labor is required to handle the typically large number of exceptions generated by incoming documents that do not conform to the pre-built rules.

This leads us to today, where the capture function of an organization, whether in-house or outsourced to a scanning bureau or business process outsourcing (BPO) agent, is being asked to provide more and more detailed information faster, more accurately, and at lower costs than ever before.

The traditional methods of capturing and processing paper documents cannot scale to deliver this. Adding more professional services upfront to script ever more complex extraction algorithms, or even worse, adding humans to the downstream process slows the process down, often introduces errors, and increases costs. A new way of working is needed to address this challenge.

That new way requires adding document intelligence to the process.

Document intelligence applies a combination of new technologies including artificial intelligence, computer vision, and machine learning, with existing technologies such as OCR to the processing of documents. The result can deliver a fast and cost-effective way to automate document classification and data extraction, improve accuracy and efficiency, and subsequently provide much higher service levels without increasing cost and resources.

In this blog, we explore three ways that document intelligence can deliver significant benefits to organizations with high-volume scanning requirements.

1 - Improved Accuracy and Productivity

While high-volume scanners can process hundreds of pages per minute, and OCR engines can work equally fast, however, when data extraction is added to the mix humans frequently need to check the results to validate the accuracy.

This has a significant impact on the cost-effectiveness of any scanning operation as the function becomes more reliant on the cost of human resources.

Document intelligence can help in many ways here.

Faster document classification

One of the most challenging parts of document scanning is to automate the classification of documents. This sounds simple — telling the difference between an invoice and a statement is easy for a human but challenging for automated systems without large amounts of templates created during setup. Document intelligence rapidly categorizes documents without requiring user training to allow different file types to be passed to the appropriate processes or specific data extraction engines.

More sophisticated data extraction

Documents are a form of unstructured data — every document contains information. Still, the format or layout of that document can vary massively, as can the type and amount of data on it. Document scanning projects aim to extract this unstructured data and turn it into structured data that a computer can understand. For example, an invoice will (at least) contain a date, an invoice number, supplier details, and an invoice total. These items reside on the page in variable locations, but document intelligence uses multiple techniques such as AI/ML and flexible fuzzy logic to determine where these values are on the page. The sophistication of these new algorithms allows for the near elimination of those dreaded “document templates” that sometimes required days and weeks of programming and scripting to create.

Innovations in both these areas significantly reduce the time required to configure very “intelligent” systems that deliver automation while not compromising on accuracy.

2 - Enriched Data

Speed, productivity, and accuracy are, of course, the first priorities for any high-volume scanning operation. But document intelligence provides a platform to deliver much more.

Automatic data validation & Error checking

Verifying accuracy of extracted data has always been the most human-intensive of high-volume scanning operations. With database integrations, the system can use fuzzy logic to match multiple fields from a document with those corresponding fields of a database record, and automatically verify that the data pulled from the document is accurate. Frequent use of this occurs in accounts payable departments, where invoice details can be automatically compared against the recognized supplier database. In this way, even poorly OCR’d documents can provide near 100% accurate data to the system.

In-line data cleansing

Supreme importance at all times is the privacy and security of personal information, especially around regulations such as GDPR and CCPA, and document intelligence can help here.

At a minimum, any document containing personally identifiable information (PII) can be identified as such, allowing it appropriate treatment. That treatment may include creating two versions: one with the PII information available for storage in a highly secure area of the organization’s systems and a second, redacted copy to be stored in a more widely available area of the business.

For off-shore processing, documents can be automatically sanitized before it crosses organizational or country borders. This opens up low-cost exception handling processes while not undermining security controls on PII and company-sensitive information.

3 - Extend, not replace, existing infrastructure

“My organization has already made significant investments in document capture software - why would I want to replace it?” This is a common question raised by organizations with high-volume scanning operations. The process might not be perfect, cost-effective, or productive — but it is working.

But who said anything about replacing what is already there?

New REST-API architectures allow the new to coexist with the old. Or said another way, new technologies can be deployed on top of existing older technologies without requiring a complete re-architecting of the solution.

Existing capture systems and processes do a tremendous job at handling non-changing and standard document processing at scale. But what happens when the client introduces a new requirement? How about smaller projects that would help drive volume to the scanning center, but are perhaps not big enough to justify the requisite rescripting of algorithms in older systems? Any modern document intelligence solution will work in tandem with existing capture tools and processes - classifying documents, enriching metadata, and more to add value to the current investment as opposed to ripping and replacing it.


High-volume document processing will, over time, disappear — but we have been saying that for the past 20 years. Until we are free of paper that needs processing, we will need to identify more effective ways to convert paper to digital, at less cost and greater accuracy. Document intelligence addresses that need but also goes beyond by offering:

  • Automated document classification
  • Improved accuracy and therefore reduced human costs
  • Simplified, comprehensive, and easy to configure data extraction
  • The ability to enrich metadata from external sources
  • Enhanced features such as privacy, security, and watermarking to be applied during the capture process
  • Integration to existing capture hardware and software to leverage existing investment rather than ripping and replacing

These factors provide compelling arguments for considering document intelligence — but the most persuasive argument of all? Document intelligence tools work just as well on digital files in document repositories as on paper documents - making any investment in the technology not only incredibly valuable today but future-proofed for tomorrow too. Now that sounds like an intelligent decision if I ever heard one.

If you would like to learn more, sign up for our webinar: Rethinking High-Volume Document Automation.

More from the blog