Information extraction: traditional vs. modern solutions

The article examines the extraction of information from documents. The task is often performed with traditional solutions, which have many limitations. Modern solutions overcome these limitations and are adaptable to all documents.

Valerio Caravani

CTO & Co-Founder

In the document processing process, the information extraction task involves finding and entering key information in documents into computer systems. The information extraction task can be performed on any type of document: structured, semi-structured, and unstructured.

If we consider an invoice, the information of interest to be extracted is, for example: the company that issued the invoice, the date of issue, and the address. From an identity document, on the other hand, the extraction of information concerns the main biographical data such as first name, last name, residence.

A generic pipeline for solving this task involves 3 steps:

a first step of text position recognition and its transcription,
A second step of analysis of identifying information of interest
A third, optional step of post processing of the results.

‍

‍

The extraction of information in documents represents a cross-industry application. The task involves massive human intervention; this has necessitated solutions that can automate or semi-automate the process.

Traditional solutions

Traditional solutions of extracting information from documents are based on rule-based and template-matching approaches. In the case of template-matching approaches, a mask is superimposed on the image in order to filter the document template and highlight the values to be extracted. In contrast, in the case of rule-based approaches, the information of interest is extracted by static rules built a posteriori on the output of an Optical Character Recognition (OCR) system. These types of approaches are capable of performing alone and synergistically on structured and semi-structured documents after configuration of the data extraction systems by a technical team. The configuration is static and requires the technical team to handle any changes and new document types.

‍

‍

These solutions represent numerous limitations and high development and maintenance costs. In addition, this type of approach fails in no way to handle unstructured documents since it is not possible to establish a priori templates and rules on this document type.

Modern solutions

The use of machine learning methodologies has made it possible to overcome many limitations of traditional solutions. The paradigm shift leads to fully data-driven solutions. The flow of developing and maintaining these types of solutions takes the form described by the following diagram:

‍

‍

A generic approach involves training a system on a large corpus of documents in order to gain generic knowledge of the application domain. The goal is to have a system that can generalize to unfamiliar documents and therefore does not require constant configurations in order to react to changes in formats or new documents.

Solutions of this type shift the cost items from the continuous configurations required to manage new documents to the collection and creation of a valuable dataset that can describe the different case histories of the process of interest.

These types of solutions can be based on different techniques that refer to the fields of Computer Vision, Natural Language Processing. The most recent proposals involve the use of neural networks; the architectures that perform best in these tasks are transformers based and graph neural networks.

You have the documents, we have the solution

myBiros is a next-generation solution forautomating processes involving document processing. myBiros leverages the latest deep learning techniques to overcome the limitations of traditional solutions. myBiros is a no-code Document AI platform that offers ready-to-use use cases and the ability to perform on-the-fly setup of new use cases with a small number of sample documents.

With myBiros it is possible to reduce the costs associated with document processing due to the use of traditional solutions. myBiros offers clear savings in terms of time, cost and resources. In addition, the features offered by the platform allow the abatement of costs related to data retrieval for model training.

Through Human in the Loop and Continuous Learning approaches, myBiros offers the ability to continuously improve model performance through human feedback, achieving unprecedented accuracy and quality of extracted data.

The approach used by myBiros is entirely data-driven. This makes the entire pipeline fully adaptable to vertical domains. Leveraging approaches from Computer Vision and Natural Language Processing, myBiros is able to interpret the document using its different features: the text content, the layout, and the document image itself.

Want to find out more about our solutions? Contact us and try our demo, we are waiting for you!

‍

Articles from the same category

Glossary

Below you will find a glossary that lists and defines essential terms for understanding and making the most of intelligent document automation.

Read it now

Risks associated with manual document processing

Every business department involves document management, which is necessary to record information, communicate with customers and suppliers, and store important data. If done manually, these activities expose the company to numerous risks.

Read it now

The cost of data entry errors

Errors due to manual data entry come at a significant cost to companies. It is important to invest in reliable data entry processes and proper quality controls so that errors and subsequent costs can be remedied.

Read it now

What is customer onboarding?

Customer onboarding is the process by which a company introduces a new customer to its product or service. The following article explains what digital onboarding is, its automation, and its benefits.

Read it now

Digital transformation and automated document processing

Digital transformation and document hyperautomation

Digital transformation includes implementing innovative technologies and redefining business processes to automate.

Read it now

Why automate expense management?

Many companies still manage expenses manually, causing low employee productivity. Today, expense management can be automated, reducing time, cost, and repetitive tasks that cause frustration.

Read it now