As part of the overall document processing process, the information extraction step involves extracting key data from documents and entering it into computer systems. Information extraction can be performed on any type of document: structured, semi-structured and unstructured.
In the case of an invoice, the information to be extracted are for example: the company that issued the invoice, the date of issue, the address. Instead, from an identity document, the extraction of information concerns the main personal data such as name, surname, residence.
A generic pipeline to extract information from documents involves three steps:
- a first step to recognize the position of the text and its transcription;
- a second phase of analysis to identify the information of interest;
- an optional third step for post-processing of the results.
Extracting information from documents provides a transversal application for all sectors. Because the process requires a great deal of human involvement, it has become necessary to find automated or semi-automated solutions.
The traditional approach for extracting information from documents relies on rule-based and template-matching approaches. By using template-matching approaches, a mask is superimposed on the image in order to filter the document template and highlight the values to be extracted. Instead, with a rule-based approach, the information of interest is extracted using static rules based on the output of an optical character recognition system (OCR). Structured and semi-structured documents can be analyzed using these methods in standalone and synergistic ways after the technical team configures the data extraction systems. The configuration is static and requires the technical team to manage variations and new document types.
There are many limitations to these solutions, as well as high development and maintenance costs. In addition, this type of approach is not capable of managing unstructured documents since templates and rules cannot be established a priori for this type of document.
With the use of machine learning methodologies, it is now possible to solve many of the problems associated with traditional solutions. Paradigm shift leads to fully data-driven solutions and the development and maintenance flow of this type of solution takes the form outlined in the following diagram:
A generic approach involves training a system on a large body of documentation in order to acquire a generic knowledge of the field of application. The goal is to have a system that is able to generalize on unknown documents, so that is does not therefore does not require constant configurations in order to react to changes in formats or new documents.
These types of solutions do not have the on-going configurations required to manage new documents.Here, it is necessary to collect and create a valuable dataset that can describe the different cases of the process of interest.
This type of solution can be based on different techniques that refer to the fields of Computer Vision, Natural Language Processing. The most recent proposals involve the use of neural networks, the architectures that work best in these tasks are transformers based and graph neural networks.
You have documents, we have the solution
myBiros is a new generation solution for the automation of processes involving documents. myBiros uses the most modern deep learning techniques to overcome the limitations of traditional solutions. myBiros is a Document AI no code platform that offers ready-to-use use cases and the ability to setup new use cases on the fly with fewer documents.
Unlike traditional solutions, myBiros allows you to automatically process any document, extracting information and data with savings on time, costs and repetitive activities.
Through Human in the Loop and Continuous Learning approaches myBiros offers the possibility to constantly improve the performance of the models thanks to human feedback, achieving unprecedented accuracy.
The approach used by myBiros is entirely based on data. This makes the entire pipeline completely adaptable to vertical domains. Exploiting approaches from Computer Vision and Natural Language Processing, myBiros is able to interpret the document using its different characteristics: the text contained, the layout and the image of the document itself.