IDP: automatic document classification

In this article, you will find all the details about automatic document classification (IDP): what it is, steps in the process, classification methodologies, and advantages in using such innovative software.

Francesco Cavina

CEO & Co-Founder

The document classification process involves assigning each of its pages, or the document as a whole, a category automatically

Automatic classification of a document can be done by following several methodologies:

through transcription and subsequent analysis of the text contained within;
Through 'image analysis of the document;
with hybrid techniques that involve analyzing both the text and its image.

Both supervised and unsupervised machine learning techniques can be used in the intelligent document processing workflow. The unsupervised approach has a lower cost in the setup phase (no data labeling phase is required) but typically offers lower accuracy. Based on the algorithm used, the model can also provide the user with a reliability score (Confidence Score) to convey the model's confidence with respect to its predictions for document classification.

So what does automatic document classification consist of? Which processes can benefit from it? What are the different methodologies for performing automatic document classification? What are the limitations and advantages of the different machine learning approaches used to automate these processes? All questions are answered in this article.

Introduction

Document classification (automatic and non-automatic) allows the user to upload different types of documents either individually or in batch (bulk) and classify them into their respective categories. This operation is also essential if a complex document of many pages contains multiple documents for analysis. The classification operation is necessary for the subsequent processing of the different types of documents by enabling, for example, subsequent assignment to the right team member for review, processing, and analysis. This operation can be a huge bottleneck for publishers, insurance companies, financial institutions, and many other companies that receive large numbers of disparate documents for processing.

A concrete example is the evaluation process for the issuance of a mortgage in which the underwriter sends 3 types of documents, let us assume through an e-mail: identity documents, payroll and CUD (as proof of profitability). Before these documents can be processed, they must be sorted into their respective categories and placed in the processing queue and respectively assigned to the right team member.

Document classification methods

The two main methodologies for classifying a document are manual or automatic.

Many companies still exploit manual document classification in their workflow with its associated penalties. If we are talking about small companies with a small volume of documents processed, the approach is typically to handle the manual process in-house, while large organizations with massive processes often outsource the work. Although time-consuming, manual classification is error-prone, expensive and inefficient. In addition, more complex cases require trained resources capable of understanding the documents to be classified, think for example of the classification of legal documents belonging to debt collection.

The main disadvantages of a manual approach can be summarized as:

Excessive processing time -The cost of the time required to process a large volume of documents can be critical.
Subjectivity - Human operators often have biases that lead them to classify documents subjectively, incurring classification errors.

In manual classification steps, a clerk often spends about 20-40% of the time retrieving documents and the remaining time processing them.

However, using an IDP technology can make the management and processing process automatic by lowering costs and time throughout the pipeline.

Automatic Classification

Automatic document classification solutions are faster and more accurate. In addition, by using a human-in-the-loop ( HITL ) approach they allow errors to be corrected and minimized. Using an IDP solution in addition to classifying documents automatically allows the process to be structured more effectively with associated benefits:

Scan documents in no particular order and without inserting separators between documents;
Automatically send the document to the right department for processing;
Automatically classify single and multiple page documents;
Automate audits on sensitive processes through scoring mechanisms.

Process steps

In an IDP process typically deep-learning techniques are exploited to identify the document class and several preliminary steps.

Identification of file format

IDP solutions typically handle varied formats. At this stage, the most relevant information is whether the document is a digital pdf or an image (jpg/png/tiff etc.). Taking images into consideration, in many cases an additional OCR step will be required to extract the text contained in the document.

Identification of document type

Depending on the type of document, techniques may or may not be used that take advantage of certain features of the document. The main features used are image, text, and document geometry (respective coordinates of the text).

The main categories of documents can be summarized as:

Structured documents-These documents are typically homogeneous in format and information content and can often be processed using a purely positional approach. Typical example an identity document
Semi-structured documents-These documents contain a predetermined set of information or tables but can vary greatly in terms of template and location of information. In this case it is useful to analyze both the text and the position and image of the document. A typical example is invoices
Unstructured documents-These documents do not follow a format and may contain highly variable information. In this case 'natural language analysis and in some cases the geometry and image of the document can be used to process the document. A classic example is contracts.

It is important to have well in mind the type of documents you want to process in order to create a performing pipeline by taking advantage of the algorithm that best fits the specific use case.

Document class identification

At this stage, an attempt is made to automatically identify the category to which the document belongs. Usually this phase is broken down into several steps.

1. Pre-processing

In many IDP processes, preliminary steps must be taken before the document can be properly classified. Typically, documents are binarized, rotated, and an attempt is made to eliminate noise, increasing the quality and readability of the document.

2. OCR

If textual features are to be exploited (typically through Natural Language Processing techniques), it is necessary to obtain a transcription of the document (if it is not a digital pdf). Many people neglect this step by relying on traditional OCR engines, but in reality proper transcription is critical to correctly classify a complex document. In an IDP performant flow having the ability to retrain your OCR engine can be important in order to reduce errors and process documents that are difficult to read.

3. Document classification.

The main methodologies are:

i) Visual Approach

In this case, taking advantage of computer vision techniques makes it possible to analyze the visual appearance of the document without having the need to transcribe it. Recurrence of the location of information or layout of the document allows it to be classified automatically. These techniques work well on structured documents and, if you have enough data, even on semi-structured documents. One of the advantages of this approach is that it does not require an OCR step working directly on the image.

(ii) Text-based approach

By taking advantage of NLP techniques, it is possible to analyze the text contained in the document automatically and determine which category the document belongs to. These methodologies make it possible to effectively process even unstructured documents such as contracts. However, in many cases not being able to analyze the image and geometry of the document plays a key role in introducing errors.

(iii) Multimodal approach

The most modern approaches propose to analyze all the salient features of a document: text, layout, and image. This approach provides the most attractive benefits of previous techniques and greater versatility in terms of application. This allows structured, semi-structured and unstructured documents to be processed with the same pipeline.

Leveraging pre-trained algorithms with unsupervised techniques can cut down on the amount of data needed to train these algorithms, allowing even processes with limited document volume to be automated. In all the cases previously exposed, depending on the type of algorithm used, it is also possible to obtain a confidence score to review the most critical documents.

Advantages of automatic classification

Regardless of how sophisticated the algorithm used to classify documents is, the main benefits to be gained are as follows:

1. Management of documents with high variability in format and content

With advances in Deep Learning and Data Augmentation techniques, a wide variety of processes can be automated with excellent results

2. Save time and money

Automating document classification eliminates or abates the need for human intervention in the process itself, which is highly time-consuming and repetitive with its attendant consequences in terms of costs and errors. It also frees up resources and improves the quality of one's work life.

3. Preventing data breaches

Automated and centralized data management reduces the risk of security holes.

Automatic Classification with myBiros

myBiros is an IDP solution that enables automatic processing of documents of any type. Key features include information extraction and automatic document classification. myBiros provides a prebuilt set of ready-to-use APIs with pre-trained templates for common use cases and the ability to retrain the entire pipeline (both the OCR engine and the document interpretation system) for custom cases.

By leveraging advanced deep learning techniques that analyze multimodal features, it is possible to process all of the above document types with the same solution. By using pre-trained models and data augmentation techniques, it is possible to train the system with a limited amount of data. This makes it possible to train AI models even for those who do not have extensive masses of documents. Through the scoring mechanism, the system makes it possible to reduce false positives by enabling the ability to review low confidence data while minimizing errors. Interaction with a human user enables the system to correct errors while continuing to train the system so that past mistakes are not repeated (Human-in-the-loop. and continuous learning). Finally, the high scalability of the cloud-based architecture makes it possible to process highly variable masses of documents without having to allocate expensive resources in advance.

If you are curious about how myBiros works and want to find out how to simplify document processing for different industries with the ability to accurately extract data from documents, classify them, and validate results, schedule a free demo with us. We would love to hear about your business use case and understand how we can help you!

‍

Articles from the same category

Glossary

Below you will find a glossary that lists and defines essential terms for understanding and making the most of intelligent document automation.

Read it now

Risks associated with manual document processing

Every business department involves document management, which is necessary to record information, communicate with customers and suppliers, and store important data. If done manually, these activities expose the company to numerous risks.

Read it now

The cost of data entry errors

Errors due to manual data entry come at a significant cost to companies. It is important to invest in reliable data entry processes and proper quality controls so that errors and subsequent costs can be remedied.

Read it now

What is customer onboarding?

Customer onboarding is the process by which a company introduces a new customer to its product or service. The following article explains what digital onboarding is, its automation, and its benefits.

Read it now

Digital transformation and automated document processing

Digital transformation and document hyperautomation

Digital transformation includes implementing innovative technologies and redefining business processes to automate.

Read it now

Why automate expense management?

Many companies still manage expenses manually, causing low employee productivity. Today, expense management can be automated, reducing time, cost, and repetitive tasks that cause frustration.

Read it now