IDP: automatic document classification

In this article, you will find details about automatic document classification (IDP): what it is, the steps involved in the process, various classification methods, and the advantages of utilizing this innovative software.

Francesco Cavina
Francesco Cavina
CEO & Co-Founder

The document classification process involves automatically assigning a category to each page of a document or to the document as a whole. Automatic classification can occur using various methods:

  • Text Transcription and Analysis: This involves analyzing the text contained within the document.
  • Image Analysis: This method focuses on examining the visual aspects of the document.
  • Hybrid Techniques: These approaches analyze both the text and the image of the document.

In the intelligent document processing workflow, both supervised and unsupervised machine learning techniques can be utilized. The unsupervised approach has a lower setup cost since it does not require data labeling; however, it typically offers lower accuracy. Depending on the algorithm used, the model can provide a Reliability Score (Confidence Score) to indicate the confidence level of its predictions for document classification.

So, what does automatic document classification entail? Which processes can benefit from it? What are the different methods for performing automatic classification of documents? What are the limitations and advantages of the various machine learning approaches used to automate these processes? All these questions will be answered in this article.

Intro

Document classification, whether automatic or manual, enables users to upload various types of documents individually or in bulk and categorize them accordingly. This process is essential, especially when dealing with complex documents that may contain multiple items requiring analysis. Effective classification is necessary for the subsequent processing of different document types, facilitating the assignment to the appropriate team member for review, processing, and analysis.

This operation can become a significant bottleneck for publishers, insurance companies, financial institutions, and many other organizations that handle a large volume of diverse documents. A concrete example is the evaluation process for mortgage issuance, where an underwriter may submit three types of documents via email: identity documents, paychecks, and a proof of income. Before these documents can be processed, they must be classified into their respective categories and queued for processing, with each assigned to the appropriate team member.

Document classification methods

The two main methodologies for classifying documents are manual and automatic. Many companies still rely on manual document classification within their workflows, which comes with several drawbacks. Small businesses that process a limited volume of documents typically handle the manual process in-house, while larger organizations often outsource this work.

Despite being time-consuming, manual classification is prone to errors, costly, and inefficient. Moreover, complex cases require skilled personnel capable of understanding the nuances of the documents being classified, such as legal documents related to debt collection.

The main disadvantages of a manual approach can be summarized as follows:

  • Excessive Processing Time: The time needed to process a large volume of documents can be significant and costly.
  • Subjectivity: Human operators may introduce biases that lead to subjective classifications, resulting in errors.

During the manual classification process, employees often spend about 20-40% of their time retrieving documents and the remainder processing them.

In contrast, utilizing Intelligent Document Processing (IDP) technology can automate the management and processing of documents, significantly reducing both costs and the time required for the entire pipeline.

Automatic Classification

Automatic document classification solutions are faster and more accurate. In addition, using a Human-in-the-Loop (HITL) approach allows for the correction and minimization of errors. Using an IDP solution not only allows for the automatic classification of documents but also structures the process more effectively with the following advantages:

  • Scan documents in no particular order and without inserting separators between documents.
  • Automatically send the document to the right department for processing.
  • Automatically classify single-page and multi-page documents.
  • Automate checks on sensitive processes through scoring mechanisms.

Stages of the process

In an Intelligent Document Processing (IDP) workflow, deep learning techniques are typically employed to identify the document class and perform several preliminary steps.

Identifying the file format

IDP solutions generally handle a variety of formats. In this phase, the key objective is to determine whether the document is a digital PDF or an image (e.g., JPG, PNG, TIFF, etc.). If images are involved, an additional OCR phase may be required to extract the text contained within the document.

Identifying the type of document

Depending on the type of document, various techniques can be employed that leverage specific characteristics of the document. The primary characteristics utilized include the image, the text, and the geometry of the document (i.e., the coordinates of the text).

The main categories of documents can be summarized as follows:

  • Structured Documents: These documents typically have a homogeneous format and consistent information content, allowing for a purely positional approach. A common example is an identity document.
  • Semi-Structured Documents: These documents contain a fixed set of information or tables but can vary significantly in terms of template and information layout. In this case, it is useful to analyze both the text and its position, as well as the document's image. A typical example would be invoices.
  • Unstructured Documents: These documents lack a standardized format and may contain highly variable information. For processing these documents, natural language analysis, and sometimes geometry and image analysis, can be employed. A classic example is contracts.

It is essential to consider the types of documents you wish to process in order to create a high-performance pipeline using the most suitable algorithm for your specific use case.

Identifying the document class

In this phase we try to automatically identify the category to which you belong of the document. Usually this phase is divided into several phases.

1. Pre-processing

In this phase, the goal is to automatically identify the category to which the document belongs. Typically, this phase is divided into several steps.

2. Optical character recognition (OCR)

To leverage textual features—typically through Natural Language Processing techniques—it is necessary to obtain a transcript of the document (especially if it is not a digital PDF). Many people overlook this phase, relying solely on traditional OCR engines. However, accurate transcription is crucial for correctly classifying complex documents. In a high-performance IDP workflow, the ability to retrain the OCR engine is essential for reducing errors and effectively processing difficult-to-read documents.

3. Document classification

The main methodologies for document classification are:

i) Visual Approach

Using computer vision techniques, this method analyzes the visual aspects of a document without the need for transcription. By leveraging the positioning of information and the layout of the document, it can be automatically classified. These techniques work effectively on structured documents and, with sufficient data, can also be applied to semi-structured documents. One of the advantages of this approach is that it eliminates the need for an OCR phase, working directly with the image.

ii) Text-based approach

This method employs Natural Language Processing (NLP) techniques to automatically analyze the text within the document and determine its category. This approach is effective for processing unstructured documents such as contracts. However, the inability to analyze the image and geometry of the document can introduce errors in many cases.

iii) Multimodal approach

Modern approaches advocate for analyzing all salient characteristics of a document: text, layout, and image. This multimodal approach combines the strengths of the previous techniques, offering greater versatility in applications. It enables the processing of structured, semi-structured, and unstructured documents within the same pipeline.

By utilizing pre-trained algorithms with unsupervised techniques, this approach can reduce the amount of data required for training, allowing even processes with limited document volumes to be automated.

Advantages of automatic classification

Regardless of the sophistication of the algorithm used for document classification, the main benefits include:

1. Management of documents with high variability

With advancements in Deep Learning and Data Augmentation techniques, a wide range of processes can be automated with excellent results, even for documents with varying formats and content.

2. Time and Cost Savings

Automating document classification eliminates the need for human intervention, which is often time-consuming and repetitive, leading to reduced costs and fewer errors. Additionally, this process frees up resources, enhancing the overall quality of work life.

3. Preventing data breaches

Automating and centralizing data management significantly reduces the risk of security breaches.

Automatic classification with MyBiros

MyBiros is an Intelligent Document Processing (IDP) solution that enables the automatic processing of various document types. Among its key features are information extraction and automatic document classification. MyBiros provides a prebuilt set of APIs that are ready to use, along with pre-trained models for the most common use cases and the capability to retrain the entire pipeline (including the OCR engine and the document interpretation system) for custom applications.

By utilizing advanced deep learning techniques that analyze multimodal features, MyBiros can process all types of documents mentioned above within a single solution. The use of pre-trained models and data augmentation techniques allows the system to be trained with a limited amount of data, enabling AI models to be developed even for those with smaller document volumes.

Through a scoring mechanism, MyBiros effectively reduces false positives by allowing for the review of low-confidence data, thereby minimizing errors. Interacting with human users enables the correction of system errors while facilitating continuous training, ensuring that past mistakes are not repeated (Human-in-the-Loop and continuous learning).

Finally, the high scalability of the cloud-based architecture allows for the processing of a large number of highly variable documents without the need to allocate expensive resources in advance.

If you’re curious about how MyBiros works and want to discover how to simplify document processing across different sectors - accurately extracting data, classifying documents, and validating results - please contact us. We’d love to hear about your business use case and explore how we can assist you!

Articles in the same category

digital transformation and automated document processing

Digital transformation and document hyperautomation

Digital transformation involves implementing innovative technologies and redefining business processes to enable automation.

Read it now
Expense management

Why automate Expense Management processes?

Many companies still manage expenses manually, leading to reduced employee productivity. Today, expense management can be automated, significantly cutting down on time, costs, and the repetitive tasks that often lead to frustration.

Read it now
risks of manual document processing

Risks of Manual Document Processing

Every business department relies on document management to record information, communicate with customers and suppliers, and store critical data. When done manually, these activities expose the company to numerous risks.

Read it now
Hands typing on keyboard

Companies still rely on manual data entry

Many companies still rely on manual data entry, which leads to numerous challenges. Today, this process can be automated using modern technologies, eliminating repetitive tasks and significantly reducing both time and costs.

Read it now
IDP Intelligent Document Processing

Intelligent Document Processing (IDP)

Intelligent Document Processing refers to a suite of tools and solutions based on deep learning techniques, designed to automate the processing of all types of documents.

Read it now
document classification with myBiros

IDP: automatic document classification

In this article, you will find details about automatic document classification (IDP): what it is, the steps involved in the process, various classification methods, and the advantages of utilizing this innovative software.

Read it now