The automatic classification of a document assigns automatically to each page of the document, or to the entire document, a category to which it belongs.
Different methodologies can be used to classify a document:
- Transcription and analysis of the text contained within;
- Analyzing the document image;
- By analyzing both the text and its image using hybrid techniques.
It is possible to use both supervised and unsupervised machine learning techniques in the intelligent document processing workflow. In contrast to the supervised approach, the unsupervised approach typically has a lower setup cost (no labeling is required), but it is typically less accurate. The model can also provide a Confidence Score for document classification predictions based on the algorithm used.
So what is automatic document classification? Which processes can benefit from it? What are different methodologies to perform automatic document classification? What are the limitations and benefits of different machine learning approaches used to automate document classification? – All questions are answered in this article.
Introduction
The classification of documents (automatic and not) allows the user to upload different types of documents both individually and in batches and to classify them in their respective categories. This is also essential if a complex document of many pages contains multiple documents to be analyzed. The classification operation is necessary for the subsequent processing of the different types of documents, allowing, for example, the subsequent assignment to the right team member for review, processing and analysis. This can be a huge bottleneck for publishers, insurance companies, financial institutions, and many other companies that receive a large number of heterogeneous documents to process.
A concrete example is given by the evaluation process for the release of a mortgage in which the underwriter sends 3 types of documents, let’s assume through an e-mail: identity documents, pay slip and CUD (as proof of profitability). Before they can be processed, these documents must be classified into their respective categories and placed in the processing queue and respectively assigned to the right team member.
Document classification methods
The two main methodologies for classifying a document are manual or automatic.
Many companies still take advantage of manual document classification in their workflow, with associated penalties. If we are talking about small companies with a low volume of processed documents, the approach is typically to manage the manual process in-house, while large organizations with massive processes often outsource the work. Although time consuming, manual document classification is error prone, expensive and inefficient. In addition, complex cases also require prepared resources. These people must be able to understand the documents to be classified, think for example of the classification of legal documents belonging to debt collection.
The main disadvantages of a manual approach can be summarized as follows:
- Excessive Processing Time – An important consideration is the time it takes to process a considerable volume of documents.
- Subjectivity – Human operators often have biases that lead them to classify documents in a subjective way incurring classification errors.
During the manual classification stages, an employee often spends about 20-40% of the time retrieving documents and the remaining time processing them.
However, using an IDP technology can automate the process of managing and processing, reducing the costs and time of the entire pipeline.
Automatic classification
Automatic document classification solutions are faster and more accurate. In addition, using a Human in the loop (HITL) approach allows you to correct and minimize errors. Using an IDP solution in addition to automatically classifying documents allows you to structure the process more effectively with its advantages:
- Scan documents in no particular order and without inserting separators between documents;
- Automatically send the document to the right department for processing;
- Automatically classify single and multiple page documents;
- Automate checks on sensitive processes through scoring mechanisms.
Process steps
In an IDP process, deep-learning techniques are typically used to identify the class of the document and several preliminary steps.
File format identification
IDP solutions typically handle varied formats. At this stage, the most relevant information is to understand whether the document is a digital pdf or an image (jpg / png / tiff etc.). In the case of images, in many cases an additional OCR step will be required to extract the text contained in the document.
Document type identification
Depending on the document type, I can use techniques that use some or all of its characteristics. The main features used are the image, text and geometry of the document (respective coordinates of the text).
The main macro categories of documents can be summarized in:
- Structured documents – These documents are typically homogeneous in format and information content and can often be processed with a purely positional approach. A typical example is an identity document.
- Semi-structured documents – These documents contain a predetermined set of information or tables but which can vary greatly in terms of template and location of the information. In this case it is useful to analyze the text, the position and image of the document. A typical example is an invoice.
- Unstructured documents – These documents do not follow a format and can contain highly variable information. In this case, the analysis of natural language and in some cases of the geometry and image of the document can be used to process the document. A classic example is a contract.
It is important to have in mind the type of documents that you want to process to create a high-performance pipeline using the algorithm that best suits the specific use case.
Document class identification
At this stage, an attempt is made to automatically identify the category to which the document belongs. Usually this phase is divided into several phases.
1. Pre-processing
In many IDP processes, preliminary steps must be taken before the document can be properly classified. Typically documents are binarized, rotated and an attempt is made to eliminate noise, increasing the quality and readability of the document.
2. OCR
If you want to exploit the textual features (typically through Natural Language Processing techniques) it is necessary to obtain a transcript of the document (if it is not a digital pdf). Many neglect this phase relying on traditional OCR engines but in reality a correct transcription is essential to correctly classify a complex document. In a performing flow of IDP, having the possibility to retrain your OCR engine can be important in order to reduce errors and process documents that are difficult to read.
3. Document classification
The main methodologies are:
i) Visual Approach
In this case, using computer vision techniques it is possible to analyze the visual aspect of the document without having to transcribe it. The occurrence of the location of the information or the layout of the document allows it to be classified automatically. These techniques work well on structured documents and, if you have enough data, also on semi-structured documents. One of the advantages of this approach is that it does not require an OCR phase, working directly on the image.
ii) Text-based approach
Using NLP techniques, it is possible to automatically analyze the text contained in the document and determine the category to which the document belongs. These methodologies make it possible to process even documents that are not structured such as contracts. However, in many cases not being able to analyze the image and the geometry of the document plays a fundamental role in introducing errors.
iii) Multimodal approach
The most modern approaches propose to analyze all the salient characteristics of a document: text, layout and image. This approach guarantees the most interesting benefits of the previous techniques and greater versatility in terms of application. This allows structured, semi-structured and non-structured documents to be processed with the same pipeline.
By exploiting pre-trained algorithms with unsupervised techniques, it is possible to reduce the amount of data necessary to train these algorithms, allowing even processes with a limited volume of documents to be automated. In all the cases described above, based on the type of algorithm used, it is also possible to obtain a confidence score to review the most critical documents.
Advantages of automatic classification
Regardless of how sophisticated the algorithm used to classify the documents is, the following benefits can be obtained:
1. Handling of documents with high variability in format and content
Automating document classification reduces the need for human intervention. The latter is highly time-consuming and repetitive with the related consequences in terms of costs and errors. In addition, the quality of working life improves.
2. Save time and money
Automatizzare la classificazione documentale elimina o abbatte la necessità di un intervento umano per il processo stesso, il quale è fortemente time-consuming e ripetitivo con le relative conseguenze in termini di costi e errori. Inoltre, le risorse vengono liberate e migliorano la qualità della propria vita lavorativa.
3. Prevent data breaches
Automated and centralized data management reduces the risk of security breaches.
Automatic classification with myBiros
myBiros is an IDP solution that allows the automatic documents processing. Among the main features, there are the extraction of information and the automatic classification of documents. myBiros provides a prebuilt set of ready-to-use APIs with pre-trained models for the most common use cases and the ability to retrain the entire pipeline (both the OCR engine and the document interpretation system) for custom cases.
By exploiting advanced deep learning techniques that analyze multimodal features, it is possible to process all the document types mentioned above with the same solution. Thanks to the use of pre-trained models and data augmentation techniques, it is possible to train the system with a limited number of data. This allows you to train AI models even to those who do not have extensive amounts of documents. Through the scoring mechanism, the system allows to reduce false positives by enabling the possibility of reviewing low-confidence data, minimizing errors. Interaction with a human user allows you to correct system errors by continuing to train him so as not to repeat the mistakes made in the past (Human-in-the-loop and continuous learning). Finally, the high scalability of the cloud-based architecture allows you to process highly variable amounts of documents without having to previously allocate expensive resources.
If you are curious about how myBiros works to find out how to simplify document processing for different sectors with the ability to accurately extract data from documents, classify them and validate the results – all in seconds -, schedule a free demo with us. We’d love to hear about your business use case and understand how we can help!