Structured, semi-structured and unstructured documents

The first question suppliers ask you when you are looking for an intelligent document processing solution is: “what type of document are you looking to process?”. Vendors expect structured, semi-structured, or unstructured responses. This article aims to outline the difference between structured, semi-structured and unstructured documents and shows the issues regarding the information extraction from these.

Structured document

Typically, each copy of a structured document has the same layout and design in terms of colors, fonts and images. In some cases, a structured document may be slightly modified due to the release of a new document. An example of a structured document is an identity document.

This type of document is the easiest to handle because the information is easily identifiable and keeps the same position among the different samples. A usual approach to process this type of documents is to use traditional solutions based on rules and templates applied to the output of an OCR engine.

This approach has the following drawbacks:

  • Document acquisition is not always guided. This can result in low-quality, rotated documents that are hard to read. As a result, rules-based approaches cannot be used to resolve the use case in question;
  • Even if the structured document is easy to interpret, there are several reasons why its format can vary: the documents change over time and the formats can be multiple due to different nationalities;
  • Documents change format over time, necessitating additional rules and templates.

Semi-structured documents

Semi-structured documents have the characteristic of containing some type of information known a priori, but which can change the position and format whitin the document itself.In addition, semi-structured documents also vary a lot in terms of layout and design (font and color). The most classic example is an invoice. Each company has to enter some necessary information in the invoice, but can freely choose the level of detail, fonts, colors and adjustment of the invoice itself. This makes semi-structured documents more difficult to process than the previous category.

Rule-based and template-based solutions for processing semi-structured documents have a number of challenges and limitations. Mainly those exposed with respect to structured documents. Second, this type of document varies by provider, which means creating a new template and related rules each time.

Unstructured documents

Unstructured documents do not follow any constraints in terms of format or content. A concrete example of an unstructured document is a contract. In fact, the terms and conditions of a contract vary completely according to the type and format of the document itself

The processing of this type of document is more complicated than the categories seen previously, in fact the techniques based on templates cannot be used in this case. Hence the need to use solutions that take advantage of machine learning and natural language analysis.

You have documents, we have the solution

myBiros is an intelligent document processing product, designed for companies that have many challenges related to document processing. Unlike traditional solutions, myBiros allows you to automatically process any document, extracting information and data with savings on time, costs and repetitive activities.

With myBiros it is easy to automate document processes, thanks to the use of a pipeline that sees the best Deep Learning techniques in action. Much more than an OCR, myBiros is able to interpret the data locked inside the documents, allowing companies to manage risks, make important decisions and seize opportunities. myBiros differs from traditional solutions: it does not use a rule-based or template-based pipeline. The approach used by myBiros is entirely based on data, this makes the entire pipeline fully trainable on a vertical domain without having to specify any rules or domain information. myBiros uses techniques from Computer Vision and NLP that allow it to interpret the document using its various characteristics: the text contained, the layout and the image of the document itself.

myBiros is therefore able to process any type of document: structured, semi-structured and unstructured documents.

