Structured, semi-structured and unstructured documents

The article outlines the difference between structured, semi-structured and unstructured documents. It illustrates the problems regarding the processing of all document types, solved by Artificial Intelligence-based solutions.

Francesco Cavina

CEO & Co-Founder

When you are looking for an intelligent document processing(IDP) solution, among the first questions vendors ask is, "What type of document do you want to process?" The answer that vendors expect is one of the following: structured, semi-structured or unstructured. This article aims to outline the difference between structured, semi-structured, and unstructured documents and illustrate the issues with regard to extracting information of interest from them.

‍

Structured documents

Structured documents typically follow a pattern; in fact, the layout and design in terms of colors, fonts, and images are similar between copies. It may happen that a structured document changes slightly due to the release of a new document version. An example of a structured document is an ID document where each individual copy has the same format.

‍

This type of document is the easiest to process because the information is well identifiable and maintains the same position among the various samples. A usual approach to process this type of document is to use traditional rule-based and template-based solutions applied to the output of an OCR engine.

Such an approach is challenged by the following issues:

document capture is not always guided. This can result in rotated, low-quality documents that are difficult to read and therefore difficult to process using a traditional solution;

although the structured document is easy to interpret, there are several reasons why its format may vary. In fact, documents change over time and formats can be multiple due to different nationalities;

language change in a document, may require different setups for the type of document.

‍

Semi-structured documents

Semi-structured documents have the characteristic of containing a certain type of information that is known in advance, but which can change position and format within the document itself. In addition, this type of document also varies greatly in terms of layout and design, so it changes relatively in the color, font, and decorations present. The most classic example is an invoice. Each company must include some necessary information within the invoice but can freely choose the level of detail, fonts, colors, and layout of the invoice itself. This makes semi-structured documents more difficult to process than the previous category.

‍

Rule- and template-based solutions for processing semi-structured documents have a number of issues and limitations. First, those exposed relative to structured documents. Second, this type of document varies depending on the vendor. This involves the implementation of a new template and related rules each time.

‍

Unstructured documents

Unstructured documents do not follow any constraints in terms of format or content. A concrete example of an unstructured document is a contract. In fact, the terms and conditions of a contract vary completely depending on the type and format of the document itself.

‍

Processing this type of document is more complicated than the categories seen above. For this reason, template-based techniques are not usable here. This gives rise to the need to use solutions that take advantage of machine learning and natural language analysis.

‍

You have the documents, we have the solution

myBiros is an intelligent document processing product designed for companies that have many challenges internally related to processing documents in order to obtain structured data. Unlike traditional processes, myBiros allows any document to be processed automatically, extracting information and data of interest. The benefit to companies is the net savings in time, cost and repetitive activities by resources.

With myBiros, it is easy to automate document processes, thanks to the use of a pipeline that sees the best Deep Learning techniques in action. Much more than an OCR, myBiros can interpret the data locked within documents. In this way, it enables companies to manage risks, make important decisions, and seize opportunities. myBiros differs from traditional solutions in that it does not use a rules-based or template-based pipeline. The approach used by myBiros is entirely data-driven. This makes the entire pipeline fully trainable on a vertical domain without having to specify any rules or domain information. myBiros leverages techniques from Computer Vision and NLP that allow the document to be interpreted using its different characteristics: the text content, the layout, and the document image itself.

Thanks to the features mentioned so far, myBiros can process any type of document: structured, semi and unstructured.

Want to find out more about our solutions? Contact us and try our demo, we are waiting for you!

‍

Articles from the same category

Glossary

Below you will find a glossary that lists and defines essential terms for understanding and making the most of intelligent document automation.

Read it now

Risks associated with manual document processing

Every business department involves document management, which is necessary to record information, communicate with customers and suppliers, and store important data. If done manually, these activities expose the company to numerous risks.

Read it now

The cost of data entry errors

Errors due to manual data entry come at a significant cost to companies. It is important to invest in reliable data entry processes and proper quality controls so that errors and subsequent costs can be remedied.

Read it now

What is customer onboarding?

Customer onboarding is the process by which a company introduces a new customer to its product or service. The following article explains what digital onboarding is, its automation, and its benefits.

Read it now

Digital transformation and automated document processing

Digital transformation and document hyperautomation

Digital transformation includes implementing innovative technologies and redefining business processes to automate.

Read it now

Why automate expense management?

Many companies still manage expenses manually, causing low employee productivity. Today, expense management can be automated, reducing time, cost, and repetitive tasks that cause frustration.

Read it now