Machine Learning Use Case: Information extraction from layout driven and template-based documents.

Pradeep Kumar Choudhary
5 min readNov 8, 2020

This article talks about the problem statement of data field extraction from documents like Invoices, Receipts, Forms

Today billions of invoices are generated daily and only 20 to 30% are exchanged digitally. Though there are standards defined for E-Invoicing, still majority of invoices are generated and exchanged in traditional way. Most vendors and sellers have their own invoice layouts.

In any organization, for accounts payable department, the transactional processes, involving invoices are the prime candidates for automation. There are back-office teams working solely on manual reading of data from invoices and typing into software (ERP/CRM). Majority of these invoices are in the form of hard copies or scanned and arrive on paper or via emails.

This space has ample opportunity for modernization and process improvement using today’s smart technologies, involving machine learning.

There have been various attempts to apply machine learning techniques to extract the data form scanned invoice documents.

Majority of early Machine Learning solutions were based on training logistic regression based models, which involves manually doing the feature engineering, identifying the features related to contents, layouts, relative positions of various tokens and labels, manually classifying and tagging the token types likes numeric, string, date, address, zip, amount etc.

Tradeshift, one of the market leaders, has used this approach commercially and at volume in their Cloudscan service for invoice processing. It had its own limitations and constraints, requiring manual interventions for error corrections.

As the AI research progressed, new deep-learning models appeared for interpreting and translating textual contents. Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) based models have been very successful at recognizing the language context.

There have been attempts to apply and adapt these models for invoice documents but expected success could not be achieved. These models are biased toward processing sequential text data, with underlying assumption that sequence holds the context. The ability of these models to selectively remember the context from previous steps or previous text sequence, works well for understanding the overall textual context. Invoices are not like that. Invoice labels are independent, and invoice documents have prominent 2D layout features. Human mind pays more attention to layout features of invoice, rather than the actual text labels. Layouts play important role in localizing the required data fields. Sequential models completely ignored this important aspect and there is no good option available to embed the dimensionality information in sequential data stream.

In recent time, Computer Vision models have become affordable in terms of consumption of computing resources. Convolutional Neural Network (CNN) based models are extensively being used for image recognition, classification, and visual object detections.

Image Credits: (Badrinarayanan et al., 2017) : CNN Encoder-Decoder

CNN models are inherently biased towards capturing the image features and efficiently learn the layout attributes. If we can encode the invoice contents in 2D format with embedded textual context, the CNN models could prove to be effective solution for learning, interpreting, and extracting the invoice contents.

There are few approaches proposed in recent time and have proved to be performing well.

In an image vector, the 3rd dimension is usually used to encode and represent the color information (RGB). We can use the 3rd dimension to encode and hold the text meaning, this could be done at character level as well word level. Character level encoding can simply be implemented using one-hot-encoding, for word level encoding, pre-trained word embedding models like Word2Vec can be used for deriving the text vector. For CNN network, it does not matter whether the 3rd dimension is color bits or text bits, it will be able to learn those patterns.

The Computer Vision pipeline for Invoice extraction makes use of a Fully Convolutional Encoder-Decoder network that predicts a segmentation mask and bounding boxes. Each document page is encoded as a two-dimensional grid. For optimized CNN model it is advisable to have comparatively small dimensions for input image matrix length and height, resulting in lighter memory footprint and a smaller number of network variables.

This can be achieved by having a reasonably sized 2D grid for layout, representing the word position mapped to one of the slots in grid. The grid size can be less than 80*80 pixel, each pixel representing the position of one of the text-words in invoice.

The following three feature vector can be encoded along the depth (i.e. 3rd dimension) of CNN input image matrix.

  • Word embedding vector
  • Color encoding vector
  • Text size encoding vector

Network architecture can follow one of following two approaches to handle three feature vectors:

  • For each word position, create a consolidated depth vector by concatenating the three feature vectors before feeding into the encoder network. [ Concatenate Word2Vec + Sizing one hot encoded + color one hot encoded]
  • Alternative approach could be to have three encoder each one having depth vector for one the above feature, and output of encoder gets concatenated before feeding in decoder.

Results achieved with CNN based encode-decoder models are really encouraging one.

Following are the research papers published in recent times based on leveraging 2D layouts with embedded textual context along the dimension of depth vector.

BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding

https://arxiv.org/abs/1909.04948

Image Credits: BERTGrid

CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor

https://arxiv.org/abs/1903.12363

Chargrid: Towards Understanding 2D Documents

https://arxiv.org/abs/1809.08799

Recently, the big cloud players, Google and Amazon have also made available their services for name-value based data extraction from forms and invoices.

Google Document-AI:

https://cloud.google.com/document-ai/

Document AI is a google service that detect and process the text in small and large form documents and the documents containing tables.

Google AI Blog: It discusses the methods used in google service for invoice parsing.

Amazon Textract:

Amazon Textract is a fully managed machine learning service that automatically extracts text and data from scanned documents

There is huge demand out there to fill this space with intelligent automation, and everyone is eying a bigger pie.

--

--

Pradeep Kumar Choudhary
0 Followers

Senior Architect | Digital Solution - Design Thinking | Intelligent Automation