OCR as a solution to avoid manual document entry

How many times have you been forced, in your company's accounting, to manually enter data from an expense report on your computer? With a valid detection rate of 99% according to the Researchgate study published in 2017, Optical Character Recognition can change your daily life: here's how this revolutionary business process works.

What is OCR?

By definition, OCR refers to the use of Optical Character Recognition (OCR), a lesser-used term for the process itself.

This abbreviation defines the process by which a file containing images is automatically converted into a text document.

Concretely, the "image" supports which can be scanned are

  • the printed document ;
  • the digital photo (taken by a camera) ;
  • the typewritten sheet.

From there, the scanning will consist in transforming the contents into one of the following formats:

  • plain text ;
  • word processing (.doc, .docx) ;
  • XML file.

Today, the most powerful OCR software allows to identify and preserve certain specifications of the text such as :

  • bold ;
  • underlined ;
  • italics ;
  • font size ;
  • the type of font ;
  • page layout (line breaks, indentation);
  • Illustrations (tables, graphs and images).

Despite the fact that the process of OCR could be considered as a very recent practice, it is not: already in 1929, Gustav Taschek, a German engineer, created the first OCR system based on a photosensitive detector capable of recognizing and detaching certain sequences of characters according to a library of templates memorized by the machine, thus reconstituting words.

Since then, of course, the technological boom has seen this practice develop and become part of the daily life of individuals and companies alike.

Good to know: For the anecdote, you should know that Stevie Wonder himself financed, in 1976, a derivative of OCR designed by Ray Kurzweil allowing blind people to read on computer support (although sketchy at the time!).

OCR : how does it work ?

In practice, there are five steps in the process of scanning a document in image format.

#1: Pre-analysis of the image

First of all, the image is brought up to the standards of the OCR software via possible improvements such as straightening and cropping, adaptation of contrast and luminosity, switching to two-color mode (black and white) as well as edge detection.

#2: image segmentation

The image is then segmented into lines of characters, and then into full characters. We proceed here to the identification and isolation of each symbol in the text.

#3: character recognition

Once each character is isolated, it is necessary to recognize them. For this, three methods exist, namely :

the metric method, aiming to compare the character with all the models stored in the software library without any previous classification (this method is not very popular because it is considered more time consuming);

the features method, matching each character to a bank containing between 100 and 300 point clouds in order to associate the closest one;

the statistical or probabilistic method, which is more applicable to the recognition of handwritten characters that are less easily recognizable.

#4: post-processing

Once each character is associated, post-processing represents a linguistic contextualization in order to reduce the number of potential errors in the isolated identification of each character. The software uses grammar rules, an integrated word dictionary or n-grams (sequences of characters or words).

#5: generating the output format

Finally, the output format is generated according to the user's preferences. Note that typed documents are much more easily converted by OCR than handwritten documents. 

Also, many programs include a learning feature to constantly enrich and adapt their character set. Despite this, it is still impossible to reduce the conversion error rate to zero because of criteria such as:

  • the quality of the initial image document ;
  • the fonts used ;
  • the adjacent notes;
  • the shape of the text.

Examples of OCR

The concept of ocerization is used in many fields, including the digitization of expense reports: this is what Jenji's Beyond Limits OCR (ADDI) offers.

Through an artificial intelligence based on deep learning, the software offers an error rate among the lowest on the market and allows not only to convert the image to text format, but also to classify the latter in the corresponding fields such as the amounts, the category of expense or the identity of the seller.

This OCR system works on the web as well as on mobile devices (iOS and Android) and enables the processing of receipts such as hotel bills, restaurant receipts and cash register receipts.

Where tasks such as manual entry of expense reports could consume many hours of wasted time and no added value, OCR automates the process to save time and money.

Maria Khizhniakova