OCR: From Concept to Deployment

Optical Character Recognition (OCR) is concerned with teaching computers how to read, or in more technical terms: fetch text from a digital document. It is widely applied in many applications now such as Document scanning, Receipt scanning, Automated data entry, and many others.

In this blog, we are going to focus on Tesseract since it is the state-of-the-art open-source OCR. In future blogs, we are going to explore and compare some of the other algorithms as well.

A Little Bit of History on OCR

Before delving into the techniques of creating our first OCR, let’s take a little history lesson to have an idea of how significant it has been through the years. Back in 1913, Dr. Edmund Fournier d’Albe invented the Optophone.

The Optophone was the first OCR machine invented to help the visually impaired read. It works by scanning a document and producing a different sound for each character it detects. Mary Jameson was the first visually impaired person to ever read a full book thanks to the Optophone and OCR.

Mary Jameson was the first visually impaired person to ever read a full book thanks to the Optophone and OCR

Available OCR Algorithms

Currently, there are many OCR algorithms. Some of them are free and open-source like Tesseract, Kraken and EasyOCR and others are not — like Google Cloud Platform OCR API and ABBYY Finereader.

How Does Tesseract Work?

Tesseract was developed by Ray Smith, Hawlett-Packard (yes, the HP) in 1994 using C and C++. Since then, Tesseract has added support for over 116 languages! Even though it is now maintained by Google, Tesseract remains open-source. Let’s go into the Tesseract process.

Tesseract starts by its own pre-processing on the image, by applying adaptive binarization (to convert the image from its initial color-space to a binary one). Then, the document is divided into line blobs, lines are divided into words and words are segmented into characters.

Character recognition then proceeds as a two-pass process. In the first pass, an attempt is made to recognize each word in turn. Each word that is satisfactory is passed to an adaptive classifier as training data. The adaptive classifier then gets a chance to more accurately recognize text lower down the page.

How Does Tesseract Work?

Let’s go into the Tesseract process

Tesseract starts by its own pre-processing on the image, by applying adaptive binarization (to convert the image from its initial color-space to a binary one). Then, the document is divided into line blobs, lines are divided into words and words are segmented into characters.

Character recognition then proceeds as a two-pass process. In the first pass, an attempt is made to recognize each word in turn. Each word that is satisfactory is passed to an adaptive classifier as training data. The adaptive classifier then gets a chance to more accurately recognize text lower down the page.

Making our First OCR

Enough with rambling, let’s delve into making our very own OCR algorithm.

Requirements and Installation

First of all, you are going to need to install these things:

  • Python
  • OpenCV
  • Tesseract: the OCR algorithm of choice. For now. In the next blog of this series, we are going to compare different OCR algorithms and see exactly which is better for which application.

In Order to install Tesseract, simply run the following command:

Ubuntu:

sudo apt install tesseract-ocr

Mac OS:

brew install tesseract-ocr

Windows:

Go to the official Tesseract installation page, download the file that matches your Windows bit version (32- or 64-bit). Install the API as you would any other software on a Windows machine.

To test that Tesseract is installed correctly, open up your terminal and run the following command:

tesseract — version

The output should look something like this:

Making our First OCR
For other libraries, simply run the following commands:pip install numpy
pip install opencv-python
pip install pytesseract

Apply Tesseract

We will start with importing the necessary libraries then will read the image using opencv. Then, we will call Tesseract to read the text in the image.

How Does Tesseract Work?

Here we import the libraries, read the image and then call the image_to_string method from PyTesseract to extract text from an image.

Showing how ocr work

Output: NO TEXTING WHILE DRIVING“What are OEM and PSM, then?” This is quite a long story and you do not need to worry about them at the moment. But, in the simplest form:

OEM (OCR Engine Mode)
Controls the engine that Tesseract uses. You can either use a Legacy mode which is the basic non-deep learning engine, LSTM which uses a Recurrent Neural Network — more specifically Long-Short Term Memory (LSTM) — or Default which automatically chooses a mode for you.

0 = Legacy engine only.
1 = Neural nets LSTM engine only.
2 = Legacy + LSTM engines.
3 = Default, based on what is available.

PSM (Page Segmentation Mode)
Specify the layout of the input page (image).

0 = Orientation and script detection (OSD) only.
= Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR.
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
11 = Sparse text. Find as much text as possible in no particular order.
12 = Sparse text with OSD.
13 = Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific

More Fun with Tesseract

1. Detecting Text Language

Tesseract has a very cool feature that tells the user which language the text in which a certain document is written.

1 # Run Tesseract OSD
2 osd_output = pytesseract.image_to_osd(img)
3 print(f’Full OSD output: \n{osd_output})
4
5 # Select the language from the OSD output
6 text_language = re.search(‘(?<=Script: )[a-zA-Z]+’, osd_output).group(0)
7 print(f’Language: {text_language})

 

The OSD output contains some useful information about the document, such as page numbers, orientation, and language. We are most interested in the language which is exactly what we do in the third line — we search for the language using regular expressions.

The OSD output
Full OSD output:
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 4.83
Script: Latin
Script confidence: 3.84
Language: Latin
Note: Latin means English

2. Text matching

Using Tesseract, we can also ensure the presence of a certain word in the document. For example, suppose we have a large database of documents and we want to search for a certain word in all of them. We can simply run Tesseract on those documents and make a very simple searching algorithm that searches for said keyword in the text.

Showing how ocr work
Output: I did not find the keyword `Tesseract`

3. Tesseract for Other Languages

In order to use Tesseract to recognize text in other languages, all we need to do is:
  • Install that specific language

sudo apt install tesseract-ocr-{lang}

  • So, to install the Arabic package for Tesseract
sudo apt install tesseract-ocr-ara
  • And then specify that language in your code
1 # Read image
2 img = cv2.imread(‘data/image3.png’)
3
4 # Run Tesseract using the Arabic model
5 text = pytesseract.image_to_string(img, lang=’ara’, config=’–oem 3 –psm 6′)
6 print(f’I found the following text in the image: {text}’)
Tesseract for Other Languages arabic
Output: تفائلوا بالخير تجدوه.

That’s it! You have successfully used your very first OCR algorithm. Congratulations!
In this post, we had covered a general introduction to OCR, how to install Tesseract and other important libraries in Python, and we got our hands dirty working with Tesseract.

In the next blog post, we are going to compare different OCR algorithms, analyze their pros, cons and when to use each one of them.
Afterward, we are going to use Tesseract for a more advanced application — document scanning.

Author: Pavly Salah
Editor: Sherif Adel