Skip to main content

Evaluating Google Tesseract


Tesseract is an optical character recognition engine. It is a free software, released under the Apache License, Version 2.0, and development has been sponsored by Google since 2006. It is considered one of the most accurate open source OCR engines currently available.

The problem is very hard. Font variations, image noise and alignment problems make it extremely difficult to design an algorithm that can translate the image of text into actual text reliably.

I conducted a series of tests with Machine-printed documents and Hand-written documents.
  • Machine-printed documents 
    • Font: Serif and Sans Serif (3 each)
    • Font: One decorative style and one script type style
    • Layout: Left-aligned text (3 serif, 3 sans serif) and justified text (3 serif and 3 sans serif)
    • With figures (one serif, one sans serif)
  • Hand-written documents
    • One-paragraph description of yourself
    • One-paragraph description of two other persons

I used Dejavu Serif, Times New Roman, and Cambria for the Serif Type of Fonts. While, Arial, Calibri, and Dejavu Sans for Sans Serif. The output of Tesseract in this kind of images are good to perfect. Simple fonts can be detected by the application.

      
Serif Text Fonts



Sans Serif Text Fonts


For the Script and Decorative font styles, I used Dancing Script and Rock Salt respectively. The output of Teserract with this kind of document is disaster. (Hu hu hu) I think that the font variation and the spaces between each character affects the decision making of the application.
Dancing Script
Rock Salt


I think that the justified layout of the document doesn't do improvement in detecting the characters for the Serif and Sans Serif documents. It output almost the same compared to the left-aligned documents.

Also, the document with images doesn't affect the character recognition. It just ignore the images in the document. Here is an example of the document with image.



For the handwritten documents, the output of the application varies on the cleanliness and clarity of the handwriting. Also. some letters can't be detected by the application. The images I used were also preprocessed to get a clearer view of the characters. Here are my examples for the handwriting document.
My Self Description

Monina's Self Description
Erin's Self Description

In conclusion, Google Tesseract works best with the "everyday-fonts-we-used" and cleaner fonts like with the Serif and Sans Serif fonts. While, the output of the application with decorative and script fonts are so bad to worst. It can't detect the most of the character and the output is a garbage-like texts. While, handwritten document is on the average because the output varies by the person's handwriting. I think, Google Tesseract must continue to improve their training data for text detection and add important features like layout analysis and a user interface. But all in all, the application is superb. It looks as if a high-quality open-source OCR solution is on the horizon.


*The article I used is from bleacherreport.com 

Comments

Popular posts from this blog

Document Layout Analysis

Document Layout Analysis is our second exercise. Using the three images above our program needs to do the following: Individual characters are boxed Individual words are boxed Lines are boxed Paragraphs are boxed The paragraphs with margins I used a bottom-up approach for this exercise. It means that I started detecting and boxing the letters to words to line to paragraph and lastly to the paragraph with margin. I created a function for each of the objectives. I used a trial and error approach for determining the appropriate kernel size for the specific function. I have a very simple step for every objectives: Load the images. Assigning of output images Convert images to grayscale Cleaning the images using Otsu's Thresholding method. (with the inversed binarized image) Assigning kernel size (1 or 2 kernels depending in the objective) Morphological Operations (Dilation, Erosion, Closing and Opening) Find the Contours Box the contours (I added some offs

Installing AsgardCMS for your Web Application

AsgardCMS is a full-featured modular and multilingual CMS on top of the Laravel Framework. Here are the steps for installing the aforementioned CMS. You can get the code using this command: composer create-project asgardcms/platform your-project-name If the terminal ask you for a token. Just follow the steps of generating a new token here: https://help.github.com/articles/creating-an-access-token-for-command-line-use/  After that, the installation must be smooth-sailing. Go t the directory of your project php artisan asgard:install Then, you will now set-up the database connection and admin creation. Finally, you can run  php artisan serve or php artisan serve --port=your-port Access the application: Application : localhost:your-port/en Admin: localhost:your-port/en/backend References:  https://asgardcms.com/install https://www.youtube.com/watch?v=MeX_D-aql6g http://asgardcms.blogspot.in/2015/12/asgardcms-installation.html

Upgrading your LAMP Stack to PHP 7 plus PHPMyAdmin Installation

First install python-software-properties package on your system which provides add-apt-repository command then use the following set of commands to add PPA for PHP 7 in your Ubuntu system and install it. sudo apt-get install python-software-properties sudo LC_ALL=en_US.UTF-8 add-apt-repository ppa:ondrej/php Then, remove PHP 5 and install PHP 7. sudo apt-get update sudo apt-get purge php5-common -y sudo apt-get install php7.0 php7.0-fpm php7.0-mysql -y sudo apt-get --purge autoremove -y You may also need to install modules like PHP7-MySQL, libapache2-mod-php7.0 etc based on your application requirements. Use the following command to find our available php 7 modules. sudo apt-cache search php7-* Above command will list all available PHP7 modules for installation, Let’s begin installation of modules. sudo apt-get install libapache2-mod-php7.0 php7.0-mysql php7.0-curl php7.0-json php7.0-mbstring php7.0-mcrypt php7.0-gd *Installing  PHPMyAdmin Change directory to /usr/sha