Skip to main content

Evaluating Google Tesseract


Tesseract is an optical character recognition engine. It is a free software, released under the Apache License, Version 2.0, and development has been sponsored by Google since 2006. It is considered one of the most accurate open source OCR engines currently available.

The problem is very hard. Font variations, image noise and alignment problems make it extremely difficult to design an algorithm that can translate the image of text into actual text reliably.

I conducted a series of tests with Machine-printed documents and Hand-written documents.
  • Machine-printed documents 
    • Font: Serif and Sans Serif (3 each)
    • Font: One decorative style and one script type style
    • Layout: Left-aligned text (3 serif, 3 sans serif) and justified text (3 serif and 3 sans serif)
    • With figures (one serif, one sans serif)
  • Hand-written documents
    • One-paragraph description of yourself
    • One-paragraph description of two other persons

I used Dejavu Serif, Times New Roman, and Cambria for the Serif Type of Fonts. While, Arial, Calibri, and Dejavu Sans for Sans Serif. The output of Tesseract in this kind of images are good to perfect. Simple fonts can be detected by the application.

      
Serif Text Fonts



Sans Serif Text Fonts


For the Script and Decorative font styles, I used Dancing Script and Rock Salt respectively. The output of Teserract with this kind of document is disaster. (Hu hu hu) I think that the font variation and the spaces between each character affects the decision making of the application.
Dancing Script
Rock Salt


I think that the justified layout of the document doesn't do improvement in detecting the characters for the Serif and Sans Serif documents. It output almost the same compared to the left-aligned documents.

Also, the document with images doesn't affect the character recognition. It just ignore the images in the document. Here is an example of the document with image.



For the handwritten documents, the output of the application varies on the cleanliness and clarity of the handwriting. Also. some letters can't be detected by the application. The images I used were also preprocessed to get a clearer view of the characters. Here are my examples for the handwriting document.
My Self Description

Monina's Self Description
Erin's Self Description

In conclusion, Google Tesseract works best with the "everyday-fonts-we-used" and cleaner fonts like with the Serif and Sans Serif fonts. While, the output of the application with decorative and script fonts are so bad to worst. It can't detect the most of the character and the output is a garbage-like texts. While, handwritten document is on the average because the output varies by the person's handwriting. I think, Google Tesseract must continue to improve their training data for text detection and add important features like layout analysis and a user interface. But all in all, the application is superb. It looks as if a high-quality open-source OCR solution is on the horizon.


*The article I used is from bleacherreport.com 

Comments

Popular posts from this blog

Document Layout Analysis

Document Layout Analysis is our second exercise. Using the three images above our program needs to do the following: Individual characters are boxed Individual words are boxed Lines are boxed Paragraphs are boxed The paragraphs with margins I used a bottom-up approach for this exercise. It means that I started detecting and boxing the letters to words to line to paragraph and lastly to the paragraph with margin. I created a function for each of the objectives. I used a trial and error approach for determining the appropriate kernel size for the specific function. I have a very simple step for every objectives: Load the images. Assigning of output images Convert images to grayscale Cleaning the images using Otsu's Thresholding method. (with the inversed binarized image) Assigning kernel size (1 or 2 kernels depending in the objective) Morphological Operations (Dilation, Erosion, Closing and Opening) Find the Contours Box the contours (I added some offs...

Installing AsgardCMS for your Web Application

AsgardCMS is a full-featured modular and multilingual CMS on top of the Laravel Framework. Here are the steps for installing the aforementioned CMS. You can get the code using this command: composer create-project asgardcms/platform your-project-name If the terminal ask you for a token. Just follow the steps of generating a new token here: https://help.github.com/articles/creating-an-access-token-for-command-line-use/  After that, the installation must be smooth-sailing. Go t the directory of your project php artisan asgard:install Then, you will now set-up the database connection and admin creation. Finally, you can run  php artisan serve or php artisan serve --port=your-port Access the application: Application : localhost:your-port/en Admin: localhost:your-port/en/backend References:  https://asgardcms.com/install https://www.youtube.com/watch?v=MeX_D-aql6g http://asgardcms.blogspot.in/2015/12/asgardcms-inst...

Installing Laravel in Linux Operating System

To install Laravel in your Linux Operating System you need to follow these steps: First you need to install composer. https://getcomposer.org/download/  php -r "readfile('https://getcomposer.org/installer');" > composer-setup.php php -r "if (hash('SHA384', file_get_contents('composer-setup.php')) === '41e71d86b40f28e771d4bb662b997f79625196afcca95a5abf44391188c695c6c1456e16154c75a211d238cc3bc5cb47') { echo 'Installer verified'; } else { echo 'Installer corrupt'; unlink('composer-setup.php'); } echo PHP_EOL;" php composer-setup.php php -r "unlink('composer-setup.php');" Download the Laravel installer using Composer: composer global require "laravel/installer" The setup PATH: export PATH="~/.composer/vendor/bin:$PATH" Once installed, the  laravel new  command will create a fresh Laravel installation in the directory you specify. For insta...