Skip to main content

Evaluating Google Tesseract


Tesseract is an optical character recognition engine. It is a free software, released under the Apache License, Version 2.0, and development has been sponsored by Google since 2006. It is considered one of the most accurate open source OCR engines currently available.

The problem is very hard. Font variations, image noise and alignment problems make it extremely difficult to design an algorithm that can translate the image of text into actual text reliably.

I conducted a series of tests with Machine-printed documents and Hand-written documents.
  • Machine-printed documents 
    • Font: Serif and Sans Serif (3 each)
    • Font: One decorative style and one script type style
    • Layout: Left-aligned text (3 serif, 3 sans serif) and justified text (3 serif and 3 sans serif)
    • With figures (one serif, one sans serif)
  • Hand-written documents
    • One-paragraph description of yourself
    • One-paragraph description of two other persons

I used Dejavu Serif, Times New Roman, and Cambria for the Serif Type of Fonts. While, Arial, Calibri, and Dejavu Sans for Sans Serif. The output of Tesseract in this kind of images are good to perfect. Simple fonts can be detected by the application.

      
Serif Text Fonts



Sans Serif Text Fonts


For the Script and Decorative font styles, I used Dancing Script and Rock Salt respectively. The output of Teserract with this kind of document is disaster. (Hu hu hu) I think that the font variation and the spaces between each character affects the decision making of the application.
Dancing Script
Rock Salt


I think that the justified layout of the document doesn't do improvement in detecting the characters for the Serif and Sans Serif documents. It output almost the same compared to the left-aligned documents.

Also, the document with images doesn't affect the character recognition. It just ignore the images in the document. Here is an example of the document with image.



For the handwritten documents, the output of the application varies on the cleanliness and clarity of the handwriting. Also. some letters can't be detected by the application. The images I used were also preprocessed to get a clearer view of the characters. Here are my examples for the handwriting document.
My Self Description

Monina's Self Description
Erin's Self Description

In conclusion, Google Tesseract works best with the "everyday-fonts-we-used" and cleaner fonts like with the Serif and Sans Serif fonts. While, the output of the application with decorative and script fonts are so bad to worst. It can't detect the most of the character and the output is a garbage-like texts. While, handwritten document is on the average because the output varies by the person's handwriting. I think, Google Tesseract must continue to improve their training data for text detection and add important features like layout analysis and a user interface. But all in all, the application is superb. It looks as if a high-quality open-source OCR solution is on the horizon.


*The article I used is from bleacherreport.com 

Comments

Popular posts from this blog

Document Layout Analysis

Document Layout Analysis is our second exercise. Using the three images above our program needs to do the following: Individual characters are boxed Individual words are boxed Lines are boxed Paragraphs are boxed The paragraphs with margins I used a bottom-up approach for this exercise. It means that I started detecting and boxing the letters to words to line to paragraph and lastly to the paragraph with margin. I created a function for each of the objectives. I used a trial and error approach for determining the appropriate kernel size for the specific function. I have a very simple step for every objectives: Load the images. Assigning of output images Convert images to grayscale Cleaning the images using Otsu's Thresholding method. (with the inversed binarized image) Assigning kernel size (1 or 2 kernels depending in the objective) Morphological Operations (Dilation, Erosion, Closing and Opening) Find the Contours Box the contours (I added some offs...

Installing AsgardCMS for your Web Application

AsgardCMS is a full-featured modular and multilingual CMS on top of the Laravel Framework. Here are the steps for installing the aforementioned CMS. You can get the code using this command: composer create-project asgardcms/platform your-project-name If the terminal ask you for a token. Just follow the steps of generating a new token here: https://help.github.com/articles/creating-an-access-token-for-command-line-use/  After that, the installation must be smooth-sailing. Go t the directory of your project php artisan asgard:install Then, you will now set-up the database connection and admin creation. Finally, you can run  php artisan serve or php artisan serve --port=your-port Access the application: Application : localhost:your-port/en Admin: localhost:your-port/en/backend References:  https://asgardcms.com/install https://www.youtube.com/watch?v=MeX_D-aql6g http://asgardcms.blogspot.in/2015/12/asgardcms-inst...

UX Research: Understanding User Needs and Behaviors

As someone who has recently started a UX course on Coursera, I have learned that UX research is a critical aspect of the design process. The purpose of UX research is to gain a deep understanding of the users, their needs, and behaviors. This information is used to inform design decisions and create products and services that meet the needs and expectations of the users. There are several methods that can be used for UX research, including: Surveys: Surveys are a quick and easy way to gather information from a large number of users. They can be administered online or in person and can be used to gather information about demographics, user behavior, and product or service usage. User interviews: User interviews are one-on-one conversations with users. They are an effective way to gather in-depth information about a user's experiences, thoughts, and opinions. User interviews can be conducted in person or over the phone and can last anywhere from 30 minutes to an hour. User testing: U...