Tesseract is an optical character recognition engine. It is a free software, released under the Apache License, Version 2.0, and development has been sponsored by Google since 2006. It is considered one of the most accurate open source OCR engines currently available.
The problem is very hard. Font variations, image noise and alignment problems make it extremely difficult to design an algorithm that can translate the image of text into actual text reliably.
I conducted a series of tests with Machine-printed documents and Hand-written documents.
- Machine-printed documents
- Font: Serif and Sans Serif (3 each)
- Font: One decorative style and one script type style
- Layout: Left-aligned text (3 serif, 3 sans serif) and justified text (3 serif and 3 sans serif)
- With figures (one serif, one sans serif)
- Hand-written documents
- One-paragraph description of yourself
- One-paragraph description of two other persons
I used Dejavu Serif, Times New Roman, and Cambria for the Serif Type of Fonts. While, Arial, Calibri, and Dejavu Sans for Sans Serif. The output of Tesseract in this kind of images are good to perfect. Simple fonts can be detected by the application.
Serif Text Fonts
Sans Serif Text Fonts
For the Script and Decorative font styles, I used Dancing Script and Rock Salt respectively. The output of Teserract with this kind of document is disaster. (Hu hu hu) I think that the font variation and the spaces between each character affects the decision making of the application.
Dancing Script |
Rock Salt |
I think that the justified layout of the document doesn't do improvement in detecting the characters for the Serif and Sans Serif documents. It output almost the same compared to the left-aligned documents.
Also, the document with images doesn't affect the character recognition. It just ignore the images in the document. Here is an example of the document with image.
For the handwritten documents, the output of the application varies on the cleanliness and clarity of the handwriting. Also. some letters can't be detected by the application. The images I used were also preprocessed to get a clearer view of the characters. Here are my examples for the handwriting document.
My Self Description |
In conclusion, Google Tesseract works best with the "everyday-fonts-we-used" and cleaner fonts like with the Serif and Sans Serif fonts. While, the output of the application with decorative and script fonts are so bad to worst. It can't detect the most of the character and the output is a garbage-like texts. While, handwritten document is on the average because the output varies by the person's handwriting. I think, Google Tesseract must continue to improve their training data for text detection and add important features like layout analysis and a user interface. But all in all, the application is superb. It looks as if a high-quality open-source OCR solution is on the horizon.
*The article I used is from bleacherreport.com
*The article I used is from bleacherreport.com
Comments
Post a Comment