అందం: Tesseract OCR for Telugu

Wednesday, May 19, 2010

Tesseract OCR for Telugu - Part 3

TESSDATA in Telugu

తెలుఁగు చదవడానికి పిల్లవాడికి ఏమేమి నేర్పాలి? పోనీ లెండి ముందు తేలికపాటి ప్రశ్న।
ఆంగ్లం చదవడానికి పిల్లవాడికి ఏమేమి నేర్పాలి?
a నుండి z వఱకును, అలానే A నుండి Z వఱకును। 0 నుండి 9 వఱకును।
ఇప్పుడు తెలుగో।
అ – ఔ వఱకును
అఁ అం అః లు
క – హ వఱకును

సరిపోతుందా అబ్బే, చిన్నప్పుడు బళ్ళో గుణింతాలు నేర్చుకోలేదూ?
క కా కి కీ మొదలు, హొ హో హౌ వఱకూ గుణింతాలు నేర్పాలి కూడా।
సరిపోతుందా అబ్బే, ఒత్తులు మఱి?
అక్క లో వుండే క వత్తు కాడనుండి అహ్హలో వుండే హ వత్తు వఱకూ నేర్వాలి (మధ్యలో ళ,ఱ లు వున్నాయండోయ్)

ఇక అంతవుంటే సరిపోతుందా అంటే, కొన్ని వత్తులు అక్షరాల్లో భాగాలయిపోతాయి,
ఉదా – త్త, స్త, స్తి, ట్లు, ర్ల, వంటివి తఱచు కనబడే సంయుక్తాలు వీటిలోని వత్తులను ప్రక్కకు వ్రాయలేము। క్క స్స బ్బ, ప్ప వంటి వత్తులు ప్రక్కకు వ్రాయవచ్చును। ఇలా లెక్కవేసుకు పోతే, గ్గ, గ్గా, గ్గి, గ్గు, గ్గౌ వంటివి నేర్పుకుంటూ పోవాలి మన టెస్సుగాడికి,

ముఖ్యగమనిక –

ప్రస్తుతానికి మనకున్న టెస్సుగానికి ఇంకా డబ్బాక్రింద డబ్బాపెట్టే సామర్ధ్యం లేదు। ఇది ఎందుకంటారా, పదారణాల ఆంగ్లభాషలో అక్కరం కింద అక్కరం పెట్టే అవసరం లేదు కనక అక్కడ పుట్టిన దీని బుఱ్ఱ కూడా అలా పేర్చబడ్డది। క్కి ని కి, కవత్తులు గా విడదీస్తుంది గాని, గ్గు ని గు క్రింద గవత్తుగా గుర్తించలేదు। కాబట్టి ప్రస్తుతాని గ్గు ని ఒక అక్కరంగా లెక్కవేయాలి। ఇలా మొత్తానికో వెయ్యి అక్కరాలు మనము నేర్పాలి।

Stuff Required for Telugu Training
1) You need to give the program a *.box,*.tiff file pair, the tiff file contains all the possible characters (as an image), and the box file contains co-ordinates of the boxes and the characters corresponding to the tiff file.

Eg:-
Image and Box file contents
sample telugu text

కే 29 115 50 154
్య 49 94 67 135
క్రై 81 79 114 150
క్ష్య 142 94 181 148
ప్రే 28 25 56 78
జ్ఞ 82 14 110 61
ఋ 141 36 204 62
As seen in Box File viewer

A *.box file viewer is one of many programs written by OCR enthusiasts for the sake of easy manipulation of box files. CowBoxer is one of them.

Ideal Box File for Telugu.
The ideal box file for Telugu would contain nearly a 10,000 lines composed of 1000 symbols repeating again and again. More probable symbols should be trained better.
Hence, in a 10,000 symbol box file (with repetetions of 1000 symbols) the common symbols డు,ము,వు,లు etc. would appear 20 to 100 times. ప్రథమా విభక్తి అంటే ఏంటో అనుకున్నారు కదా! ఇదన్నమట దాని ఉపయోగం।

For now, download long32.box.txt and long32.tif from here. Rename them as pothana.box and pothana.tif respectively. Also (you may want to) download the tiff images as test images.

Once you have a box file
We shall look at generation of good and huge box files in the next Part. Right now we will concentrate on understanding the system. Assume God, (read me) gives you a good box file that can be used for training. How do we generate the required TESSDATA files? Or, assume you need to train only for non-conjuncts. అ-హ, కా-హౌ। You have a sufficiently huge box file. What do you do next?

RUN
tesseract pothana.tif junk nobatch box.train
Note that the box filename must match the tif filename, including the path, or Tesseract won't find it. The output of this step is pothana.tr which contains the features of each character of the training page.
Error: You might get an error here

read_variables_file:Can't open ./tessdata/configs/box.trainCould not open file, nobatch

To counter this, You have to ensure that the tessdata folder has the config and tessconfig folders in it. See step 1 under TEST RUN for OCRing English text in Part-2.

RUN
mftraining pothana.tr
This will output two data files: inttemp, pffmtable.

cntraining pothana.tr
This will output the normproto data file.

RUN
unicharset_extractor pothana.box
This will generate the unicharset data file.

DATA DICTIONARIES

పిల్లవానికి తెలుఁగు చదవడం నేర్పుతున్నామంటే, వానికి కనీసం కొన్ని తెలుఁగు పదాలు వచ్చివుండాలికదా। ఉదాహరణకు వాడికి వత్తులు సరిగా రావనుకుందాం। మవత్తు, నవత్తు సరిగా గుర్తుపట్టలేడు। కానీ అమ్మ, నాన్న వంటి పదాలు చూసినప్పుడు, వాడు అమ-ఏదో వత్తు కాబట్టి అమ్మ అయ్యింటుంది అని అనుకుంటాడు। అలానే నాన అయితే నాన్న అయివుంటుంది అనుకుంటాడు। టెస్సుగానికి కూడా కొన్ని పదాలు ఇవ్వాళ్ట। తఱచూ వచ్చే పదాలు, మిగిలిన పదాలు।

Tesseract uses 3 dictionary files for each language. To make the DAWG dictionary files, you first need a wordlist for your language. The wordlist is formatted as a UTF-8 text file with one word per line. Split the wordlist into two sets: the frequent words, and the rest of the words, and then use wordlist2dawg to make the DAWG files:

RUN
wordlist2dawg frequent_words_list freq-dawg
wordlist2dawg words_list word-dawg

The third dictionary file is called user-words and is usually empty.

tel.Files
For now leave the DangAmbigs files as a blank file.

Rename the files you have as
tel.inttemp
tel.normproto
tel.pffmtable
tel.unicharset
tel.freq-dawg,
tel.word-dawg,
tel.user-words,
tel.DangAmbigs

Place them in the tessdata directory in the same directory as your tesseract.exe and the libtiff dlls.
* Alternately you can get the above data files as tel.tessdata.zip { from here }.

RUN

tesseract sample_telugu.tif output -l tel

You should see the recognized telugu characters to sample_telugu.txt file.

Hurray!

Please try the above and ask me if you have any doubts.
+౯౧ ౯౫ ౫౦ ౧౭ ౦౪ ౭౧

12 comments:

Unknown2:48 pm, May 20, 2010
Excellent article in english.Congratulations. Keep it up
please also post article on tesseract latest version now released yesterday for benefit of users.
ReplyDelete
Replies
K.S.R5:50 pm, September 13, 2010
hai sir,
you have taken an example of mandarasample.
what is the font style and font size.
i think it is pothana 26.
give reply.
thank you
ReplyDelete
Replies
Kiran3:24 am, October 18, 2010
hello Rakeswara Rao Garu,
Its really a good article on Telugu OCR. By far its the best I could find by searching on google.
I tried to follow your step by step instuctions. When I reached
-'* Alternately you can get the above data files as tel.tessdata.zip { from here }."
I tried to get that file from the "http://groups.google.com/group/telugu-computing/files" but file portion is always down in last 3 days.
If possible can you please e-mail me those files.
thanks in advance
-kiran
ReplyDelete
Replies
Kiran2:17 am, October 20, 2010
Hello Rakeswara Rao Garu,
Thanks for helping to run tesseract on my local machine.
Can you please let me know how to create a box file from the images. If you could point out a link that would be fine. Probably I will get back to you with more questions :-)
thanks
-kiran
ReplyDelete
Replies
Sunil Mulagada10:47 am, February 15, 2011
Here I am unable to download tel.tessdata.zip, so I am trying to generate the training files.
But I haven't got the steps to create the tel.training data correct.
Right now I have the following files
tel.freq-dawg
tel.frequent_words_list
tel.inttemp
tel.normproto
tel.pffmtable
tel.unicharset
tel.word-dawg
tel.words_list

Can you please help me ..?

Regards,
sunil
ReplyDelete
Replies
rākeśvara10:53 am, February 15, 2011
సునిల్,
ఇది పాత సరుకు, టెస్ ౩ గుఱించి వేఱే చోటవుంటుంది. నాకు వేగు పంపరాదు.
రాకేశ్వర్@जीमेयिल्.काम्
आर् ये के ई ऎस् हॆच् वी ये आर्
ReplyDelete
Replies
Sunil Mulagada9:25 pm, February 15, 2011
Hello Rajeshwar,

Thanks for your quick response.
I searched a lot on the web. So you have any sugested links for me ..?

Thanks,
Sunil
ReplyDelete
Replies
rākeśvara9:46 pm, February 15, 2011
ఇలా వ్యాఖ్యల్లో చెప్పడం కుదరదు కానీ పైనిచ్చిన చిరునామాకు వేగు పంపండి.
ReplyDelete
Replies
K.S.R2:13 pm, April 16, 2011
hai,
please send pothana images.i have to test them.
k.s.r
ReplyDelete
Replies
K.S.R4:55 pm, May 23, 2011
rakesh garu,
you r not giving reply for my questions.
which type of features r used for segmentation,making box file in tesseract.how to make changes in tesseract to train other font(lohit telugu).
ReplyDelete
Replies
K.S.R4:56 pm, May 23, 2011
In which university you r pursuing ph.d & when will u come back to india.
i want to meet you.
bye,
k.srinivasa rao,hyd
ReplyDelete
Replies

Add comment