Training Tesseract for Urdu language

Training Tesseract for Urdu language - machine-learning

I want to train Tesseract for Urdu language. I have installed Tesseract 4.00 and have a representative dataset for Urdu available. I have dataset with images like the one attached:
Training Image
I have the label for each image e.g the label for the above image is:
بعد نجی ٹی وی سے گفتگو کرتے ہوئے وزیر خارجہ شاہ محمود قریشی نے بتایا کہ ملاقات
Images are in these fonts: Pak Nastaleeq, Alvi Nastaleeq, Jameel Noori Nastaleeq, Nafees Nastaleeq.
What is the process of training tesseract? How do I tag (make boxes around Urdu word) each Urdu word in the images?

Related

Tesseract (Tess4j) increasing accuracy

Am making a licence plate recognition software, I already trained my language using SunnyPage 2.7, currently the detection is good except Tesseract is not giving me good results. For example it reads This plate as AC2 4529 well thats good except when I load the same image in SunnyPage with my language I get ACZ 4529 which is correct, I ended up configuring Tesseract to tess.setPageSegMode(10) single character mode segmenting the individual characters and processing each character one by one in Tesseract, that increased accuracy but not as much, below is my Tesseract configuration
Tesseract instance = new Tesseract(); //
instance.setLanguage(LANGUAGE);
instance.setHocr(false);
instance.setTessVariable("tessedit_char_whitelist", "ACPBZRT960847152");
instance.setTessVariable("load_system_dawg", "false");
instance.setTessVariable("load_freq_dawg", "false");
instance.setOcrEngineMode(TessOcrEngineMode.OEM_CUBE_ONLY);
instance.setPageSegMode(TessPageSegMode.PSM_SINGLE_CHAR);
instance.setPageSegMode(10);
Anyone know How I can get results as good as SunnyPage? as far as I know my image is good, it is skewed and well segmented so it is most likely to do with Tesseract alone.

The best thing to do would be to train tesseract with actual images of your license plates. This will make your results much more accurate because tesseract will actually know what a Z and 2 look like and it will recognize them much more accurately.
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract
http://vietocr.sourceforge.net/training.html

OpenCV LBP recognizer on MNIST digits - haarcascade?

I am trying to implement the OpenCV LBPHFaceRecognizer() and make it work for the images of digits from the MNIST dataset. These images are 28 x 28 px and look like this:
But for this task I need an haarcascade.xml file which is able to recognize digits. In the OpenCV package I only find xml files which are suited for facerecognition and russian plate numbers.
Here is my code, I basicly just need to replace the cascadePath = "haarcascade_frontalface_default.xml" with an apropriate xml for digits, but where do I get one?
All in all I want to test facerecognition with numbers instead of faces. So an input image where a "1" is shown should be able to recognize all other "1"`s in the dataset.

For this, you need to train a cascade. Here two link to explain how to do this :
1 This is the Opencv documentation for opencv_traincascade which is the opencv app to train cascade (generate .xml)
2 This is a useful tutorial to train cascade with opencv. It explains what to do and give some tricks to generate input file.

How To Train Stanford NER For Names That Include Spaces?

The example training excersize labels single-term names after tokenizing with something like a simple split(' ').
I need to train for and recognize names that include spaces. How do I train the recognizer?
Example: "I saw a Big Red Apple Tree." -- How would I tokenize for training and then recognize "Big Red Apple Tree" instead of recognizing four separate words?
Will this work for the training data?
I\tO
saw\tO
a\tO
Big Red Apple Tree\tMyName
.\tO
Would the output from the recognizer look the same as that?
The training section in the FAQ says "The training file parser isn't very forgiving: You should make sure each line consists of solely content fields and tab characters. Spaces don't work."

The problem you are trying to solve belongs to phrase identification. There are different ways with which you can tag the words. For example, You can tag the words with IOB tags. Train the stanford ner model onto this newly created data. Write a post processing step to concatenate the predicted data.
For Example :
your training data should look like this:
I\tO
saw\tO
a\tO
Big\tB-MyName
Red\tI-MyName
Apple\tI-MyName
Tree\tO-MyName
.\tO<br/>
So Basically, you are using [ 0, B-MyName , I-MyName , O-MyName ] as tags.
I have solved similar problem and it works great. But make sure you have enough data to train it on.

How to improve Text recognition usingTesseract OCR.?

I had implemented tesseract ocr for text recognition in IOS.I had preprocessed the input image and give into Tesseract method.It gives poor recognition result.
Steps:
1.Erode function
2.Dilate function
3.Bitwise_not function
Mat MCRregion;
cv::dilate ( MCRregion, MCRregion, 24);
cv::erode ( MCRregion, MCRregion, 24);
cv::bitwise_not(MCRregion, MCRregion);
UIImage * croppedMCRregion = [self UIImageFromCVMat:MCRregion];
Tesseract* tesseract = [[Tesseract alloc] initWithDataPath:#"tessdata" language:#"eng"];
[tesseract setVariableValue:#"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz.>,'`;-:</" forKey:#"tessedit_char_whitelist"];
[tesseract setImage:[self UIImageFromCVMat:MCRregion]];
// [tesseract setImage:image];
[tesseract recognize];
NSLog(#"%#", [tesseract recognizedText]);
Input Image:
Image Link
1.How to Improve text recognition rate using Tesseract ?
2.Is any other pre processing steps applied in Tesseract.?
3.Is dewarp text Done in Tesseract OCR.?

Tesseract is a highly configurable piece of software -- though its configurations are poorly documented (unless you want to dig deep in the 150K lines of code). A good comprehensive list is present here http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version.
Also look at
https://code.google.com/p/tesseract-ocr/wiki/ControlParams and https://code.google.com/p/tesseract-ocr/wiki/ImproveQuality
You can improve the quality tremendously if you feed more info about the data you're OCR'ing.
e.g. in case the images are all National IDs or Passports which follow certain standard MRZ formats, you can configure tesseract to use that info.
For the image you attach (an MRZ), i got the following result,
IDFRADOUEL<<<<<<<<<<<<<<<<<<<<9320
05O693202O438CHRISTIANE<<N1Z90620<3
by using the following config
# disable dict, freq tables etc which would distract OCR'ing an MRZ
load_system_dawg F
load_freq_dawg F
load_unambig_dawg F
load_punc_dawg F
load_number_dawg F
load_fixed_length_dawgs F
load_bigram_dawg F
wordrec_enable_assoc F
# mrz allows only these chars
tessedit_char_whitelist 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ<
Also see that your installation is trained for the fonts to get more accurate results. In your case it seems it is OCR-B font.

It is not necessary to go through the tedious task of retraining Tesseract. Yes, you will get much better results but in some cases you can get pretty far with the ENG training set.
You can improve your results by paying attention to the following things:
Use a binary image as input and make sure you have black text on a white background
By default Tesseract will try to make words out of things that have no spacing. Try to segment each character seperately and place them in a new image with lots of spacing. Especially if you have combinations of letters and numbers Tesseract will "correct" this to match the surrounding characters.
Try to segment different parts of your image with a whitelist for the characters you know should be in there. If your only looking for digits in the first part then use a seperate instance of Tesseract to detect these numbers with a number only whitelist.
If you use the same object multiple times without resetting it Tesseract seems to have a memory. That means that you can get a different result each time you perform OCR. You can reset Tesseract to counter this or just create a new object.
Last but not least, use the resultIterator to go through the boxes that Tesseract can give as a result. You can check the size and confidence of each character and filter accordingly.

Based on my experience:
1.How to Improve text recognition rate using Tesseract ?
Firstly, preprocessing. Ensure that the input image is binary image with a good threshold. OpenCV has a good set of functions to apply threshold algorithms such as the Otsu algorithm as well as contour detection to help with warping and rotation.
You can also use contour detection in OpenCV to distinguish between lines of text.
Some filtering would also remove noise which often confuse tesseract and increase processing time.
Set up proper configurations for tesseract (e.g. eng.config). Full list of configs here (http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version). Some examples include blacklists, whitelists, chopping, etc...
Use proper flags. E.g. -psm 6 if you are doing blocks of text rather than lines
Having trained my own language data... I would say do so only if you have lots of time and resources. Or if your font is very peculiar (e.g. dot matrix).
More recent versions of Tesseract (closer to 3.0) allow for multiple language files to be used on the same pass (-l one+two). This means you can have one specially trained for text and another for numbers. In our case, it seemed to work well.
Postprocessing of tesseract results was particularly important for us too. String replacements of typical mis-recognitions and what not.
2.Is any other pre processing steps applied in Tesseract.?
Tesseract uses leptonica library for preprocessing.
3.Is dewarp text Done in Tesseract OCR.?
I am inclined to think yes. Considering that warping functions are part of leptonica.

SVMlight train data formatting

I am trying to classify the reuters text using svm light but my train data does not follow the format
<'line> .=. <'target> <'feature>:<'value> <'feature>:<'value> ... <'feature>:<'value> # <'info>
it is of the form
<'line> .=. <'feature>:<'value> <'feature>:<'value> ... <'feature>:<'value> # <'info>
The target label is in a separate file.
I know there's an option in SVM light that lets you specify a separate target label file but i cannot find it on the SVM light website because a get an arror message:
Reading examples into memory...Line must start with label or 0!!!
whenever i load my training data using
svm_learn example1/train.dat example1/model
any help ?

Performing a rigorous research i realized that there is no syntax in SVM light that allows users to specify an external class label file for training data. The class labels must be part of the training data and it should follow the "target feature:value" format of SVM light

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart