Tesseract does not recognize complete image whereas correctly recognizes part of it? - image-processing

I have to parse some Lab Reports and I am using Tesseract to extract data from them. I have encountered an issue that Tesseract does not correctly recognize the text if I pass entire page's image. But if I pass a small subsection of page (from Test Report covering the entire table till *****) it is able to read all the text correctly.
In the formal case (when I pass the entire image) it produces some random text output of English words which do not make sense. Part of text is as follows:
Command I ran: tesseract -l eng report.png out
Refierence No : assurcAN, 98941-EU
5:er Nu (SKU) , 95942, 95943
Labelled age gwup “aw
Quamny 20 pweces
Fackagmg pmwosd Yes
Vendor
Manmamurer
But when I pass the subsection, I get accurate results.
What might be the issue here? How do I fix it?
See the sample report image:

Related

image preprocessing methods that can be used for identification of industrial parts name (stuck or engraved) on the surface?

I am working on a project where my task is to identify machine part by its part number written on label attached to it or engraved on its surface. One such example of label and engraved part is shown in below figures.
My task is to recognise 9 or 10 alphanumerical number (03C 997 032 D in 1st image and 357 955 531 in 2nd image). This seems to be easy task however I am facing problem in distinguishing between useful information in the image and rest of the part i.e. there are many other numbers and characters in both image and I want to focus on only mentioned numbers. I tried many things but no success as of now. Does anyone know the image pre processing methods or any ML/DL model which I should apply to get desired result?
Thanks in advance!
JD
You can use OCR to the get all characters from the image and then use regular expressions to extract the desired patterns.
You can use OCR method, like Tesseract.
Maybe, you want to clean the images before running the text-recognition system, by performing some filtering to remove noise / remove extra information, such as:
Convert to gray scale (colors are not relevant, aren't them?)
Crop to region of interest
Canny Filter
A good start can be one of this tutorial:
OpenCV OCR with Tesseract (Python API)
Recognizing text/number with OpenCV (C++ API)

Scan video for text string?

My goal is to find the title screen from a movie trailer. I need a service where I can search a video for a string, then return the frame with that string. Pretty obscure, does anything like this exist?
e.g. for this movie, I'd scan for "Sausage Party" and retrieve this frame:
Edit: I found the cloudsight api which would actually work except cost is prohibitive # $.04 per call assuming I need to split the video into 1s intervals and scan every image (at least 60 calls per video).
No exact service that I can find, but you could attempt to do this yourself...
ffmpeg -i sausage_party.mp4 -r 1 %04d.png
/usr/local/bin/parallel --no-notice -j 8 \
/usr/local/bin/tesseract -psm 6 -l eng {} {.} \
::: *.png
This extracts one frame a second from the video file, and then uses tesseract to extract the text via OCR into files of the same name as the image frame (eg. 0135.txt. However your results are going to vary massively depending on the font used and the quality of the video file.
You'd probably find it cheaper/easier to use something like Amazon Mechanical Turk , especially since the OCR is going to have a hard time doing this automatically.
Another option could be implementing this service by yourself using the Scene Text Detection and Recognition module in OpenCV (docs.opencv.org/3.0-beta/modules/text/doc/text.html). You can take a look at this video to get an idea of how such a system would operate. As pointed out above the accuracy would depend on the font used in the movie titles, the quality of the video files, and the OCR.
OpenCV relies on Tesseract as the underlying OCR but, alternatively, you could use the text detection and localization functions (docs.opencv.org/3.0-beta/modules/text/doc/erfilter.html) in OpenCV to find text areas in the image and then employ a different OCR to perform the recognition. The text detection and localization stage can be done very quickly thus achieving real time performance would be mostly a matter of picking a fast OCR.

Opencv traincascade cannot fill temp stage

So, I have 20 positive samples and 500 negative samples. I created the .vec file using createsample utility.Now, when i try to train the classifier using the traincascade.exe utility, I run into the following error:
I have looked into many solutions given to people who have faced similar issues, but none of them worked.
Things I tried: 1. Increasing the negative sample size 2. Checking the path of the negative(or background images) stored in the Negative.txt file 3. Varying different parameters.
Here is some information regarding the path: My working directory has the following files: 1. Traincascade.exe 2. Positive image folder 3. NegativeImageFolder 4. vec file 5. Negative.txt (file that has path to images in the negative image folder)
My Negative.txt file has the absolute file path for the images in the negative image folder. I also tried changing the file path to the following format:
NegativeImageFolder\Image1.pgm
but didn't work! I tried both front and backslash too!
I have run out of ways to change the file path or make any modification to make this work!
First of all: is NumStages 1 and maxDepth 1 intentional?
Looking at Opencv's source code (cascadeclassifier.cpp, imagestorage.cpp), the error is thrown when in function
bool CvCascadeClassifier::updateTrainingSet( double& acceptanceRatio)
a number, negCount=500, of negative samples cannot be filled.
Before, everything was ok with positive samples (and the line about pos count that was printed on the screen is a proof of this).
Digging deep into source code negCount cannot be filled when imgReader.getNeg( img ) returns false, this means it cannot provide any image, which in turn happens when the list of source negatives is empty.
So you have to concentrate all your efforts in the direction of providing the algorithm with the correct list of negative images.
There are two ways to solve this: make sure that Negative.txt is read and all paths are regular and that every image in the list can be read regularly.
Is the file name “Negative.txt” or “Negatives.txt”?
Anyway with so few positive and negative samples you won’t train anything functioning, it is only useful to make you understand how the process of training works.
Well I was able to resolve the issue and run the train the classifier successfully. However, I am not 100% sure as to how the change I made helped.
This is what I did:
I was generating the Negative.txt file using Excel. I would enter the file path of one image and increment the image filename (since my images were name image1, image2, image3...). So the format as mentioned earlier would be :
C:\OpenCV-3.0.0\opencv\build\x64\vc12\bin\Negative\Image1.pgm
And finally save the file as a Unicode txt document. However, saving it as a unicode txt document gave me the error stated in the question. I saved it as a Text (tab delimited) file and it worked.

A guide to convert_imageset.cpp

I am relatively new to machine learning/python/ubuntu.
I have a set of images in .jpg format where half contain a feature I want caffe to learn and half don't. I'm having trouble in finding a way to convert them to the required lmdb format.
I have the necessary text input files.
My question is can anyone provide a step by step guide on how to use convert_imageset.cpp in the ubuntu terminal?
Thanks
A quick guide to Caffe's convert_imageset
Build
First thing you must do is build caffe and caffe's tools (convert_imageset is one of these tools).
After installing caffe and makeing it make sure you ran make tools as well.
Verify that a binary file convert_imageset is created in $CAFFE_ROOT/build/tools.
Prepare your data
Images: put all images in a folder (I'll call it here /path/to/jpegs/).
Labels: create a text file (e.g., /path/to/labels/train.txt) with a line per input image . For example:
img_0000.jpeg 1
img_0001.jpeg 0
img_0002.jpeg 0
In this example the first image is labeled 1 while the other two are labeled 0.
Convert the dataset
Run the binary in shell
~$ GLOG_logtostderr=1 $CAFFE_ROOT/build/tools/convert_imageset \
--resize_height=200 --resize_width=200 --shuffle \
/path/to/jpegs/ \
/path/to/labels/train.txt \
/path/to/lmdb/train_lmdb
Command line explained:
GLOG_logtostderr flag is set to 1 before calling convert_imageset indicates the logging mechanism to redirect log messages to stderr.
--resize_height and --resize_width resize all input images to same size 200x200.
--shuffle randomly change the order of images and does not preserve the order in the /path/to/labels/train.txt file.
Following are the path to the images folder, the labels text file and the output name. Note that the output name should not exist prior to calling convert_imageset otherwise you'll get a scary error message.
Other flags that might be useful:
--backend - allows you to choose between an lmdb dataset or levelDB.
--gray - convert all images to gray scale.
--encoded and --encoded_type - keep image data in encoded (jpg/png) compressed form in the database.
--help - shows some help, see all relevant flags under Flags from tools/convert_imageset.cpp
You can check out $CAFFE_ROOT/examples/imagenet/convert_imagenet.sh
for an example how to use convert_imageset.

Tesseract Appears to be learning characters as you perform more OCRs, how do I save the learning data between uses?

I have a particular set of 10 images to perform OCRs. They are all digits; somewhat short, about 20 digits in each image. There is one particular image, if I run it first, it will have some mismatches; however, if I run other tests first, then come back to that one, all characters match.
I am inclined to conclude that Tesseract is learning the characters as more OCR operations are performed, which makes me very happy. Now the question is, if it's possible, for me to save the learning data, so Tesseract would know to pick it up the next time I use it?
You can set classify_save_adapted_templates to 1 in your Tesseract config file to save the adapted templates and set classify_use_pre_adapted_templates to 1 to load the templates next time you run Tesseract
The code that specifies the behavior of these options is here:
http://code.google.com/p/tesseract-ocr/source/browse/trunk/classify/classify.cpp?r=570

Resources