tesseract OCR not 100% accurate - image-processing

I need to scan a handwritten font. I am using tesseract OCR v3.02 for that. I have trained my OCR using box files and adding dictionary words as well but still I am not able to get 100% accuracy.
I am trying to scan the following image
So far the text file i am obtaining is like this:
We each took baths before bed. Bigfoot was so large that he had to use the bathtub more like a sink trying to clean
up his best. he left a pool of water around the usa It was a mess and I mopped it and removed all of the fur
left behind before taking a bath of my Own. I have to admit it was cguite disgusting. “It”s cguite simple.” Me
assured her. “Mere, I”ll show you.”Iimmy‘s was the favorite burger joint among us kids. Sreat burgers. and
there were video games and pinball machines. Jimmy”s has been around since the fift"es. all our parents used to
go there when they were young. “‘Mot exactly.""Another blanket. Ben. please."“I really hope so. because I feel so
Any help to improve the the OCR?

Related

image preprocessing methods that can be used for identification of industrial parts name (stuck or engraved) on the surface?

I am working on a project where my task is to identify machine part by its part number written on label attached to it or engraved on its surface. One such example of label and engraved part is shown in below figures.
My task is to recognise 9 or 10 alphanumerical number (03C 997 032 D in 1st image and 357 955 531 in 2nd image). This seems to be easy task however I am facing problem in distinguishing between useful information in the image and rest of the part i.e. there are many other numbers and characters in both image and I want to focus on only mentioned numbers. I tried many things but no success as of now. Does anyone know the image pre processing methods or any ML/DL model which I should apply to get desired result?
Thanks in advance!
JD
You can use OCR to the get all characters from the image and then use regular expressions to extract the desired patterns.
You can use OCR method, like Tesseract.
Maybe, you want to clean the images before running the text-recognition system, by performing some filtering to remove noise / remove extra information, such as:
Convert to gray scale (colors are not relevant, aren't them?)
Crop to region of interest
Canny Filter
A good start can be one of this tutorial:
OpenCV OCR with Tesseract (Python API)
Recognizing text/number with OpenCV (C++ API)

Image processing with programming

I was given a task to process image files, and analyze data on them.
Imagine an exam paper with A, B, C, D answers to fill in (Picture1).
A vision sensor inspects this paper, and saves an image file of it on the computer. I would like to have this image file analyzed (check for the correct filled in circles) and create a document with the results.
With close to no programming skills, I am kind of clueless on how to even start this project. I basically need something to detect if the red circles are filled in or at least have some % of the area filled (Picture2), and the others in the row are not, and give scores accordingly.
I don't know that this could help you and I even do not know that is the right answer but I can not make comments yet.
So basically you should make some application where you can process every picture and check pixels in some area that covers your template with good answers. Then you store that there was true/false result in inspection and sum this up to store the score.
Maybe this will be helpful: Images and Pixels by Daniel Shiffman
But also I think that with no programming skills this could be very hard to accomplish your task.

Finding features for classifying document into printable or non-printable

I would like to perform a binary classification of documents (.txt, .pdf, .jpeg, .img, etc.) into two categories: printable and non-printable. Essentially our school runs a free printing service for clubs, but the reality is that many clubs abuse the free printing and end up printing their homework, papers, etc., which amounts to thousands of dollars in ink and paper. Thus we would like to take some unsupervised methods to help limit this by determining whether a document is with high probability not club related (e.g. Biophysics paper, there is no biophysics club!).
So this is a very simple binary classification problem. I am not looking for low-level implementation details or which ML algorithms I should use, but rather how I should discover the relevant features that will then be fed to the training, etc.
My first idea was to gather all the documents that students print in the library. The idea is that if you have actual club printing, you'll do it for free at the club printing center rather than pay for it at the library. That would be a massive dataset, assuming every document printed at the library is assigned the non-printable/club material category. Unfortunately, the school is very liberal and opposed to allowing this due to privacy concerns, so it is not really an option without legal risks.
A similar-minded option would be to collect documents that are tied to courses / school work, e.g. course syllabi, available course documents online (homeworks, papers, etc.) and do feature extraction / selection on these. The assumption is that students would be abusing the printing to generally print material relevant to their studies.
While for .pdf and .txt based document this approach should have reasonable performance, I am at a loss at how to classify image based documents, besides perhaps using the title of the document and other meta data. A clever violator could simply convert all their text documents to image format to circumvent this system. However that is outside the scope of this question and should be saved for a future question / research. For now the scope is just text based documents.
Note that there are previous questions on topics similar to this, but mine is very specific and I believe it may pose challenges that something like movie review classification might not have to face.
I just wanted to leave a comment but it ended way longer than what I imagined.
While this is an interesting problem I'm not sure ML will get you what you need easily.
Firstly your classification problem is of the type A vs the World and A isn't strictly defined. Unless you know exactly what kind of stuff the clubs print you can't really say that new material belong or no to that class.
This will prove particularly difficult when you will need to assemble a large enough training set to be able to cover whatever can or cannot be printed. Such task will be extremely tedious, and as you said you won't have access to what the clubs usually print out so at best you will have a large class imbalance in your training set.
As the goal is to make the system automated (I mean if there is human interaction anyway, it's faster to check what will be printed than to make a ML algorithm that will provide a score that a human will have to investigate anyway) the number of false positives and false negatives will also be problematic. There will be cases where the clubs won't be able to print things they have the right to.
As you said you could simplify greatly the problem by classifying Course Material and Not Course Material. For that I will look towards BoW because some words are more present than others in papers or course material (everything remotely technical). The number of words as well as the overall size of the file seem like sensible things to extract. The structure is often also particular : it might be a good idea to extract such things : "number of lines with less than x words", "number of lines per page", "number of pictures" (if that's something you can extract from the file), ...
For pictures the major thing to check would be if this a scan of something (often they will scan and print course related things I guess), for that the format of the image is already a good indication but I don't see other things that would be particularly "course related".
So for me, if you can't really define precisely one of your two classes don't go with classification or reduce the problem to something you can really define (course related things).
If you are able to compile a "black list" of documents students are not allowed to print, you can then implement a several layers rejection mechanism.
I would suggest these 3 levels:
compare the md5 of the file they want to print with a database of all the md5 of the black-listed documents.
if the 1) is passed, compare repeat 1) but at a page level, rather than at document level (perhaps they want to print just few pages rather than the entire document).
if 2) is passed you can compare the page they want to print with the pages of the black-listed documents document using an image similarity method, like SSIM. if you get a high score between the page they want print and one of the black-listed items do not print, and update your md5 database accordingly.
if 3) is passed: print!
A few words about SSIM: this method is quite robust to noise, so even a smart student who added some sort of niose to the image will be caught
However:
you have to find a proper way to extract a region of interest (ROI) from the page and the db of documents (if the two ROIs are in two different area of the page, SSIM will be negative)
SSIM might be slow! definitely a C implementation is needed here.
I think SSIM is not rotational invariant, hence the check will fail if they print the page upside down (unless you have a smart way to rotate the page).

Tesseract on iOS - bad results

After spending over 10 hours to compile tesseract using libc++ so it works with OpenCV, I've got issue getting any meaningful results. I'm trying to use it for digit recognition, the image data I'm passing is a small square (50x50) image with either one or no digits in it.
I've tried using both eng and equ tessdata (from google code), the results are different but both get guess 0 digits. Using eng data I get '4\n\n' or '\n\n' as a result most of the time (even when there's no digit in the image), with confidence anywhere from 1 to 99.
Using equ data I get '\n\n' with confidence 0-4.
I also tried binarizing the image and the results are more or less the same, I don't think there's a need for it though since images are filtered pretty good.
I'm assuming that there's something wrong since the images are pretty easy to recognize compared to even simplest of the example images.
Here's the code:
Initialization:
_tess = new TessBaseAPI();
_tess->Init([dataPath cStringUsingEncoding:NSUTF8StringEncoding], "eng");
_tess->SetVariable("tessedit_char_whitelist", "0123456789");
_tess->SetVariable("classify_bln_numeric_mode", "1");
Recognition:
char *text = _tess->TesseractRect(imageData, (int)bytes_per_pixel, (int)bytes_per_line, 0, 0, (int)imageSize.width, (int)imageSize.height);
I'm getting no errors. TESSDATA_PREFIX is set properly and I've tried different methods for recognition. imageData looks ok when inspected.
Here are some sample images:
http://imgur.com/a/Kg8ar
Should this work with the regular training data?
Any help is appreciated, my first time trying tessarect out and I could have missed something.
EDIT:
I've found this:
_tess->SetPageSegMode(PSM_SINGLE_CHAR);
I'm assuming it must be used in this situation, tried it but got the same results.
I think Tesseract is a bit overkill for this stuff. You would be better off with a simple neural network, trained explicitly for your images. At my company, recently we were trying to use Tesseract on iOS for an OCR task (scanning utility bills with the camera), but it was too slow and inaccurate for our purposes (scanning took more than 30 seconds on an iPhone 4 at a tremendously low FPS). At the end, I trained a neural-network specifically for our target font, and this solution not only beat Tesseract (it could scan stuff flawlessly even on an iPhone 3Gs), but also a commercial ABBYY OCR engine, which we were given a sample from the company.
This course's material would be a good start in machine learning.

How to make texts in images sharper using PIL?

I was working with PIL, OpenCV and OCR readers to read texts from Images. The biggest problem I faced is when it comes to Image processing to make texts sharp enough for easier/accurate extraction by the OCR reader.
For that, I thought of increasing the contrast/brightness and do a histogram equlization using PIL but that didn't help the cause either.
So, what would you suggest to do to make the texts appear sharper for better text extraction?
PIL has sharpen and edge enhancing filters. Is this what you want? An example image showing what you are dealing with would be helpful.
Your image has an uneven background color which may be causing problems. Try looking at this solution to create a nice leveled b&w image.
But the black collar is also going to cause problems and you should look at ways of cropping it out.
That said, I get reasonable improvements with a simple PIL SHARPEN filter:
tesseract results after SHARPEN filter:
From what I've learned looking inside people, ^ I've decided human
beings are somewhere ` between a hurricane and an ice cube} in some
respects, permanently mysterious, but in others- with enough science
and careful probingeentirely ' scrutabler It would be as foolish to
think we have reached the limits of human knowledge as it is to 3
think we could ever know everything. There is still room enough to
get better, to ask questions of i even the dead, to learn from
knowing when our i simple certainties are wrong.
And results without filter:
From what I've learned lnnkmg wade maple} Fve deculed lunnuan wlng;.
el'. .y.w.r-a' isbetween a luurrlctuvr null llva laAll.' a. I ll
respects, permanently unyst:-rwntMl ln ms. re with enough scaena)
and turutul pmlulng l~m.rely scrutable. It would he as loallsla to
thank we have reached the llmlts of human knowledge as lt ls to think
we could ever know everything. There ls still room enough to get
better, to ask quesuons of ` even the dead, to learn from knowmg when
our simple certeindes are wrong.

Resources