How to Distinguish Slashed Zero From Eight (0->8) in OCR - machine-learning

I'm using ML Kit for Firebase for my Android app (ReCalc: Receipt Calculator) and it performs very well except in the case of slashed zero.
In around half or more of the cases it recognizes slashed zero as eight.
One idea I have is to slice the rectangle containing the zero in regions and detect whether the regions just above and below the middle are dark or not.
But actually...
I'm planning to train a model to classify zeroes and eights.
This is a lot of work thus I decided first to ask here for another solution/idea.
Here is an example:
Similar question: Tesseract OCR confuses slashed 0 as 8

Finally I've trained a model myself.
Its accuracy is pretty good (more than 98%). However I have concerns how well it generalize.
Here is the project: ZEC - Slashed Zero-Eight Classifier
I've created also an application showing how to use the model on Android: ZECA - Slashed Zero-Eight Classifier for Android

Related

Recognize objects while falling - viewpoint variation

I have a problem statement to recognize 10 classes of different variations(variations in color and size) of same object (bottle cap) while falling taking into account the camera sees different viewpoint of the object. I have split this into sub-tasks
1) Trained a deep learning model to classify only the flat surface of the object and successful in this attempt.
Flat Faces of sample 2 class
2) Instead of taking fall into account, trained a model for possible perspective changes - not successful.
Perception changes of sample 2 class
What are the approaches to recognize the object even for perspective changes. I am not constrained to arrive with a single camera solution. Open to ideas in approaching towards this problem of variable perceptions.
Any help could be really appreciated, Thanks in advance!
The answer I want to give you is: CapsNets
You should definately check out the paper, where you will be introduced to some short comings of CNNs and how they tried to fix them.
That said, I find it hard to believe that your architecture cannot solve the problem successfully when the perspective changes. Is your dataset extremely small? I'd expect the neural network to learn filters for the riffled edges, which can be seen from all perspectives.
If you're not limited to one camera you could try to train a "normal" classifier, which you feed multiple images in production and average the prediction. Or you could build an architecture that takes in multiple perspectives at once. You have to try for yourself, what works best.
Also, never underestimate the power of old school image preprocessing. If you have 3 different perspectives, you could take the one that comes closest to the "flat" perspective. This is probably as easy as using the image with the largest colored area, where img.sum() is the highest.
Another idea is to figure out the color through explicit programming, which should be fairly easy and then feed the network a grayscale image. Maybe your network is confused by the strong correlation of the color and ignores the shape altogether.

OpenCV: how to use opencv_createsamples and opencv_traincascade

In my project I need to recognize euro coins and someone advised me to use OpenCV classifiers and training algorithms. So I downloaded the 3.1 version of OpenCV and I trying to get it started. I would like to know some things that I don't understand from tutorials (the ones I am following are this, this from OpenCV official documentation and this).
First of all, is it mandatory to generate and consider negative samples? if yes, which kind of "objects" should I consider as negatives? In my app I should detect and recognize euro coins so...should I create negatives from any other random kind of objects?
Secondly, my app is supposed to recognize 2€, 1€ and 0.50€ coins. So, how many positive samples shall I generate with opencv_createsamples? One for each coin (front and back) or an unique one for all the 3 kind of coins? If I have understood well, then I will have some .xml files which I am supposed to enclose in my iOS app project, right?
Finally, will detectMultiScale() not only detect the coin but also its kind? That's why I was thinking that I would need more than a classifier file, to distinguish back from right side and to distinguish the value of the coin.
Hoping I have not written a too broad question, thank you for your attention.
Regarding negative (or, background) example consider on which background the euro coins will be in the image that your program has to process in the end. E.g., if they are on a table together with non-Euros, use table tops (without coins) as negatives. If you want to spot them on images where people hold them, use hands as negatives, etc.
Regarding the second question - I am also searching for advice on this. I guess it depends on the complexity of the features you need to recognize and also on the complexity of the background. I succeeded to train a classifier with 40 circle positives, but completely failed to train a classifier using 60 (aerial) cow images.
Regarding your third question, I think (not tested!) that depends on your recognition needs: If you just want to spot one of the Euro coins, you'll be fine to train one classifier with images from all coins from all sides. If you want to distinguish the coins, create different classifiers with images of each coin from all sides. If you even need to distinguish front from back, you'll have to create one classifier for every side of every coin.
yes, Negative samples are mandatory and again yes, you can have images of random objects but you need to have lots of them so i suggest use videos to extract their frames or website such as this
next, you need to have positive samples for your coins. Greater the number of positive coins, more accurate results will be. One thing you can do is to take 2-3 images of your coin and super impose them on your negative ones by changing an angle every time you superimpose. That way if you have say, 500 negative images and 2 positive images of a coin, you will have 1000 positive images after superimposing. opencv_createsamples script takes care of all of this.

How to train SVM for "Euro" coin recognition with OpenCV 3?

My xmas holiday project this year was to build a little Android app, which should be able to detect arbitrary Euro coins in a picture, recognize their value and sum the values up.
My assumptions/requirements for the picture for a good recognition are
uniform background
picture should be roughly the size of a DinA4 paper
coins may not overlap, but may touch each other
number-side of the coins must be up/visible
My initial thought was, that for the coin value-recognition later it would be best to first detect the actual coins/their regions in the picture. Any recognition then would run on only these regions of the picture, where actual coins are found.
So the first step was to find circles. This i have accomplished using this OpenCV 3 pipeline, as suggested in several books and SO postings:
convert to gray
CannyEdge detection
Gauss blurring
HoughCircle detection
filtering out inner/redundant circles
The detection works rather successfully IMHO, here a picture of the result:
Coins detected with HoughCircles with blue border
Up to the recognition now for every found coin!
I searched for solutions to this problem and came up with
template matching
feature detection
machine learning
The template matching seems very inappropriate for this problem, as the coins can be arbitrary rotated with respect to a template coin (and the template matching algorithm is not rotation-invariant! so i would have to rotate the coins!).
Also pixels of the template coin will never exactly match those of the region of the formerly detected coin. So any algorithm computing the similarity will produce only poor results, i think.
Then i looked into feature detection. This seemed more appropriate to me. I detected the features of a template-coin and the candidate-coin picture and drew the matches (combination of ORB and BRUTEFORCE_HAMMING). Unfortunately the features of the template-coin were also detected in the wrong candidate coins.
See the following picture, where the template or "feature" coin is on the left, a 20 Cents coin. To the right there are the candidate coin, where the left-most coin is a 20 Cents coin. I actually expected this coin to have the most matches, unfortunately not. So again, this seems not to be a viable way to recognize the value of coins.
Feature-matches drawn between a template coin and candidate coins
So machine learning is the third possible solution. From university i still now about neural networks, how they work, etc. Unfortunately my practical knowledge is rather poor AND i don't know Support Vector Machines (SVM) at all, which is the machine learning supported by OpenCV.
So my question is actually not source-code related, but more how to setup the learning process.
Should i learn on the plain coin-images or should i first extract features and learn on the features? (i think: features)
How much positives and negatives per coin should be given?
Would i have to learn also on rotated coins or would this rotation be handled "automagically" by the SVM? So would the SVM recognize rotated coins, even if i only trained it on non-rotated coins?
One of my picture-requirements above ("DinA4") limits the size of the coin to a certain size, e.g. 1/12 of the picture-height. Should i learn on coins of roughly the same size or different sizes? I think, that different sizes would result in different features, which would not help the learning process, what do you think?
Of course, if you have a different possible solution, this is also welcome!
Any help is appreciated! :-)
Bye & Thanks!
Answering your questions:
1- Should i learn on the plain coin-images or should i first extract features and learn on the features? (i think: features)
For many object classification tasks it's better to extract the features first and then train a classifier using a learning algorithm. (e.g the features can be HOG and the learning algorithm can be something like SVM or Adaboost). It's mainly due to the fact that the features have more meaningful information compared to the pixel values. (They can describe edges,shapes, texture, etc.) However, the algorithms like deep learning will extract the useful features as a part of learning procedure.
2 - How much positives and negatives per coin should be given?
You need to answer this question depending on the variation in the classes you want to recognize and the learning algorithm you use. For SVM , if you use HOG features and want to recognize specific numbers on coins you won't need much.
3- Would i have to learn also on rotated coins or would this rotation be handled "automagically" by the SVM? So would the SVM recognize rotated coins, even if i only trained it on non-rotated coins?
Again it depends on your final decision about the features(not SVM which is the learning algorithm) you're going to choose. HOG features are not rotation invariant but there are features like SIFT or SURF which are.
4-One of my picture-requirements above ("DinA4") limits the size of the coin to a certain size, e.g. 1/12 of the picture-height. Should i learn on coins of roughly the same size or different sizes? I think, that different sizes would result in different features, which would not help the learning process, what do you think?
Again, choose your algorithm , some of them ask you for a fixed/similar width/height ratio. You can find out about the specific requirements in related papers.
If you decide to use SVM take a look at this and also if you feel ok with Neural Network, using Tensorflow is a good idea.

Sparse Image matching in iOS

I am building an iOS app that, as a key feature, incorporates image matching. The problem is the images I need to recognize are small orienteering 10x10 plaques with simple large text on them. They can be quite reflective and will be outside(so the light conditions will be variable). Sample image
There will be up to 15 of these types of image in the pool and really all I need to detect is the text, in order to log where the user has been.
The problem I am facing is that with the image matching software I have tried, aurasma and slightly more successfully arlabs, they can't distinguish between them as they are primarily built to work with detailed images.
I need to accurately detect which plaque is being scanned and have considered using gps to refine the selection but the only reliable way I have found is to get the user to manually enter the text. One of the key attractions we have based the product around is being able to detect these images that are already in place and not have to set up any additional material.
Can anyone suggest a piece of software that would work(as is iOS friendly) or a method of detection that would be effective and interactive/pleasing for the user.
Sample environment:
http://www.orienteeringcoach.com/wp-content/uploads/2012/08/startfinishscp.jpeg
The environment can change substantially, basically anywhere a plaque could be positioned they are; fences, walls, and posts in either wooded or open areas, but overwhelmingly outdoors.
I'm not an iOs programmer, but I will try to answer from an algorithmic point of view. Essentially, you have a detection problem ("Where is the plaque?") and a classification problem ("Which one is it?"). Asking the user to keep the plaque in a pre-defined region is certainly a good idea. This solves the detection problem, which is often harder to solve with limited resources than the classification problem.
For classification, I see two alternatives:
The classic "Computer Vision" route would be feature extraction and classification. Local Binary Patterns and HOG are feature extractors known to be fast enough for mobile (the former more than the latter), and they are not too complicated to implement. Classifiers, however, are non-trivial, and you would probably have to search for an appropriate iOs library.
Alternatively, you could try to binarize the image, i.e. classify pixels as "plate" / white or "text" / black. Then you can use an error-tolerant similarity measure for comparing your binarized image with a binarized reference image of the plaque. The chamfer distance measure is a good candidate. It essentially boils down to comparing the distance transforms of your two binarized images. This is more tolerant to misalignment than comparing binary images directly. The distance transforms of the reference images can be pre-computed and stored on the device.
Personally, I would try the second approach. A (non-mobile) prototype of the second approach is relatively easy to code and evaluate with a good image processing library (OpenCV, Matlab + Image Processing Toolbox, Python, etc).
I managed to find a solution that is working quite well. Im not fully optimized yet but I think its just tweaking filters, as ill explain later on.
Initially I tried to set up opencv but it was very time consuming and a steep learning curve but it did give me an idea. The key to my problem is really detecting the characters within the image and ignoring the background, which was basically just noise. OCR was designed exactly for this purpose.
I found the free library tesseract (https://github.com/ldiqual/tesseract-ios-lib) easy to use and with plenty of customizability. At first the results were very random but applying sharpening and monochromatic filter and a color invert worked well to clean up the text. Next a marked out a target area on the ui and used that to cut out the rectangle of image to process. The speed of processing is slow on large images and this cut it dramatically. The OCR filter allowed me to restrict allowable characters and as the plaques follow a standard configuration this narrowed down the accuracy.
So far its been successful with the grey background plaques but I havent found the correct filter for the red and white editions. My goal will be to add color detection and remove the need to feed in the data type.

Scoreboard digit recognition using OpenCV

I am trying to extract numbers from a typical scoreboard that you would find at a high school gym. I have each number in a digital "alarm clock" font and have managed to perspective correct, threshold and extract a given digit from the video feed
Here's a sample of my template input
My problem is that no one classification method will accurately determine all digits 0-9. I have tried several methods
1) Tesseract OCR - this one consistently messes up on 4 and frequently returns weird results. Just using the command line version. If I actually try to train it on an "alarm clock" font, I get unknown character every time.
2) kNearest with OpenCV - I search a database consisting of my template images (0-9) and see which one is nearest. I frequently get confusion between 3/1 and 7/1
3) cvMatchShapes - this one is fairly bad, it usually can't tell the difference between 2 of the digits for each input digit
4) Tangent Distance - This one is the closest, but the smallest tangent distance between the input and my templates ends up mapping "7" to "1" every time
I'm really at a loss to get a classification algorithm for such a simple problem. I feel I have cleaned up the input fairly well and it's a fairly simple case for classification but I can't get anything reliable enough to actually use in practice. Any ideas about where to look for classification algorithms, or how to use them correctly would be appreciated. Am I not cleaning up the input? What about a better input database? I don't know what else I'd use for input, each digit and template looks spot on at this point.
The classical digit recognition, which should work well in this case is to crop the image just around the digit and resize it to 4x4 pixels.
A Discrete Cosine Transform (DCT) can be used to further slim down the search space. You could select the first 4-6 values.
With those values, train a classifier. SVM is a good one, readily available in OpenCV.
It is not as simple as emma's or martin suggestions, but it's more elegant and, I think, more robust.
Given the width/height ratio of your input, you may choose a different resolution, like 3x4. Choose the smallest one that retains readable digits.
Given the highly regular nature of your input, you could define a set of 7 target areas of the image to check. Each area should encompass some significant portion of one of the 7 segments of each digital of the display, but not overlap.
You can then check each area and average the color / brightness of the pixels in to to generate a probability for a given binary state. If your probability is high on all areas you can then easily figure out what the digit is.
It's not as elegant as a pure ML type algorithm, but ML is far more suited to inputs which are not regular, and in this case that does not seem to apply - so you trade elegance for accuracy.
Might sound silly but have you tried simply checking for black bars vertically and then horizontally in the top and bottom halfs - left and right of the centerline ?
If you are trying text recognition with Tesseract, try passing not one digit, but a number of duplicated digits, sometimes it could produce better results, here's the example.
However, if you're planning a business software, you may want to have a look at a commercial OCR SDK. For example, try ABBYY FineReader Engine. It's not affordable for free to use applications, but when it comes to business, it can a good value to your product. As far as i know, ABBYY provides the best OCR quality, for example check out http://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison
You want your scorecard image inputs S feeding an algorithm that maps them to {0,1,2,3,4,5,6,7,8,9}.
Let V denote the set of n-tuples of integers.
Construct an algorithm α that maps each image S to a n-tuple
(k1,k2,...,kn)
that can differentiate between two different scoreboard digits.
If you can specify the range of α then you only have to collect the vectors in V that correspond to a digit in order to solve the problem.
I've applied this idea using Martin Beckett's idea and it works. My initial attempt was a simple injection into a 2-tuple by vertical left-to-right summing, with the first integer a image column offset and the second integer was the length of a 'nice' vertical line.
This did not work - images for 6 and 8 would map to the same vectors. So I needed another mini-info-capture for my digit input types (they are not scoreboard) and a 3-tuple info vector does the trick.

Resources