I have a camera pointed at a hallway, which can have one of of three states - it can be empty, it can contain my cat, and it can contain a bad cat. I'm trying to train a neural network to alert me when the camera sees the bad cat in the hallway.
I am new to machine learning and classification, so my question is - should I use a binary classification (empty/my cat vs bad cat), or should I use a 3 classes classification (empty vs my cat vs bad cat)? Which could give me better result?
As additional information, my cat is black and the bad cat is black and white. The hallway is lit throughout the day and night, though the quality of the light changes. The images are 640 by 480, though I currently crop them to about 450 by 300.
Based on the provided information, you basically have three situations. However, it depends on the color set of the background. Computer vision algorithms look into distracting features and try to describe them based on their trainings.
If for instance your hall way was red or yellow, then you will have a much easier time then having a gray hallway. That being said, since you have asked about Neural networks and keras library, there is a very famous Cat or Dog CNN network exercise under Keras. That might work well with your data set. Besides, I am not sure if your application is going to be real-time?
Related
tl;dr - I use an autoencoder to try to reduce input dimensions for a reinforcement-learning (RL) agent to learn how to play Atari-KungFu. But it fails at encoding/decoding thrown knives, because they are only a couple pixels and getting them wrong probably has negligible impact on the autoencoder MSE loss (see green arrows in bottom left of image). This will probably permanently hobble the results. I want to figure out if there is a way to solve this -- preferably with a generalized solution, but I'd be happy for now with something specific to this problem.
Background:
I am working on Week5 of the "Practical Reinforcement Learning" course on Coursera (National Research University HSE), and I decided to spend extra time trying to expand performance on the Atari-KungFu assignment using Actor-Critic architecture. This post is not about actor-critic, but more about an interesting sub-problem I ran into related to autoencoders.
I create an encoder which outputs a tanh-64-neuron layer, which is used as a common input to the decoder, policy learner (actor), and value learner (critic). During training, the simulator returns batches of four sequential frames (64 x 144 x 4) and rewards from the last action. Then images are first used to train the autoencoder, then used with the rewards to train the actor & critic branches.
I display some metrics and example frames every 25000 iterations to see how it's doing. If the reconstructed images are accurate, then the inputs to the actor & critic branches should be getting good distilled information for efficient learning.
You can see below that the autoencoder is pretty good except for the thrown knives (see bottom-left). Arguably this is because missing those couple pixels minimally increases the MSE loss of the reconstructed image, so it has little incentive to learn it (and also there's not a lot of frames that have knives). Yet, seeing those knives is critical for the RL agent to learn to how to survive.
I haven't seen this kind of problem addressed before. A tiny artifact in the input images is crucial for learning, but is unlikely to be learned by the autoencoder. Can we fix/improve this?
IMO your problem is loss specific, some things which would probably help autoencoder reconstruct knife as well:
Find knives in input image using image processing techniques. Regions where knives are present should have higher loss value in MSE, say 10 times more. One way to find those semi-automatically could probably be convolution with big kernel; White pixels at the strict center would give more weight and only zeros around it would give it more weight as well. Something along these lines should find a region where only knives are located (throwing guys wouldn't, as they contain too many white pixels and holes). Using some threshold found empirically for the value of this kernel should be enough to correctly find them.
Lower loss for images when no knive was found, say divided by half. This would focus autoencoder harder on rarely seen cases when knive is seen.
On the downside - I suppose it could introduce some artifacts. In such case you may think about usage of pretrained encoder (like some version of ResNet) and increase model's capabilities.
I have to determine the number of elephant seals in an image. The original image is too big to be uploaded so there is a sample :
Elephant seals
Classical image processing technics can not be used since animals and sand have slightly the same color. We would segment shadows or textures but not the seals. That's why I wanted to test machine learning.
The goal would be to determine manually some ROIs representing the seals and other for the sand in order to recognize the other animals in the image. The problem is that I don't know which feature I can use to describe seals and distinguish them from the sand.
Local histograms and its statistics (in particular mean and standard deviation) seem to be interesting but not enough. I thought about using image gradient but it did not lead to a good discrimination. Moreover, I think that a combination of several features must bu used but it's hard to tell which ones.
That's why I wonder if there was a way to determine automatically discriminative features to use them for the learning and predicting step of the machine learning algorithm.
In every tutos I found, descriptors were already defined.
Do you have any clue?
I have to determine the number of elephant seals in an image. .. any clue?
Well," classical " ML-Features ( per-se ) are not enough here :
this is a common situation for smart object recognition, which must provide a reasonable robustness, before any counting starts to have a sense.
As an example, CNN-methods deploy ( typically deep ) architectures of pre-processing with specialised kernels, that first help to decompose the 2D-scene into pre-cursors, that next may help the actual ML-based learner ( the fully connected "tail" section of the pipeline ) to start learning the object recognitions.
Without these ( deep or shallow ) convolutional layers and many there applied transcoding and pooling tricks, that re-rasterise the scene with non-linearly transformed co-local new, kernel-produced "visual"-features, pre-process these auto-synthetised-features for the ( yet ) deeper layers of the actual ML-learner.
Many papers published on this, so you are actually happy to have public sources to work with.
I want to train a neural network to extract a number of (128) face features from an image.
The features are numbers that measure things like the distance between middles of the eyes, or the distance between middle of the left eyes and middle point of mouth.
I need this to find the dissimilarity between two faces: given a database with users, by analyzing a photo I'll be able to tell if it's a photo of Jhon.
I began my study using this link, which states: Researchers have discovered that the most accurate approach is to let the computer figure out the measurements to collect itself.
Ok, so the output of the network is an array of 128 numbers, I'll use some formula to adjust the weights so the output numbers are as accurate as possible.
What should I use as input? Will my input nodes be three photos, like in this article, and I'll extract the features based on the comparisons between the photos?
My first thought would be for you to use a library as Openface, which is already trained with lots of faces and has a great face representation (with the same 128 dimensions you need).
However, you mentioned that you want to train it yourself. I'd recommend you to start taking a look at Siamese Neural Networks. Siamese Neural Networks receive a pair of images (genuine pair - e.g. images from the same person; impostor pair - e.g. images from different persons) and try to learn a similarity/dissimilarity metric (also called Metric Learning). It is very useful for learning face embeddings since your goal seems to be related to that. They basically learn a way to map the input images to a representation that "benefits comparison". Other implementations (as OpenFace) are trained with Triplet Embeddings, where instead of a pair of images you receive a triple (two similar and one dissimilar).
Here are some references to start with Siamese Networks:
Signature recognition (a little old but good to understand about them): https://papers.nips.cc/paper/769-signature-verification-using-a-siamese-time-delay-neural-network.pdf
Siamese networks for face embeddings: http://yann.lecun.com/exdb/publis/pdf/chopra-05.pdf
Triplet Embeddings paper: https://arxiv.org/pdf/1503.03832.pdf
Just keep in mind that training these architectures is quite difficult, since selecting the best pairs is a very important and challenging part of the problem. One paper that mentions some of the challenges for creating image pairs but is not related to faces is this one.
Hope that helps!
I have a set of images of a particular object. I want to find if some of these has anomalies with a machine learning algorithm. For example if I have many photos of glasses I want to find if one of these is broken or has something anomalous. Something like this:
GOOD!!
BAD!!
(Obviously I will use the same kind of glasses...)
The problem is that I don't know every negative situation, so, for training, I have only positive images.
In other words I want an algorithm that recognize if an image has something different from the dataset. Do you have any suggestion?
In particular is there a way to use convolutional neural network?
What you are looking for is usually called anomaly, outlier, or novelty detection. You have lots of examples of what your data should look like, and you want to know when something doesn't look like your data.
A good approach for this problem, since you are using images, you can get a feature vectorized version using a pre-trained CNN on image net. Then you can use an anomaly detector on that feature set. The isolation forest should be an easier one to get working.
This is a typical Classification problem. I do not understand why you need CNN for this ......
My suggestion would be to build/train a classification model
comprising of only GOOD images of glass. Here you would possibly
have all kinds of glasses that are intact with a regular shape.
If the model encounters anything other than GOOD images, it will
classify those as BAD images. This so called BAD images may
include cracked/broken glasses having an irregular shape.
Another option that might work is to use an autoencoder.
Autoencoders are unsupervised neural networks with bottleneck architecture that try to reconstruct its own input.
We could train a deep convolutional autoencoder with examples of good glasses so that it gets specialized in reconstructing those type of images. You don't need to train autoencoder with bad glasses.
Therefore I would expect the trained autoencoder to produce low error rate for good glasses and high error rate for bad glasses. Error rate could be measured with MSE based on the difference between the reconstructed and original values (pixels).
From the trained autoencoder you can plot the MSEs for good vs bad glasses to help you define the right threshold. Or you can also try statistic thresholds such as: avg + 2*std, median + 2*MAD, etc.
Autoencoder details:
http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/
Deep autoencoder for images:
https://cds.cern.ch/record/2209085/files/Outlier%20detection%20using%20autoencoders.%20Olga%20Lyudchick%20(NMS).pdf
I am new to opencv, I am guessing that this problem could be somewhat simple: I am trying to detect an object which is almost 25 by 15 pixels in an image which is 470 by 590 pixels.
I am attaching a zoomed image of this object, I have several options to go with:
1 - Two close Circles Detection using hough transformation,
2 - Histogram matching
3 - SURF feature detection
Any advise on which direction should I take? Please consider speed and real-time application. Thanks
I think it should go without explicitly saying so, but there are probably hundreds of things that could be tried, and with only one example image it is quite difficult to advise. For instance are the LED always green? we don't know.
That aside, imho, two good places to start would be with the ol' faithful template matching, or blob detection.
Then if that is not robust enough, you will need to look at some alternative representations of the template/blob, like the classic HoG (good for shape, maybe a bit heavy this app.), or even your own bespoke one that encodes your own domain specific knowledge of this problem.
Then if that is not robust enough, build a dataset of representative +ve and -ve examples, as big as you can, and then train a machine like svm , or a boosted classifier.
Template Matching:
http://docs.opencv.org/doc/tutorials/imgproc/histograms/template_matching/template_matching.html
Blob detection:
https://code.google.com/p/cvblob/
Machine Learning:
http://docs.opencv.org/modules/ml/doc/ml.html
TIPS:
Add as much domain knowledge as possible, i.e. if they are always green, use color in the representation, like hog on g channel for instance. If they are always circular, try to encode that, like use a log-polar grid in the template,rather than a regular grid... and so on.
Machine Learning is not magic, a linear classifier will essentially weight different points in the feature space, so you still require a good representation, so if the Template matching was a total fail, the it is unlikely that simple linear ml with help, but if the Template matching was okay, then ml may well boost the performance to a good level.
step 1: Remove the black background.
step 2: A snake algorithm can be used to find the boundaries of your object