I have to determine the number of elephant seals in an image. The original image is too big to be uploaded so there is a sample :
Elephant seals
Classical image processing technics can not be used since animals and sand have slightly the same color. We would segment shadows or textures but not the seals. That's why I wanted to test machine learning.
The goal would be to determine manually some ROIs representing the seals and other for the sand in order to recognize the other animals in the image. The problem is that I don't know which feature I can use to describe seals and distinguish them from the sand.
Local histograms and its statistics (in particular mean and standard deviation) seem to be interesting but not enough. I thought about using image gradient but it did not lead to a good discrimination. Moreover, I think that a combination of several features must bu used but it's hard to tell which ones.
That's why I wonder if there was a way to determine automatically discriminative features to use them for the learning and predicting step of the machine learning algorithm.
In every tutos I found, descriptors were already defined.
Do you have any clue?
I have to determine the number of elephant seals in an image. .. any clue?
Well," classical " ML-Features ( per-se ) are not enough here :
this is a common situation for smart object recognition, which must provide a reasonable robustness, before any counting starts to have a sense.
As an example, CNN-methods deploy ( typically deep ) architectures of pre-processing with specialised kernels, that first help to decompose the 2D-scene into pre-cursors, that next may help the actual ML-based learner ( the fully connected "tail" section of the pipeline ) to start learning the object recognitions.
Without these ( deep or shallow ) convolutional layers and many there applied transcoding and pooling tricks, that re-rasterise the scene with non-linearly transformed co-local new, kernel-produced "visual"-features, pre-process these auto-synthetised-features for the ( yet ) deeper layers of the actual ML-learner.
Many papers published on this, so you are actually happy to have public sources to work with.
Related
I dont mean that a neural network can complete the work of traditional image processing algorithm.What i want to say is if it exists a kind of neural network can use the parameters of the traditional method as input and outputs more universal parameters that dont require manual adjustment.Intuitively, my ideas are less efficient than using neural networks directly,but I don't know much about the mathematics of neural networks.
If I understood correctly, what you mean is for a traditional method (let's say thresholding), you want to find the best parameters using ann. It is possible but you have to supply so many training data which needs to be created, processed and evaluated that it will take a lot of time. AFAIK many mobile phones that have AI assisted camera use this method to find the best aperture, exposure..etc.
First of all, thank you very much. I still have two things to figure out. If I wanted to get a (or a set of) relatively optimal parameters, what data set would I need to build (such as some kind of error between input and output and threshold) ? Second, as you give an example, is it more efficient or better than traversal or Otsu to select the optimal threshold through neural networks in practice?To be honest, I wonder if this is really more efficient than training input and output directly using neural networks
For your second question, Otsu only works on cases where the histogram has two distinct peaks. Thresholding is a simple function but the cut-off value is based on your objective; there is no single "best" value valid for every case. So if you want to train a model for thresholding, I think you have to come up with separate models for each case (like a model for thresholding bright objects, another for darker ones...etc.) Maybe an additional output parameter for determining the aim works but I am not sure. Will it be more efficient and better? Depends on the case (and your definition of better). Otsu, traversal or adaptive thresholding does not work all the time (actually Otsu has very specific use cases). If they work for your case, excellent. If not, then things get messy. So to answer your question, it depends on your problem at hand.
For the first question, TBF, it is quite difficult to work with images in traditional ANNs. Images have a lot of pixels, so standard ANNs struggle with inputs. Moreover, when the location/scale of an object in the image changes, the whole pixel data changes even though the content is the same (These are the reasons why CNN's are superior to ANN's for images). For these reasons it is better to use processed metrics which contain condensed and location-invariant information. E.g. for thresholding, you can give the histogram and it returns a thresholding value. Therefore you need an ann with 256 input neurons (for an intensity histogram of 8bit grayscale image), 1 output neuron, and 1-2 middle layers with some deeply connected neurons (128 maybe?). Your training data will be a bunch of histograms as input and corresponding best threshold value for each histogram. Then once training is finished, you can give the ANN a histogram it has never seen before and it will tell you the optimal thresholding value based on its training.
what I want to do is a model that can output different parameters (parameter sets) based on different input images, so I think if you choose a good enough data set it should be somewhat universal.
Most likely, but your data set should be quite inclusive of expected images (in terms of metrics and features), which means it has to be large.
Also, I don't know much about modeling -- can I use a function about the output/parameters (which might be a function about the result of the traditional method) as an error in the back-propagation by create a custom loss function?
I think so, but training the model will be more involved than using predefined loss functions because, well, you have to write them. Also you have to test they work as expected.
tl;dr - I use an autoencoder to try to reduce input dimensions for a reinforcement-learning (RL) agent to learn how to play Atari-KungFu. But it fails at encoding/decoding thrown knives, because they are only a couple pixels and getting them wrong probably has negligible impact on the autoencoder MSE loss (see green arrows in bottom left of image). This will probably permanently hobble the results. I want to figure out if there is a way to solve this -- preferably with a generalized solution, but I'd be happy for now with something specific to this problem.
Background:
I am working on Week5 of the "Practical Reinforcement Learning" course on Coursera (National Research University HSE), and I decided to spend extra time trying to expand performance on the Atari-KungFu assignment using Actor-Critic architecture. This post is not about actor-critic, but more about an interesting sub-problem I ran into related to autoencoders.
I create an encoder which outputs a tanh-64-neuron layer, which is used as a common input to the decoder, policy learner (actor), and value learner (critic). During training, the simulator returns batches of four sequential frames (64 x 144 x 4) and rewards from the last action. Then images are first used to train the autoencoder, then used with the rewards to train the actor & critic branches.
I display some metrics and example frames every 25000 iterations to see how it's doing. If the reconstructed images are accurate, then the inputs to the actor & critic branches should be getting good distilled information for efficient learning.
You can see below that the autoencoder is pretty good except for the thrown knives (see bottom-left). Arguably this is because missing those couple pixels minimally increases the MSE loss of the reconstructed image, so it has little incentive to learn it (and also there's not a lot of frames that have knives). Yet, seeing those knives is critical for the RL agent to learn to how to survive.
I haven't seen this kind of problem addressed before. A tiny artifact in the input images is crucial for learning, but is unlikely to be learned by the autoencoder. Can we fix/improve this?
IMO your problem is loss specific, some things which would probably help autoencoder reconstruct knife as well:
Find knives in input image using image processing techniques. Regions where knives are present should have higher loss value in MSE, say 10 times more. One way to find those semi-automatically could probably be convolution with big kernel; White pixels at the strict center would give more weight and only zeros around it would give it more weight as well. Something along these lines should find a region where only knives are located (throwing guys wouldn't, as they contain too many white pixels and holes). Using some threshold found empirically for the value of this kernel should be enough to correctly find them.
Lower loss for images when no knive was found, say divided by half. This would focus autoencoder harder on rarely seen cases when knive is seen.
On the downside - I suppose it could introduce some artifacts. In such case you may think about usage of pretrained encoder (like some version of ResNet) and increase model's capabilities.
I understand that Batch Normalisation helps in faster training by turning the activation towards unit Gaussian distribution and thus tackling vanishing gradients problem. Batch norm acts is applied differently at training(use mean/var from each batch) and test time (use finalized running mean/var from training phase).
Instance normalisation, on the other hand, acts as contrast normalisation as mentioned in this paper https://arxiv.org/abs/1607.08022 . The authors mention that the output stylised images should be not depend on the contrast of the input content image and hence Instance normalisation helps.
But then should we not also use instance normalisation for image classification where class label should not depend on the contrast of input image. I have not seen any paper using instance normalisation in-place of batch normalisation for classification. What is the reason for that? Also, can and should batch and instance normalisation be used together. I am eager to get an intuitive as well as theoretical understanding of when to use which normalisation.
Definition
Let's begin with the strict definition of both:
Batch normalization
Instance normalization
As you can notice, they are doing the same thing, except for the number of input tensors that are normalized jointly. Batch version normalizes all images across the batch and spatial locations (in the CNN case, in the ordinary case it's different); instance version normalizes each element of the batch independently, i.e., across spatial locations only.
In other words, where batch norm computes one mean and std dev (thus making the distribution of the whole layer Gaussian), instance norm computes T of them, making each individual image distribution look Gaussian, but not jointly.
A simple analogy: during data pre-processing step, it's possible to normalize the data on per-image basis or normalize the whole data set.
Credit: the formulas are from here.
Which normalization is better?
The answer depends on the network architecture, in particular on what is done after the normalization layer. Image classification networks usually stack the feature maps together and wire them to the FC layer, which share weights across the batch (the modern way is to use the CONV layer instead of FC, but the argument still applies).
This is where the distribution nuances start to matter: the same neuron is going to receive the input from all images. If the variance across the batch is high, the gradient from the small activations will be completely suppressed by the high activations, which is exactly the problem that batch norm tries to solve. That's why it's fairly possible that per-instance normalization won't improve network convergence at all.
On the other hand, batch normalization adds extra noise to the training, because the result for a particular instance depends on the neighbor instances. As it turns out, this kind of noise may be either good and bad for the network. This is well explained in the "Weight Normalization" paper by Tim Salimans at al, which name recurrent neural networks and reinforcement learning DQNs as noise-sensitive applications. I'm not entirely sure, but I think that the same noise-sensitivity was the main issue in stylization task, which instance norm tried to fight. It would be interesting to check if weight norm performs better for this particular task.
Can you combine batch and instance normalization?
Though it makes a valid neural network, there's no practical use for it. Batch normalization noise is either helping the learning process (in this case it's preferable) or hurting it (in this case it's better to omit it). In both cases, leaving the network with one type of normalization is likely to improve the performance.
Great question and already answered nicely. Just to add: I found this visualisation From Kaiming He's Group Norm paper helpful.
Source: link to article on Medium contrasting the Norms
I wanted to add more information to this question since there are some more recent works in this area. Your intuition
use instance normalisation for image classification where class label
should not depend on the contrast of input image
is partly correct. I would say that a pig in broad daylight is still a pig when the image is taken at night or at dawn. However, this does not mean using instance normalization across the network will give you better result. Here are some reasons:
Color distribution still play a role. It is more likely to be a apple than an orange if it has a lot of red.
At later layers, you can no longer imagine instance normalization acts as contrast normalization. Class specific details will emerge in deeper layers and normalizing them by instance will hurt the model's performance greatly.
IBN-Net uses both batch normalization and instance normalization in their model. They only put instance normalization in early layers and have achieved improvement in both accuracy and ability to generalize. They have open sourced code here.
IN provide visual and appearance in-variance and BN accelerate training and preserve discriminative feature.
IN is preferred in Shallow layer(starting layer of CNN) so remove appearance variation and BN is preferred in deep layers(last CNN layer) should be reduce in order to maintain discrimination.
I want to find images similar to another image. So after researching i found two methods first was two represent the image by its attributes like
length = full
pattern = check
color = blue
but the limitation of this method is that I will not be able to get an exhaustive dataset with all the features marked.
The second approach I found was to extract features and do feature mapping.
So I decided to use deep convolution neural networks with caffe so that by using any of the exsisting models I could learn the features and then perform feature matching or some other operation. I just wanted to take a general advice what can be the other methods which are good and worth a try. And since I am just starting out with caffe so can anyone give a general guideline how to approach the problem with caffe?
Thanks in advance
I looked at phash just was curious that it will find the images which are same like there are minor intensity variations and some other variation wiill it also work to give the same type(semantically) like for a tshirt with blue and red stripes will it give black and white stripe as similar and would it consider things like the length of shirt, collar style etc
It's true, that it's been empirically shown, that the euclidean distance between the features extracted using ConvNets is closer for images of the same class, while farther for images of different classes - but it's important to understand what kind of similarity you're looking for.
One can define many types of similarity measures, and the type of features you use (in the case of ConvNets, the type of data it was trained on) affects the kind of similar images you'll get. For instance, maybe given an image of a dog, you want to find other pictures of dogs but not specifically that exact dog, alternatively, maybe you have a picture of a church and you want to find another image of the exact same church but from a different angle - these are two very different problems, with different methods you can use to solve them.
One particular kind of convolutional neural networks you can look at, are Siamese Network, which are built to learn similarities between two images, given a dataset of pairs of images with the labels same/not_same. You can look for implementation in Caffe for this method here.
A different method, is to take a ConvNet trained on ImageNet data (see here for options), and use the python/matlab interface to classify images, and then extract the second to last layer, and use that as the representation for that image. Now you can just take the euclidean distance of those representations and this would be your similarity measure.
Unrelated to Caffe, you can also use "old school" methods of feature matching, included in open source libraries like OpenCV (an example tutorial of such method).
I am new to opencv, I am guessing that this problem could be somewhat simple: I am trying to detect an object which is almost 25 by 15 pixels in an image which is 470 by 590 pixels.
I am attaching a zoomed image of this object, I have several options to go with:
1 - Two close Circles Detection using hough transformation,
2 - Histogram matching
3 - SURF feature detection
Any advise on which direction should I take? Please consider speed and real-time application. Thanks
I think it should go without explicitly saying so, but there are probably hundreds of things that could be tried, and with only one example image it is quite difficult to advise. For instance are the LED always green? we don't know.
That aside, imho, two good places to start would be with the ol' faithful template matching, or blob detection.
Then if that is not robust enough, you will need to look at some alternative representations of the template/blob, like the classic HoG (good for shape, maybe a bit heavy this app.), or even your own bespoke one that encodes your own domain specific knowledge of this problem.
Then if that is not robust enough, build a dataset of representative +ve and -ve examples, as big as you can, and then train a machine like svm , or a boosted classifier.
Template Matching:
http://docs.opencv.org/doc/tutorials/imgproc/histograms/template_matching/template_matching.html
Blob detection:
https://code.google.com/p/cvblob/
Machine Learning:
http://docs.opencv.org/modules/ml/doc/ml.html
TIPS:
Add as much domain knowledge as possible, i.e. if they are always green, use color in the representation, like hog on g channel for instance. If they are always circular, try to encode that, like use a log-polar grid in the template,rather than a regular grid... and so on.
Machine Learning is not magic, a linear classifier will essentially weight different points in the feature space, so you still require a good representation, so if the Template matching was a total fail, the it is unlikely that simple linear ml with help, but if the Template matching was okay, then ml may well boost the performance to a good level.
step 1: Remove the black background.
step 2: A snake algorithm can be used to find the boundaries of your object