Not CNNs, regular NNs. Also, I'm actually interested in making an AI based edge detector. I've read some papers, but none seem to kick start me. Can anyone share some getting started tips for making edge detectors with AI? CNNs work as classifiers, not image filters. So how can I?
Neural Network's back propagation technique is one of the popular techniques that mainly used for classification process. In the process of back propagation, a convolution matrix will be generated, a knowledge that actually generates the edge from gray level image.
But, I have another doubt what kind of learning you are opting for to train your NN, Supervised or Unsupervised?
Supervised- Train the network with a given set of data sets which can be an edge
Unsupervised- Create input layer with 5 inputs and subtract central pixel from all the neighbour four pixels and thresholding can be done at output layer.
You can even go for HYBRID APPROACH OF NEURO-FUZZY:-
One the given input image Sobel and Laplacian is applied.
Fuzzy rules are applied on the output we gain from these operators.
In neural network, input layer consists of gradient direction and hidden layer consists of fuzzy data. Both are used to train the network.
Hope, it helps
Related
Each layer in a CNN reduces the size of the input via convolution and max-pooling operations. Convolution is translation equivariant, but max-pooling is translation invariant. Correct me if this is wrong : each time max-pooling applied, the precise location of a feature is reduced. So the feature maps of the final conv layer in a very deep CNN will have a large receptive field (w.r.t the original image), but the location of this feature (in the original image) is not discernible from looking at this feature map alone.
If this is true, how can the accuracy of bounding boxes when we do localisation be so good with a deep CNN? I understand how classification works, but making accurate bounding box predictions is confusing me.
Perhaps a toy example will clarify my confusion;
Say we have a dataset of images with dimension 256x256x1, and we want to predict whether a cat is present, and if so, where it is, so our target is something like [sigmoid_cat_present, cat_location].
Our vanilla CNN (let's assume something like VGG) will take in the image and transform it to something like 16x16x256 in the last convolutional layer. Each pixel in this final 16x16 feature map can be influenced by a much larger region in the original image. So if we determine a cat is present, how can the [cat_location] be refined to value more granular than this effective receptive field?
To add to your question - how about pixel perfect accuracy of segmentation boundary !!
Your intuition regarding down-sampling via max-pooling is correct. Normal CNNs have that limit. However, there have been some improvements recently to overcome it.
The breakthrough to this problem came in 2015-6 in the form of U-net and atrous/dilated convolution introduced in DeepLab.
Dilated convolutions or atrous convolutions, previously described for wavelet analysis without signal decimation, expands window size without increasing the number of weights by inserting zero-values into convolution kernels. Dilated convolutions have been shown to decrease blurring in semantic segmentation maps, and are purported to work at least in part by extracting long range information without the need for pooling.
Using U-Net architectures is another method that seeks to retain high spatial frequency information by directly adding skip connections between early and late layers. In other words, up-sampling followed by down-sampling.
In TensorFlow, atrous convolutions are implemented with function:
tf.nn.atrous_conv2d
There are many more methods and this is an ongoing research area.
Im learning about Convolutional Neural Networks and right now i'm confused about how to implement it.
I know about regular neural networks and concepts like Gradient Descent and Back Propagation, And i can understand how CNN's how works intuitively.
My question is about Back Propagation in CNN's. How it happens? The last fully connected layers is the regular Neural Networks and there is no problem about that. But how i can update filters in convolution layers? How I can Back Propagate error from fully connected layers to these filters? My problem is updating Filters!
Filters are only simple matrixes? Or they have structures like regular NN's and connections between layers simulates that capability? I read about Sparse Connectivity and Shared Weights but I cant relate them to CNN's. Im really confused about implementing CNN's and i cant find any tutorials that talks about these concept. I can't read Papers because I'm new to these things and my Math is not good.
i dont want to use TensorFlow or tools like this, Im learning the main concept and using pure Python.
First off, I can recommend this introduction to CNNs. Maybe you can grasp the idea of it better with this.
To answer some of your questions in short:
Let's say you want to use a CNN for image classification. The picture consists of NxM pixels and has 3 channels (RBG). To apply a convolutional layer on it, you use a filter. Filters are matrices of (usually, but not necessarily) quadratic shape (e. g. PxP) and a number of channels that equals the number of channels of the representation it is applied on. Therefore, the first Conv layer filter has also 3 channels. Channels are the number of layers of the filter, so to speak.
When applying a filter to a picture, you do something called discrete convolution. You take your filter (which is usually smaller than your image) and slide it over the picture step by step, and calculate the convolution. This basically is a matrix multiplication. Then you apply a activation function on it and maybe even a pooling layer. Important to note is that the filter for all performed convolutions on this layer stays the same, so you only have P*P parameters per layer. You tweak the filter in a way, so that it fits the training data as well as possible. That's why its parameters are called shared weights. When applying GD, you simply have to apply it on said filter weights.
Also, you can find a nice demo for the convolutions here.
Implementing these things are certainly possible, but for starting out you could try out tensorflow for experimenting. At least that's the way I learn new concepts :)
Last day I learned about the convolution neural network, And went through some implementations of CNN using Tensorflow, All the implementation only specify the size, number of filters and strides for the filter. But when I learned about the filter it says that filter on each layer extracts different feature like edges, corners etc.
My question is can we explicitly specify filter which all feature we should extract, Or which portion for the image is more important etc
All the explanation says that we take a small part of an input image a slide across it with convolving. If so do we take all the parts of image and convolve across the image?
can we explicitly specify filter which all feature we should extract, Or which portion for the image is more important etc
Sure, this could be done. But the advantage of CNNs is that they learn the best features themselves (or at least very good ones; better ones than we can come up with in most cases).
One famous example is the ImageNet dataset:
In 2012 the first end-to-end learned CNN was used. End-to-end means that the network gets the raw data on one end as input and the optimization objective on the other end.
Before CNNs, the computer vision community used manually designed features for many years. After AlexNet in 2012, nobody did so (for "typical" computer vision - there are special applications where it is still worth a shot).
All the explanation says that we take a small part of an input image a slide across it with convolving. If so do we take all the parts of image and convolve across the image?
It is always the complete image which is convolved with a small filter. The convolution operation is local, meaning you can compute much of it in parallel as the result of the convolution in the upper left corner is not
dependent of the convolution in the lower left corner.
I think you may be confusing filters and channels. A filter is the weight window size in your convolution which can be used to produce channels from the convolution output. It is typically these channels that represent different features:
In this car identification example you can see some of the earlier channels picking up things like the hood, doors, and other borders of the car. It is hard to truly specify which features the network is extracting. If you already have knowledge of features that are important to the network you can feed them in as an additional mask layer or using some type of weighting matrix on them.
Probably lots of people already saw this article by Google research:
http://googleresearch.blogspot.ru/2015/06/inceptionism-going-deeper-into-neural.html
It describes how Google team have made neural networks to actually draw pictures, like an artificial artist :)
I wanted to do something similar just to see how it works and maybe use it in future to better understand what makes my network to fail. The question is - how to achieve it with nolearn\lasagne (or maybe pybrain - it will also work but I prefer nolearn).
To be more specific, guys from Google have trained an ANN with some architecture to classify images (for example, to classify which fish is on a photo). Fine, suppose I have an ANN constructed in nolearn with some architecture and I have trained to some degree. But... What to do next? I don't get it from their article. It doesn't seem that they just visualize the weights of some specific layers. It seems to me (maybe I am wrong) like they do one of 2 things:
1) Feed some existing image or purely a random noise to the trained network and visualize the activation of one of the neuron layers. But - looks like it is not fully true, since if they used convolution neural network the dimensionality of the layers might be lower then the dimensionality of original image
2) Or they feed random noise to the trained ANN, get its intermediate output from one of the middlelayers and feed it back into the network - to get some kind of a loop and inspect what neural networks layers think might be out there in the random noise. But again, I might be wrong due to the same dimensionality issue as in #1
So... Any thoughts on that? How we could do the similar stuff as Google did in original article using nolearn or pybrain?
From their ipython notebook on github:
Making the "dream" images is very simple. Essentially it is just a
gradient ascent process that tries to maximize the L2 norm of
activations of a particular DNN layer. Here are a few simple tricks
that we found useful for getting good images:
offset image by a random jitter
normalize the magnitude of gradient
ascent steps apply ascent across multiple scales (octaves)
It is done using a convolutional neural network, which you are correct that the dimensions of the activations will be smaller than the original image, but this isn't a problem.
You change the image with iterations of forward/backward propagation just how you would normally train a network. On the forward pass, you only need to go until you reach the particular layer you want to work with. Then on the backward pass, you are propagating back to the inputs of the network instead of the weights.
So instead of finding the gradients to the weights with respect to a loss function, you are finding gradients to inputs with respect to the l2 Normalization of a certain set of activations.
At the end of the introduction to this instructive kaggle competition, they state that the methods used in "Viola and Jones' seminal paper works quite well". However, that paper describes a system for binary facial recognition, and the problem being addressed is the classification of keypoints, not entire images. I am having a hard time figuring out how, exactly, I would go about adjusting the Viola/Jones system for keypoint recognition.
I assume I should train a separate classifier for each keypoint, and some ideas I have are:
iterate over sub-images of a fixed size and classify each one, where an image with a keypoint as center pixel is a positive example. In this case I'm not sure what I would do with pixels close to the edge of the image.
instead of training binary classifiers, train classifiers with l*w possible classes (one for each pixel). The big problem with this is that I suspect it will be prohibitively slow, as every weak classifier suddenly has to do l*w*original operations
the third idea I have isn't totally hashed out in my mind, but since the keypoints are each parts of a greater part of a face (left, right center of an eye, for example), maybe I could try to classify sub-images as just an eye, and then use the left, right, and center pixels (centered in the y coordinate) of the best-fit subimage for each face-part
Is there any merit to these ideas, and are there methods I haven't thought of?
however, that paper describes a system for binary facial recognition
No, read the paper carefully. What they describe is not face specific, face detection was the motivating problem. The Viola Jones paper introduced a new strategy for binary object recognition.
You could train a Viola Jones style Cascade for eyes, another for a nose, and one for each keypoint you are interested in.
Then, when you run the code - you should (hopefully) get 2 eyes, 1 nose, etc, for each face.
Provided you get the number of items you expected, you can then say "here are the key points!" What takes more work is getting enough data to build a good detector for each thing you want to detect, and gracefully handling false positives / negatives.
I ended up working on this problem extensively. I used "deep learning," aka several layers of neural networks. I used convolutional networks. You can learn more about them by checking out these demos:
http://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist.html
http://deeplearning.net/tutorial/lenet.html#lenet
I made the following changes to a typical convolutional network:
I did not do any down-sampling, as any loss of precision directly translates to a decrease in the model's score
I did n-way binary classification, with each pixel being classified as a keypoint or non-keypoint (#2 in the things I listed in my original post). As I suspected, computational complexity was the primary barrier here. I tried to use my GPU to overcome these issues, but the number of parameters in the neural network were too large to fit in GPU memory, so I ended up using an xl amazon instance for training.
Here's a github repo with some of the work I did:
https://github.com/cowpig/deep_keypoints
Anyway, given that deep learning has blown up in popularity, there are surely people who have done this much better than I did, and published papers about it. Here's a write-up that looks pretty good:
http://danielnouri.org/notes/2014/12/17/using-convolutional-neural-nets-to-detect-facial-keypoints-tutorial/