I'm trying to implement an object detection module which contains the following steps:
1) extract image descriptors with SURF, creating a matrix of size [x, 64], where x depends of the number of keypoints found in the image;
2) fix the descriptor size to a [k,64] format using bag of features/words approach. Where k is the number of clusters created using k-means.
3) feed a neural network using the resulting bag of words matrix as trainingSamples.
So far I've implemented steps 1 and 2 but I'm not quite sure how to format the output vector of the NN. On OpenCV CvANN_MLP, the number of rows in the output vector should have the same number of the input rows (otherwise returns an what() exception), but the number of input rows are the number of the k clusters on step 2, so I'm not understanding how to write the output matrix based on that.
I know the output matrix should have n columns corresponding to the number of classes in the output that I want (e.g. 3 classes: cat, dog and bird will result on a matrix with 3 columns), but how do I organize the rows of this matrix based on the input rows? I read this related post , it uses matlab and it says that each feature should be a row, but I'm not sure how to do this on OpenCV C++.
If anyone has any idea/tips of how to proceed with that, it would be very appreciated.
Have you done this:
However, before you train your neural network, as you suspected, you
must represent every image you wish to train with this feature vector.
Before feeding your neural network? I lack in experience in using neural networks, however after reading this and your question, it seems that you are trying to feed the bag-of-words clusters to your neural network, which is incorrect.
Related
I am trying to create a neural network that outputs more than a binary value.
The problem is the following:
I have recently stumbled upon this problem on kaggle https://www.kaggle.com/c/poker-rule-induction
Basically, the problem is getting the program to predict the hands the test set has to classify it from 0-9. I have already managed to solve this problem using RandomForest library.
My question is how can I solve this problem using a neural network?
I have already tried to follow some tutorials where you have 2 binary inputs and 1 binary output.
Dataset looks like the following:
If I understand you correctly, you are asking how to structure your neural network output neurons for binary classification. Instead of having one value that outputs a non-binary (0-9) which doesn't really work for many reasons, you can design the outputs to produce a binary vector.
Where...
1 = [0,1,0,0,0,0,0,0,0,0]
2 = [0,0,1,0,0,0,0,0,0,0]
3 = [0,0,0,1,0,0,0,0,0,0]
...etc
So, each item in the vector can be one of the 10 output neurons, and if that item is a 1, its position refers to its classification group. The best example of this, is the MNIST digit neural networks that also usually 10 neuron binary outputs.
Bare in mind the actual outputs will be decimals representing a probability / guess, that is close to either 0, or the 1.
This also means your target value has to be a vector that back propagates each item that corresponds to each neuron.
I am using word2vec model for training a neural network and building a neural embedding for finding the similar words on the vector space. But my question is about dimensions in the word and context embeddings (matrices), which we initialise them by random numbers(vectors) at the beginning of the training, like this https://iksinc.wordpress.com/2015/04/13/words-as-vectors/
Lets say we want to display {book,paper,notebook,novel} words on a graph, first of all we should build a matrix with this dimensions 4x2 or 4x3 or 4x4 etc, I know the first dimension of the matrix its the size of our vocabulary |v|. But the second dimension of the matrix (number of vector's dimensions), for example this is a vector for word “book" [0.3,0.01,0.04], what are these numbers? do they have any meaning? for example the 0.3 number related to the relation between word “book" and “paper” in the vocabulary, the 0.01 is the relation between book and notebook, etc.
Just like TF-IDF, or Co-Occurence matrices that each dimension (column) Y has a meaning - its a word or document related to the word in row X.
The word2vec model uses a network architecture to represent the input word(s) and most likely associated output word(s).
Assuming there is one hidden layer (as in the example linked in the question), the two matrices introduced represent the weights and biases that allow the network to compute its internal representation of the function mapping the input vector (e.g. “cat” in the linked example) to the output vector (e.g. “climbed”).
The weights of the network are a sub-symbolic representation of the mapping between the input and the output – any single weight doesn’t necessarily represent anything meaningful on its own. It’s the connection weights between all units (i.e. the interactions of all the weights) in the network that gives rise to the network’s representation of the function mapping. This is why neural networks are often referred to as “black box” models – it can be very difficult to interpret why they make particular decisions and how they learn. As such, it's very difficult to say what the vector [0.3,0.01,0.04] represents exactly.
Network weights are traditionally initialised to random values for two main reasons:
It prevents a bias being introduced to the model before training begins
It allows the network to start from different points in the search space after initialisation (helping reduce the impact of local minima)
A network’s ability to learn can be very sensitive to the way its weights are initialised. There are more advanced ways of initialising weights today e.g. this paper (see section: Weights initialization scaling coefficient).
The way in which weights are initialised and the dimension of the hidden layer are often referred to as hyper-parameters and are typically chosen according to heuristics and prior knowledge of the problem space.
I have wondered the same thing and put in a vector like (1 0 0 0 0 0...) to see what terms it was nearest to. The answer is that the results returned didn't seem to cluster around any particular meaning, but were just kind of random. This was using Mikolov's 300-dimensional vectors trained on Google News.
Look up NNSE semantic vectors for a vector space where the individual dimensions do seem to carry specific human-graspable meanings.
I am just starting to grasp the idea of Backpropagation and MLP Networks. What I have confusion about is that how is the input vectors "clamped" in the input layer ?
For example lets take a mock IRIS dataset-:
[5.0,4.4,2.7,1.5,0],
[3.0,3.6,1.8,1.7,1],
[2.0,1.2,3.3.4.2,2]
Are these inputs feed in all together in the input layer ? Or are they fed in one by one.
What I mean is that on the first iteration is the first input vector fed like-:
[5.0,4.4,2.7,1.5,1]
and then the error is calculated and then the next input vector is sent ie.
[3.0,3.6,1.8,1.7,2]
OR are they all sent in together as -:
[[A vector of all petal lengths],[A vector of all sepal lengths],etc]
I know different frameworks handle these differently but feel free to comment on how any popular deeplearning framework would do this. I use DeepLearning4J myself.
Thanks.
Input vectors are usually fed into neural networks in batches. How many vectors those batches contain is dependent on the batch size. E.g. a batch size of 128 would mean that you feed 128 input vectors into the network (or less if there aren't that many left) and then update the weights/parameters. The iris tutorial of Deeplearning4J seems to use a batch size of 150: int batchSize = 150; and later DataSetIterator iter = new IrisDataSetIterator(batchSize, numSamples);.
Note that there's also a batch mode for updating the weights of neural networks, which - confusingly - updates the weights only after all input vectors have been fed into the network. The batch mode however is basically never used.
I am building a software to classify cells from images taken by a microscope.
I have a dataset of images of cells to use as training dataset - I have extracted Keypoints from each image using ORB - Here is my problem - some image produce a lot of keypoints and some small number of keypoints. Thus the descriptor vectors produced are of different lentgh. So when i try to build a training matrix from them i have to 'Normalize' the number of Keypoints chosen from each Image so that the length of the descriptor vectors will be the same.
How many key points should i pick and which? how to pick the 'Best' Keypoints? (this question also rises when i want to preform a prediction on an object i want to classify) are there known approaches to this problem?
Regards.
You could use bag of words approach to classify your images. You first have to collect all keypoint descriptors and cluster them into a certain number of groups. Each group (cluster) is your word. A descriptor corresponds to a word. For each image now, you can build a histogram by counting the occurrence of words. You can then normalize the histogram to remove the effect of varying number of keypoints in different images.
Using spatial pyramid matching could be another solution.
The simplest approach, as described by Ajay, is to cluster keypoints into N clusters and then define N binary features, such that for a given sample, feature i equals 1 if the sample shows a keypoint in cluster i, and 0 otherwise.
Another approach is to use a kernel classifier, like Support Vector Machines (SVM), and use a kernel that accepts variable-length vectors (e.g. Fisher kernel).
i have some doubts incase of bag of words based image classification, i will first of tell what i have done
i have extracted the features from the training image with two different categories using SURF method,
i have then made clustering of the features for the two categories.
in order to classify my test image (i.e) to which of the two category the test image belongs to. for this classifying purpose i am using SVM classifier, but here is what i have a doubt , how do we input the test image do we have to do the same step from 1 to 2 again and then use it as a test set or is there any other method to do,
also would be great to know the efficiency of the bow approach,
kindly some one provide me with an clarification
The classifier needs the representation for the test data to have the same meaning as the training data. So, when you're evaluating a test image, you extract the features and then make the histogram of which words from your original vocabulary they're closest to.
That is:
Extract features from your entire training set.
Cluster those features into a vocabulary V; you get K distinct cluster centers.
Encode each training image as a histogram of the number of times each vocabulary element shows up in the image. Each image is then represented by a length-K vector.
Train the classifier.
When given a test image, extract the features. Now represent the test image as a histogram of the number of times each cluster center from V was closest to a feature in the test image. This is a length K vector again.
It's also often helpful to discount the histograms by taking the square root of the entries. This approximates a more realistic model for image features.