Neural Networks - How are the IRIS input vectors processed? - machine-learning

I am just starting to grasp the idea of Backpropagation and MLP Networks. What I have confusion about is that how is the input vectors "clamped" in the input layer ?
For example lets take a mock IRIS dataset-:
[5.0,4.4,2.7,1.5,0],
[3.0,3.6,1.8,1.7,1],
[2.0,1.2,3.3.4.2,2]
Are these inputs feed in all together in the input layer ? Or are they fed in one by one.
What I mean is that on the first iteration is the first input vector fed like-:
[5.0,4.4,2.7,1.5,1]
and then the error is calculated and then the next input vector is sent ie.
[3.0,3.6,1.8,1.7,2]
OR are they all sent in together as -:
[[A vector of all petal lengths],[A vector of all sepal lengths],etc]
I know different frameworks handle these differently but feel free to comment on how any popular deeplearning framework would do this. I use DeepLearning4J myself.
Thanks.

Input vectors are usually fed into neural networks in batches. How many vectors those batches contain is dependent on the batch size. E.g. a batch size of 128 would mean that you feed 128 input vectors into the network (or less if there aren't that many left) and then update the weights/parameters. The iris tutorial of Deeplearning4J seems to use a batch size of 150: int batchSize = 150; and later DataSetIterator iter = new IrisDataSetIterator(batchSize, numSamples);.
Note that there's also a batch mode for updating the weights of neural networks, which - confusingly - updates the weights only after all input vectors have been fed into the network. The batch mode however is basically never used.

Related

How do Convolutional neural networks proceed after the pooling step?

I am trying to learn about convolutional neural networks, but i am having trouble understanding what happens to neural networks after the pooling step.
So starting from the left we have our 28x28 matrix representing our picture. We apply a three 5x5 filters to it to get three 24x24 feature maps. We then apply max pooling to each 2x2 square feature map to get three 12x12 pooled layers. I understand everything up to this step.
But what happens now? The document I am reading says:
"The final layer of connections in the network is a fully-connected
layer. That is, this layer connects every neuron from the max-pooled
layer to every one of the 10 output neurons. "
The text did not go further into describing what happens beyond that and it left me with a few questions.
How are the three pooled layers mapped to the 10 output neurons? By fully connected, does it mean each neuron in every one of the three layers of the 12x12 pooled layers has a weight connecting it to the output layer? So there are 3x12x12x10 weights linking from the pooled layer to the output layer? Is an activation function still taken at the output neuron?
Pictures and extract taken from this online resource: http://neuralnetworksanddeeplearning.com/chap6.html
Essentially, the fully connected layer provides the main way for the neural network to make a prediction. If you have ten classes, then a fully connected layer consists of ten neurons, each with a different probability as to the likelihood of the classified sample belonging to that class (each neuron represents a class). These probabilities are determined by the hidden layers and convolution. The pooling layer is simply outputted into these ten neurons, providing the final interface for your network to make the prediction. Here's an example. After pooling, your fully connected layer could display this:
(0.1)
(0.01)
(0.2)
(0.9)
(0.2)
(0.1)
(0.1)
(0.1)
(0.1)
(0.1)
Where each neuron contains a probability that the sample belongs to that class. In this, case, if you are classifying images of handwritten digits and each neuron corresponds to a prediction that the image is 1-10, then the prediction would be 4. Hope that helps!
Yes, you're on the right track. There is a layer with a weight matrix of 4320 entries.
This matrix will be typically arranged as 432x10. This is because these 432 number are a fixed-size representation of the input image. At this point, you don't care about how you got it -- CNN, plain feed-forward or a crazy RNN going pixel-by-pixel, you just want to turn the description into classifaction. In most toolkits (e.g. TensorFlow, PyTorch or even plain numpy), you'll need to explicitly reshape the 3x12x12 output of the pooling into a 432-long vector. But that's just a rearrangement, the individual elements do not change.
Additionally, there will usually be a 10-long vector of biases, one for every output element.
Finally about the nonlinearity: Since this is about classification, you typically want the output 10 units to represent posterior probabilities that the input belongs to a particular class (digit). For this purpose, the softmax function is used: y = exp(o) / sum(exp(o)), where exp(o) stands for an element-wise exponentiation. It guarantees that it's output will be a proper categorical distribution, all elements in <0; 1> and summing up to 1. There is a nice a detailed discussion of softmax in neural networks in the Deep Learning book (I recommend reading Section 6.2.1 in addition to the softmax Sub-subsection itself.)
Also note that this is not specific to convolutional networks at all, you'll find this block fully connected layer -- softmax at the end of virtually every classification network. You can also view this block as the actual classifier, while anything in front of it (a shallow CNN in your case) is just trying to prepare nice features.

Multi output neural network

I am trying to create a neural network that outputs more than a binary value.
The problem is the following:
I have recently stumbled upon this problem on kaggle https://www.kaggle.com/c/poker-rule-induction
Basically, the problem is getting the program to predict the hands the test set has to classify it from 0-9. I have already managed to solve this problem using RandomForest library.
My question is how can I solve this problem using a neural network?
I have already tried to follow some tutorials where you have 2 binary inputs and 1 binary output.
Dataset looks like the following:
If I understand you correctly, you are asking how to structure your neural network output neurons for binary classification. Instead of having one value that outputs a non-binary (0-9) which doesn't really work for many reasons, you can design the outputs to produce a binary vector.
Where...
1 = [0,1,0,0,0,0,0,0,0,0]
2 = [0,0,1,0,0,0,0,0,0,0]
3 = [0,0,0,1,0,0,0,0,0,0]
...etc
So, each item in the vector can be one of the 10 output neurons, and if that item is a 1, its position refers to its classification group. The best example of this, is the MNIST digit neural networks that also usually 10 neuron binary outputs.
Bare in mind the actual outputs will be decimals representing a probability / guess, that is close to either 0, or the 1.
This also means your target value has to be a vector that back propagates each item that corresponds to each neuron.

Using Bag of words/features and neural network

I'm trying to implement an object detection module which contains the following steps:
1) extract image descriptors with SURF, creating a matrix of size [x, 64], where x depends of the number of keypoints found in the image;
2) fix the descriptor size to a [k,64] format using bag of features/words approach. Where k is the number of clusters created using k-means.
3) feed a neural network using the resulting bag of words matrix as trainingSamples.
So far I've implemented steps 1 and 2 but I'm not quite sure how to format the output vector of the NN. On OpenCV CvANN_MLP, the number of rows in the output vector should have the same number of the input rows (otherwise returns an what() exception), but the number of input rows are the number of the k clusters on step 2, so I'm not understanding how to write the output matrix based on that.
I know the output matrix should have n columns corresponding to the number of classes in the output that I want (e.g. 3 classes: cat, dog and bird will result on a matrix with 3 columns), but how do I organize the rows of this matrix based on the input rows? I read this related post , it uses matlab and it says that each feature should be a row, but I'm not sure how to do this on OpenCV C++.
If anyone has any idea/tips of how to proceed with that, it would be very appreciated.
Have you done this:
However, before you train your neural network, as you suspected, you
must represent every image you wish to train with this feature vector.
Before feeding your neural network? I lack in experience in using neural networks, however after reading this and your question, it seems that you are trying to feed the bag-of-words clusters to your neural network, which is incorrect.

Where do dimensions in Word2Vec come from?

I am using word2vec model for training a neural network and building a neural embedding for finding the similar words on the vector space. But my question is about dimensions in the word and context embeddings (matrices), which we initialise them by random numbers(vectors) at the beginning of the training, like this https://iksinc.wordpress.com/2015/04/13/words-as-vectors/
Lets say we want to display {book,paper,notebook,novel} words on a graph, first of all we should build a matrix with this dimensions 4x2 or 4x3 or 4x4 etc, I know the first dimension of the matrix its the size of our vocabulary |v|. But the second dimension of the matrix (number of vector's dimensions), for example this is a vector for word “book" [0.3,0.01,0.04], what are these numbers? do they have any meaning? for example the 0.3 number related to the relation between word “book" and “paper” in the vocabulary, the 0.01 is the relation between book and notebook, etc.
Just like TF-IDF, or Co-Occurence matrices that each dimension (column) Y has a meaning - its a word or document related to the word in row X.
The word2vec model uses a network architecture to represent the input word(s) and most likely associated output word(s).
Assuming there is one hidden layer (as in the example linked in the question), the two matrices introduced represent the weights and biases that allow the network to compute its internal representation of the function mapping the input vector (e.g. “cat” in the linked example) to the output vector (e.g. “climbed”).
The weights of the network are a sub-symbolic representation of the mapping between the input and the output – any single weight doesn’t necessarily represent anything meaningful on its own. It’s the connection weights between all units (i.e. the interactions of all the weights) in the network that gives rise to the network’s representation of the function mapping. This is why neural networks are often referred to as “black box” models – it can be very difficult to interpret why they make particular decisions and how they learn. As such, it's very difficult to say what the vector [0.3,0.01,0.04] represents exactly.
Network weights are traditionally initialised to random values for two main reasons:
It prevents a bias being introduced to the model before training begins
It allows the network to start from different points in the search space after initialisation (helping reduce the impact of local minima)
A network’s ability to learn can be very sensitive to the way its weights are initialised. There are more advanced ways of initialising weights today e.g. this paper (see section: Weights initialization scaling coefficient).
The way in which weights are initialised and the dimension of the hidden layer are often referred to as hyper-parameters and are typically chosen according to heuristics and prior knowledge of the problem space.
I have wondered the same thing and put in a vector like (1 0 0 0 0 0...) to see what terms it was nearest to. The answer is that the results returned didn't seem to cluster around any particular meaning, but were just kind of random. This was using Mikolov's 300-dimensional vectors trained on Google News.
Look up NNSE semantic vectors for a vector space where the individual dimensions do seem to carry specific human-graspable meanings.

How to use stacked autoencoders for pretraining

Let's say I wish to used stacked autoencoders as a pretraining step.
Let's say my full autoencoder is 40-30-10-30-40.
My steps are:
Train a 40-30-40 using the original 40 features data set in both input and output layers.
Using the trained encoder part only of the above i.e. 40-30 encoder, derive a new 30 feature representation of the original 40 features.
Train a 30-10-30 using the new 30 features data set (derived in step 2) in both input and output layers.
Take the trained encoder from step 1 ,40-30, and feed it into the encoder from step 3,30-10, giving a 40-30-10 encoder.
Take the 40-30-10 encoder from step 4 and use it as the input the NN.
a) Is that correct?
b) Do I freeze the weights in the 40-30-10 encoder when training the NN which would be the same as pregenerating the 10 feature representation from the original 40 feature data set and training on the new 10 feature representation data set.
PS. I already have a question out asking about whether I need to tie the weights of the encoder and decoder
a) Is that correct?
This is one of the typical approaches. You could also try to fit the autoencoder directly, as "raw" autoencoder with that many layers should be possible to fit right away, As an alternative you might consider fitting stacked denoising autoencoders instead, which might benefit more from "stacked" training.
b) Do I freeze the weights in the 40-30-10 encoder when training the NN which would be the same as pregenerating the 10 feature representation from the original 40 feature data set and training on the new 10 feature representation data set.
When you train whole NN you do not freeze anything. Pretraining is only a kind of preconditioning for the optimization process - you show your method where to start, but you do not want to limit the fitting procedure of actual supervised learning.
PS. I already have a question out asking about whether I need to tie the weights of the encoder and decoder
No, you do not have to tie weights, especially that you actually throw away your decoder anyway. Tieing the weights is important for some more probabilistic models in order to make minimization procedure possible (like in the case of RBMs), but for autoencoder there is no point.

Resources