How to find direction of the negative gradient vector? - machine-learning

I'm a ninth grader who's working on a Neural Network that takes 28x28 grid pixels(from mnist) and guesses a number.
At the backpropagation, there's something I don't understand. You count the partial derivatives of the cost function with respect to the partial derivatives of the weights and the biases. Then, you put them in a direction vector, where you count the nudges you need to add to the weights and make an average of those over 10000 pictures. And here's the problem: How do you count those little nudges from the partial derivatives of the cost function with respect to the partial derivatives of the weights and the biases to find the minimum of the cost?
enter image description here

Related

What is compared when a CNN learns a set of features during backpropagation?

I am relatively new the subject and have been doing loads of reading. What I am particularly confused about is how a CNN learns its filters for a particular labeled feature in a training data set.
Is the cost calculated by which outputs should or shouldn't be active on a pixel by pixel basis? And if that is the case, how does mapping the activations to the labeled data work after having down sampled?
I apologize for any poor assumptions or general misunderstandings. Again, I am new to this field and would appreciate all feedback.
I'll break this up into a few small pieces.
Cost calculation -- cost / error / loss depends only on comparing the final prediction (the last layer's output) to the label (ground truth). This serves as a metric of how right or wrong the prediction is.
Inter-layer structure -- Each input to the prediction is an output of the prior layer. This output has a value; the link between the two has a weight.
Back-prop -- Each weight gets adjusted in proportion to the error comparison and its weight. A connection that contributed to a correct prediction gets rewarded: its weight is increased in magnitude. Conversely, a connection that pushed for a wrong prediction gets reduced.
Pixel-level control -- To clarify the terminology ... traditionally, each kernel is a square matrix of float values, each of which is called a "pixel". The pixels are trained individually. However, that training comes from sliding a smaller filter (also square) across the kernel, performing a dot-product of the window with the corresponding square sub-matrix of the kernel. The output of that dot-product is the value of a single pixel in the next layer.
When the strength of pixel in layer N is increased, this effectively increases the influence of the filter in layer N-1 providing that input. That filter's pixels are, in turn, tuned by the inputs from layer N-2.

How is cross entropy calculated for pixel level prediction

I'm running a FCN in Keras that uses the binary cross-entropy as the loss function. However, im not sure how the losses are accumulated.
I know that the loss gets applied at the pixel level, but then are the losses for each pixel in the image summed up to form a single loss per image? Or instead of being summed up, is it being averaged?
And furthermore, are the loss of each image simply summed(or is it some other operation) over the batch?
I assume that you question is a general one, and to specific to a particular model (if not can you share your model?).
You are right that if the cross-entropy is used at a pixel level, the results have to be reduced (summed or averaged) over all pixels to get a single value.
Here is an example of a convolutional autoencoder in tensorflow where this step is specific:
https://github.com/udacity/deep-learning/blob/master/autoencoder/Convolutional_Autoencoder_Solution.ipynb
The relevant lines are:
loss = tf.nn.sigmoid_cross_entropy_with_logits(labels=targets_, logits=logits)
cost = tf.reduce_mean(loss)
Whether you take the mean or sum of the cost function does not change the value of the minimizer. But If you take the mean, then the value of the cost function is more easily comparable between experiments when you change the batch size or image size.

I couldn’t understand 3 things in normalized inputs and initial weights video?

In this video https://www.udacity.com/course/viewer#!/c-ud730/l-6370362152/m-7119160655 it talks about zero mean and equal variance in our cross-entropy function I cannot understand where is zero mean and variance. Could someone give me an example to explain it? It also talks about initializing weights using normal distribution, could someone explain it to me how? And in the end it talks about taking derivatives with respect to weights and biases and then subtracting the values of weights and biases and moving in a loop. Could you explain this to me? I'm very confused!!
these are basic statistics questions: you can find many resources on them.
A quick summary:
zero mean: Calculate the mean of the data points by
Sum(data points) / Count(data points)
Equal variance: Calculate the variance of the two different datasets then apply normalization to their data points:
Each Data Point <- Data Point Value - Mean(data points for that dataset)
-----------------------------
Standard Deviation for that dataset
The standard deviation / variance for the two datasets will be different. By dividing each data point in the respective datasets by their corresponding standard deviation you get more easily comparable results.
E.g. if variance of dataset A is 25 and the variance of dataset B is 100: each data point in A is then diviced by 5 and the point in B are divided by 10. That allows cross entropy calculation to compare similar-amplitude values.

How to calculate second order derivative at output layer in neural networks?

I am trying to implement the stochastic diagonal Levenberg-Marquardt method for Convolutional Neural Network in order to back propagate for learning weights.
i am new in it, and quite confused in it, so I have few questions, i hope you may help me.
1) How can i calculate the second order derivative at output layer from the two outputs.
As i in first order derivative i have to subtract output from desired output and multiply it with derivative of the output.
But in second derivative how can i do that?
2) In MaxPooling layer of convolutional Neural Network, I select max value in 2x2 window, and multiply it with the weight, now Does i have to pass it through activation function or not?
Can some one give me explanation how to do it in opencv, or how with mathematical explanation or any reference which show the mathematics.
thanks in advance.
If you have calculated Jacobian matrix already (the matrix of partial first order derivatives) then you can obtain an approximation of the Hessian (the matrix of partial second order derivatives) by multiplying J^T*J (if residuals are small).
You can calculate second derivative from two outputs: y and f(X) and Jacobian this way:
In other words Hessian approximation B is chosen to satisfy:
In this paper you can find more about it.
Ananth Ranganathan: The Levenberg-Marquardt Algorithm

How to visualizate svm weights in hog

In the original paper of HOG (Histogram of Oriented Gradients) http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf there are some images, which shows the hog representation of an image (Figure 6).In this figure the f, g part says "HOG descriptor weighted by respectively the positive and the negative SVM weights".
I don't understand what does this mean. I understand that when I train a SVM, I get a Weigth vector, and to classify, I have to use the features (HOG descriptors) as the input of the function. So what do they mean by positive and negative weigths? And how would I plot them like the paper?
Thanks in Advance.
The weights tell you how significant a specific element of the feature vector is for a given class. That means that if you see a high value in your feature vector you can lookup the corresponding weight
If the weight is a high positiv number it's more likely that your object is of the class
If your weight is a high negative number it's more likely that your object is NOT of the class
If your weight is close to zero this position is mostly irrelavant for the classification
Now your using those weights to scale the feature vector you have where the length of the gradients are mapped to the color-intensity. Because you can't display negative color intensities they decided to split the positive and negative visualization. In the visualizations you can now see which parts of the input-image contributes to the class (positiv) and which don't (negative).

Resources