Fluctuations of gradient descent [closed]

Fluctuations of gradient descent [closed] - machine-learning

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I'm studying about Neural network and I have some questions about the theory of Gradient descent.
why is the fluctuation(slope) of Batch gradient descent less than the fluctuation(slope) of SGD.
Why SGD avoid local minima better Batch gradient descent.
Batch Gradient survey all data then come updates. What is its meaning?

Everything is related to the compromise between exploitation and exploration.
Gradient Descend uses all the data to update the weights which implies a better update. In neural networks Batch Gradient Descend is used because the original is not applicable to practice. Instead Stochastic Gradient Descend only uses a single example and that adds noise. With BGD and GD you exploit more data.
With SGD you can avoid minimum locations because by using a single example you benefit the exploration and you can come up with other solutions that with BGD you could not, that implies noise. SGD you explore more.
BGD, takes a dataset and breaks it into N chunks where each chunk has B samples (B is the batch_size). That forces you to go through all the data in the dataset.

Related

Difference between Cost Function and Activation funtion? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I would like to understand the difference between Cost function and Activation function in a machine learning problems.
Can you please help me understand the difference?

A cost function is a measure of error between what value your model predicts and what the value actually is. For example, say we wish to predict the value yi for data point xi. Let fθ(xi) represent the prediction or output of some arbitrary model for the point xi with parameters θ. The cost function is the sum of (yi−fθ(xi))2 (this is only an example it could be the absolute value over the square). Training the hypothetical model we stated above would be the process of finding the θ that minimizes this sum.
An activation function transforms the shape/representation of the in the model. A simple example could be max(0,x), a function which outputs 0 if the input x is negative or x if the input x is positive. This function is known as the “Rectified Linear Unit” (ReLU) activation function. These representations are essential for making high-dimensional data linearly separable, which is one of the many uses of a neural network. The choice of these functions depends on your case if you need a custom model also the kind of layer (hidden / output) or the model architecture.

How can I get the right balance between classification loss and a regularizer? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm working on a deep learning classifier (Keras and Python) that classifies time series into three categories. The loss function that I'm using is the standard categorical cross-entropy. In addition to this, I also have an attention map which is being learnt within the same model.
I would like this attention map to be as small as possible, so I'm using a regularizer. Here comes the problem: how do I set the right regularization parameter? What I want is the network to reach its maximum classification accuracy first, and then starts minimising the intensity attention map. For this reason, I train my model once without regulariser and a second time with the regulariser on. However, if the regulariser parameter (lambda) is too high, the network loses completely accuracy and only minimises the attention, while if the regulariser is too small, the network only cares about the classification error and won't minimise the attention, even when the accuracy is already the maximum.
Is there a smarter way to combine the categorical cross-entropy with the regulariser? Maybe something that considers the variation of categorical cross-entropy in time, and if it doesn't go down for, say N iterations, it only considers the regulariser?
Thank you

Regularisation is a way to fight with overfitting. So, you should understood if your model overfits. A simple way to do it: you can compare f1 score for train and test. If f1 score for train is high and for test is low, seems, you have overfitting - so you need add some regularisation.

What is the difference between energy function and loss function? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
In the paper A Tutorial on Energy Based Learning I have seen two definitions:
Energy function E(X, Y) is minimized by inference process: the goal is to find such value of Y, such that E(X, Y) takes is minimal value.
Loss function is a measure of a quality of an energy function using training set.
I understand the meaning of loss function (good example is the mean squared error). But can you explain me what is the difference between energy function and loss function? Can you give me an example of energy function in ML or DL?

In short, the energy function describes your problem. In contrast the loss function is just something that is used by an ML algorithm as input. This might be the same function but is not necessarily the case.
The energy of a system in physics might be the movement inside this system. In a ML context, you might want to minimize the movement by adjusting the parameters. Then one way to achieve this is to use the energy function as a loss function and minimize this function directly. In other cases this function might not be easy to evaluate or to differentiate and then other functions might be used as a loss for your ML algorithm. Similarly as in classification, where you care for the accuracy of the classifier, but you still use cross entropy on the softmax as a loss function and not accuracy.

Machine Learning data preprocessing [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I have a question regarding data preprocessing for machine learning. Specifically transforming the data so it has zero mean and unit variance.
I have split my data into two datasets (I know I should have three, but for the sake of simplicity let's just say I have two). Should I transform my training data set so that the entire training data set has unit variance and zero mean and then when testing the model transform each test input vector so that each particular test input vector presents unit variance and zero mean, or I should just transform the entire dataset (traning and testing) together so that the whole thing presents unit var and zero mean? My belief is that I should do the former that way I won't be introducing a despicable amount of bias into the test data set. But I am no expert, thus my question.

Fitting your preprocessor should only be done on the training-set and the mean and variance transformers are then used on the test-set. Computing these statistics on train and test leaks some information about the test-set.
Let me link you to a good course on Deep-Learning and show you a citation (both from Andrej Karpathy):
Common pitfall. An important point to make about the preprocessing is that any preprocessing statistics (e.g. the data mean) must only be computed on the training data, and then applied to the validation / test data. E.g. computing the mean and subtracting it from every image across the entire dataset and then splitting the data into train/val/test splits would be a mistake. Instead, the mean must be computed only over the training data and then subtracted equally from all splits (train/val/test).

Max/min of trained TensorFlow NN [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I would like to understand what the best way is to conduct further analysis on a trained TensorFlow neural network for regression.
Specifically, I am looking on how to find further maxima/minima from a trained neural network (equivalent to finding max/min from a regression curve). The easy way is to obviously "try out" all possible combinations and check the result set for a max/min, but testing all combinations can quickly become a huge resource sink when having multiple inputs and dependent variables.
Is there any way to use a trained TensorFlow neural network to conduct these further analyses?

As networks are trained incrementally, you can find the maximum incrementally.
Suppose you have a neural network with an input size of 100 (e.g. a 10x10 image) and a scalar output of size 1 (e.g. the score of the image for a given task).
You can incrementally modify the input, starting from random noise, until you obtain a local maximum of the output. All you need is the gradients of the output with respect to the input:
input = tf.Variable(tf.truncated_normal([100], mean=127.5, stddev=127.5/2.))
output = model(input)
grads = tf.gradients(output, input)
learning_rate = 0.1
update_op = input.assign_add(learning_rate * grads)

ANNs is not something which can be checked analytically. It has sometimes millions of weights and thousands of neurons, non-linear activation functions of different types, convolution and max-pooling layers.. No way you analytically determine anything about it. Actually that's why networks are trained incrementally.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Fluctuations of gradient descent [closed] - machine-learning

Related

Difference between Cost Function and Activation funtion? [closed]

How can I get the right balance between classification loss and a regularizer? [closed]

What is the difference between energy function and loss function? [closed]

Machine Learning data preprocessing [closed]

Max/min of trained TensorFlow NN [closed]

Categories

Resources