Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
In the paper A Tutorial on Energy Based Learning I have seen two definitions:
Energy function E(X, Y) is minimized by inference process: the goal is to find such value of Y, such that E(X, Y) takes is minimal value.
Loss function is a measure of a quality of an energy function using training set.
I understand the meaning of loss function (good example is the mean squared error). But can you explain me what is the difference between energy function and loss function? Can you give me an example of energy function in ML or DL?
In short, the energy function describes your problem. In contrast the loss function is just something that is used by an ML algorithm as input. This might be the same function but is not necessarily the case.
The energy of a system in physics might be the movement inside this system. In a ML context, you might want to minimize the movement by adjusting the parameters. Then one way to achieve this is to use the energy function as a loss function and minimize this function directly. In other cases this function might not be easy to evaluate or to differentiate and then other functions might be used as a loss for your ML algorithm. Similarly as in classification, where you care for the accuracy of the classifier, but you still use cross entropy on the softmax as a loss function and not accuracy.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I'm studying about Neural network and I have some questions about the theory of Gradient descent.
why is the fluctuation(slope) of Batch gradient descent less than the fluctuation(slope) of SGD.
Why SGD avoid local minima better Batch gradient descent.
Batch Gradient survey all data then come updates. What is its meaning?
Everything is related to the compromise between exploitation and exploration.
Gradient Descend uses all the data to update the weights which implies a better update. In neural networks Batch Gradient Descend is used because the original is not applicable to practice. Instead Stochastic Gradient Descend only uses a single example and that adds noise. With BGD and GD you exploit more data.
With SGD you can avoid minimum locations because by using a single example you benefit the exploration and you can come up with other solutions that with BGD you could not, that implies noise. SGD you explore more.
BGD, takes a dataset and breaks it into N chunks where each chunk has B samples (B is the batch_size). That forces you to go through all the data in the dataset.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I'm working on Kmeans clustering but unlike supervised learning I cannot figure the performance metrics for clustering algorithms. How to perform the accuracy after training the data?
For kmeans you can find the inertia_ of it. Which can give you an idea how well kmeans algorithm has worked.
kmeans = KMeans(...)
# Assuming you already have fitted data on it.
kmeans.inertia_ # lesser is better
Or, alternatively if you call score() function, which will give you the same but the sign will be negative. As we assume bigger score means better but for kmeans lesser inertia_ is better. So, to make them consistent an extra negation is applied on it.
# Call score with data X
kmeans.score(X) # greater is better
This is the very basic form of analyzing performance of kmeans. In reality if you take the number of clusters too high the score() will increase accordingly (in other words inertia_ will decrease), because inertia_ is nothing but the summation of the squared distances from each point to its corresponding cluster's centroid to which cluster it is assigned to. So if you increase the number of the clusters too much, the overall distances' squared summation will decrease as each point will get a centroid very near to it. Although, the quality of clustering is horrible in this case. So, for better analysis you should find out silhouette score or even better use silhouette diagram in this case.
You will find all of the implementations in this notebook: 09_unsupervised_learning.ipynb
The book corresponding to this repository is: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition. It is a great book to learn all of these details.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I would like to understand the difference between Cost function and Activation function in a machine learning problems.
Can you please help me understand the difference?
A cost function is a measure of error between what value your model predicts and what the value actually is. For example, say we wish to predict the value yi for data point xi. Let fθ(xi) represent the prediction or output of some arbitrary model for the point xi with parameters θ. The cost function is the sum of (yi−fθ(xi))2 (this is only an example it could be the absolute value over the square). Training the hypothetical model we stated above would be the process of finding the θ that minimizes this sum.
An activation function transforms the shape/representation of the in the model. A simple example could be max(0,x), a function which outputs 0 if the input x is negative or x if the input x is positive. This function is known as the “Rectified Linear Unit” (ReLU) activation function. These representations are essential for making high-dimensional data linearly separable, which is one of the many uses of a neural network. The choice of these functions depends on your case if you need a custom model also the kind of layer (hidden / output) or the model architecture.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm working on a deep learning classifier (Keras and Python) that classifies time series into three categories. The loss function that I'm using is the standard categorical cross-entropy. In addition to this, I also have an attention map which is being learnt within the same model.
I would like this attention map to be as small as possible, so I'm using a regularizer. Here comes the problem: how do I set the right regularization parameter? What I want is the network to reach its maximum classification accuracy first, and then starts minimising the intensity attention map. For this reason, I train my model once without regulariser and a second time with the regulariser on. However, if the regulariser parameter (lambda) is too high, the network loses completely accuracy and only minimises the attention, while if the regulariser is too small, the network only cares about the classification error and won't minimise the attention, even when the accuracy is already the maximum.
Is there a smarter way to combine the categorical cross-entropy with the regulariser? Maybe something that considers the variation of categorical cross-entropy in time, and if it doesn't go down for, say N iterations, it only considers the regulariser?
Thank you
Regularisation is a way to fight with overfitting. So, you should understood if your model overfits. A simple way to do it: you can compare f1 score for train and test. If f1 score for train is high and for test is low, seems, you have overfitting - so you need add some regularisation.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
In deep learning we should choose the best model according to the train/val loss and accuracy, but how do I know which point is the best?
Does it only depend on the val accuracy regardless of the other metrics?
And two more relevant questions:
How do the optimal train/val loss and accuracy curves look like?
What should I do if the train loss is decreasing and train accuracy is increasing, but the val loss is increasing while val accuracy stops increasing after training a long time?
It looks like this:
train accuracy
train loss
val accuracy
val loss
First in first, you need to choose model according to the result on development/validation dataset. Therefore, val accuracy and val loss are used to judge the model's performance.
To some extent, higher val accuracy are often associated with lower val loss. That's because your loss is used to measure the difference between the predicted result and the ground-truth.
Different problems measured by different metrics, just like we often use BLEU score in machine translation, you need to read some papers about your research field to get which metric is popular.
Train loss decrease and val loss increase is quite a normal apperance in model training, it usually means your model is over-fitting. It learns too much features only appeared in training dataset but not the whole dataset.
As for dealing with over-fitting, there are many methods like early-stopping, drop layers, etc. You can just google it.