is there a way to minimize a cost function in the form of cost = output(to be calculated) - function of(output to be calculated) + input provided. If yes, anyone kind enough to point me to a good learning link, to a short explanation would be even better!
Related
NOTE: when you see (0) in the functions it represents Theta not Zero
I've been studying Andrew Ng's Machine Learning Course, and I have the following inquery:
(Short Version: If one were to look at all the mathematical expressions/calculations used for both Forward AND Backward propagation, then it appears to me that we never use the Cost Function directly, but its Derivative , so what is the importance of the cost function and its choice anyway? is it purely to evaluate our system whenever we feel like it?)
Andrew mentioned that for Logistic Regression, using the MSE (Mean Squared Error) Cost function
wouldn't be good, because applying it to our Sigmoid function would yield a non-convex cost function that has a lot of Local Optima, so it is best that we use the following logistic cost function:
Which will have 2 graphs (one for y=0 and one for y=1), both of which are convex.
My question is the following, since it is our objective to minimize the cost function (aka have the Derivative reach 0), which we achieve by using Gradient Descent, updating our weights using the Derivative of the Cost Function, which in both cases (both cost functions) is the same derivative:
dJ = (h0(x(i)) - y(i)) . x(i)
So how did the different choice of cost function in this case effect our algorithm in any way? because in forward propagation, all we need is
h0(x(i)) = Sigmoid(0Tx)
which can be calculated without ever needing to calculate the cost function, then in backward propagation and in updating the weights, we always use the derivative of the cost function, so when does the Cost Function itself come into play? is it just necessary when we want an indication of how well our network is doing? (then why not just depend on the derivative to know that)
The forward propagation does not need the cost function in any way because you just applying all your learned weights to the corresponding input.
The cost function is generally used to measure how good your algorihm is by comparing your models outcome (therefore applying your current weights to your input) with the true label of the input (in supervised algorithms). The main objective is therefore to minimize the cost function error as (in most cases) you want the difference of the prediction and the true label as small as possible. In optimization it is pretty helpful if your function you want to optimize is convex because it guarantees that if you find a local minimum it is at the same time the global minimum.
For minimizing the cost function, gradient descent is used to iteratively update your weights to get closer to the minimum. This is done w.r.t to the learned weights such that you are able to update your weights of the model for achieving the lowest possible costs. The backpropagation algorithm is used to adjust the weights using the cost function in the backward pass.
Technically, you are correct: we do not explicitly use the cost function in any of the calculations for forward propagation and back propagation.
You asked 'what is the importance of the cost function and its choice anyway?'. I have two answers:
The cost function is incredibly important because its gradient is what allows us to update our weights. Although we are only actually computing the gradient of the cost function and not the cost function itself, choosing a different cost function would mean we would have a different gradient, thus changing how we update our weights.
The cost function allows us to evaluate our model performance. It is common practice to plot cost vs epoch to understand how the cost decreases over time.
Your answer indicted you essentially understood all of this already but I hoped to clarify it a bit. Thanks!
I am working on a problem for which we aim to solve with deep Q learning. However, the problem is that training just takes too long for each episode, roughly 83 hours. We are envisioning to solve the problem within, say, 100 episode.
So we are gradually learning a matrix (100 * 10), and within each episode, we need to perform 100*10 iterations of certain operations. Basically we select a candidate from a pool of 1000 candidates, put this candidate in the matrix, and compute a reward function by feeding the whole matrix as the input:
The central hurdle is that the reward function computation at each step is costly, roughly 2 minutes, and each time we update one entry in the matrix.
All the elements in the matrix depend on each other in the long term, so the whole procedure seems not suitable for some "distributed" system, if I understood correctly.
Could anyone shed some lights on how we look at the potential optimization opportunities here? Like some extra engineering efforts or so? Any suggestion and comments would be appreciated very much. Thanks.
======================= update of some definitions =================
0. initial stage:
a 100 * 10 matrix, with every element as empty
1. action space:
each step I will select one element from a candidate pool of 1000 elements. Then insert the element into the matrix one by one.
2. environment:
each step I will have an updated matrix to learn.
An oracle function F returns a quantitative value range from 5000 ~ 30000, the higher the better (roughly one computation of F takes 120 seconds).
This function F takes the matrix as the input and perform a very costly computation, and it returns a quantitative value to indicate the quality of the synthesized matrix so far.
This function is essentially used to measure some performance of system, so it do takes a while to compute a reward value at each step.
3. episode:
By saying "we are envisioning to solve it within 100 episodes", that's just an empirical estimation. But it shouldn't be less than 100 episode, at least.
4. constraints
Ideally, like I mentioned, "All the elements in the matrix depend on each other in the long term", and that's why the reward function F computes the reward by taking the whole matrix as the input rather than the latest selected element.
Indeed by appending more and more elements in the matrix, the reward could increase, or it could decrease as well.
5. goal
The synthesized matrix should let the oracle function F returns a value greater than 25000. Whenever it reaches this goal, I will terminate the learning step.
Honestly, there is no effective way to know how to optimize this system without knowing specifics such as which computations are in the reward function or which programming design decisions you have made that we can help with.
You are probably right that the episodes are not suitable for distributed calculation, meaning we cannot parallelize this, as they depend on previous search steps. However, it might be possible to throw more computing power at the reward function evaluation, reducing the total time required to run.
I would encourage you to share more details on the problem, for example by profiling the code to see which component takes up most time, by sharing a code excerpt or, as the standard for doing science gets higher, sharing a reproduceable code base.
Not a solution to your question, just some general thoughts that maybe are relevant:
One of the biggest obstacles to apply Reinforcement Learning in "real world" problems is the astoundingly large amount of data/experience required to achieve acceptable results. For example, OpenAI in Dota 2 game colletected the experience equivalent to 900 years per day. In the original Deep Q-network paper, in order to achieve a performance close to a typicial human, it was required hundres of millions of game frames, depending on the specific game. In other benchmarks where the input are not raw pixels, such as MuJoCo, the situation isn't a lot better. So, if you don't have a simulator that can generate samples (state, action, next state, reward) cheaply, maybe RL is not a good choice. On the other hand, if you have a ground-truth model, maybe other approaches can easily outperform RL, such as Monte Carlo Tree Search (e.g., Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning or Simple random search provides a competitive approach to reinforcement learning). All these ideas a much more are discussed in this great blog post.
The previous point is specially true for deep RL. The fact of approximatting value functions or policies using a deep neural network with millions of parameters usually implies that you'll need a huge quantity of data, or experience.
And regarding to your specific question:
In the comments, I've asked a few questions about the specific features of your problem. I was trying to figure out if you really need RL to solve the problem, since it's not the easiest technique to apply. On the other hand, if you really need RL, it's not clear if you should use a deep neural network as approximator or you can use a shallow model (e.g., random trees). However, these questions an other potential optimizations require more domain knowledge. Here, it seems you are not able to share the domain of the problem, which could be due a numerous reasons and I perfectly understand.
You have estimated the number of required episodes to solve the problem based on some empirical studies using a smaller version of size 20*10 matrix. Just a caution note: due to the curse of the dimensionality, the complexity of the problem (or the experience needed) could grow exponentially when the state space dimensionalty grows, although maybe it is not your case.
That said, I'm looking forward to see an answer that really helps you to solve your problem.
So I've done a coursera ml course, and now i am looking at scikit-learn logistic regression which is a little bit different. I've been using sigmoid function and a cost function was divided to two separate cases when y=0 and y=1. But scikit learn have one function(I've found out that this is Generalised logistic function) which really don't make sense for me.
http://scikit-learn.org/stable/_images/math/760c999ccbc78b72d2a91186ba55ce37f0d2cf37.png
I am sorry i don't have enough reputation to post image.
So the main concern of the function is the case when y=0, than the cost function always have this log(e^0+1) value, so it does not matter what was the X or w. Can someone explain me that ?
If I am correct the $y_i$ you have in this formula can only assume the values -1 or 1 (as derived in the link given by #jinyu0310). Normally you use this cost function (with regularisation)
(sorry for the equation inserted as image, but cannot use latex here, the image comes from goo.gl/M53u44)
So you always have the two terms that plays a role when yi = 0 or when yi=1. I am trying to find a better explanation on the notation that scikit uses in this formula but so far no luck.
Hope that helps you. Is just another way of writing the cost function with a regularisation factor. Keep also in mind that a constant factor in front of everything will not play a role in the optimisation process. Since you want to find the minimum and are not interested in overall factors that multiply everything.
I am reading A Tutorial on Energy Based Learning and I am trying to understand the difference between all those terms stated above in the context of SVMs. This link summarizes the differences between a loss, cost and an objective function. Based on my understanding,
Objective function: Something we want to minimize. For example ||w||^2 for SVM.
Loss function: Penalty between prediction and label which is also equivalent to the regularization term. Example is the hinge loss function in SVM.
Cost function: A general formulation that combines the objective and loss function.
Now, the 1st link states that the hinge function is max(0, m + E(W,Yi,Xi) - E(W,Y,X)) i.e. it is a function of the energy term. Does that mean that the energy function of the SVM is 1 - y(wx + b) ? Are energy functions are a part of a loss function. And a loss + objective function a part of the cost function ?
A concise summary of the 4 terms would immensely help my understanding. Also, do correct me if my understanding is wrong. The terms sound so confusing. Thanks !
Objective function: Something we want to minimize. For example ||w||^2 for SVM.
Objective function is - as name suggests - objective of optimization. It can be either something we want to minimize (like cost function) or maximize (like likelihood). In general - function that measures how good is our current solution (usually by returning a real number)
Loss function: Penalty between prediction and label which is also equivalent to the regularization term. Example is the hinge loss function in SVM.
First of all, loss is not equivalent to regularization, in any sense. Loss function is a a penalty between a model and truth. This can be a prediction of class conditional distribuition vs true label, thus can also be a data distribution vs. empirical sample, and many more.
Regularization
Regularization is a term, penalty, measure which is supposed to be a penalty for too complex model. In ML, or generally in statistics when dealing with estimators, you always try to balance two sources of error - variance (coming from too complex models, overfitting) and bias (coming from too simple models, bad learning methods, underfitting). Regularization is a technique of penalizing high-variance models in the optimization process in order to get less overfitted one. In other words - for techniques which can fit training set perfectly, it is important to have a measure which forbids it in order to preserve ability to generalize.
Cost function: A general formulation that combines the objective and loss function.
Cost function is just an objective function which one minimizes. It can be composed of some agglomeration of loss functions and regularizers.
Now, the 1st link states that the hinge function is max(0, m + E(W,Yi,Xi) - E(W,Y,X)) i.e. it is a function of the energy term. Does that mean that the energy function of the SVM is 1 - y(wx + b) ? Are energy functions are a part of a loss function. And a loss + objective function a part of the cost function ?
The hinge loss is max(0, 1 - y(<w,x> - b)). The one defined here is not really for SVM but for general factor graphs, I would strongly suggest to start learning ML from basics and not from advanced techniques. Without good understanding of basics of ML, this paper will not be possible to understand.
To show example with SVM and naming convention
C SUM_i=1^N max(0, 1 - y_i(<w, x_i> - b)) + ||w||^2
\__________________________/ \_____/
loss regularization
\_________________________________________________/
cost / objective function
Given that there is a 15-square puzzle and we will solve the puzzle using a-star search. The heuristic function is Manhattan distance.
Now a solution is provided by someone with cost T and we are not sure if this solution is optimal. With this information provided,
Is it possible to find a better solution with cost < T?
Is it possible to optimize the performance of searching algorithm?
For this question, I have considered several approaches.
h(x) = MAX_INT if g(x) >= T. That is, the f(x) value will be maximum if the solution is larger than T.
Change the search node as CLOSED state if g(x) >= T.
Is it possible to find a better solution?
You need to know if T is the optimal solution. If you do not know the optimal solution, use the average cost; a good path is better than the average. If T is already better than average, you don't need to find a new path.
Is it possible to optimize the performance of the searching algorithm?
Yes. Heuristics are assumptions that help algorithms to make good decisions. The A* algorithm makes the following assumptions:
The best path costs the least (Djikstra's Algorithm - stay near origin of search)
The best path is the most direct path (Greedy Search - minimize distance to goal)
Good heuristics vastly improve performance (A* is useful for this reason). Bad heuristics lead the search away from good solutions and obliterate performance. My advice is to know the game you are searching; in chess, it's generally best to avoid losing a queen, so that may be a good heuristic to use.
Heuristics will have the largest impact on performance, especially in the case of a 15x15 search space. In larger search spaces (2000x2000), good use of high efficiency data structures like arrays and integers may improve performance.
Potential solutions
Both the solutions you provide are effectively the same; if the path isn't as good as the other paths you have, ignore them. Search algorithms like A* do this for you, as j_random_hacker has said in a roundabout manner.
The OPEN list is a set of possible moves; select the best and ignore the rest. The CLOSED list is the set of moves that have already been selected, not the ones you wish to ignore.
(1) d(x) = Djikstra's Algorithm
(2) g(x) = Greedy Search
(3) a*(x) = A* Algorithm = d(x) + g(x)
To make your A* more greedy (prefer suboptimal but fast solutions), multiply the cost of g(x) to favour a greedy search; (4) a*(x) = d(x) + 1.1 * g(x)
I actually tested this in to a search space of 1500x2000. (3), a standard A*, took about 5 seconds to find the goal on the opposite side. (4) took only milliseconds to find the goal, demonstrating the value of using heuristics well.
You may also add other heuristics to A*, such as:
Depth-first search (prefer a greater amount of moves)
Bread-first (prefer a smaller amount of moves)
Stick to Roads (if terrain determines movement speed, increase the cost of choosing bad terrain)
Stay out of enemy territory (if you want to avoid losing units, don't put them in harms way)