The formula for the Harmonic Mean is: (2*Recall*Precision/(1*Recall+Precision).
The 2 comes from (Beta² + 1) and the 1 comes from Beta². Where Beta = a factor that indicates the relative importance of recall and precision.
How do I update the formula so that Recall becomes twice as important?
I think you kind of answered your own question: the Harmonic Mean is the formula you stated with Beta equal to 1, so in order to make recall twice as important as precision, simply set Beta to 2 to obtain:
Related
I've been looking at the deviance calculation for negative binomial model in H2O (code line 580/959) and I'm struggling to reason why it is 0 when yr or ym is/are 0.
(yr==0||ym==0)?0:2*((_invTheta+yr)*Math.log((1+_theta*ym)/(1+_theta*yr))+yr*Math.log(yr/ym))
The formula for the deviance calculation is as below (from H2O Documentation):
Going with maths, I don't see the deviance is 0 unless both yr and ym are 0.
Does anyone happen to know if there is a special case where deviance for negative binomial needs to be set to 0 when either of the yr and ym is/are 0?
Thanks!
I'm not sure, but it seems to me they maybe just chose the lazy way out of a numerical difficulty.
mu=0 (ym) is a degenerate case where p=0 and so y=0 always. It's not interesting, and not really part of any useful analysis. I'm not sure it can even come out with the linear-predictor. With using the natural parameter = linear predictor, you need the linear predictor to be equal to minus infinity...
However, y can be equal to 0 for other mu's. And what you do in this case, is take the limit of the deviance as y->0, which is completely defined for Negative-Binomial, and isn't equal to 0. They could have implemented it, but chose not too, so this is why I call it "lazy".
I'm relatively new to machine learning concepts, and I have been following several lectures/tutorials covering Q-Learning, such as: Stanford's Lecture on Reinforcement Learning
They all give short, or vague answers to what exactly gamma's utility is in the policy function. The most understandable explanation I have found thus far says it is "how much we value future rewards."
Is it really that simple? Is gamma what defines how we delay rewards/look ahead? Such as knowing to take option B in the following example:
In case of two options, A and B, A will give an immediate payoff of 10 then a payoff of another 10, while B will give an immediate payoff of 0 and then 30.
So, my questions:
What is a deep explanation of gamma?
How do we set it?
If it's not for looking-ahead, how do we look ahead?
The gamma parameter is indeed used to say something about how you value your future rewards. In more detail your discounted reward (which is used in training) looks like:
Discounted reward:
This means that an exponential function decides on how the future rewards are taken into account.
As an example, let's compare 2 gamma values:
gamma = 0.9
gamma = 0.99
Let's look at when gamma**steps reaches 0.5. In the case of gamma = 0.9, this is 6 steps. With gamma = 0.99 it is more like 60ish steps. This means that for gamma = 0.9 the reward in 6 steps is half as important as the immediate reward, but for gamma = 0.99, the same is valid for 60 steps. The drop-off is thus much less significant for gamma = 0.99 and the rewards in the future are higher valued than with gamma = 0.9.
To set which gamma parameter you need for you application, it is important to have some kind of feeling on how much steps you need in your environment to get to your rewards.
To come back to your option A and B. A should have a low gamma value as the immediate reward is very important. Option B should have a higher gamma value because the reward is in the future.
I have been trying to solve a problem stated in an exam of coursera. I am not seeking the solution but I need to get the steps and concepts to resolve this.
Can any one share the concept and steps to help me find the solution.
UPDATE:
I was expecting a down-vote and its not unusual, as its the most easiest thing people can do. I am seeking the direction to solve the problem as I wasn't able to get the idea to solve it after watching the videos on Coursera. I hope someone sensible out there can share a direction and step to achieve the mentioned goal.
Mean Normalization
Mean normalization, also known as 'standardization', is one of the most popular techniques of feature scaling.
Andrew Ng describes it in the 12a slide of lecture 4:
How to resolve the problem
The problem asks you to standardize the first feature in the third example: midterm = 94;
well, we have just to resolve the equation!
Just for clarity, the notation:
μ (mu) = "avg value of x in training set", in other words: the mean of the x1 column.
σ (sigma) = "range (max-min)", literaly σ = max-min (of the x1 column).
So:
μ = ( 89 + 72 + 94 +69 )/4 = 81
σ = ( 94 - 69 ) = 25
x_std = (94 - 81)/25 = 0.52
Result: 0.52
Best regards,
Marco.
The first step of solving this question is to identify what is , from the content of the lecture, it refers to the first feature of the third training case. Which is the unsquared version of the midterm score in the third row of the table.
Secondly, you need to understand the concept of normalization. The reason why we need normalization is that the value of some features among all training examples may much larger than the value of other features, which may make the cost function have pretty bad shape and this will make it harder gradient descent to find the minimum. In order to solve this, we want to make all features have nearly the same scale, and make the range of the feature to be centered at zero.
In this question, we want to scale every feature to a scale of 1, in order to do this, you need to find the max and min value of the feature among all training cases. Then squeeze the range of the feature to 0 and 1. The second step is to find the center value of the feature (average value in this case) and move the center value of the feature to 0.
I think this is pretty much all hints I can give you, you will totally be able to calculate the answer to this question by yourself from this point.
I have a problem which I think can be converted to a variant of
fractional knapsack problem.
The objective function is in the form of:
$\sum_{i} x_iv_i$
However, my problem differs in that it allows $v_i$ s and $x_i$ to be negative.
I want to prove that this problem can be solved using the greedy algorithm (explained in the link).
I have tested this for many test cases and greedy algorithm seems to solve it, but I want a definite
proof that greedy algorithm is still applicable given the extra constraint.
In the fractional knapsack problem, you find the Value/Weight of every item that you may put in the knapsack, and sort these items from the best V/W ratio to the worst. You then start with the best ratio, and fill the knapsack is either full or you run out. If you run out, you then head to the next item in the list and fill the knapsack with it. This pattern continues until the knapsack is full. It is greedy, because once we sort this list we know that we can confidently add the items fractionally in this order and that we will end with the greatest potential value in the bag.
By allowing the values and "weights" to be negative, as in this problem, however, the algorithm is no longer greedy. It is ruined by the fact that an item could have a negative "weight" and negative value, resulting in a positive V/W ratio. For example, take the following list of items:
V=-1, W=-1 -> V/W = 1.0
V=.9, W=1 -> V/W = 0.9
V=.8, W=1 -> V/W = 0.8
Following the greedy algorithm, we would want to add as much of item 1 as exists, because it has the best V/W ratio. However, adding item 1 really hurts us in the long run, because we are losing more value per weight then we can add later on. For example, let's assume the |W|=10 for each, and the max weight of the knapsack is 10. By adding all of 1, we will have a weight of -10 and a value of -10. Then we add all of 2, which results in a weight of 0 and a value of -1. Then we add all of 3, which results in a weight of 10 and a value of 7.
If instead of this, we just added all of item 2 from the start, we would have a weight of 10 and a value of 9. Therefore by contradiction, if weight and value can be negative, the algorithm is NOT a greedy algorithm.
The Scharr-Filter is explained in Scharrs dissertation. However the values given on page 155 (167 in the pdf) are [47 162 47] / 256. Multiplying this with the derivation-filter would yield:
Yet all other references I found use
Which is roughly the same as the ones given by Scharr, scaled by a factor of 32.
Now my guess is that the range can be represented better, but I'm curious if there is an official explanation somewhere.
To get the ball rolling on this question in case no "expert" can be found...
I believe the values [3, 10, 3] ... instead of [47 162 47] / 256 ... are used simply for speed. Recall that this method is competing against the Sobel Operator whose coefficient values are are 0, and positive/negative 1's and 2's.
Even though the divisor in the division, 256 or 512, is a power of 2 and can can be performed by a shift, doing that and multiplying by 47 or 162 is going to take more time. A multiplication by 3 however can in fact be done on some RISC architectures like the IBM POWER series in a single shift-and-add operation. That is 3x = (x << 1) + x. (On these architectures, the shifter and adder are separate units and can be done independently).
I don't find it surprising that Phd paper used the more complicated and probably more precise formula; it needed to prove or demonstrate something, and the author probably wasn't totally certain or concerned that it be used and implemented alongside other methods. The purpose in the thesis was probably to have "perfect rotational symmetry". Afterwards when one decides to implement it, that person I suspect used the approximation formula and gave up a little on perfect rotational symmetry, to gain speed. That person's goal as I said was to have something that was competitive at the expense of little bit of speed for this rotational stuff.
Since I'm guessing you are willing to do work this as it is your thesis, my suggestion is to implement the original algorithm and benchmark it against both the OpenCV Scharr and Sobel code.
The other thing to try to get an "official" answer is: "Use the 'source', Luke!". The code is on github so check it out and see who added the Scharr filter there and contact that person. I won't put the person's name here, but I will say that the code was added 2010-05-11.