Negative Binomial Deviance Calculation in H2O - glm

I've been looking at the deviance calculation for negative binomial model in H2O (code line 580/959) and I'm struggling to reason why it is 0 when yr or ym is/are 0.
(yr==0||ym==0)?0:2*((_invTheta+yr)*Math.log((1+_theta*ym)/(1+_theta*yr))+yr*Math.log(yr/ym))
The formula for the deviance calculation is as below (from H2O Documentation):
Going with maths, I don't see the deviance is 0 unless both yr and ym are 0.
Does anyone happen to know if there is a special case where deviance for negative binomial needs to be set to 0 when either of the yr and ym is/are 0?
Thanks!

I'm not sure, but it seems to me they maybe just chose the lazy way out of a numerical difficulty.
mu=0 (ym) is a degenerate case where p=0 and so y=0 always. It's not interesting, and not really part of any useful analysis. I'm not sure it can even come out with the linear-predictor. With using the natural parameter = linear predictor, you need the linear predictor to be equal to minus infinity...
However, y can be equal to 0 for other mu's. And what you do in this case, is take the limit of the deviance as y->0, which is completely defined for Negative-Binomial, and isn't equal to 0. They could have implemented it, but chose not too, so this is why I call it "lazy".

Related

is any method to approximate the softmax probability under special conditions?

I'm trying to find approach to compute the softmax probability without using exp().
assume that:
target: to compute f(x1, x2, x3) = exp(x1)/[exp(x1)+exp(x2)+exp(x3)]
conditions:
1. -64 < x1,x2,x3 < 64
2. result is just kept 3 desimal places.
is there any way to find a polynomial to approximately represent the result under such conditions?
My understanding of Softmax probability
The output of neural networks (NN) is not very discriminating. For example if I have 3 classes, for the correct class say NN output may be some value a and for others b,c such that a>b, a>c. But if we do the softmax trick, after transformation firstly a+b+c = 1 which makes it interpretable as probability. Secondly, a>>>b, a>>>c and so we are now much more confident.
So how to go further
To get the first advantage, it is sufficient to use
f(x1)/[f(x1)+f(x2)+f(x3)]
(equation 1)
for any function f(x)
Softmax chooses f(x)=exp(x). But as you are not comfortable with exp(x), you can choose say f(x)=x^2.
I give some plots below which have profile similar to exponential and you may choose from them or use some similar function. To tackle the negative range, you may add a bias of 64 to the output.
Please note that the denominator is just a constant and need not be computed. For simplicity you can just use following instead of equation 1,
[f(x)] / [3*f(xmax)]
In your case xmax = 64 + bias(if you choose to use one)
Regards.

How should zero standard deviation in one of the features be handled in multi-variate gaussian distribution

I am using multi-variate guassian distribution to analyze abnormality.
This is how the training set looks
19-04-16 05:30:31 1 0 0 377816 305172 5567044 0 0 0 14 62 75 0 0 100 0 0
<Date> <time> <--------------------------- ------- Features --------------------------->
Lets say one of the above features do not change, they remain zero.
Calculation mean = mu
mu = mean(X)'
Calculating sigma2 as
sigma2 = ((1/m) * (sum((X - mu') .^ 2)))'
Probability of individual feature in each data set is calculated using standard gaussian formula as
For a particular feature, if all values come out to be zero, then mean (mu) is also zero. Subsequently sigma2 will also be zero.
Thereby when I calculate the probability through gaussian distribution, I would get a "device by zero" problem.
However, in test sets, this feature value can fluctuate and I would like term that as a an abnormality. How, should this be handled? I dont want to ignore such a feature.
So - the problem occurs every time when you have a variable which is constant. But then approximating it by a Normal Distribution has absolutely no sense. The whole information about such variable is contained in only one value - and this is an intuition why this division by 0 phenomenon occurs.
In case when you know that there are these fluctuations in your variable not observed in a training set - you could simply set a variance of such variable not to be lesser than a certain value. You could apply a function max(variance(X), eps) instead of a classic variance definition. Then - you will be sure that no division by 0 occurs.

How to use previous values of calculation as the initial conditions in ABAQUS

I am trying to implement a subroutine in ABAQUS.
It is a very simple non-linear elastic model, in which the Young's modulus depends on the mean pressure, in details, E=3*(1-2*poisson)*p/kap (where, poisson=0.3 is Poisson's coefficient and kap=0.005 is swelling index). The initial stress is 1e5 Pa for sigma11, 22 and 33.
When I run the subroutine , it gives linear behavior with E=3*(1-2*0.3)*(3*1e5/3)/0.05 (which is the Young's modulus calculated with the initial stress). If the initial stress is 0 for all components, it gives us 0 for all calculation because E=3*(1-2*0.3)*(3*0/3)/0.05=0.
I would like to ask if you could help me to solve this problem (define the initial conditions as the previous values for each variables).

How to check if gradient descent with multiple variables converged correctly?

In linear regression with 1 variable I can clearly see on plot prediction line and I can see if it properly fits the training data. I just create a plot with 1 variable and output and construct prediction line based on found values of Theta 0 and Theta 1. So, it looks like this:
But how can I check validity of gradient descent results implemented on multiple variables/features. For example, if number of features is 4 or 5. How to check if it works correctly and found values of all thetas are valid? Do I have to rely only on cost function plotted against number of iterations carried out?
Gradient descent converges to a local minimum, meaning that the first derivative should be zero and the second non-positive. Checking these two matrices will tell you if the algorithm has converged.
We can think of gradient descent as of something solving a problem of f'(x) = 0 where f' denotes gradient of f. For checking this problem convergence, as far as I know, the standard approach is to calculate discrepancy on each iteration and see if it converges to 0.
That is, check if ||f'(x)|| (or its square) converges to 0.
There are some things you can try.
1) Check if your cost/energy function is not improving as your iteration progresses. Use something like "abs(E_after - E_before) < 0.00001*E_before", i.e. check if the relative difference is very low.
2) Check if your variables have stopped changing. You can opt a very similar strategy like above to check this.
There is actually no perfect way to fully make sure that your function has converged, but some of the things mentioned above are what usually people try.
Good luck!

Classifying Output of a Network

I made a network that predicts either 1 or 0. I'm now working on the ROC Curve of that network where I have to find the TN, FN, TP, FP. When the output of my network is >= 0.5 with desired output of 1, I classified it under True Positive. And when it's >=0.5 with desired output of 0, I classified it under False Positive. Is that the right thing to do? Just wanna make sure if my understanding is correct.
It all depends on how you are using your network as the True/False Positive/Negative is just a form of analysing results of your classification, not the internals of the network. From what you have written I assume, that you have a network with one output node, which can yield values in the [0,1]. If you use your model in the way, that if this value is bigger then 0.5 then you assume the 1 output and 0 otherwise, then yes, you are correct. In general, you should consider what is the "interpretation" of your output and simply use the definition of TP, FN, etc. which can be summarized as follows:
your network
truth 1 0
1 TP FN
0 FP TN
I refered to "interpretation" as in fact you are always using some function g( output ), which returns the predicted class number. In your case, it is simply g( output ) = 1 iff output >= 0.5. but in multi class problem it would be probably g( output ) = argmax( output ), yet it does not have to, in particular - what about "draws" (when two or more neurons have the same value). For calculating True/False Positives/Negatives you should always only consider the final classification. And as a result, you are measuring the quality of the model, learning process as well as this "interpretation" g.
It should also be noted, that concept of "positive" and "negative" class is often ambiguous. In problems like detection of some object/event it is quite clear, that "occurence" is a positive event and "lack of" is negative, but in many others - like for example gender classification there is no clear interpretation. In such cases one should carefully choose used metrics, as some of them are biased towards positive (or negative) examples (for example precision do not consider neither true nor false negatives).

Resources