Logistic Regression - loss function - machine-learning

I tried to substitute p = 1/1+e(-z) back to the first function, but I can't reach the simplify function.
For the middle term, I get the sum of (z(1-t^i)) instead of the sum of (t^i z^i). I don't know what's going wrong.

Since you have not posted your solution,I cannot say what mistake you did commit but I think you need to substitute p(C=0|x,w) and p(C=1|x,w) both and not just p=1/1+e(-z).Here goes the derivation-
substitute p(C=0|x,w) and p(C=1|x,w) in main equation(first)(sum==summation) and z is z^(i)
L=-sum(t^(i)*log(e^-z))+sum(t^(i)*log(1+e^-z))+sum(log(1+e^-z)-sum(t^(i)*log(1+e^-z))
The second and fourth term cancel out.We get
L=-sum(t^(i)*log(e^-z))+sum(log(1+e^-z)
By prop of logarithms-log(e^(-z))=z.So
L=sum(log(1+e^-z)+sum(t^(i)*z)

Related

Maxima: Is there any way to make functions defined within the main function be local, in a similar way to local variables?

I wonder if there is any way to make functions defined within the main function be local, in a similar way to local variables. For example, in this function that calculates the gradient of a scalar function,
grad(var,f) := block([aux],
aux : [gradient, DfDx[i]],
gradient : [],
DfDx[i] := diff(f(x_1,x_2,x_3),var[i],1),
for i in [1,2,3] do (
gradient : append(gradient, [DfDx[i]])
),
return(gradient)
)$
The variable gradient that has been defined inside the main function grad(var,f) has no effect outside the main function, as it is inside the aux list. However, I have observed that the function DfDx, despite being inside the aux list, does have an effect outside the main function.
Is there any way to make the sub-functions defined inside the main function to be local only, in a similar way to what can be made with local variables? (I know that one can kill them once they have been used, but perhaps there is a more elegant way)
To address the problem you are needing to solve here, another way to compute the gradient is to say
grad(var, e) := makelist(diff(e, var1), var1, var);
and then you can say for example
grad([x, y, z], sin(x)*y/z);
to get
cos(x) y sin(x) sin(x) y
[--------, ------, - --------]
z z 2
z
(There isn't a built-in gradient function; this is an oversight.)
About local functions, bear in mind that all function definitions are global. However you can approximate a local function definition via local, which saves and restores all properties of a symbol. Since the function definition is a property, local has the effect of temporarily wiping out an existing function definition and later restoring it. In between you can create a temporary function definition. E.g.
foo(x) := 2*x;
bar(y) := block(local(foo), foo(x) := x - 1, foo(y));
bar(100); /* output is 99 */
foo(100); /* output is 200 */
However, I don't this you need to use local -- just makelist plus diff is enough to compute the gradient.
There is more to say about Maxima's scope rules, named and unnamed functions, etc. I'll try to come back to this question tomorrow.
To compute the gradient, my advice is to call makelist and diff as shown in my first answer. Let me take this opportunity to address some related topics.
I'll paste the definition of grad shown in the problem statement and use that to make some comments.
grad(var,f) := block([aux],
aux : [gradient, DfDx[i]],
gradient : [],
DfDx[i] := diff(f(x_1,x_2,x_3),var[i],1),
for i in [1,2,3] do (
gradient : append(gradient, [DfDx[i]])
),
return(gradient)
)$
(1) Maxima works mostly with expressions as opposed to functions. That's not causing a problem here, I just want to make it clear. E.g. in general one has to say diff(f(x), x) when f is a function, instead of diff(f, x), likewise integrate(f(x), ...) instead of integrate(f, ...).
(2) When gradient and Dfdx are to be the local variables, you have to name them in the list of variables for block. E.g. block([gradient, Dfdx], ...) -- Maxima won't understand block([aux], aux: ...).
(3) Note that a function defined with square brackets instead of parentheses, e.g. f[x] := ... instead of f(x) := ..., is a so-called array function in Maxima. An array function is a memoizing function, i.e. if f[x] is called two or more times, the return value is only computed once, and then returned every time thereafter. Sometimes that's a useful optimization when the domain of the function comprises a finite set.
(4) Bear in mind that x_1, x_2, x_3, are distinct symbols, not related to each other, and not related to x[1], x[2], x[3], even if they are displayed the same. My advice is to work with subscripted symbols x[i] when i is a variable.
(5) About building up return values, try to arrange to compute the whole thing at one go, instead of growing the result incrementally. In this case, makelist is preferable to for plus append.
(6) The return function in Maxima acts differently than in other programming languages; it's a little hard to explain. A function returns the value of the last expression which was evaluated, so if gradient is that last expression, you can just write grad(var, f) := block(..., gradient).
Hope this helps, I know it's obscure and complex. The Maxima programming language was not designed before being implemented, and some of the decisions are clearly questionable at the long interval of more than 50 years (!) later. That's okay, they were figuring it out as they went along. There was not a body of established results which could provide a point of reference; the original authors were contributing to what's considered common knowledge today.

Simplify product of roots containing goniometric functions

In Maxima, I am trying to simplify the expression
sqrt(1 - sin(x)) * sqrt(1 + sin(x))
to yield
cos(x)
I properly restricted the definition of x
declare(x, real) $
assume(x > 0, x < %pi/2) $
and tried several simplification commands including radcan, trigsimp, trigreduce and trigexpand, but without any success. How can this be done?
Try trigsimp(rootscontract(expr))
The restrictions you assert do not uniquely determine the simplified result you request.
It would seem both harmless and obviously unnecessary to declare or assume the following:
declare(9, real)
assume(9>0)
and yet, sqrt(9) is still the set {-3, +3}, mathematically speaking, as opposed to "what I learned in 6th grade".
Stavros' suggestion give |cos(x)|, which is not quite what the original questioner wanted.
Another way of getting the same result, one which may more explicitly exhibit the -in general falseness - of the result, is to square and then use the semi-bogus sqrt, that attempts to pick the positive answer.
trigsimp (sqrt(expand(expr^2)));
If you think this is a way of simplifying expr, note that it changes -3 to 3.

The tensor product ti() in GAM package gives incorrect results

I am surprising to notice that it is somehow difficult to obtain a correct fit of interaction function from gam().
To be more specific, I want to estimate an additive function:
y=m_1(x)+m_2(z)+m_{12}(x,z)+u,
where m_1(x)=x^2, m_2(z)=z^2,m_{12}(x,z)=xz. The following code generate this model:
test1 <- function(x,z,sx=1,sz=1) {
#--m1(x) function
m.x<-x^2
m.x<-m.x-mean(m.x)
#--m2(z) function
m.z<-z^2
m.z<-m.z-mean(m.z)
#--m12(x,z) function
m.xz<-x*z
m.xz<-m.xz-mean(m.xz)
m<-m.x+m.z+m.xz
return(list(m=m,m.x=m.x,m.z=m.z,m.xz=m.xz))
}
n <- 1000
a=0
b=2
x <- runif(n,a,b)/20
z <- runif(n,a,b)
u <- rnorm(n,0,0.5)
model<-test1(x,z)
y <- model$m + u
So I use gam() by fitting the model as
b3 <- gam(y~ ti(x) + ti(z) + ti(x,z))
vis.gam(b3);title("tensor anova")
#---extracting basis matrix
B.f3<-model.matrix.gam(b3)
#---extracting series estimator
b3.hat<-b3$coefficients
Question: when I plot the estimated function by gam()above against its true function, I end up with
par(mfrow=c(1,3))
#---m1(x)
B.x<-B.f3[,c(2:5)]
b.x.hat<-b3.hat[c(2:5)]
plot(x,B.x%*%b.x.hat)
points(x,model$m.x,col='red')
legend('topleft',c('Estimate','True'),lty=c(1,1),col=c('black','red'))
#---m2(z)
B.z<-B.f3[,c(6:9)]
b.z.hat<-b3.hat[c(6:9)]
plot(z,B.z%*%b.z.hat)
points(z,model$m.z,col='red')
legend('topleft',c('Estimate','True'),lty=c(1,1),col=c('black','red'))
#---m12(x,z)
B.xz<-B.f3[,-c(1:9)]
b.xz.hat<-b3.hat[-c(1:9)]
plot(x,B.xz%*%b.xz.hat)
points(x,model$m.xz,col='red')
legend('topleft',c('Estimate','True'),lty=c(1,1),col=c('black','red'))
However, the function estimate of m_1(x) is largely different from x^2, and the interaction function estimate m_{12}(x,z) is also largely different from xz defined in test1 above. The results are the same if I use predict(b3).
I really can't figure it out. Can anybody help me out by explaining why the results end up with this? Greatly appreciate it!
First, the problem of the above issue is not due to the package, of course. It is closely related to the identification conditions of the smooth functions. One common practice is to impose the assumptions that E(mj(.))=0 for all individual function j=1,...,d, and E(m_ij(x_i,x_j)|x_i)=E(m_ij(x_i,x_j)|x_j)=0 for i not equal to j. Those conditions require one to employ centered basis function in series estimator, which has been done already in GAM package. However, in my case above, function m(x,z)=x*z defined in test1 does not satisfy the above identification assumptions, since the integral of x*z with respect to either x or z is not zero when x and z have range from zero to two.
Furthermore, series estimator allows the individual and interaction function to be identified if one impose m(0)=0 or m(0,x_j)=m(x_i,0)=0. This can be readily achieved if we center the basis function around zero. I have tried both cases, and they work well whenever DGP satisfies the identification conditions.

Recurrence Relation tree method

I am currently having issues with figuring our some recurrence stuff and since I have midterms about it coming up soon I could really use some help and maybe an explanation on how it works.
So I basically have pseudocode for solving the Tower of Hanoi
TOWER_OF_HANOI ( n, FirstRod, SecondRod, ThirdRod)
if n == 1
move disk from FirstRod to ThirdRod
else
TOWER_OF_HANOI(n-1, FirstRod, ThirdRod, SecondRod)
move disk from FirstRod to ThirdRod
TOWER_OF_HANOI(n-1, SecondRod, FirstRod, ThirdRod)
And provided I understand how to write the relation (which, honestly I'm not sure I do...) it should be T(n) = 2T(n-1)+Ɵ(n), right? I sort of understand how to make a tree with fractional subproblems, but even then I don't fully understand the process that would give you the end solution of Ɵ(n) or Ɵ(n log n) or whatnot.
Thanks for any help, it would be greatly appreciated.
Assume the time complexity is T(n), it is supposed to be: T(n) = T(n-1) + T(n-1) + 1 = 2T(n-1) + 1. Why "+1" but not "+n"? Since "move disk from FirstRod to ThirdRod" costs you only one move.
For T(n) = 2T(n-1) + 1, its recursion tree will exactly look like this:
https://www.quora.com/What-is-the-complexity-of-T-n-2T-n-1-+-C (You might find it helpful, the image is neat.) C is a constant; it means the cost per operation. In the case of Tower of Hanoi, C = 1.
Calculate the sum of the cost each level, you will easily find out in this case, the total cost will be 2^n-1, which is exponential(expensive). Therefore, the answer of this recursion equation is Ɵ(2^n).

Can a SHA-1 hash be all-zeroes?

Is there any input that SHA-1 will compute to a hex value of fourty-zeros, i.e. "0000000000000000000000000000000000000000"?
Yes, it's just incredibly unlikely. I.e. one in 2^160, or 0.00000000000000000000000000000000000000000000006842277657836021%.
Also, becuase SHA1 is cryptographically strong, it would also be computationally unfeasible (at least with current computer technology -- all bets are off for emergent technologies such as quantum computing) to find out what data would result in an all-zero hash until it occurred in practice. If you really must use the "0" hash as a sentinel be sure to include an appropriate assertion (that you did not just hash input data to your "zero" hash sentinel) that survives into production. It is a failure condition your code will permanently need to check for. WARNING: Your code will permanently be broken if it does.
Depending on your situation (if your logic can cope with handling the empty string as a special case in order to forbid it from input) you could use the SHA1 hash ('da39a3ee5e6b4b0d3255bfef95601890afd80709') of the empty string. Also possible is using the hash for any string not in your input domain such as sha1('a') if your input has numeric-only as an invariant. If the input is preprocessed to add any regular decoration then a hash of something without the decoration would work as well (eg: sha1('abc') if your inputs like 'foo' are decorated with quotes to something like '"foo"').
I don't think so.
There is no easy way to show why it's not possible. If there was, then this would itself be the basis of an algorithm to find collisions.
Longer analysis:
The preprocessing makes sure that there is always at least one 1 bit in the input.
The loop over w[i] will leave the original stream alone, so there is at least one 1 bit in the input (words 0 to 15). Even with clever design of the bit patterns, at least some of the values from 0 to 15 must be non-zero since the loop doesn't affect them.
Note: leftrotate is circular, so no 1 bits will get lost.
In the main loop, it's easy to see that the factor k is never zero, so temp can't be zero for the reason that all operands on the right hand side are zero (k never is).
This leaves us with the question whether you can create a bit pattern for which (a leftrotate 5) + f + e + k + w[i] returns 0 by overflowing the sum. For this, we need to find values for w[i] such that w[i] = 0 - ((a leftrotate 5) + f + e + k)
This is possible for the first 16 values of w[i] since you have full control over them. But the words 16 to 79 are again created by xoring the first 16 values.
So the next step could be to unroll the loops and create a system of linear equations. I'll leave that as an exercise to the reader ;-) The system is interesting since we have a loop that creates additional equations until we end up with a stable result.
Basically, the algorithm was chosen in such a way that you can create individual 0 words by selecting input patterns but these effects are countered by xoring the input patterns to create the 64 other inputs.
Just an example: To make temp 0, we have
a = h0 = 0x67452301
f = (b and c) or ((not b) and d)
= (h1 and h2) or ((not h1) and h3)
= (0xEFCDAB89 & 0x98BADCFE) | (~0x98BADCFE & 0x10325476)
= 0x98badcfe
e = 0xC3D2E1F0
k = 0x5A827999
which gives us w[0] = 0x9fb498b3, etc. This value is then used in the words 16, 19, 22, 24-25, 27-28, 30-79.
Word 1, similarly, is used in words 1, 17, 20, 23, 25-26, 28-29, 31-79.
As you can see, there is a lot of overlap. If you calculate the input value that would give you a 0 result, that value influences at last 32 other input values.
The post by Aaron is incorrect. It is getting hung up on the internals of the SHA1 computation while ignoring what happens at the end of the round function.
Specifically, see the pseudo-code from Wikipedia. At the end of the round, the following computation is done:
h0 = h0 + a
h1 = h1 + b
h2 = h2 + c
h3 = h3 + d
h4 = h4 + e
So an all 0 output can happen if h0 == -a, h1 == -b, h2 == -c, h3 == -d, and h4 == -e going into this last section, where the computations are mod 2^32.
To answer your question: nobody knows whether there exists an input that produces all zero outputs, but cryptographers expect that there are based upon the simple argument provided by daf.
Without any knowledge of SHA-1 internals, I don't see why any particular value should be impossible (unless explicitly stated in the description of the algorithm). An all-zero value is no more or less probable than any other specific value.
Contrary to all of the current answers here, nobody knows that. There's a big difference between a probability estimation and a proof.
But you can safely assume it won't happen. In fact, you can safely assume that just about ANY value won't be the result (assuming it wasn't obtained through some SHA-1-like procedures). You can assume this as long as SHA-1 is secure (it actually isn't anymore, at least theoretically).
People doesn't seem realize just how improbable it is (if all humanity focused all of it's current resources on finding a zero hash by bruteforcing, it would take about xxx... ages of the current universe to crack it).
If you know the function is safe, it's not wrong to assume it won't happen. That may change in the future, so assume some malicious inputs could give that value (e.g. don't erase user's HDD if you find a zero hash).
If anyone still thinks it's not "clean" or something, I can tell you that nothing is guaranteed in the real world, because of quantum mechanics. You assume you can't walk through a solid wall just because of an insanely low probability.
[I'm done with this site... My first answer here, I tried to write a nice answer, but all I see is a bunch of downvoting morons who are wrong and can't even tell the reason why are they doing it. Your community really disappointed me. I'll still use this site, but only passively]
Contrary to all answers here, the answer is simply No.
The hash value always contains bits set to 1.

Resources