forward backward algorithm pseudocode clarification - machine-learning

I had a question regarding the pseudocode on the wikipedia page (https://en.wikipedia.org/wiki/Forward%E2%80%93backward_algorithm#Python_example) for the forward backward algorithm. Specifically,what is the purpose of this section of code:
merging the two parts
posterior = []
for i in range(L):
posterior.append({st: fwd[i][st]*bkw[i][st]/p_fwd for st in states})

This might be a little late.
The goal of this algorithm is to inference on data given the model. By this calculation you are calculating the posterior probability of P(x_i|o_i) which can help finding out the next step in the test procedure.
Hope that has helped.

Related

Can a linear classifier represent logic gates requiring more than one operator?

For example how do you get the 0 and 1 table that represents for example a OR (b AND NOT c). I under the individual logic gates but how does put the pieces together. I've tried but to no avail.
Check this example. You can break your expression to few steps to easily understand each step.
I'm not aware of a general method to find a linear decision function. What you can do is training a linear classifier on the boolean table and see the result.
A possible decision function for OR could be:
if A+B-0.5>0:
return 1
else:
return 0
Here is an interesting link related to your question:
It is possible to use neural networks to represent some logic gates like XOR.
https://towardsdatascience.com/how-neural-networks-solve-the-xor-problem-59763136bdd7
Basically, this link, explains the methodology taken in the book deep learning p.171 (Ian Goodfellow, Yoshua Bengio, Aaron Courville) Where the author is trying to learn XOR function. No matter how you try to place your decision boundary you will have a miss label (coordinates can't be on the decision boundary). Has your you can see in the first image: XOR with decision boundary attempt. They will train an MSE loss function on a 2 layers network.
However, I think this question is more directed toward the math stack exchange. Here is a link if you want to learn more : https://datascience.stackexchange.com/questions/11589/creating-neural-net-for-xor-function/19799#19799

How do sample weights work in classification models?

What does it mean to provide weights to each sample for
classification? How does a classification algorithm like Logistic regression or SVMs use weights to emphasize certain examples more than others? I would love going into details to unpack how these algorithms leverage sample weights.
If you look at the sklearn documentation for logistic regression, you can see that the fit function has an optional sample_weight parameter which is defined as an array of weights assigned to individual samples.
this option is meant for imbalance dataset. Let's take an example: i've got a lot of datas and some are just noise. But other are really important to me and i'd like my algorithm to consider them a lot more than the other points. So i assigne a weight to it in order to make sure that it will be dealt with properly.
It change the way the loss is calculate. The error (residues) will be multiplie by the weight of the point and thus, the minimum of the objective function will be shifted. I hope it's clear enough. i don't know if you're familiar with the math behind it so i provide here a small introduction to have everything under hand (apologize if this was not needed)
https://perso.telecom-paristech.fr/rgower/pdf/M2_statistique_optimisation/Intro-ML-expanded.pdf
See a good explanation here: https://www.kdnuggets.com/2019/11/machine-learning-what-why-how-weighting.html .

How is the equation in “Evolution Strategies as a Scalable Alternative to Reinforcement Learning” derived?

In the OpenAI paper "Evolution Strategies as a Scalable Alternative to Reinforcement Learning", how is the equation in page 3 derived?
It's not "derived," in the sense that this equation was not a natural progression from the previous equation presented in the paper.
This formula demonstrates how the authors chose to apply stochastic gradient ascent. It is a mathematical representation of the algorithm they used.
Right below that equation, they explain how it works:
The resulting algorithm repeatedly executes two phases: 1)
Stochastically perturbing the parameters of the policy and evaluating
the resulting parameters by running an episode in the environment, and
2) Combining the results of these episodes, calculating a stochastic
gradient estimate, and updating the parameters.
It might help to re-start the paper from the beginning and read very slowly and carefully. If you come across anything that doesn't make sense, look it up and don't continue reading the paper until you understand what the authors are trying to tell you.

what actually the sigmoid derivative purpose?

Honestly im learning the neural network but i have a question in the activation part.
I know that the question is general and a lot of explanation around the internet. But i still don't understand clearly.
Why we need to derivate the sigmoid function? why do not we just use
it?
It will be good if you give the clear explanation. Thankyou.
I've seen many videos on youtube, i've read many article about it but still don't get it.
Thanks for your help.
Your question is not entirely clear, but I assume you are asking: "Why don't we just use the Sigmoid function without having to calculate its derivative?".
Your question is also very broad, so my answer is very broad and wordy, you will need to read more to understand all the details, for which I'll try to provide links.
Activation function: as the name suggests, we are wanting to know if a given node is "on" or "off", for which the sigmoid function provides an easy way to turn continuous variables (X) into a range of {0,1}.
Use cases can vary and this function has certain properties, and so that is why there are many alternative "activation" functions, like tanh, ReLU, etc. Read more here: https://en.wikipedia.org/wiki/Sigmoid_function
Differentiate (derivate): most models we want to find the best-fit beta parameters for all our activation functions. To do this we, we typically want to minimise a "cost" function that describes how good our model is at predicting observed data. One way to solve this optimisation problem is Gradient Descent. Each step of gradient descent updates the parameters by following the multi-dimensional cost-function space. To do this, it needs the gradient of the activation function. This is important for back propagation that uses gradient descent to optimise the network, it requires that the activation functions you use (in most cases) to be differentiateable.
Read more here: https://en.wikipedia.org/wiki/Gradient_descent
I suggest if you have a deeper question that you take it to one of the machine learning stackexchange sites.

logistic regression cost function scikit learn

So I've done a coursera ml course, and now i am looking at scikit-learn logistic regression which is a little bit different. I've been using sigmoid function and a cost function was divided to two separate cases when y=0 and y=1. But scikit learn have one function(I've found out that this is Generalised logistic function) which really don't make sense for me.
http://scikit-learn.org/stable/_images/math/760c999ccbc78b72d2a91186ba55ce37f0d2cf37.png
I am sorry i don't have enough reputation to post image.
So the main concern of the function is the case when y=0, than the cost function always have this log(e^0+1) value, so it does not matter what was the X or w. Can someone explain me that ?
If I am correct the $y_i$ you have in this formula can only assume the values -1 or 1 (as derived in the link given by #jinyu0310). Normally you use this cost function (with regularisation)
(sorry for the equation inserted as image, but cannot use latex here, the image comes from goo.gl/M53u44)
So you always have the two terms that plays a role when yi = 0 or when yi=1. I am trying to find a better explanation on the notation that scikit uses in this formula but so far no luck.
Hope that helps you. Is just another way of writing the cost function with a regularisation factor. Keep also in mind that a constant factor in front of everything will not play a role in the optimisation process. Since you want to find the minimum and are not interested in overall factors that multiply everything.

Resources