While going through Andrew NG's Coursera course on machine learning . I found this particular thing that prices of a house might goes down after certain value of x in Quadratic regression equation. Can anyone explain why is it so?
Andrew Ng is trying to show that a Quadratic function doesn't really make sense to represent the price of houses.
This what the graph of a quadratic function might look like -->
The values of a, b and c were chosen randomly for this example.
As you can see in the figure, the graph first rises to a maximum and then begins to dip. This isn't representative of the real-world since the price of a house wouldn't normally come down with an increasingly larger house.
He recommends that we use a different polynomial function to represent this problem better, such as the cubic function.
The values of a, b, c and d were chosen randomly for this example.
In reality, we would use a different method altogether for choosing the best polynomial function to fit a problem. We would try different polynomial functions on a cross-validation dataset and have an algorithm choose the best suited one. We could also manually chose a polynomial function for a dataset if we already know the trend that our data would follow (due to prior mathematical or physical knowledge).
Related
I've been studying the Machine Learning course on Coursera by Andrew Ng.
I've solved the programming assignment that is given at the end of the week 3. But there was something that I was wondering and I do not want to skip things without being able to reason them, at least I couldn't.
Why do we omit coefficients of polynomial terms when feature mapping?
I quote from the assignment's PDF (which you can view from the link).
2.2 Feature mapping
One way to fit the data better is to create more features from each data
point. In the provided function mapFeature.m, we will map the features into
all polynomial terms of x 1 and x 2 up to the sixth power.
As you can see coefficients are not included.
Considering I am not good at math I think we found all the polynomial terms doing:
since here we would get coefficients after we take the powers, why do we omit them in the vector?
By coefficients I mean coefficients of the terms
instead of in the vector.
So I've done a coursera ml course, and now i am looking at scikit-learn logistic regression which is a little bit different. I've been using sigmoid function and a cost function was divided to two separate cases when y=0 and y=1. But scikit learn have one function(I've found out that this is Generalised logistic function) which really don't make sense for me.
http://scikit-learn.org/stable/_images/math/760c999ccbc78b72d2a91186ba55ce37f0d2cf37.png
I am sorry i don't have enough reputation to post image.
So the main concern of the function is the case when y=0, than the cost function always have this log(e^0+1) value, so it does not matter what was the X or w. Can someone explain me that ?
If I am correct the $y_i$ you have in this formula can only assume the values -1 or 1 (as derived in the link given by #jinyu0310). Normally you use this cost function (with regularisation)
(sorry for the equation inserted as image, but cannot use latex here, the image comes from goo.gl/M53u44)
So you always have the two terms that plays a role when yi = 0 or when yi=1. I am trying to find a better explanation on the notation that scikit uses in this formula but so far no luck.
Hope that helps you. Is just another way of writing the cost function with a regularisation factor. Keep also in mind that a constant factor in front of everything will not play a role in the optimisation process. Since you want to find the minimum and are not interested in overall factors that multiply everything.
I am implementing Probabilistic Matrix Factorization models in theano and would like to make use of Adam gradient descent rules.
My goal is to have a code that is as uncluttered as possible, which means that I do not want to keep explicitly track of the 'm' and 'v' quantities from Adam's algorithm.
It would seem that this is possible, especially after seeing how Lasagne's Adam is implemented: it hides the 'm' and 'v' quantities inside the update rules of theano.function.
This works when the negative log-likelihood is formed with each term that handles a different quantity. But in Probabilistic Matrix Factorization each term contains a dot product of one latent user vector and one latent item vector. This way, if I make an instance of Lasagne's Adam on each term, I will have multiple 'm' and 'v' quantities for the same latent vector, which is not how Adam is supposed to work.
I also posted on Lasagne's group, actually twice, where there is some more detail and some examples.
I thought about two possible implementations:
each existing rating (which means each existing term in the global NLL objective function) has its own Adam updated via a dedicated call to a theano.function. Unfortunately this lead to a non correct use of Adam, as the same latent vector will be associated to different 'm' and 'v' quantities used by the Adam algorithm, which is not how Adam is supposed to work.
Calling Adam on the entire objective NLL, which will make the update mechanism like simple Gradient Descent instead of SGD, with all the disadvantages that are known (high computation time, staying in local minima etc.).
My questions are:
maybe is there something that I did not understand correctly on how Lasagne's Adam works?
Would the option number 2 be actually like SGD, in the sense that every update to a latent vector is going to make an influence to another update (in the same Adam call) that make use of that updated vector?
Do you have any other suggestions on how to implement it?
Any idea on how to solve this issue and avoid manually keeping replicated vectors and matrices for the 'v' and 'm' quantities?
It looks like in the paper they are suggesting that you optimise the entire function at once using gradient descent:
We can then perform gradient descent in Y , V , and W to minimize the objective function given by Eq. 10.
So I would say your option 2 is the correct way to implement what they have done.
There aren't many complexities or non-linearities in there (beside the sigmoid) so you probably aren't that likely to run into typical optimisation problems associated with neural networks that necessitate the need for something like Adam. So as long as it all fits in memory I guess this approach will work.
If it doesn't fit in memory, perhaps there's some way you could devise a minibatch version of the loss to optimise over. Would also be interested to see if you could add one or more neural networks to replace some of these conditional probabilities. But that's getting a bit off-topic...
Given a machine learning model built on top of scikit-learn, how can I classify new instances but then choose only those with the highest confidence? How do we define confidence in machine learning and how to generate it (if not generated automatically by scikit-learn)? What should I change in this approach if I had more that 2 potential classes?
This is what I have done so far:
# load libraries
from sklearn import neighbors
# initialize NearestNeighbor classifier
knn = neighbors.KNeighborsClassifier(n_neighbors=3)
# train model
knn.fit([[1],[2],[3],[4],[5],[6]], [0,0,0,1,1,1])
# predict ::: get class probabilities
print(knn.predict_proba(1.5))
print(knn.predict_proba(37))
print(knn.predict_proba(3.5))
Example:
Let's assume that we have created a model using the XYZ machine learning algorithm. Let's also assume that we are trying to classify users based on their gender using information such as location, hobbies, and income. Then, we have 10 say new instances that we want to classify. As normal, upon the applying of the model, we get 10 outputs, either M (for male) or F (for female). So far so good. However, I would like to somehow measure the precision of these results and then, by using a hard-coded threshold, leave out those with low precision. My question is on how to measure the precession. Is probability (as given by the predict_proba() function) a good measure? For example, can I say that if probably is between 0.9 and 1 then "keep" (otherwise "omit")? Or I should use a more sophisticated method for doing that? As you can see, I lack theoretical background so any help would be highly appreciated.
While this is more of a stats question I can give answers relative to scikit-learn.
Confidence in machine learning depends on the method used for the model. For exemple with 3-NN (what you used), predict_proba(x) will give you n/3 with x the number of "class 1" among the 3 nearest neighbours from x. You can easily say that if n/3 is smaller than 0.5 that means there are less than 2 "class 1" among the nearest neighbours and that there are more than 2 "class 0". That means your x is more likely to be from "class 0". (I assume you knew that already)
For another method like SVM the confidence can be the distance from the point considered to the hyperplan or for ensemble models it could be the number of aggregated votes towards a certain class. Scikit-learn's predict_proba() uses what is available from the model.
For multiclass problems (imagine Y can be equal to A, B or C) ypu have two main approach that are sometimes directly taken into consideration in scikit learn.
The first approach is OneVsOne. It basically compute every new sample as a AvsB AvsC and BvsC model and takes the most probable (imagine if A wins against B and against C it is very likely that the right class is A, the annoying cases are resolved by taking the class that has the highest confidence in the match ups e.g. if A wins against B, B wins against C and C wins against C, if the confidence of A winning against B is higher than the rest it will most likely be A).
The second approach is OneVsAll, in wich you compute A vs B and C, B vs A and C, C vs A and B and take the class that is the most likely by looking at the confidence scores.
Using scikit-learn's predict() will always give the most likely class based on the confidence scores that predict_proba would give.
I suggest you read this http://scikit-learn.org/stable/modules/multiclass.html very carefully.
EDIT :
Ah I see what you are trying to do. predict_proba() has a big flaw : let's assume you have a big outlier in your new instances (e.g. female with video games and guns as hobbies, software developper as a job etc.) if you use for instance k-NN and your outlier will be in the flock of the other classe's cloud of point predict_proba() could give 1 as a confidence score for Male while the instance is Female. However it will well for undecisive cases (e.g. male or female, with video games and guns as hobbies, and works in a nursery) as predict_proba() will give something around ~0.5.
I don't know if something better can be used tought. If you have enough training samples for doing cross validation I suggest you maybe look toward ROC and PR curves for optimizing your threshold.
I am working on a Machine Learning problem which looks like this:
Input Variables
Categorical
a
b
c
d
Continuous
e
Output Variables
Discrete(Integers)
v
x
y
Continuous
z
The major issue that I am facing is that Output Variables are not totally independent of each other and there is no relation that can be established between them. That is, there is a dependence but not due to the causality (one value being high doesn't imply that the other will be high too but the chances of other being higher will improve)
An Example would be:
v - Number of Ad Impressions
x - Number of Ad Clicks
y - Number of Conversions
z - Revenue
Now, for an Ad to be clicked, it has to first appear on a search, so Click is somewhat dependent on Impression.
Again, for an Ad to be Converted, it has to be first clicked, so again Conversion is somewhat dependent on Click.
So running 4 instances of the problem predicting each of the output variables doesn't make sense to me. Infact there should be some way to predict all 4 together taking care of their implicit dependencies.
But as you can see, there won't be a direct relation, infact there would be a probability that is involved but which can't be worked out manually.
Plus the output variables are not Categorical but are in fact Discrete and Continuous.
Any inputs on how to go about solving this problem. Also guide me to existing implementations for the same and which toolkit to use to quickly implement the solution.
Just a random guess - I think this problem can be targeted by Bayesian Networks. What do you think ?
Bayesian Networks will do fine in your case. Your network won't be that huge either so you can live with exact inference algorithms like graph elimination or junction tree. If you decide to use BNs, then you can use Kevin Murphy's BN toolbox. Here is a link to that. For a more general toolbox that uses Gibbs sampling for approximate Monte Carlo inference, you can use BUGS.
Edit:
As an example look at the famous sprinkler example here. For totally discrete variables, you define the conditional probability tables as in the link. For instance you say that given that today is cloudy, there is a 0.8 probability of rain. You define all probability distributions, where the graph shows the causality relations (i.e. if cloud then rain etc.) Then as query you ask to your inference algorithm questions like, given that grass was wet; was it cloudy, was it raining, was the sprinkler on and so on.
To use BNs one needs a system model that is described in terms of causality relations (Directed Acyclic Graph) and probability transitions. If you wanna learn your system parameters there are techniques like EM algorithm. However, learning the graph structure is a really hard task and supervised machine learning approaches will do better in that case.