Get the result of normal distribution via. machine learning - machine-learning

I have data of survey result of each questions, and like to find the formula\pattern to calculate\predict the customer satisfactions index. we assume the satisfactions index are normal distribution, and the formula is:
Satisfaction Index = ∑ Weight(i) * Rate(i) , Rate(i) is the rating score of the question i, the goal is to figure out Weight(i)
Any idea to figure out Weight(i) based on the normal distributed Satisfaction Index?

You should just use linear regression. Regress the index as a function of the rating score of each question. If you assume that the index and ratings are normally distributed, and that the index is a linear combination of the rates, then this is the correct thing to do.
However this probably isn't true. There are probably nonlinearities in the satisfaction index. In this case, you could use a neural network or an SVM to try to make a better predictor. But you should first try the linear regression and see if it gives a good fit.

Related

Is there a machine learning algorithm to deal with this problem?

Let's say that we have a database where each row includes:
a person vector such as (1,0, ..., 1)
a movie vector (1,1, ..., 0)
the rating that the person gave to the content
What machine learning algorithm would be able to establish a movie ranking based on the user vector ?
Thank you in advance for your answers!
Your question is bit vague. But I will try to answer. So it looks like what you are asking is if you already have the latent factors what algorithms can be used. How did you obtain these latent factors? are these part of the dataset. If they are not derived and given it is going to be hard for you to predict the movie rating from these given data. Do you have ratings as part of this data set. Then combining this you can use MF or Clustering to obtain movie rankings.

Can we use Logistic Regression to predict numerical(continuous) variable i.e Revenue of the Restaurant

I have been given a task to predict the revenue of the Restaurant based on some variables can i use Logistic regression to predict the Revenue data.
the dataset is of kaggle Restaurant Revenue Prediction Project.
PS :- I have been told to use Logistic regression i know its not the correct algorithm for this problem
Yes... You can.!!
Prediction using Logistic Regression can be done for numerical variables. The data you have right now contains all independent variables, and the outcome will be a dichotomous (dependent variable, having value TRUE/1 or FALSE/0).
You can then use it to determine the log odds ratio to find a probability(range 0-1).
For a reference you can have look at this.
-------------------UPDATE-------------------------------
Let me give u an example of my last yr's wok.. we had to predict if a student can qualify in campus placement or not, given history data of 3 yrs of test results and their final success or failure. (NOTE : This is dichotomous, will talk about this later.)
Sample data was, student's marks in academics, and aptitude test held at college, and their status as placed or not.
But in your case, you have to predict the revenue (WHICH IS non-dichotomous). So what to do?? It seems that my case was simple, right??
Nope..!!
We were not asked just to predict if the student will qualify or not, we were to predict the chances of individual student getting placed, which is not at all a dichotomous. Looks like your scenario right?
So, what you can do is, first classify the data as for what input variables, what is the final output variable (that will help in revenue calculation).
For eg: Use data to find out if the restaurant will go in profit or loss, then relate it with some algorithms to find out the approx revenue prediction.
I'm not sure if there are already such algorithms (identical to your need) exists or not, but I'm sure you can do much better by putting more efforts on research an analysis on this topic.
TIP: NEVER think in such way that "Will Logistic Regression ONLY solve my problem?" Rather expand it to, "What Logistic can do better if used with some other technique.?"

Is my method to detect overfitting in matrix factorization correct?

I am using matrix factorization as a recommender system algorithm based on the user click behavior records. I try two matrix factorization method:
The first one is the basic SVD whose prediction is just the product of user factor vector u and item factor i: r = u * i
The second one I used is the SVD with bias component.
r = u * i + b_u + b_i
where b_u and b_i represents the bias of preference of users and items.
One of the models I use has a very low performance, and the other one is reasonable. I really do not understand why the latter one performs worse, and I doubt that it is overfitting.
I googled methods to detect overfitting, and found the learning curve is a good way. However, the x-axis is the size of the training set and y-axis is the accuracy. This make me quite confused. How can I change the size of the training set? Pick out some of the records out of the data set?
Another problem is, I tried to plot the iteration-loss curve (The loss is the ). And it seems the curve is normal:
But I am not sure whether this method is correct because the metrics I use are precision and recall. Shall I plot the iteration-precision curve??? Or this one already tells that my model is correct?
Can anybody please tell me whether I am going in the right direction? Thank you so much. :)
I will answer in reverse:
So you are trying two different models, one that uses straight matrix factorization r = u * i and the other which enters the biases, r = u * i + b_u + b_i.
You mentioned you are doing Matrix Factorization for a recommender system which looks at user's clicks. So my question here is: Is this an Implicit ratings case? or Explicit one? I believe is an Implicit ratings problem if it is about clicks.
This is the first important thing you need to be very aware of, whether your problem is about Explicit or Implicit ratings. Because there are some differences about the way they are used and implemented.
If you check here:
http://yifanhu.net/PUB/cf.pdf
Implicit ratings are treated in a way that the number of times someone clicked or bought a given item for example is used to infer a confidence level. If you check the error function you can see that the confidence levels are used almost as a weight factor. So the whole idea is that in this scenario the biases have no meaning.
In the case of Explicit Ratings, where one has ratings as a score for example from 1-5, one can calculate those biases for users and products (averages of these bounded scores) and introduce them in the ratings formula. They make sense int his scenario.
The whole point is, depending whether you are in one scenario or the other you can use the biases or not.
On the other hand, your question is about over fitting, for that you can plot training errors with test errors, depending on the size of your data you can have a holdout test data, if the errors differ a lot then you are over fitting.
Another thing is that matrix factorization models usually include regularization terms, see the article posted here, to avoid over fitting.
So I think in your case you are having a different problem the one I mentioned before.

How to run a reverse prediction with machine learning?

I am quite new to machine learning but I am looking to solve following problem. It is a kind of reverse prediction.
I have a lot of inputs and accordingly for each record one output. So I could do easily a classification and predict the output for an unknown new set of data.
The problem I would like to solve is taking one expected outcome and get a classification of the set of input data which will end up on a very high probability to the expected defined output.
To make the problem more complex I would like to have the flexibility to define some of the input criteria which are probably not changeable j(e.g. Male/female) and add these criteria like filters and get a new Revers prediction - what would be the most relevant important input beside the given one to end up with an expected and defined Outcome.
Let's give an example: I have thousands of records of students including education etc. and the information if they earn normal or extreme money after 10 years of work experience. So if I am a new student I could predict the outcome if I will earn a lot of money or average based on my education, gender, age at degree, what I am studying etc.
what I would like to get is given the fact that I am male and have an expected age at time of degree, what should I study to have a high probability of earning extreme?
This problem has not an unique or optimal solution, though it can be tackled in several ways, IMO.
The key fact to understand is that you have a loss of information from the vector input to the scalar/categorical output. It is not an 'invertible' or 'reversible' transformation, due to the fact that multiple and very different input vector could lead to the same output value, thus diluting the info component.
Said that, one possible angle of attack for the problem would be to cluster your input vectors, obtaining several relevant clusters for every output value. Then, you could extract those input cluster centers and inspect what are these prototypical values that lead to the desired outcome. This way you will have your desired reverse 'input points of interest'.

In a content-based recommender systems, how to judge per-user rather than per-rating?

I'm studying the recommender systems from the Andrew Ng course on Coursera, and this question popped into my mind.
In the course, Andrew does recommendations for movies, like Netflix does.
We have an output matrix Y of ratings of various movies, where each cell Y(i,j) is the rating given by user j to movie i. If the user has not rated it, Y(i,j)=?
Assuming we are doing linear regression, we had the following minimization objective:
My question is, doesn't this calculate on a per-rating basis? As in, all ratings are equal. So if someone rates 100 movies, he has more effect on the algorithm than someone who rates only 10 movies.
I was wondering if it is possible to judge on a per-user basis, i.e. all users are equal.
It is definitely possible to apply a weight to the loss function with either weight = 1/ratings_for_user[u] or weight = 1/sqrt(ratings_for_user[u]). where ratings_per_user[u] is the number of rating for the user who gave the rating in your particular sample. Whether it's a good idea or not is another question.
To answer that question, I would first ask the question: Is this more meaningful to the problem you are really trying to solve? If it is, as the second question: Does the model you built work well? Does it have a good cross validation score?

Resources