Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Given a data set, such as:
(FirstName, LastName, Sex, DateOfBirth, HairColor, EyeColor, Height, Weight, Location)
that some model can train on, what kind of Machine Learning paradigm can be used to predict missing values if only given some of them?
Example:
Given:
(FirstName: John, LastName: Doe, Sex: M, Height: (5,10))
What model could predict the missing values?
(DateOfBirth, HairColor, EyeColor, Weight, Location)
In other words, the model should be able to take any of the fields as inputs, and "fill in" any that are missing.
And what type of ML/DL is this even called?
If you're looking to fill missing values with an algorithm, this is called imputing missing data. If you're using Python, the scikit-learn library has a number of imputation algorithms that you can explore in the docs.
One nice algorithm is KNNImputer, which looks n_neighbors most similar observations to the current observation and fills the missing data with mean for the column from those similar observations. Read more here: https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html
If there are a lot of missing values in a row, first need to understand: will it add value to my problem? Else drop such rows which have a lot of missing values.
One way to handle: Remove the target variable. Using the features which do not have missing values, predict columns that have missing values. Use ML algorithms, to predict and fill those values. Then again use previously imputed missing values to predict other missing values.
Eg: if features and target are: X1, X2, X3, X4, Y
Let X1 and X2 do not have missing values, X3 and X4 have missing values.
First, keep aside Y. Using X1 and X2, fill missing values in X3 with the help of ML algorithms. Again, using X1, X2, X3 fill missing values in X4. Then finally predict the target values (Y).
I have used this method in hackathons and got good results. Before applying this, first, try to get a good understanding of the data. The approach might be slightly different from what you have asked, but this is a decent approach for such problems.
Related
I was learning Machine Learning from this course on Coursera taught by Andrew Ng. The instructor defines the hypothesis as a linear function of the "input" (x, in my case) like the following:
hθ(x) = θ0 + θ1(x)
In supervised learning, we have some training data and based on that we try to "deduce" a function which closely maps the inputs to the corresponding outputs. To deduce the function, we introduce the hypothesis as a linear function of input (x). My question is, why the function involving two θs is chosen? Why it can't be as simple as y(i) = a * x(i) where a is a co-efficient? Later we can go about finding a "good" value of a for a given example (i) using an algorithm? This question might look very stupid. I apologize but I'm not very good at machine learning I am just a beginner. Please help me understand this.
Thanks!
The a corresponds to θ1. Your proposed linear model is leaving out the intercept, which is θ0.
Consider an output function y equal to the constant 5, or perhaps equal to a constant plus some tiny fraction of x which never exceeds .01. Driving the error function to zero is going to be difficult if your model doesn't have a θ0 that can soak up the D.C. component.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Here is an open question:
suppose I need to predict a student's exam score given some inputs, e.g. hours spent on prep, previous scores, etc. How should I bound the output between 0 - 100? What are the best practices out there?
Thanks!
Edit:
Since the answers are mostly concerned about bounding model output after we have the predictions, is it possible to train the model beforehand such that this bound is implicitly learned by the model?
You would train an Isotonic Regression model: http://scikit-learn.org/stable/modules/generated/sklearn.isotonic.IsotonicRegression.html
Or you could simply clip the predicted values that are out of bounds.
It is general practice, when training multi-flavored data to appropriately scale it between 0 - 1, so for example, say ur test data was:
[input: [10 hrs studying, 100% on last test], output: [95% on this test] ]
then you should first standardize both input and output by dividing by the greatest numerical value in each of their elements or the greatest possible value:
input = input/input.max
output = output/100
[input: [0.1 , 1], output: [0.95] ]
When you are done training and want to predict a test scores, simply multiply the output by 100 and you are done.
BTW what you want to do is well documented on stephenwelch's Neural Network Youtube series.
You can either do Normalisation or Standardisation. They would transform your values within [0, 1].
I am not sure why you need the range to be 0-100, but if it is really so, you can multiply by 100 to get that range post the above transformation.
Normalise: Here each value of your feature column is converted like so:
X_new = (X - X_min) / (X_max - X_min)
where X_min and X_max are min and max values in the feature.
Standardise: Here each value of your feature column is converted like so:
X_new = (X - Mean) / StandardDeviation
where Mean and StandardDeviation are the mean and SD values of your feature.
Check which one gives you better results. If your data has extreme outliers, Standardisation might give better results.
In sklearn, you can use sklearn.preprocessing.normalize or sklearn.preprocessing.StandardScaler to do the conversions.
HTH
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I am new to ML, I am not sure on how to solve this problem
Could someone tell me how to solve this problem of finding values in a a step by step manner?
From newcomer view point you can actually just test:
h1=0.5+0.5x
h2=0+0.5x
h3=0.5+0x
h4=1+0.5x
h5=1+x
Then which one of the hs(1..5) gives exact observed values of y(0.5,1,2,0) for a given set of dependent variables x(1,2,4,0).
You can answer that by passing sample values of x in the above equation.
I hope i made it simple enough
Here is the cache It's one of most easy problems in machine learning.
Just see that we have to create a linear regression model to fit the following data:-
STEP 1:UNDERSTANDING THE PROBLEM
And as mentioned at the last of question it should completely fit the data.
We have to find theta0 and theta1 in such a way such that given value of x Htheta(x) will give the correct value of y.
STEP 2:FINDING THETA1
In these m examples take any 2 random examples
Htheta(x2)-Htheta(x1) = theta1*(x2)-theta1*(x1)
-----Subtracting those 2 variables(eliminating theta0)
hteta(x2) = y2
(y corresponding to that x in the data as the parameters exactly fit the data provided )
(y2-y1)/(x2-x1) = theta1
----taking common and then dividing by(x2-x1) on both sides of equation
From this:
theta1 = 0.5
STEP3 :CALCULATING THETA0
Take any random example and put the values of theta1, y and x in this equation
y = theta1*x + theta0
theta0 will come out to be 0
My approach would be to view these points by plotting a graph with x,y values. Since it's a straight line, calculate tan(theta) using normal trigonometry, which in this case is y/x(Since it's mentioned they fit perfectly!!). eg:-
tan(theta1) = 0.5/1 or 1/2
Calculate arctan(1/2) // Approx 0.5
Note:- This is not a scalable approach but just some maths fun! Sorry.
In general you would execute some non-iterative algorithmic approach (probably based on solving a system of linear equations) or some iterative approach like GD (Gradient Descent), but this is more simple here, as it's already given that there is a perfect fit.
Perfect fit means: loss/error of zero.
Loss of zero implicates, that sigma0 needs to be zero or else sample 4 (last one) induces a loss
Overall loss is the sum of sample-losses and each loss/component is nonnegative -> we can't tolerate a loss here
When sigma0 is fixed, sample 4 has an infinite amount of solutions producing no loss
But sample 1 shows that it has to be 0.5 to induce no loss
Check the others, it's fitting perfectly
One assumption i made:
Gradient-descent will converge to the optimal solution (which is not always true, even for convex-optimization problems; it's depending learning-rates; one might use line-searches to proof convergence based on some assumptions about the problem; but all that is irrelevant here)
The following is an example question from previous years of Machine Learning. Can anyone help me solve this question.
The correct way to solve part (a) involves marginalizing over all variables in the model.
p(x3,x4)=1/Z \sum_{x1,x2,x5} \phi(x1,x2) \phi(x2,x4) \phi(x3,x4) \phi(x4,x5)
Z=\sum_{x1,x2,x3,x4,x5} \phi(x1,x2) \phi(x2,x4) \phi(x3,x4) \phi(x4,x5)
In a small model like this, you can just compute the sums over the 2^3 and 2^5 respective possibilities. A better method, however, is to compute the sums using belief propagation.
For instance, the sum in the numerator above can be rewritten as
S(x4,x5)=\sum_{x1,x2,x5} \phi(x1,x2) \phi(x2,x4) \phi(x3,x4) \phi(x4,x5)
=\phi(x3,x4) \sum_{x5} \phi(x4,x5) \sum_{x2} \phi(x2,x4) \sum_x1 \phi(x1,x2)
The following intermediate sums can then be computed and used to obtain the final marginal probability:
sx1x2(x2=0)=\phi(x1=0,x2=0)+\phi(x1=1,x2=0)
sx1x2(x2=1)=\phi(x1=0,x2=1)+\phi(x1=1,x2=1)
sx1x2x4(x4=0)=\phi(x2=0,x4=0) sx1x2(x2=0)+\phi(x2=1,x4=0) sx1x2(x2=1)
sx1x2x4(x4=1)=\phi(x2=0,x4=1) sx1x2(x2=0)+\phi(x2=1,x4=1) sx1x2(x2=1)
sx4x5(x4=0)=\phi(x4=0,x5=0)+\phi(x4=0,x5=1)
sx4x5(x4=1)=\phi(x4=1,x5=0)+\phi(x4=1,x5=1)
Then
S(x3,x4)=\phi(x3,x4) sx1x2x4(x4) sx4x5(x4)
and
Z=\sum_{x3,x4} S(x3,x4)
I am working on a data mining projects and I have to design following model.
I have given 4 feature x1, x2, x3 and x4 and four function defined on these
feature such that each function depend upon some subset of available feature.
e.g.
F1(x1, x2) =x1^2+2x2^2
F2(x2, x3) =2x2^2+3x3^3
F3(x3, x4) =3x3^3+4x4^4
This implies F1 is some function which depend on feature x1, x2. F2 is some feature which depend upon x2, x3 and so on
Now I have a training data set is available where value of x1,x2,x3,x4 is known and sum(F1+F2+F3) { I know total sum but not individual sum of function)
Now using these training data I have to make a model which which can predict total sum of all the function correctly i.e. (F1+F2+F3)
I am new to the data mining and Machine learning field .So I apologizes in advance if this question is too trivial or wrong. I have tried to model it many way but I am not getting any clear thought about it. I will appreciate any help regarding this.
Your problem is non-linear regression. You have features
x1 x2 x3 x4 S where (S = sum(F1+F2+F3) )
You would like to predict S using Xn but S function is non-linear.
Since your function S is non linear, you need to use a non-linear regression algorithm for this problem. Normal nonlinear regression may solve your problem or you may choose other approaches. You may for example try Tree Regression or MARS (Multivariate Adaptive Regression Splines). They are well known algorithms and you can find commercial and open source versions.