Bound output values [closed] - machine-learning

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Here is an open question:
suppose I need to predict a student's exam score given some inputs, e.g. hours spent on prep, previous scores, etc. How should I bound the output between 0 - 100? What are the best practices out there?
Thanks!
Edit:
Since the answers are mostly concerned about bounding model output after we have the predictions, is it possible to train the model beforehand such that this bound is implicitly learned by the model?

You would train an Isotonic Regression model: http://scikit-learn.org/stable/modules/generated/sklearn.isotonic.IsotonicRegression.html
Or you could simply clip the predicted values that are out of bounds.

It is general practice, when training multi-flavored data to appropriately scale it between 0 - 1, so for example, say ur test data was:
[input: [10 hrs studying, 100% on last test], output: [95% on this test] ]
then you should first standardize both input and output by dividing by the greatest numerical value in each of their elements or the greatest possible value:
input = input/input.max
output = output/100
[input: [0.1 , 1], output: [0.95] ]
When you are done training and want to predict a test scores, simply multiply the output by 100 and you are done.
BTW what you want to do is well documented on stephenwelch's Neural Network Youtube series.

You can either do Normalisation or Standardisation. They would transform your values within [0, 1].
I am not sure why you need the range to be 0-100, but if it is really so, you can multiply by 100 to get that range post the above transformation.
Normalise: Here each value of your feature column is converted like so:
X_new = (X - X_min) / (X_max - X_min)
where X_min and X_max are min and max values in the feature.
Standardise: Here each value of your feature column is converted like so:
X_new = (X - Mean) / StandardDeviation
where Mean and StandardDeviation are the mean and SD values of your feature.
Check which one gives you better results. If your data has extreme outliers, Standardisation might give better results.
In sklearn, you can use sklearn.preprocessing.normalize or sklearn.preprocessing.StandardScaler to do the conversions.
HTH

Related

What kind of ML model can find missing parameters? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Given a data set, such as:
(FirstName, LastName, Sex, DateOfBirth, HairColor, EyeColor, Height, Weight, Location)
that some model can train on, what kind of Machine Learning paradigm can be used to predict missing values if only given some of them?
Example:
Given:
(FirstName: John, LastName: Doe, Sex: M, Height: (5,10))
What model could predict the missing values?
(DateOfBirth, HairColor, EyeColor, Weight, Location)
In other words, the model should be able to take any of the fields as inputs, and "fill in" any that are missing.
And what type of ML/DL is this even called?
If you're looking to fill missing values with an algorithm, this is called imputing missing data. If you're using Python, the scikit-learn library has a number of imputation algorithms that you can explore in the docs.
One nice algorithm is KNNImputer, which looks n_neighbors most similar observations to the current observation and fills the missing data with mean for the column from those similar observations. Read more here: https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html
If there are a lot of missing values in a row, first need to understand: will it add value to my problem? Else drop such rows which have a lot of missing values.
One way to handle: Remove the target variable. Using the features which do not have missing values, predict columns that have missing values. Use ML algorithms, to predict and fill those values. Then again use previously imputed missing values to predict other missing values.
Eg: if features and target are: X1, X2, X3, X4, Y
Let X1 and X2 do not have missing values, X3 and X4 have missing values.
First, keep aside Y. Using X1 and X2, fill missing values in X3 with the help of ML algorithms. Again, using X1, X2, X3 fill missing values in X4. Then finally predict the target values (Y).
I have used this method in hackathons and got good results. Before applying this, first, try to get a good understanding of the data. The approach might be slightly different from what you have asked, but this is a decent approach for such problems.

Partial derivative for mean absolute error [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
What is the partial derivative for MAE? I understand that for mean squared error (MSE) the partial derivative with respect to some x1 would be -x1 * (y_pred - y_actual) assuming the the following version of MSE is used.
What is the partial derivative for x1 when the loss function is MAE instead of MSE? I've been trying to find this but I haven't had any luck. Would it just be -(y_pred - y_actual) when x1 is greater than 0, and (y_pred - y_actual) when x1 is less than 0? Or is there something else that I'm missing?
Unless you're having a single neuron, there's no fixed formula for partial derivative of loss function with respect to each weight; it depends strictly on the connections between neurons in the network. And the partial derivative formula is not only one, each weight has a different one.
For small network with kinda 2, 3 layers, apply chain rule, and sum rule to find the partial derivative of loss function manually, otherwise, dynamic programming in backpropagation is needed.

How to find the value of theta 0 and theta 1? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I am new to ML, I am not sure on how to solve this problem
Could someone tell me how to solve this problem of finding values in a a step by step manner?
From newcomer view point you can actually just test:
h1=0.5+0.5x
h2=0+0.5x
h3=0.5+0x
h4=1+0.5x
h5=1+x
Then which one of the hs(1..5) gives exact observed values of y(0.5,1,2,0) for a given set of dependent variables x(1,2,4,0).
You can answer that by passing sample values of x in the above equation.
I hope i made it simple enough
Here is the cache It's one of most easy problems in machine learning.
Just see that we have to create a linear regression model to fit the following data:-
STEP 1:UNDERSTANDING THE PROBLEM
And as mentioned at the last of question it should completely fit the data.
We have to find theta0 and theta1 in such a way such that given value of x Htheta(x) will give the correct value of y.
STEP 2:FINDING THETA1
In these m examples take any 2 random examples
Htheta(x2)-Htheta(x1) = theta1*(x2)-theta1*(x1)
-----Subtracting those 2 variables(eliminating theta0)
hteta(x2) = y2
(y corresponding to that x in the data as the parameters exactly fit the data provided )
(y2-y1)/(x2-x1) = theta1
----taking common and then dividing by(x2-x1) on both sides of equation
From this:
theta1 = 0.5
STEP3 :CALCULATING THETA0
Take any random example and put the values of theta1, y and x in this equation
y = theta1*x + theta0
theta0 will come out to be 0
My approach would be to view these points by plotting a graph with x,y values. Since it's a straight line, calculate tan(theta) using normal trigonometry, which in this case is y/x(Since it's mentioned they fit perfectly!!). eg:-
tan(theta1) = 0.5/1 or 1/2
Calculate arctan(1/2) // Approx 0.5
Note:- This is not a scalable approach but just some maths fun! Sorry.
In general you would execute some non-iterative algorithmic approach (probably based on solving a system of linear equations) or some iterative approach like GD (Gradient Descent), but this is more simple here, as it's already given that there is a perfect fit.
Perfect fit means: loss/error of zero.
Loss of zero implicates, that sigma0 needs to be zero or else sample 4 (last one) induces a loss
Overall loss is the sum of sample-losses and each loss/component is nonnegative -> we can't tolerate a loss here
When sigma0 is fixed, sample 4 has an infinite amount of solutions producing no loss
But sample 1 shows that it has to be 0.5 to induce no loss
Check the others, it's fitting perfectly
One assumption i made:
Gradient-descent will converge to the optimal solution (which is not always true, even for convex-optimization problems; it's depending learning-rates; one might use line-searches to proof convergence based on some assumptions about the problem; but all that is irrelevant here)

Neural network weird prediction [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I try to implement a neural network. I'm using backpropagation to compute the gradients. After obtaining the gradients, I multiply them by the learning rate and subtract them from the corresponding weights. (basically trying to apply gradient descent, please tell me if this is wrong).
So the first thing I tried after having the backpropagation and gradient descent ready, was to train a simple XOR classifier where the inputs can be (0,0), (1,0), (0,1), (1,1) and the corresponding outputs are 0, 1, 1, 0. So my neural network contains 2 input units, 1 output unit and one hidden layer with 3 units on it. When training it with a learning rate of 3.0 for >100 (even tried >5000), the cost drops until a specific point where it gets stuck, so it's remaining constant. The weights are randomly initialized each time I run the program, but it always gets stuck at the same specific cost. Anyways, after the training is finished I tried to run my neural network on any of the above inputs and the output is always 0.5000. I thought about changing the inputs and outputs so they are : (-1,-1), (1, -1), (-1, 1), (1, 1) and the outputs -1, 1, 1, -1. Now when trained with the same learning rate, the cost is dropping continuously, no matter the number of iterations but the results are still wrong, and they always tend to be very close to 0. I even tried to train it for an insane number of iterations and the results are the following: [ iterations: (20kk), inputs:(1, -1), output:(1.6667e-08) ] and also [iterations: (200kk), inputs:(1, -1), output:(1.6667e-09) ], also tried for inputs(1,1) and others, the output is also very close to 0. It seems like the output is always mean(min(y), max(y)), it doesn't matter in what form I provide the input/output. I can't figure out what I'm doing wrong, can someone please help?
There are so many places where you might be wrong:
check your gradients numerically
you have to use nonlinear hidden units to learn XOR - do you have non-linear activation there?
you need bias neuron, do you have one?
minor things that should not cause the mentioned problem, but worth fixing either way:
do you have sigmoidal activation in the output node (as your network is a classifier)?
do you train with cross-entropy cost (although this is minor problem)?

Naive bayes text classification fails in one category. Why? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I am implementing Naive Bayes classifier for text category detection.
I have 37 categories and I've got accuracy about 36% on my test set.
I want to improve accuracy, so I decided to implement 37 two-way classifiers as suggested in many sources (Ways to improve the accuracy of a Naive Bayes Classifier? is one of them), these classifiers would answer for a given text:
specific_category OR everything_else
and I would determine text's category by applying them sequentally.
But I've got a problem with first classifier, it always fails in "specific_category" category.
I have training data - 37 categories, 100 documents for each category of the same size.
For each category I found list of 50 features I selected by mutual information criteria (features are just words).
For the sake of example, I use two categories "agriculture" and "everything_else" (except agriculture).
For category "agriculture":
number of words in all documents of this class
(first term in denominator in http://nlp.stanford.edu/IR-book/pdf/13bayes.pdf, (13.7))
W_agriculture = 31649.
Size of vocabulary V_agriculture = 6951.
Log probability of Unknown word (UNK) P(UNK|agriculture) = -10.56
Log probability of class P(agriculture) = log(1/37) = -3.61 (we have 37 categories of same-size documents)
For category "everything_else":
W_everything_else = 1030043
V_everything_else = 44221
P(UNK|everything_else) = -13.89
P(everything_else) = log(36/37) = -0.03
Then I have a text not related to agriculture, let it consist mostly of Unknown words (UNK). It has 270 words, they are mostly unknown for both categories "agriculture" and "everything_else". Let's assume 260 words are UNK for "everything_else", other 10 is known.
Then, when I calculate probabilities
P(text|agriculture) = P(agriculture) + SUM(P(UNK|agriculture) for 270 times)
P(text|everything_else) = P(everything_else) + SUM(P(UNK|everything_else) for 260 times) + SUM(P(word|everything_else) for 10 times)
In the last line we counted 260 words as UNK and 10 as known for a category.
Main problem. As P(UNK|agriculture) >> P(everything_else) (for log it is much greater), the influence of those 270 terms P(UNK|agriculture) outweighs influence of sum for P(word|everything_else) for each word in text.
Because
SUM(P(UNK|agriculture) for 270 times) = -2851.2
SUM(P(UNK|everything_else) for 260 times) = -3611.4
and first sum is much larger and can't be corrected not with P(agriculture) nor SUM(P(word|everything_else) for 10 words), because the difference is huge. Then I always fail in "agriculture" category though the text does not belong to it.
The questions is: Am I missing something? Or how should I deal with big number of UNK words and their probability being significantly higher for small categories?
UPD: Tried to enlarge tranining data for "agriculture" category (just concatenating the document 36 times) to be equal in number of documents. It helped for few categories, not much for others, I suspect due to fewer number of words and dictionary size, P(UNK|specific_category) gets bigger and outweighs P(UNK|everything_else) when summing 270 times.
So it seems such method is very sensitive on number of words in training data and vocabulary size. How to overcome this? Maybe bigrams/trigrams would help?
Right, ok. You're pretty confused, but I'll give you a couple of basic pointers.
Firstly, even if you're following a 1-vs-all scheme, you can't have different vocabularies for the different classes. If you do this, the event spaces of the random variables are different, so probabilities are not comparable. You need to decide on a single common vocabulary for all classes.
Secondly, throw out the unknown token. It doesn't help you. Ignore any words that aren't part of the vocabulary you decide upon.
Finally, I don't know what you're doing with summing probabilities. You're confused about taking logs, I think. This formula is not correct:
P(text|agriculture) = P(agriculture) + SUM(P(UNK|agriculture) for 270 times)
Instead it's:
p(text|agriculture) = p(agriculture) * p(unk|agriculture)^270 * p(all other words in doc|agriculture)
If you take logs, this becomes:
log( p(t|a) ) = log(p(agriculture)) + 270*log(p(unk|agriculture)) + log(p(all other words|agriculture))
Finally, if your classifier is right, there's no real reason to believe that one-vs-all will work better than just a straight n-way classification. Empirically it might, but theoretically their results should be equivalent. In any case, you shouldn't apply decisions sequentially, but do all n 2-way problems and assign to the class where the positive probability is highest.

Resources