statsmodel OLS p-values vs sklearn's feature_selection.f_regression - machine-learning

Basically, I ran a normal LinearRegression() with sklearn. I got the coefficients no problem. But to get the p-values, I have to use sklearn's feature_selection.f_regression. I then tried to cross check it with StatsModel OLS. This is where I got a problem.
Coefficients:
They are the same from sklearn's LinearRegression() and StatsModel's OLS.
P-Values:
TOTALLY DIFFERENT between the two. I've triple checked that I fed the same information into both models. If I made a mistake by feeding different data into the two models, my coefficients would not have been the same.
So my question is, why are the p-values different for the two?

Related

Can logistic and lineair regression produce a prediction on a scale?

I currently have a dataset of drawings, each drawing being represented by some features. Each feature (independent variable) is a continuous number. None of the drawings have a label as of yet, which is why I am planning to start a sort of questionaire with people. However, before I can correctly setup such questionaire, I should have an idea of what kind of labels I should use for my training data.
At first thought, I was thinking about letting people rate the drawings on a scale, for example from 1 to 5 with 1 being bad, 3 being average and 5 being good. Alternatively, I could also reduce the question to a simple good or bad question. The latter would mean I lose some valuable information, but the dependent variable could then be considered 'binary'.
Using the training data I then composed, I would need to have a machine learning algorithm (model) which given a drawing, predicts if the drawing is good or not. Ideally, I would have some way of tuning the strictness in this prediction. For example, the model could instead of simply predicting 'good' or 'bad', predict the likelyhood of a painting being good on a scale of 0 to 1. I could then say "Well, let's say all paintings which are 70% likely to be good, are considered as good". Another example would be that the model predicts the goodness using the same categorical values the people used to rate the drawing initially. So it would either predict the drawing being a 1, 2, 3, 4 or 5. Similar to my first example, I could then say "Well, all paintings which are rated at least a 4, are considered good paintings" and tune this threshhold to my liking.
After doing some research, I came up with logistic and linear regression being good candidates. However, if which of the two would be the best for my scenario? Equally important, how would I need to format my labels? Just simple 0's and 1's or a scale?
You could use a 1 vs all representation if you wanted to use a multi-class categorical classification:
Essentially, you train 1 classifier for every category you have (you have 10 categories, so you have 10 classifiers) and then each classifier is just trained to predict whether or not the category belongs to each specific class.
There are alternative ways to make multi-class logistic regression work that only require training a single model, such as by using categorical cross entropy, but given that you'd like to use ordinal data, a linear regression used as a regression model is likely more ideal. You'd predict a value between 1 and 10 and then just round to the nearest integer. This way you aren't penalizing close guesses as much as far away guesses.
what keeps you from using a logistic regression model. Due to a lack of better dataset I used the standard diabetes data. The target variable is an integer between 50 and 200. I normalised the data between [-1,1] such that I can use sigmoid as activation function. For the loss I decided to use
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import MaxPooling2D, Input, Convolution2D
import numpy as np
from sklearn import datasets
diabetes = datasets.load_diabetes()
x_train=diabetes.data
y_train=2*(diabetes.target-min(diabetes.target))/(max(diabetes.target)-min(diabetes.target))-1
inputs = tf.keras.Input(shape=(x_train.shape[1],))
outputs = tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)(inputs)
model = tf.keras.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=tf.keras.optimizers.Adam(), # Optimizer
loss=tf.keras.losses.MSE,
metrics=['sparse_categorical_accuracy'])
history = model.fit(x_train, y_train,
batch_size=64,
epochs=300,
validation_data=(x_train, y_train))
You could also use a linear regression model. There you only need to replace the activation function by linear. However I think the squashing character, besides ensuring hat there is no rating larger 1 or smaller -1.
A last alternative would be to train pair-wise preference. The idea is to show the human two drawings and ask which one he likes more. Then build a binary model, e.g., logistic regression. This approach appears preferable to me as it is easier to answer for the human

Machine learning models don't work with continuous data

I'm attempting to get a machine learning model to predict a baseball players Batting Average based on their At Bats and Hits. Since:
Batting Average = Hits/At Bats
I would think this relationship would be relatively easier to discover. However, since Batting Average is float (i.e. 0.300), all the models I try return the following error:
ValueError: Unknown label type: 'continuous'
I'm using sklearns models. I've tried LogisticRegression, RandomForestClassifier, LinearRegression. They all have the same problem.
From reading other StackOverflow posts on this error, I began doing this:
lab_enc = preproccessing.LabelEncoder()
y = pd.DataFrame(data=lab_enc.fit_transform(y))
Which seems to change values such as 0.227 to 136 which seems odd to me. Probably just because I don't quite understand what the transform is doing. I would, if possible, prefer just using the actual Batting Average values.
Is there a way to get the models I tried to work when predicting continuous values?
The problem you are trying to solve falls into the regression (i.e. numeric prediction) context, and it can certainly be dealt with ML algorithms.
I'm using sklearns models. I've tried LogisticRegression, RandomForestClassifier, LinearRegression. They all have the same problem.
The first two algorithms you mention here (Logistic Regression and Random Forest Classifier) are for classification problems, and thus are not suitable for your (regression) setting (they expectedly produce the error you mention). Linear Regression however is suitable and it should work fine here.
Please, for starters, stick to Linear Regression, in order to convince yourself that it can indeed handle the problem; you can subsequently extend to other scikit-learn algorithms like RandomForestRegressor etc. If you face any issues, open a new question with the specific code & error(s)...

How azure ML give an output for a value which is not used when training the model?

I am trying to predict the price of a house. Therefore I added no-of-rooms as one variable to get the prediction. Previous values for that variable was (3,2,1) when I was training the model. Now I am adding no-of-rooms as "6" to get an output(which was not use before to get the predicted value). How will it give the output for a new value?Is it only consider the variables except no-of-rooms ? I used Boosted decision tree regression as the model.
The short answer is that when you train your model on a set of features and then use a test set to run predictions, yes it will be able to utilize/understand feature values that the model hasn't previously seen during training. If you have large outliers in your test set that would differ significantly from what the model saw during training, it will affect accuracy, but it will still attempt a prediction.
This is less of a Azure Machine Learning question and more machine learning basics (or really just the basics of how regression works). I would do some research on both "linear regression", and the concept of "over-fitting in machine learning". These are two very basic conceptual topics that will help with your understanding. Understanding regression will help you see why a model can use a value it hasn't previously seen to create a prediction.

Sigmoid activation for multi-class classification?

I am implementing a simple neural net from scratch, just for practice. I have got it working fine with sigmoid, tanh and ReLU activations for binary classification problems. I am now attempting to use it for multi-class, mutually exclusive problems. Of course, softmax is the best option for this.
Unfortunately, I have had a lot of trouble understanding how to implement softmax, cross-entropy loss and their derivatives in backprop. Even after asking a couple of questions here and on Cross Validated, I can't get any good guidance.
Before I try to go further with implementing softmax, is it possible to somehow use sigmoid for multi-class problems (I am trying to predict 1 of n characters, which are encoded as one-hot vectors)? And if so, which loss function would be best? I have been using the squared error for all binary classifications.
Your question is about the fundamentals of neural networks and therefore I strongly suggest you start here ( Michael Nielsen's book ).
It is python-oriented book with graphical, textual and formulated explanations - great for beginners. I am confident that you will find this book useful for your understanding. Look for chapters 2 and 3 to address your problems.
Addressing your question about the Sigmoids, it is possible to use it for multiclass predictions, but not recommended. Consider the following facts.
Sigmoids are activation functions of the form 1/(1+exp(-z)) where z is the scalar multiplication of the previous hidden layer (or inputs) and a row of the weights matrix, in addition to a bias (reminder: z=w_i . x + b where w_i is the i-th row of the weight matrix ). This activation is independent of the others rows of the matrix.
Classification tasks are regarding categories. Without any prior knowledge ,and even with, most of the times, categories have no order-value interpretation; predicting apple instead of orange is no worse than predicting banana instead of nuts. Therefore, one-hot encoding for categories usually performs better than predicting a category number using a single activation function.
To recap, we want an output layer with number of neurons equals to number of categories, and sigmoids are independent of each other, given the previous layer values. We also would like to predict the most probable category, which implies that we want the activations of the output layer to have a meaning of probability disribution. But Sigmoids are not guaranteed to sum to 1, while softmax activation does.
Using L2-loss function is also problematic due to vanishing gradients issue. Shortly, the derivative of the loss is (sigmoid(z)-y) . sigmoid'(z) (error times the derivative), that makes this quantity small, even more when the sigmoid is closed to saturation. You can choose cross entropy instead, or a log-loss.
EDIT:
Corrected phrasing about ordering the categories. To clarify, classification is a general term for many tasks related to what we used today as categorical predictions for definite finite sets of values. As of today, using softmax in deep models to predict these categories in a general "dog/cat/horse" classifier, one-hot-encoding and cross entropy is a very common practice. It is reasonable to use that if the aforementioned is correct. However, there are (many) cases it doesn't apply. For instance, when trying to balance the data. For some tasks, e.g. semantic segmentation tasks, categories can have ordering/distance between them (or their embeddings) with meaning. So please, choose wisely the tools for your applications, understanding what their doing mathematically and what their implications are.
What you ask is a very broad question.
As far as I know, when the class become 2, the softmax function will be the same as sigmoid, so yes they are related. Cross entropy maybe the best loss function.
For the backpropgation, it is not easy to find the formula...there
are many ways.Since the help of CUDA, I don't think it is necessary to spend much time on it if you just want to use the NN or CNN in the future. Maybe try some framework like Tensorflow or Keras(highly recommand for beginers) will help you.
There is also many other factors like methods of gradient descent, the setting of hyper parameters...
Like I said, the topic is very abroad. Why not trying the machine learning/deep learning courses on Coursera or Stanford online course?

Regression Model Comparrison

I'm looking for metrics to compare various regressions models (e.g. SVM, Decision Tree, Neural Network etc), to decide the merits of each for solving a specific problem.
For my problem I have just over 80,000 training samples with 12 variables, all of which are independent and identically distributed.
I've done most of my research into neural networks but I'm drawing a blank when trying to compare them against other models.
Any input (including reading suggestions) would be greatly appreciated, thanks!
You can compare regression models by calculating the mean squared error for each model over a test set. The best model will simply be the one with the least error.
Sadly, there ist nothing like roc curves for regression models. Except your output is a binary variable like with logistic regression.

Resources