Learning something new here so I'm hoping the community can help. I'm mostly a ruby guy but trying to transition to python so I can tackle Machine learning with tensorflow. I'm having the most difficult time getting this logistic regression script to work using housing data I collected.
Links to data:
https://storage.googleapis.com/datastorage_machinelearning/first1500.csv
logistic regression script:
https://gist.github.com/Nick-Harvey/404b605423b3c19710eb2a1de6cb5880
Script Output:
https://gist.github.com/Nick-Harvey/3eab9262770bfb690730cad1fbadf9eb
The error is somewhat obvious as it's saying that there are incompatible shapes. This is most likely due to the encoding I'm doing to city names adding additional columns. However, I can't seem to figure out a way to be able to fit the data so I can predict house price by sqft and be able to plot it all. Eventually, I'd like to be able to plot the data so you could compare sqft to price and sort it by the city.
Note: I think you're not doing logistic regression. Logistic regression requires the dependent variable be binary and is estimated by maximum likelihood, such as Fischer scoring. Your error function is average least squares and your dependent variable is numerical. It looks like multiple linear regression to me.
I studied your code and was able to train a model. Your code crashed later when you calculated the least square loss. But I'll leave that to you.
Your problem is the dimension of your training data. Your dependent variable y_train has shape: (1176).
Try:
y_train = y_train.reshape((y_train.shape[0], 1))
after
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
Related
I'm taking a course in stat learning / ML, currently doing a project including a classification task, and I have some newbie questions regarding the random_state parameter. The accuracy of my model is heavily changing depending on the random_state. I'm currently working with logistic regression (from sklearn.linear_model.LogisticRegression()). I try to tune the hyperparameter by using the GridSearchCV method.
The problem:
I get different prediction accuracy, depending on which random_state I'm using.
What I have tried:
I have tried to set the random_state parameter as a global state (using np.random.seed(randomState) and setting randomState as an integer in the top of the script). Further, I split the data using the
train_test_split(X, y, test_size=0.2, random_state=randomState)
with the same (global) integer randomState. Further, I want to preform GridSearchCV to tune the hyperparameters. Thus, I specify a param_grid and preform a GridSearchCV on this. From this, I find the best estimator and choose this as my model. Then, I use my model for prediction and print a classification report of the results. I take the average out of 10 runs by changing the randomState.
Example: I do this procedure with randomState=1 and find the best model from GridSearchCV: model_1. I get the accuracy 84%. If Im changing to randomState = 2,...,10 and still use model_1, the average accuracy becomes 80.5%.
I do this procedure with randomState=42 and find the best model from GridSearchCV: model_42. I get the accuracy 77%. If Im changing to randomState = 41, 40, 39,..,32 and still use model_42, the average accuracy becomes 78.7%.
I'm very confused why the accuracy varies so much depending on random_state.
Tuning random_state gives you different accuracies. Random state is like randomly splitting the dataset into train and test rather than splitting the dataset according to ascending values of index. This results in splitting of data-points into train and test and if there is any point in test data which is not there in train data, then this may lead to poor accuracies. The best way to deal this problem is by using Cross-validation Split. In this approach which randomly split the data into train and test then perform machine learning modelling, and this step is repeated for n times where n is number of splits (mostly n = 5). Then we take the mean of all accuracies and will consider this accuracy to be the final result. Instead of changing the value random_state every-time you can perform Cross-validation Split
You can find references to this in the below link:
https://machinelearningmastery.com/k-fold-cross-validation/#:~:text=Cross%2Dvalidation%20is%20a%20resampling,k%2Dfold%20cross%2Dvalidation
I currently have a dataset of drawings, each drawing being represented by some features. Each feature (independent variable) is a continuous number. None of the drawings have a label as of yet, which is why I am planning to start a sort of questionaire with people. However, before I can correctly setup such questionaire, I should have an idea of what kind of labels I should use for my training data.
At first thought, I was thinking about letting people rate the drawings on a scale, for example from 1 to 5 with 1 being bad, 3 being average and 5 being good. Alternatively, I could also reduce the question to a simple good or bad question. The latter would mean I lose some valuable information, but the dependent variable could then be considered 'binary'.
Using the training data I then composed, I would need to have a machine learning algorithm (model) which given a drawing, predicts if the drawing is good or not. Ideally, I would have some way of tuning the strictness in this prediction. For example, the model could instead of simply predicting 'good' or 'bad', predict the likelyhood of a painting being good on a scale of 0 to 1. I could then say "Well, let's say all paintings which are 70% likely to be good, are considered as good". Another example would be that the model predicts the goodness using the same categorical values the people used to rate the drawing initially. So it would either predict the drawing being a 1, 2, 3, 4 or 5. Similar to my first example, I could then say "Well, all paintings which are rated at least a 4, are considered good paintings" and tune this threshhold to my liking.
After doing some research, I came up with logistic and linear regression being good candidates. However, if which of the two would be the best for my scenario? Equally important, how would I need to format my labels? Just simple 0's and 1's or a scale?
You could use a 1 vs all representation if you wanted to use a multi-class categorical classification:
Essentially, you train 1 classifier for every category you have (you have 10 categories, so you have 10 classifiers) and then each classifier is just trained to predict whether or not the category belongs to each specific class.
There are alternative ways to make multi-class logistic regression work that only require training a single model, such as by using categorical cross entropy, but given that you'd like to use ordinal data, a linear regression used as a regression model is likely more ideal. You'd predict a value between 1 and 10 and then just round to the nearest integer. This way you aren't penalizing close guesses as much as far away guesses.
what keeps you from using a logistic regression model. Due to a lack of better dataset I used the standard diabetes data. The target variable is an integer between 50 and 200. I normalised the data between [-1,1] such that I can use sigmoid as activation function. For the loss I decided to use
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import MaxPooling2D, Input, Convolution2D
import numpy as np
from sklearn import datasets
diabetes = datasets.load_diabetes()
x_train=diabetes.data
y_train=2*(diabetes.target-min(diabetes.target))/(max(diabetes.target)-min(diabetes.target))-1
inputs = tf.keras.Input(shape=(x_train.shape[1],))
outputs = tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)(inputs)
model = tf.keras.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=tf.keras.optimizers.Adam(), # Optimizer
loss=tf.keras.losses.MSE,
metrics=['sparse_categorical_accuracy'])
history = model.fit(x_train, y_train,
batch_size=64,
epochs=300,
validation_data=(x_train, y_train))
You could also use a linear regression model. There you only need to replace the activation function by linear. However I think the squashing character, besides ensuring hat there is no rating larger 1 or smaller -1.
A last alternative would be to train pair-wise preference. The idea is to show the human two drawings and ask which one he likes more. Then build a binary model, e.g., logistic regression. This approach appears preferable to me as it is easier to answer for the human
I'm trying to test the prediction score of the following classifiers:
- random forest
- k neighbors
- svm
- naïve bayes
I'm not using feature selection or feature scaling (no preprocessing at all).
I'm using a train-test split as follows:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
I tested several datasets (from sklearn):
- load_iris
- load_breast_cancer
- load_wine
In all those 3, random forest always gave perfect prediction (test accuracy 1.0).
I tried to create random samples for classification:
make_classification(flip_y=0.3, weights = [0.65, 0.35], n_features=40, n_redundant=4, n_informative=36,n_classes=2,n_clusters_per_class=1, n_samples=50000)
and again random forest gave perfect prediction on the test set (accuracy 1.0).
All the other classifiers gave good performance on the test set (0.8-0.97) but not perfect (1.0) as random forest.
What am I missing ?
Does random forest really outperforms all other classifiers in a perfect way ?
Regarding the perfect accuracy score of 1.0, we have to keep in mind that all these 3 datasets are nowadays considered as actually toy ones, and the same probably holds true for the artificial data generated by scikit-learn's make_classification.
That said, it is true that RF is considered a very powerful classification algorithm. There is even a relatively recent (2014) paper, titled Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?, which concluded (quoting from the abstract, emphasis in the original):
We evaluate 179 classifiers arising from 17 families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest-neighbors, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods) [...] We use 121 data sets, which represent the whole UCI data base [...] The classifiers most likely to be the bests are the random forest (RF) versions
Although there has been some criticism of the paper, mainly because it did not include boosted trees (but not only for that, see also Are Random Forests Truly the Best Classifiers?), truth is that, in the area of "traditional", pre-deep learning classification at least, there already was the saying when in doubt, try RF, which the first paper mentioned above came to reinforce.
The question proposed reads as follows: Use scikit-learn to split the data into a training and test set. Classify the data as either cat or dog using DBSCAN.
I am trying to figure out how to go about using DBSCAN to fit a model using training data and then predict the labels of a testing set. I am well aware that DBSCAN is meant for clustering and not prediction. I have also looked at Use sklearn DBSCAN model to classify new entries as well as numerous other threads. DBSCAN only comes with fit and fit_predict functions, which don't seem relatively useful when trying to fit the model using the training data and then test the model using the testing data.
Is the question worded poorly or am I missing something? I have looked at the scikit-learn documentation as well as looked for examples, but have not had any luck.
# Split the samples into two subsets, use one for training and the other for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Instantiate the learning model
dbscan = DBSCAN()
# Fit the model
dbscan.fit(X_train, y_train)
# Predict the response
# Confusion matrix and quantitative metrics
print("The confusion matrix is: " + np.str(confusion_matrix(y_test, dbscan_pred)))
print("The accuracy score is: " + np.str(accuracy_score(y_test, dbscan_pred)))
Whoever gave you that assignment has no clue...
DBSCAN will never predict "cat" or "dog". It just can't.
Because it is an unsupervised algorithm, it doesn't use training labels. y_train is ignored (see the parameter documentation), and it is stupid that sklearn will allow you to pass it at all! It will output sets of points that are clusters. Many tools will enumerate these sets as 1, 2, ... But it won't name a set "dogs".
Furthermore it can't predict on new data either - which you need for predicting on "test" data. So it can't work with a train-test split, but that does not really matter because it does not use labels anyway.
The accepted answer in the question you linked is a pretty good one for you, too: you want to perform classification, not discover structure (which is what clustering does).
DBSCAN, as implemented in scikit-learn, is a transductive algorithm, meaning you can't do predictions on new data. There's an old discussion from 2012 on the scikit-learn repository about this.
Suffice to say, when you're using a clustering algorithm, the concept of train/test splits is less defined. Cross-validation usually involves a different metric; for example, in K-means, the cross-validation is often over the hyperparameter k, rather than mutually exclusive subsets of the data, and the metric that is optimized is the intra-vs-inter cluster variance, rather than F1 accuracy.
Bottom line: trying to perform classification using a clustering technique is effectively square-peg-round-hole. You can jam it through if you really want to, but it'd be considerably easier to just use an off-the-shelf classifier.
First there are questions on this forum very similar to this one but trust me none matches so no duplicating please.
I have encountered two methods of linear regression using scikit's sklearn and I am failing to understand the difference between the two, especially where in first code there's a method train_test_split() called while in the other one directly fit method is called.
I am studying with multiple resources and this single issue is very confusing to me.
First which uses SVR
X = np.array(df.drop(['label'], 1))
X = preprocessing.scale(X)
y = np.array(df['label'])
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)
clf = svm.SVR(kernel='linear')
clf.fit(X_train, y_train)
confidence = clf.score(X_test, y_test)
And second is this one
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)
So my main focus is the difference between using svr(kernel="linear") and using LinearRegression()
cross_validation.train_test_split : Splits arrays or matrices into random train and test subsets.
In second code, splitting is not random.
svm.SVR: The Support Vector Regression (SVR) uses the same principles as the SVM for classification, with only a few minor differences. First of all, because output is a real number it becomes very difficult to predict the information at hand, which has infinite possibilities. In the case of regression, a margin of tolerance (epsilon) is set in approximation to the SVM which would have already requested from the problem. But besides this fact, there is also a more complicated reason, the algorithm is more complicated therefore to be taken in consideration. However, the main idea is always the same: to minimize error, individualizing the hyperplane which maximizes the margin, keeping in mind that part of the error is tolerated.
Linear Regression: In statistics, linear regression is a linear approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. The case of one explanatory variable is called simple linear regression.
Reference:
https://cs.adelaide.edu.au/~chhshen/teaching/ML_SVR.pdf
This is what I found:
Intuitively, as all regressors it tries to fit a line to data by minimising a cost function. However, the interesting part about SVR is that you can deploy a non-linear kernel. In this case you end making non-linear regression, i.e. fitting a curve rather than a line.
This process is based on the kernel trick and the representation of the solution/model in the dual rather than in the primal. That is, the model is represented as combinations of the training points rather than a function of the features and some weights. At the same time the basic algorithm remains the same: the only real change in the process of going non-linear is the kernel function, which changes from a simple inner product to some non linear function.
So SVR allows non linear fitting problems as well while LinearRegression() is only for simple linear regression with straight line (may contain any number of features in both cases).
The main difference for these methods is in mathematics background!
We have samples X and want to predict target Y.
The Linear Regression method just minimizes the least squares error:
for one object target y = x^T * w, where w is model's weights.
Loss(w) = Sum_1_N(x_n^T * w - y_n) ^ 2 --> min(w)
As it is a convex functional the global minimum will be always found.
After taking derivative of Loss by w and transforming sums to vectors you'll get:
w = (X^T * X)^(-1)* (X^T * Y)
So, in ML (i'm sure sklearn also has the same implementation) the w is calculated according above formula.
X is train samples, when you call fit method.
In predict this weights just multiplies on X_test.
So the decision is explicit and faster (except for Big selections as finding inverse matrix in this cases is complicated task) than converging methods such as svm.
In addition: Lasso and Ridge solves the same task but have additionally the regularization on weights in their losses.
And you can calculate the weights explicit in that cases too.
The SVM.Linear does almost the same thing except it has an optimization task for maximizing the margin (i apologize but it is difficult to put it down because i didn't find out how to write in Tex format here).
So it uses gradient descent methods for finding global extremum.
Sklearn's class SVM even have attribute max_iter which is used in the converging tasks.
To sum up: Linear Regression has explicit decision and SVM finds approximate of real decision because of numerical(computational) solution.