Accuracy matrics isn't working on linear regression - machine-learning

Kindly help here:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
X = [[1.1],[1.3],[1.5],[2],[2.2],[2.9],[3],[3.2],[3.2],[3.7],[3.9],[4],[4],[4.1],[4.5],[4.9],[5.1],[5.3],[5.9],[6],[6.8],[7.1],[7.9],[8.2],[8.7],[9],[9.5],[9.6],[10.3],[10.5]]
y = [39343,46205,37731,43525,39891,56642,60150,54445,64445,57189,63218,55794,56957,57081,61111,67938,66029,83088,81363,93940,91738,98273,101302,113812,109431,105582,116969,112635,122391,121872]
#implement the dataset for train & test
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 1/3,random_state=0)
#implement our classifier based on Simple Linear Regression
from sklearn.linear_model import LinearRegression
SimpleLinearRegression = LinearRegression()
SimpleLinearRegression.fit(X_train,y_train)
y_predict= SimpleLinearRegression.predict(X_test)
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,y_predict))
I'm sure I'm missing something here, is there some other way to calculate accuracy score for regression? Thanks in advance :)

Accuracy as a metric is applicable to a classification problem, as it is defined as a fraction of labels that is correctly predicted. In your case you do a regression (LinearRegression), i.e. your target variable is continuous. So either you picked a wrong model my mistake or accuracy is a wrong metric for your problem.

You can use mean absolute error and mean squared error.
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np
MAE = mean_absolute_error(y_test, y_predict)
RMSE = np.sqrt(mean_squared_error(y_test, y_predict))

We can't use accuracy for regression problems, Its only used in classification problem.
You can use MSE,RMSE,MAPE,MAE as the matrix to determine how good your regression model is.
These values tell us how far we are from the correct predictions. The lower values are better for these cases.

Related

Clustering on 97 features of categorical data

I am trying to apply unsupervised learning on a data with 97 features and around 6500 rows/samples. All features have discrete data (mostly from 1-10) with some being binary (0/1). What are some of the best clustering algorithms to apply on this data. Thank You!
It's impossible to say which clustering algo will perform best on your given dataset. You just have to try several methodologies and inspect the final results that you get. Here are several clustering algos that you can try.
https://github.com/ASH-WICUS/Notebooks/blob/master/Clustering%20Algorithms%20Compared.ipynb
Here is a small sample.
import statsmodels.api as sm
import numpy as np
import pandas as pd
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df_cars = pd.DataFrame(mtcars)
df_cars.head()
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import KMeans
from matplotlib import pyplot
# define dataset
X = df_cars[['mpg','hp']]
# define the model
model = KMeans(n_clusters=8)
# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
X['kmeans']=yhat
pyplot.scatter(X['mpg'], X['hp'], c=X['kmeans'], cmap='rainbow', s=50, alpha=0.8)
# plot X & Y coordinates and color by cluster number
import plotly.express as px
df = px.data.iris()
fig = px.scatter(df_cars, x="hp", y="mpg", color="kmeans", size='mpg', hover_data=['kmeans'])
fig.show()

Get 100% accuracy score on Decision tree model

I got 100% accuracy on my decision tree using decision tree algorithm but only got 75% accuracy on random forest
Is there something wrong with my model or is decision tree best suited for the dataset provide?
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state= 30)
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier = classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
print(cm)
At First it may look like your model is overfitted but it is not the case as you have put the test set aside.
The reason is Data Leak. Random Forest, randomly exludes some features for every tree. Now suppose you have the labels as one of the features: in some trees the label got excluded and the accuracy is reduced while in the Decission three the label is always among the featurs and predict the result perfectly.
How can you find if it is the case?
use the visualization for the Decision three and if my guess is true you will find that there a few number of decision nodes. You can also visualize the correlation between label and every feature and check out if there is any perfevt correlation or not.

How to draw ROC curve for a multi-class dataset?

I have a multi-class confusion matrix as below and would like to draw its associated ROC curve for one of its classes (e.g. class 1). I know the "one-VS-all others" theory should be used in this case, but I want to know how exactly we need to change the threshold to obtain different pairs of TP and corresponding FP rates.enter image description here
SkLearn has a handy implementation which calculates the tpr and fpr and another function which generates the auc for you. You can just apply this to your data by treating each class on its own (all other data being negative) by looping through each class. The code below was inspired by the scikit-learn page on this topic itself.
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
#generating synthetic data
N_classes = 3
N_per_class=100
labels = np.concatenate([[i]*N_per_class for i in range(N_classes)])
preds = np.stack([np.random.uniform(0,1,N_per_class*N_classes) for _ in range(N_classes)]).T
preds /= preds.sum(1,keepdims=True) #approximate softmax
tpr,fpr,roc_auc = ([[]]*N_classes for _ in range(3))
f,ax = plt.subplots()
#generate ROC data
for i in range(N_classes):
fpr[i], tpr[i], _ = roc_curve(labels==i, preds[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
ax.plot(fpr[i],tpr[i])
plt.legend(['Class {:d}'.format(d) for d in range(N_classes)])
plt.xlabel('FPR')
plt.ylabel('TPR')

How to know if my data has been scaled by StandardScaler?

"I have scaled my dataset by using Standard Scaler , Now how to know it has been scaled, I am sure it has been scaled but how to see it"
As #Coderji said you can always find out the mean and standard deviation, which should be equal to 0 and 1 respectively.
However, there is another method to visualize it.
from sklearn import datasets
import numpy as np
from sklearn.preprocessing import StandardScaler
I am using iris dataset for this example.
iris = datasets.load_iris()
X = iris.data
sc = StandardScaler()
sc.fit(X)
x = sc.transform(X)
import matplotlib.pyplot as plt
import seaborn as sns
sns.distplot(x[:,1])
See this Output for sepel length
Similarly you can see for all variables or a simple pairplot will do the job.
This gives an idea that the data is standardised visually.

scikit-learn cross validation score in regression

I'm trying to build a regression model, validate and test it and make sure it doesn't overfit the data. This is my code thus far:
from pandas import read_csv
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, cross_val_score, validation_curve
import numpy as np
import matplotlib.pyplot as plt
data = np.array(read_csv('timeseries_8_2.csv', index_col=0))
inputs = data[:, :8]
targets = data[:, 8:]
x_train, x_test, y_train, y_test = train_test_split(
inputs, targets, test_size=0.1, random_state=2)
rate1 = 0.005
rate2 = 0.1
mlpr = MLPRegressor(hidden_layer_sizes=(12,10), max_iter=700, learning_rate_init=rate1)
# trained = mlpr.fit(x_train, y_train) # should I fit before cross val?
# predicted = mlpr.predict(x_test)
scores = cross_val_score(mlpr, inputs, targets, cv=5)
print(scores)
Scores prints an array of 5 numbers where the first number usually around 0.91 and is always the largest number in the array.
I'm having a little trouble figuring out what to do with these numbers. So if the first number is the largest number, then does this mean that on the first cross validation attempt, the model scored the highest, and then the scores decreased as it kept trying to cross validate?
Also, should I fit the training the data before I call the cross validation function? I tried commenting it out and it's giving me more or less the same results.
The cross validation function performs the model fitting as part of the operation, so you gain nothing from doing that by hand:
The following example demonstrates how to estimate the accuracy of a linear kernel support vector machine on the iris dataset by splitting the data, fitting a model and computing the score 5 consecutive times (with different splits each time):
http://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics
And yes, the returned numbers reflect multiple runs:
Returns: Array of scores of the estimator for each run of the cross validation.
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score
Finally, there is no reason to expect that the first result is the largest:
from sklearn.model_selection import cross_val_score
from sklearn import datasets
from sklearn.neural_network import MLPRegressor
boston = datasets.load_boston()
est = MLPRegressor(hidden_layer_sizes=(120,100), max_iter=700, learning_rate_init=0.0001)
cross_val_score(est, boston.data, boston.target, cv=5)
# Output
array([-0.5611023 , -0.48681641, -0.23720267, -0.19525727, -4.23935449])

Resources