How to show kdeplot in a 5*4 subplot? - machine-learning

I am working on a machine learning project and am using the seaborn kdeplot to show the standard scaler after scaling. However, no matter how large the figure size I change, the graphs just can't show and will show the error: AttributeError: 'numpy.ndarray' object has no attribute 'plot'.The image I'm willing to show is a 5*4 subplot that look like this:
expected subplot image
#feature scaling
#since numerical attributes have very different scales,
#we use standardization to get all attributes to have the same scale
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
matplotlib.style.use('ggplot')
scaler = preprocessing.StandardScaler()
scaled_df = scaler.fit_transform(train_set)
scaled_df = pd.DataFrame(scaled_df, columns=["SaleAmount","SaleCount","ReturnAmount","ReturnCount",
"KeyedAmount","KeyedCount","VoidRejectAmount","VoidRejectCount","RetrievalAmount",
"RetrievalCount","ChargebackAmount","ChargebackCount","DepositAmount","DepositCount",
"NetDeposit","AuthorizationAmount","AuthorizationCount","DeclinedAuthorizationAmount","DeclinedAuthorizationCount"])
fig, axes = plt.subplots(figsize=(20,10), ncols=5, nrows=4)
sns.kdeplot(scaled_df['SaleAmount'], ax=axes[0])
sns.kdeplot(scaled_df['SaleCount'], ax=axes[1])
sns.kdeplot(scaled_df['ReturnAmount'], ax=axes[2])
sns.kdeplot(scaled_df['ReturnCount'], ax=axes[3])
sns.kdeplot(scaled_df['KeyedAmount'], ax=axes[4])
sns.kdeplot(scaled_df['KeyedCount'], ax=axes[5])
sns.kdeplot(scaled_df['VoidRejectAmount'], ax=axes[6])
sns.kdeplot(scaled_df['VoidRejectCount'], ax=axes[7])
sns.kdeplot(scaled_df['RetrievalAmount'], ax=axes[8])
sns.kdeplot(scaled_df['RetrievalCount'], ax=axes[9])
sns.kdeplot(scaled_df['ChargebackAmount'], ax=axes[10])
sns.kdeplot(scaled_df['ChargebackCount'], ax=axes[11])
sns.kdeplot(scaled_df['DepositAmount'], ax=axes[12])
sns.kdeplot(scaled_df['DepositCount'], ax=axes[13])
sns.kdeplot(scaled_df['NetDeposit'], ax=axes[14])
sns.kdeplot(scaled_df['AuthorizationAmount'], ax=axes[15])
sns.kdeplot(scaled_df['AuthorizationCount'], ax=axes[16])
sns.kdeplot(scaled_df['DeclinedAuthorizationAmount'], ax=axes[17])
sns.kdeplot(scaled_df['DeclinedAuthorizationCount'], ax=axes[18])

You need to know that you have a two dimension array so something like this:
sns.kdeplot(scaled_df['DeclinedAuthorizationCount'], ax=axes[9,2])

Related

create error bars for random forest regression

I'm new to the world of machine learning and more generally to AI.
I am analyzing a dataset containing characteristics of different houses and their prices using Python and JupyterLab.
Here is the dataset in use:
https://www.kaggle.com/datasets/harlfoxem/housesalesprediction
I applied random forest (scikit-learn) on this dataset and now I would like to plot the error bars of the model.
Specifically, I'm using the ForestCI package and applying exactly this code to my case:
http://contrib.scikit-learn.org/forest-confidence-interval/auto_examples/plot_mpg.html
This is my code:
# Regression Forest Example
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn import metrics
from sklearn.metrics import r2_score
import numpy as np
from matplotlib import pyplot as plt
from sklearn.ensemble import RandomForestRegressor
import sklearn.model_selection as xval
import forestci as fci
#import dataset
mpg_data = pd.read_csv(path_to_dataset)
#drop some useless features
mpg_data=mpg_data.drop('date', axis=1)
mpg_data=mpg_data.drop('yr_built', axis=1)
mpg_data = mpg_data.drop(["id"],axis=1)
#separate mpg data into predictors and outcome variable
mpg_X = mpg_data.drop(labels='price', axis=1)
mpg_y = mpg_data['price']
# remove rows where the data is nan
not_null_sel = np.where(mpg_X.isna().sum(axis=1).values == 0)
mpg_X = mpg_X.values[not_null_sel]
mpg_y = mpg_y.values[not_null_sel]
# split mpg data into training and test set
mpg_X_train, mpg_X_test, mpg_y_train, mpg_y_test = xval.train_test_split(
mpg_X,
mpg_y,
test_size=0.25,
random_state=42)
# Create RandomForestRegressor
mpg_forest = RandomForestRegressor(random_state=42)
mpg_forest.fit(mpg_X_train, mpg_y_train)
mpg_y_hat = mpg_forest.predict(mpg_X_test)
# Plot predicted MPG without error bars
plt.scatter(mpg_y_test, mpg_y_hat)
plt.xlabel('Reported MPG')
plt.ylabel('Predicted MPG')
plt.show()
print(r2_score(mpg_y_test, mpg_y_hat))
# Calculate the variance
mpg_V_IJ_unbiased = fci.random_forest_error(mpg_forest, mpg_X_train,
mpg_X_test)
# Plot error bars for predicted MPG using unbiased variance
plt.errorbar(mpg_y_test, mpg_y_hat, yerr=np.sqrt(mpg_V_IJ_unbiased), fmt='o')
plt.xlabel('Reported MPG')
plt.ylabel('Predicted MPG')
plt.show()
It seems to work but when the graphs are plotted, neither the error bar nor the prediction line appears:
Instead, as visible in the documentation, it should look like the picture here: http://contrib.scikit-learn.org/forest-confidence-interval/auto_examples/plot_mpg.html
You forget to add this line
plt.plot([5, 45], [5, 45], 'k--')
Your code should look like this
plt.errorbar(mpg_y_test, mpg_y_hat, yerr=np.sqrt(mpg_V_IJ_unbiased), fmt='o')
plt.plot([5, 45], [5, 45], 'k--')
plt.xlabel('Reported MPG')
plt.ylabel('Predicted MPG')
plt.show()

Linear Regression script not working in Python

I tried running my Machine Learning LinearRegression code, but it is not working. Here is the code:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import pandas as pd
df = pd.read_csv(r'C:\Users\SVISHWANATH\Downloads\datasets\GGP_data.csv')
df["OHLC"] = (df.open+df.high+df.low+df.close)/4
df['HLC'] = (df.high+df.low+df.close)/3
df.index = df.index+1
reg = LinearRegression()
reg.fit(df.index, df.OHLC)
Basically, I just imported a few libraries, used the read_csv function, and called the LinearRegression() function, and this is the error:
ValueError: Expected 2D array, got 1D array instead:
array=[ 1 2 3 ... 1257 1258 1259].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or
array.reshape(1, -1) if it contains a single sample
Thanks!
As mentioned in the error message, you need to give the fit method a 2D array.
df.index is a 1D array. You can do it this way:
reg.fit(df.index.values.reshape(-1, 1), df.OHLC)

How to display categorical values on export tree image of decision tree classifier?

I am trying to export the decision tree as an image with the original labels of all categorical fields.
The current data I have is like so:
I transformed the categorical features into numerical:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('data.csv')
X = dataset.iloc[:, 0:4]
y = dataset.iloc[:, 4]
from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
X['Outlook'] = lb.fit_transform(X['Outlook'])
X['Temp'] = lb.fit_transform(X['Temp'])
X['Humidity'] = lb.fit_transform(X['Humidity'])
X['Windy'] = lb.fit_transform(X['Windy'])
y = lb.fit_transform(y)
Afterwards, I applied the DecisionTreeClassifier:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(criterion="entropy")
dtc.fit(X, y)
At the end, I needed to check the tree generated from the model using the following:
# Import tools needed for visualization
from sklearn.tree import export_graphviz
import pydot
# Pull out one tree from the forest
# Export the image to a dot file
export_graphviz(dtc, out_file = 'tree.dot', feature_names = X.columns, rounded = True, precision = 1)
# Use dot file to create a graph
(graph, ) = pydot.graph_from_dot_file('tree.dot')
# Write graph to a png file
graph.write_png('tree.png')
The tree.png:
But what I really need, is to see the main labels of each feature inside the nodes or at each branch, instead of true or false or a numeric representation.
I tried the following:
y=lb.inverse_transform(y)
And the same for X features, but the tree is being generated the same as above.

Keras: Model Compilation Giving "Index 200005 is out of bounds for axis 0 with size 200000" Error

I'm using Jena Climate Data that my book gives a link to. I have it below;
https://s3.amazonaws.com/keras-datasets/jena_climate_2009_2016.csv.zip
I tried messing with it but I have no clue why the index is surpassing 200000. I'm not sure why it gets to 200005 since my training data is 200001 observations long.
I've also gotten an error that said, " Index 200000 is out of bounds for axis 0 with size 200000."
The data is 420551x14 of weather data. My code is as follows:
import pandas as pd
import numpy as np
import keras
data = pd.read_csv("D:\\School\\Spring_2019\\GraduateProject\\jena_climate_2009_2016_Data\\jena_climate_2009_2016.csv")
data = data.iloc[:,data.columns!='Date Time']
data
# Standardize the Data
from sklearn import preprocessing
data = preprocessing.scale(data[:200000])
# Build Generators
from keras.preprocessing.sequence import TimeseriesGenerator
target = data[:,1] # Should target be scaled?
# ? Do I need to remove targets from the data variable?
trainGen = TimeseriesGenerator(data,targets=target,length=1440,
sampling_rate=6,
batch_size=190,
start_index=0,
end_index=200000)
valGen = TimeseriesGenerator(data,targets=target,length=1440,
sampling_rate=6,
batch_size=190,
start_index=199999,
end_index=300000)
testGen = TimeseriesGenerator(data,targets=target,length=6,
batch_size=128,
start_index=300000,
end_index=420550)
from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop
from keras.layers import LSTM
#Flatten part is: 240 = lookback//step. This is 1440/6 because we are looking at
model = Sequential()
model.add(layers.Flatten(input_shape=(240,data.shape[-1])))
model.add(layers.Dense(32,activation='relu'))
model.add(layers.Dense(1))
val_steps = 300000-200001-1440
model.compile(optimizer=RMSprop(),loss='mae')
history = model.fit_generator(trainGen,
steps_per_epoch=250,
epochs=20,
validation_data=valGen,
validation_steps=val_steps)
Let me know if you need anything else and thank you greatly in advance.
Well, you've only selected first 200000 rows for your data (data = preprocessing.scale(data[:200000]), so validation and test generators are out of bounds (index > 200000)

BaggingClassifier take all dataset each time

from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, RandomForestClassifier
import numpy as np
import random
from sklearn.svm import SVC
X=np.random.rand(1000,2)
Y=[random.randint(0,1) for x in range(0,1000)]
svm=BaggingClassifier(SVC(kernel='rbf', random_state=123, gamma=.000001, C=100000, class_weight='balanced'), max_samples=1/5.0, n_estimators=5, n_jobs=-1,random_state=123)
classfier=svm.fit(X,Y)
print(len(svm.estimators_samples_))
print(len(svm.estimators_samples_[0]))# here I expect 0.05*400 samples. but the result is 1000.
In this code, I try to apply BaggingClassifier with SVM. Normally as discussed in the documentation of sckitlearn, the max_samples fix the maximal number of samples to be used for each estimators. However, I remark that each estimator (n_estimators=5) take all the dataset!!! Is it a bug ?
svm.estimators_samples_[0] will return an array equal to the length of the data. This array is populated with boolean values, those values equal to True are the data points used in the estimator (in terms of index value).
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, RandomForestClassifier
import numpy as np
import random
from sklearn.svm import SVC
X=np.random.rand(1000,2)
Y=[random.randint(0,1) for x in range(0,1000)]
svm=BaggingClassifier(SVC(kernel='rbf', random_state=123, gamma=.000001, C=100000, class_weight='balanced'), max_samples=1/5.0, n_estimators=5, n_jobs=-1,random_state=123)
classfier=svm.fit(X,Y)
print(len([i for i in svm.estimators_samples_[0] if i == True]))
Running the above code I get:
181

Resources