Max_samples hyperparameter in PU bagging for highly imbalanced dataset - machine-learning

I am using the credit card fraud dataset(link below) and it's highly imbalanced where the positive class has only 492 instances and the negative class has 284315 instances.
I was applying PU Bagging (link below) on it to extract hidden positives in the negative class i.e negative instances having similar property/values like positive instances. In the max_samples hyperparameter, I was putting sum(y) which worked, but just for testing purposes I typed max_samples as 1000 just to check if it gives an error, but it does not. If I have given max_samples=1000 that means it should take 1000 samples from both classes but it did not give me any error. I also tested with values less than 492 like 30 but it still worked and I also tried with bootstrap and oob_score as False but still no error. I also tried giving max_samples as a list like [492,492] but it does not accept a list like that.
I want the classifier to take 492 samples from both the classes as [492,492] but i don't know its doing that or not.
Link for the dataset: https://machinelearningmastery.com/standard-machine-learning-datasets-for-imbalanced-classification/
Link for pu_bagging code: https://github.com/roywright/pu_learning/blob/master/baggingPU.py
My code is:
#importing and preprocessing
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
df_bank=pd.read_csv('testingcreditcard.csv')
y_bank=df_bank['labels']
df_bank.drop(['labels'],axis=1,inplace=True)
#counter
unique, counts = np.unique(y_bank, return_counts=True)
dict(zip(unique, counts))
#Pu_bagging
from sklearn.ensemble import RandomForestClassifier
from baggingPU import BaggingClassifierPU
bc = BaggingClassifierPU(RandomForestClassifier(), n_estimators = 200, n_jobs = -1, max_samples = 30 )
bc.fit(df_bank, y_bank)
#Predictions
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
rpredd=bc.predict(df_bank)
print(confusion_matrix(y_bank,rpredd))
print(accuracy_score(y_bank,rpredd))
print(classification_report(y_bank,rpredd))

Related

Clustering on 97 features of categorical data

I am trying to apply unsupervised learning on a data with 97 features and around 6500 rows/samples. All features have discrete data (mostly from 1-10) with some being binary (0/1). What are some of the best clustering algorithms to apply on this data. Thank You!
It's impossible to say which clustering algo will perform best on your given dataset. You just have to try several methodologies and inspect the final results that you get. Here are several clustering algos that you can try.
https://github.com/ASH-WICUS/Notebooks/blob/master/Clustering%20Algorithms%20Compared.ipynb
Here is a small sample.
import statsmodels.api as sm
import numpy as np
import pandas as pd
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df_cars = pd.DataFrame(mtcars)
df_cars.head()
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import KMeans
from matplotlib import pyplot
# define dataset
X = df_cars[['mpg','hp']]
# define the model
model = KMeans(n_clusters=8)
# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
X['kmeans']=yhat
pyplot.scatter(X['mpg'], X['hp'], c=X['kmeans'], cmap='rainbow', s=50, alpha=0.8)
# plot X & Y coordinates and color by cluster number
import plotly.express as px
df = px.data.iris()
fig = px.scatter(df_cars, x="hp", y="mpg", color="kmeans", size='mpg', hover_data=['kmeans'])
fig.show()

How to know if my data has been scaled by StandardScaler?

"I have scaled my dataset by using Standard Scaler , Now how to know it has been scaled, I am sure it has been scaled but how to see it"
As #Coderji said you can always find out the mean and standard deviation, which should be equal to 0 and 1 respectively.
However, there is another method to visualize it.
from sklearn import datasets
import numpy as np
from sklearn.preprocessing import StandardScaler
I am using iris dataset for this example.
iris = datasets.load_iris()
X = iris.data
sc = StandardScaler()
sc.fit(X)
x = sc.transform(X)
import matplotlib.pyplot as plt
import seaborn as sns
sns.distplot(x[:,1])
See this Output for sepel length
Similarly you can see for all variables or a simple pairplot will do the job.
This gives an idea that the data is standardised visually.

Accuracy matrics isn't working on linear regression

Kindly help here:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
X = [[1.1],[1.3],[1.5],[2],[2.2],[2.9],[3],[3.2],[3.2],[3.7],[3.9],[4],[4],[4.1],[4.5],[4.9],[5.1],[5.3],[5.9],[6],[6.8],[7.1],[7.9],[8.2],[8.7],[9],[9.5],[9.6],[10.3],[10.5]]
y = [39343,46205,37731,43525,39891,56642,60150,54445,64445,57189,63218,55794,56957,57081,61111,67938,66029,83088,81363,93940,91738,98273,101302,113812,109431,105582,116969,112635,122391,121872]
#implement the dataset for train & test
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 1/3,random_state=0)
#implement our classifier based on Simple Linear Regression
from sklearn.linear_model import LinearRegression
SimpleLinearRegression = LinearRegression()
SimpleLinearRegression.fit(X_train,y_train)
y_predict= SimpleLinearRegression.predict(X_test)
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,y_predict))
I'm sure I'm missing something here, is there some other way to calculate accuracy score for regression? Thanks in advance :)
Accuracy as a metric is applicable to a classification problem, as it is defined as a fraction of labels that is correctly predicted. In your case you do a regression (LinearRegression), i.e. your target variable is continuous. So either you picked a wrong model my mistake or accuracy is a wrong metric for your problem.
You can use mean absolute error and mean squared error.
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np
MAE = mean_absolute_error(y_test, y_predict)
RMSE = np.sqrt(mean_squared_error(y_test, y_predict))
We can't use accuracy for regression problems, Its only used in classification problem.
You can use MSE,RMSE,MAPE,MAE as the matrix to determine how good your regression model is.
These values tell us how far we are from the correct predictions. The lower values are better for these cases.

scikit-learn cross validation score in regression

I'm trying to build a regression model, validate and test it and make sure it doesn't overfit the data. This is my code thus far:
from pandas import read_csv
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, cross_val_score, validation_curve
import numpy as np
import matplotlib.pyplot as plt
data = np.array(read_csv('timeseries_8_2.csv', index_col=0))
inputs = data[:, :8]
targets = data[:, 8:]
x_train, x_test, y_train, y_test = train_test_split(
inputs, targets, test_size=0.1, random_state=2)
rate1 = 0.005
rate2 = 0.1
mlpr = MLPRegressor(hidden_layer_sizes=(12,10), max_iter=700, learning_rate_init=rate1)
# trained = mlpr.fit(x_train, y_train) # should I fit before cross val?
# predicted = mlpr.predict(x_test)
scores = cross_val_score(mlpr, inputs, targets, cv=5)
print(scores)
Scores prints an array of 5 numbers where the first number usually around 0.91 and is always the largest number in the array.
I'm having a little trouble figuring out what to do with these numbers. So if the first number is the largest number, then does this mean that on the first cross validation attempt, the model scored the highest, and then the scores decreased as it kept trying to cross validate?
Also, should I fit the training the data before I call the cross validation function? I tried commenting it out and it's giving me more or less the same results.
The cross validation function performs the model fitting as part of the operation, so you gain nothing from doing that by hand:
The following example demonstrates how to estimate the accuracy of a linear kernel support vector machine on the iris dataset by splitting the data, fitting a model and computing the score 5 consecutive times (with different splits each time):
http://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics
And yes, the returned numbers reflect multiple runs:
Returns: Array of scores of the estimator for each run of the cross validation.
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score
Finally, there is no reason to expect that the first result is the largest:
from sklearn.model_selection import cross_val_score
from sklearn import datasets
from sklearn.neural_network import MLPRegressor
boston = datasets.load_boston()
est = MLPRegressor(hidden_layer_sizes=(120,100), max_iter=700, learning_rate_init=0.0001)
cross_val_score(est, boston.data, boston.target, cv=5)
# Output
array([-0.5611023 , -0.48681641, -0.23720267, -0.19525727, -4.23935449])

Bagging using random forest classifier in sklearn

I built a random forest and I want to find the out of bag score.But my out of bag score is coming out to be 1.0,but it should be less than 1.My sample size consists of 20000 elements.Here is the python code.Please tell the changes to be done.Here X is a numpy array of datasets and Z contains true labels.
import csv
import numpy as np
from sklearn import preprocessing
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier
with open('C:\Users\Harsh Bhandari\Desktop\letter.csv') as f:
reader = csv.reader(f, delimiter='\t')
data = [(col1, int(col2), int(col3), int(col4),int(col5),int(col6),int(col7),int(col8),int(col9),int(col10),int(col11),int(col12),int(col13),int(col14),int(col15),int(col16),int(col17))
for col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17 in reader]
X=[]
Y=[]
i=0
while i<20000:
t=data[i][1:]
X.append(t)
t=data[i][0]
Y.append(t)
i=1+i
X=np.asarray(X)
Y=np.asarray(Y)
le = preprocessing.LabelEncoder()
Z=le.fit_transform(Y)
clf = RandomForestClassifier(n_estimators=100,oob_score=True)
clf=clf.fit(X,Z)
a=clf.predict(X)
scores=clf.score(X,a)
print scores
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
In score you send the Test Data and its actual labels, here you are passing the predicted labels itself which match the prediction hence you are
getting 1.0 score.
i see a couple things here.
you are doing clf.score(X, a)
but you should be doing clf.score(X, Z)
where Z is the true label for X
the score parameter is defined as such clf.score(X, true_labels_for_X)
you instead put the values that you predicted as y_true which dosen't make sense. since Sklearn will already run predict on X, you don't need to pass a.
Also, you can find the oobscore of by doing
print clf.oob_score_

Resources