Clustering on 97 features of categorical data - machine-learning

I am trying to apply unsupervised learning on a data with 97 features and around 6500 rows/samples. All features have discrete data (mostly from 1-10) with some being binary (0/1). What are some of the best clustering algorithms to apply on this data. Thank You!

It's impossible to say which clustering algo will perform best on your given dataset. You just have to try several methodologies and inspect the final results that you get. Here are several clustering algos that you can try.
https://github.com/ASH-WICUS/Notebooks/blob/master/Clustering%20Algorithms%20Compared.ipynb
Here is a small sample.
import statsmodels.api as sm
import numpy as np
import pandas as pd
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df_cars = pd.DataFrame(mtcars)
df_cars.head()
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import KMeans
from matplotlib import pyplot
# define dataset
X = df_cars[['mpg','hp']]
# define the model
model = KMeans(n_clusters=8)
# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
X['kmeans']=yhat
pyplot.scatter(X['mpg'], X['hp'], c=X['kmeans'], cmap='rainbow', s=50, alpha=0.8)
# plot X & Y coordinates and color by cluster number
import plotly.express as px
df = px.data.iris()
fig = px.scatter(df_cars, x="hp", y="mpg", color="kmeans", size='mpg', hover_data=['kmeans'])
fig.show()

Related

Max_samples hyperparameter in PU bagging for highly imbalanced dataset

I am using the credit card fraud dataset(link below) and it's highly imbalanced where the positive class has only 492 instances and the negative class has 284315 instances.
I was applying PU Bagging (link below) on it to extract hidden positives in the negative class i.e negative instances having similar property/values like positive instances. In the max_samples hyperparameter, I was putting sum(y) which worked, but just for testing purposes I typed max_samples as 1000 just to check if it gives an error, but it does not. If I have given max_samples=1000 that means it should take 1000 samples from both classes but it did not give me any error. I also tested with values less than 492 like 30 but it still worked and I also tried with bootstrap and oob_score as False but still no error. I also tried giving max_samples as a list like [492,492] but it does not accept a list like that.
I want the classifier to take 492 samples from both the classes as [492,492] but i don't know its doing that or not.
Link for the dataset: https://machinelearningmastery.com/standard-machine-learning-datasets-for-imbalanced-classification/
Link for pu_bagging code: https://github.com/roywright/pu_learning/blob/master/baggingPU.py
My code is:
#importing and preprocessing
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
df_bank=pd.read_csv('testingcreditcard.csv')
y_bank=df_bank['labels']
df_bank.drop(['labels'],axis=1,inplace=True)
#counter
unique, counts = np.unique(y_bank, return_counts=True)
dict(zip(unique, counts))
#Pu_bagging
from sklearn.ensemble import RandomForestClassifier
from baggingPU import BaggingClassifierPU
bc = BaggingClassifierPU(RandomForestClassifier(), n_estimators = 200, n_jobs = -1, max_samples = 30 )
bc.fit(df_bank, y_bank)
#Predictions
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
rpredd=bc.predict(df_bank)
print(confusion_matrix(y_bank,rpredd))
print(accuracy_score(y_bank,rpredd))
print(classification_report(y_bank,rpredd))

How to know if my data has been scaled by StandardScaler?

"I have scaled my dataset by using Standard Scaler , Now how to know it has been scaled, I am sure it has been scaled but how to see it"
As #Coderji said you can always find out the mean and standard deviation, which should be equal to 0 and 1 respectively.
However, there is another method to visualize it.
from sklearn import datasets
import numpy as np
from sklearn.preprocessing import StandardScaler
I am using iris dataset for this example.
iris = datasets.load_iris()
X = iris.data
sc = StandardScaler()
sc.fit(X)
x = sc.transform(X)
import matplotlib.pyplot as plt
import seaborn as sns
sns.distplot(x[:,1])
See this Output for sepel length
Similarly you can see for all variables or a simple pairplot will do the job.
This gives an idea that the data is standardised visually.

how to get more accuracy on CNN with less number of images

currently I am working on flower Classification dataset of kaggle which has only 210 images, with this set of image I am getting accuracy of only 11% on validation set.
enter code here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import cv2
#from tqdm import tqdm
import os
import warnings
warnings.filterwarnings('ignore')
flower_img = r'C:\Users\asus\Downloads\flower_images\flower_images'
data = pd.read_csv(r'C:\Users\asus\Downloads\flower_images\flower_labels.csv')
img = os.listdir(flower_img)[1]
image_name = [img.split('.')[-2] for img in os.listdir(flower_img)]
label_array = np.array(data['label'])
label_unique = np.unique(label_array)
names = [' phlox','rose','calendula','iris','leucanthemum maximum','bellflower','viola','rudbeckia laciniata','peony','aquilegia']
Flower_names = {}
for i in range(10):
Flower_names[i] = names[i]
print(Flower_names)
Flower_names.get(8)
x = data['label'][2]
Flower_names.get(x)
i=0
for img in os.listdir(flower_img):
#print(img)
path = os.path.join(flower_img,img)
#img = cv2.imread(path,cv2.IMREAD_GRAYSCALE)
img = cv2.imread(path)
#print(img.shape)
img = cv2.resize(img,(128,128))
data['file'][i] = np.array(img)
i+=1
data['file'][0].shape
plt.imshow(data['file'][0])
plt.show()
import keras
from keras.models import Sequential
from keras.layers import Dense,Conv2D,Activation,MaxPool2D,Dropout,Flatten
model = Sequential()
model.add(Conv2D(32,kernel_size=3,activation='relu',input_shape=(128,128,3)))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Conv2D(64,kernel_size=3,activation='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Conv2D(128,kernel_size=3,activation='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
#model.add(Conv2D(512,kernel_size=3,activation='relu'))
#model.add(MaxPool2D(pool_size=(2,2)))
model.add(Flatten())
model.add(Dense(512,activation='relu'))
model.add(Dense(10,activation='softmax'))
model.add(Dropout(0.25))
from keras.optimizers import Adam
model.compile(loss='categorical_crossentropy',optimizer=Adam(lr=0.002),metrics=['accuracy'])
model.summary()
x = np.array([i for i in data['file']]).reshape(-1,128,128,3)
y = np.array([i for i in data['label']])
from keras.utils import to_categorical
y = to_categorical(y)
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y)
model.fit(x_train,y_train,validation_data=(x_test,y_test),epochs=10)
model.evaluate(x_test,y_test)
model.evaluate(x_train,y_train)
how can I increase accuracy only using this dataset also how can I predict classes for any input image.
Link of Flower color images dataset : https://www.kaggle.com/olgabelitskaya/flower-color-images
Your dataset size is very small. Convolutional neural networks are optimal when trained using very large data sets. You really want to have thousands of images (or more!) in your data set.
You can try to enhance your current data set by using various image processing techniques to increase the size of the data set. These techniques will take the original images, skew them, rotate them and do other modification to bolster the size of the training data. These techniques can be helpful, but increasing the natural size of the data set is preferred.
If you cannot increase the size of the dataset, you should examine why you need to use a CNN. There are other algorithms that may give better results when trained with a smaller data set. Take a look at Support Vector Machines or k-NN.
If you must use a CNN, Transfer Learning is a good solution. You can use the features from a trained model and apply them to your problem. I have had great success with this approach.
The things you can do:
Progressive resizing link
Image augmentation link
Transfer learning link
To be honest, there are much and much more techniques could be utilized to enhance the effectiveness of used data. Try to search about this topic. These ones are the ones that I remember in a minute. These ones that I've given link are just major example ones. You can dig better with a dedicated research.

How to run PCA with dask_ml. I am getting an error, "This function (tsqr) supports QR decomposition in the case of tall-and-skinny matrices"?

I want to perform dimensionality reduction over data with around 3000 rows and 6000 columns. Here the number of observations (n_samples) < number of features (n_columns). I am not able to achieve the result using dask-ml whereas the same is possible through scikit learn. What modifications do I need to perform to my existing code?
#### dask_ml
from dask_ml.decomposition import PCA
from dask_ml import preprocessing
import dask.array as da
import numpy as np
train = np.random.rand(3000,6000)
train = da.from_array(train,chunks=(100,100))
complete_pca = PCA().fit(train)
#### scikit learn
from sklearn.decomposition import PCA
from sklearn import preprocessing
import numpy as np
train = np.random.rand(3000,6000)
complete_pca = PCA().fit(train)
The PCA algorithm in Dask-ML is only designed for tall-and-skinny matrices. You could try using the raw SVD algorithms in dask.array. Also, with a 3000x6000 matrix you can probably also just use a single machine.
Adding in something like Dask-ML for a problem of this size might be adding more complexity than you need. If Scikit-Learn works for you then I would stick with that.

Bagging using random forest classifier in sklearn

I built a random forest and I want to find the out of bag score.But my out of bag score is coming out to be 1.0,but it should be less than 1.My sample size consists of 20000 elements.Here is the python code.Please tell the changes to be done.Here X is a numpy array of datasets and Z contains true labels.
import csv
import numpy as np
from sklearn import preprocessing
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier
with open('C:\Users\Harsh Bhandari\Desktop\letter.csv') as f:
reader = csv.reader(f, delimiter='\t')
data = [(col1, int(col2), int(col3), int(col4),int(col5),int(col6),int(col7),int(col8),int(col9),int(col10),int(col11),int(col12),int(col13),int(col14),int(col15),int(col16),int(col17))
for col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17 in reader]
X=[]
Y=[]
i=0
while i<20000:
t=data[i][1:]
X.append(t)
t=data[i][0]
Y.append(t)
i=1+i
X=np.asarray(X)
Y=np.asarray(Y)
le = preprocessing.LabelEncoder()
Z=le.fit_transform(Y)
clf = RandomForestClassifier(n_estimators=100,oob_score=True)
clf=clf.fit(X,Z)
a=clf.predict(X)
scores=clf.score(X,a)
print scores
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
In score you send the Test Data and its actual labels, here you are passing the predicted labels itself which match the prediction hence you are
getting 1.0 score.
i see a couple things here.
you are doing clf.score(X, a)
but you should be doing clf.score(X, Z)
where Z is the true label for X
the score parameter is defined as such clf.score(X, true_labels_for_X)
you instead put the values that you predicted as y_true which dosen't make sense. since Sklearn will already run predict on X, you don't need to pass a.
Also, you can find the oobscore of by doing
print clf.oob_score_

Resources