Does sklearn clustering output differs due to machine? - machine-learning

I am using sklearn AffinityPropagation clustering algorithm . The output of the clustering algorithm on my 4 core machine is different than what is getting generated on a typical server machine. Can someone suggest any method so that I can get similar output on both the systems.
I am using similar feature vector on both the machine.
Output on my machine is cluster0:[1,2,3],cluster1:[4,5,6] but on server its cluster0:[1,2] cluster1:[3,4],cluster2:[5]
from keras.applications.xception import Xception
from keras.preprocessing import image
from keras.applications.xception import preprocess_input
from keras.models import Model
from sklearn.cluster import AffinityPropagation
import cv2
import glob
base_model = Xception(weights = model_path)
base_model=Model(inputs=base_model.input,outputs=base_model.get_layer('avg_pool').output)
files = glob.glob("*.jpg")
image_vector = []
for f in files:
image = cv2.imread(f)
temp_vector = base_model.predict(image)
image_vector.append(temp_vector)
import numpy as np
image_vector = np.asarray(image_vector)
clustering = AffinityPropagation()
clustering.fit(image_vector)
Packages :-
scikit-learn 0.20.3
sklearn 0.0
tensorflow 1.12.0
keras 2.2.4
opencv-python
Machine 1 :- 4 core 8GB RAM
Machine 2 :- 7 Core 16GB RAM

Results on different machines can be different when running algorithms that are not deterministic.
I suggest that you fix the random seed of numpy and the random seed of Python if you want to be able to reproduce results across machines for such algorithms.
Python random seed can be fixed by using: random.seed(42) (or any other integer)
Numpy random seed can be fixed with: np.random.seed(12345) (or any other integer)
sklearn and Keras use numpy random number generator so the second option by itself could solve your issue.
This answer assumes that all libraries versions are the same on both systems.

Related

Clustering on 97 features of categorical data

I am trying to apply unsupervised learning on a data with 97 features and around 6500 rows/samples. All features have discrete data (mostly from 1-10) with some being binary (0/1). What are some of the best clustering algorithms to apply on this data. Thank You!
It's impossible to say which clustering algo will perform best on your given dataset. You just have to try several methodologies and inspect the final results that you get. Here are several clustering algos that you can try.
https://github.com/ASH-WICUS/Notebooks/blob/master/Clustering%20Algorithms%20Compared.ipynb
Here is a small sample.
import statsmodels.api as sm
import numpy as np
import pandas as pd
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df_cars = pd.DataFrame(mtcars)
df_cars.head()
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import KMeans
from matplotlib import pyplot
# define dataset
X = df_cars[['mpg','hp']]
# define the model
model = KMeans(n_clusters=8)
# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
X['kmeans']=yhat
pyplot.scatter(X['mpg'], X['hp'], c=X['kmeans'], cmap='rainbow', s=50, alpha=0.8)
# plot X & Y coordinates and color by cluster number
import plotly.express as px
df = px.data.iris()
fig = px.scatter(df_cars, x="hp", y="mpg", color="kmeans", size='mpg', hover_data=['kmeans'])
fig.show()

CPU and GPU running only around 10-15% during machine learning, training is very slow

Question edited, with code added.
I am getting into machine learning (not even deep learning yet) and I notice that calculations take extremely long, while my CPU and GPU don't seem to be working very hard. I am playing with the MNIST dataset (70000 samples with 784 features each).
My hardware is:
CPU: AMD Ryzen 5 3600X 6 core
GPU: Radeon RX 570
RAM: 16GM
Windows 10, python 3.8 in jupyter notebook
Here's the code, and the time I measured for each block (cells in jupyter). I have no reference whether these are normal times. Some seem very long to me, (I used a training set of only 10000 instead of 60000, for demonstration purposes), and what I find strange is that my CPU and GPU hardly go above 10-15%.
My question:
is are those times normal for those classification tasks my hardware setup?
Why are my CPU and GPU not working harder? I even wonder if my GPU is doing anything at all.
from sklearn.datasets import fetch_openml
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
mnist = fetch_openml('mnist_784', version=1)
X,y = mnist.data, mnist.target
y = y.astype(np.uint8)
some_digit = X[0]
# following code block is just to visualize what the MNIST dataset is
'''
some_digit_image = some_digit.reshape(28,28)
plt.imshow(some_digit_image, cmap = 'binary')
plt.axis('off')
plt.show()
'''
print( y[0])
# To demonstrate timing, I just took 10000 for training (60000 out of 70000 would make more sense
X_train, X_test, y_train, y_test = X[:10000],X[10000:],y[:10000], y[10000:]
# first make a binary classifier: 5 or not 5
y_train_5 = (y_train ==5)
y_test_5 = (y_test == 5)
# following takes 703 ms
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(max_iter=1000, tol=1e-3, random_state = 40)
sgd_clf.fit(X_train, y_train_5)
sgd_clf.predict([some_digit])
# following takes 1.44 s
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5,cv=3, scoring='accuracy')
# following takes 1.44 s
from sklearn.model_selection import cross_val_predict
y_train_predict = cross_val_predict(sgd_clf,X_train, y_train_5,cv=3)
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train_5, y_train_predict)
# following takes 16.8 s
from sklearn.svm import SVC
svm_clf = SVC()
svm_clf.fit(X_train, y_train)
svm_clf.predict([some_digit])
# following takes 1.7 s
from sklearn.neighbors import KNeighborsClassifier
kn = KNeighborsClassifier()
kn.fit(X_train,y_train)
#following takes 57 s
y_train_predict = cross_val_predict(kn,X_train, y_train,cv=3)
print(confusion_matrix(y_train, y_train_predict))
The easiest way to decrease training time is to increase batch size. Increase it until GPU usage is near max.

scikit-learn.impute isn't being imported from Imputer via Spyder using the code from Machine Learning A-Z tutorial

My code isn't working that I copied word for word from the Machine Learning A-Z™: Hands-On Python & R In Data Science tutorial course. I am using Python 3.7, I have installed the scikit-learn package in my environment. It isn't working, I have tried looking for a package that has sklearn although it doesn't seem to find anything. It is giving me this error.
I am running my environment through Anaconda.
ImportError: cannot import name 'Imputer' from 'sklearn.preprocessing' (C:\Users\vygan\.conda\envs\env\lib\site-packages\sklearn\preprocessing\__init__.py)
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Data.csv')
X = pd.DataFrame(dataset.iloc[:, :-1].values)
y = pd.DataFrame(dataset.iloc[:, 3].values)
# Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
it moved permanently from preprocessing to impute library, u can call it like:
from sklearn.impute import SimpleImputer
it's quite the same.
if it doesn't work, you should uninstall it with pip and then install it again
it may not installed properly for the first time
it doesn't have axis anymore but you could easily handle it with pandas dataframe header like this:
si=SimpleImputer()
si.fit([dataset["headername"]])
there is a strategy parameter that let you choose between "mean", "most_frequent","median" and "constant"
but there is another imputer that I like more:
from sklearn.impute import KNNImputer
which will impute missing values with an average of k nearest neighbors
A more complete answer:
Imputer (https://sklearn.org/modules/generated/sklearn.preprocessing.Imputer.html`)
can be found only in versions 0.19.1 and below.
SimpleImputer appeared at the latest versions and this is what you need.
Try to install the latest version:
pip install -U scikit-learn # or using conda
And then use:
from sklearn.impute import SimpleImputer
Source: https://github.com/mindsdb/lightwood/issues/75
Your code works fine for me. Which sklearn version do you have?
import sklearn
sklearn.__version__
'0.21.3'
You can upgrade packages with conda in the following way:
How to upgrade scikit-learn package in anaconda
I had faced the same problem because the library was changed from preprocessing to impute and the class was changed to SimpleImputer from Imputer.
I changed my code as follows:
from sklearn.impute import SimpleImputer
simp = SimpleImputer(missing_values = 'NaN', strategy = 'mean')
simp = SimpleImputer().fit(X[:, 1:3])
X[:, 1:3] = simp.transform(X[:, 1:3])

Flatten layer incompatible with input

I am trying to run the code
import data_processing as dp
import numpy as np
test_set = dp.read_data("./data2019-12-01.csv")
import tensorflow as tf
import keras
def train_model():
autoencoder = keras.Sequential([
keras.layers.Flatten(input_shape=[400]),
keras.layers.Dense(150,name='bottleneck'),
keras.layers.Dense(400,activation='sigmoid')
])
autoencoder.compile(optimizer='adam',loss='mse')
return autoencoder
trained_model=train_model()
trained_model.load_weights('./weightsfile.h5')
trained_model.evaluate(test_set,test_set)
The test_set in line 3 is of numpy array of shape (3280977,400). I am using keras 2.1.4 and tensorflow 1.5.
However, this puts out the following error
ValueError: Input 0 is incompatible with layer flatten_1: expected min_ndim=3, found ndim=2
How can I solve it? I tried changing the input_shape in flatten layer and also searched on the internet for possible solutions but none of them worked out. Can anyone help me out here? Thanks
After much trial and error, I was able to run the code. This is the code which runs:-
import data_processing as dp
import numpy as np
test_set = np.array(dp.read_data("./datanew.csv"))
print(np.shape(test_set))
import tensorflow as tf
from tensorflow import keras
# import keras
def train_model():
autoencoder = keras.Sequential([
keras.layers.Flatten(input_shape=[400]),
keras.layers.Dense(150,name='bottleneck'),
keras.layers.Dense(400,activation='sigmoid')
])
autoencoder.compile(optimizer='adam',loss='mse')
return autoencoder
trained_model=train_model()
trained_model.load_weights('./weightsfile.h5')
trained_model.evaluate(test_set,test_set)
The change I made is I replaced
import keras
with
from tensorflow import keras
This may work for others also, who are using old versions of tensorflow and keras. I used tensorflow 1.5 and keras 2.1.4 in my code.
Keras and TensorFlow only accept batch input data for prediction.
You must 'simulate' the batch index dimension.
For example, if your data is of shape (M x N), you need to feed at the prediction step a tensor of form (K x M x N), where K is the batch_dimension.
Simulating the batch axis is very easy, you can use numpy to achieve that:
Using: np.expand_dims(axis = 0), for an input tensor of shape M x N, you now have the shape 1 x M x N. This why you get that error, that missing '1' or 'K', the third dimension is that batch_index.

How to run PCA with dask_ml. I am getting an error, "This function (tsqr) supports QR decomposition in the case of tall-and-skinny matrices"?

I want to perform dimensionality reduction over data with around 3000 rows and 6000 columns. Here the number of observations (n_samples) < number of features (n_columns). I am not able to achieve the result using dask-ml whereas the same is possible through scikit learn. What modifications do I need to perform to my existing code?
#### dask_ml
from dask_ml.decomposition import PCA
from dask_ml import preprocessing
import dask.array as da
import numpy as np
train = np.random.rand(3000,6000)
train = da.from_array(train,chunks=(100,100))
complete_pca = PCA().fit(train)
#### scikit learn
from sklearn.decomposition import PCA
from sklearn import preprocessing
import numpy as np
train = np.random.rand(3000,6000)
complete_pca = PCA().fit(train)
The PCA algorithm in Dask-ML is only designed for tall-and-skinny matrices. You could try using the raw SVD algorithms in dask.array. Also, with a 3000x6000 matrix you can probably also just use a single machine.
Adding in something like Dask-ML for a problem of this size might be adding more complexity than you need. If Scikit-Learn works for you then I would stick with that.

Resources