TPOT freezes while optimizing in Ubuntu 18.04 - machine-learning

I have started working with tpot with dask. But the optimization always freezes at 0%. I have tried almost all the solutions suggested. But still no luck. Here is my code -
import numpy as np
import multiprocessing
import csv
import tpot
from dask.distributed import Client
from sklearn.model_selection import train_test_split
target = list(csv.reader(open('target.csv')))
target_n = []
for i in range(len(target)):
target_n.append(int(target[i][0]))
data = list(csv.reader(open('data.csv')))
data_n = []
for i in range(len(data)):
tmp=[]
for j in range(len(data[i])):
tmp.append(np.longdouble(data[i][j]))
data_n.append(tmp)
data_array = np.asarray(data_n)
data_array = np.where(data_array < np.finfo(np.float64).max , data_array,np.finfo(np.float64).max)
data_array = data_array.clip(min=0)
X_train, X_test, y_train, y_test = train_test_split(data_array, target_array, train_size=0.75, test_size=0.25)
tp = tpot.TPOTClassifier(generations=5, config_dict = 'TPOT light', population_size=10, cv = 5, random_state = 0, verbosity=3, use_dask=True, max_eval_time_mins=0.04, n_jobs = 20)
tp.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
Any kind of suggestion will be greatly appreciated. By the way the size of my data is ~85MB

Few things, if this is still an issue?:
Let's assume your using significant hardware, as you're using dask and n_jobs=20.
My guess is your generations and population size are too small to parallelise across that many workers so too smaller workloads are getting split up.
You may also need a line for multiprocessing when using multiple n_jobs.
See the crash freeze heading here.
Tricky to do dev tests by it's nature but TPOT recommends testing with at least a population size of 60, from memory.
Post your error message if this is still an issue.

Related

dead kernel when doing feature engineering?

I am working on a prediction problem. In my training set, I have around 8,700 samples and around 1,000 features. I used different models but still, it is highly biased. So, I decided to add new features to the model. I added some lags to the features and then used the polynomial tools in sklearn to generate polynomial features (degree=2).
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(2)
X_poly = poly.fit_transform(X)
X = pd.DataFrame(X_poly, columns=poly.get_feature_names_out(), index=X.index)
Now, I have around 490,000 features. Next, when I want to do the feature scaling,
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X)
I face an error in jupyternotebook saying "dead kernel" and I cannot go further.
What should I do? Any suggestion?
You need to do a batch processing with partial fit and then transform (also needs a loop):
scaler = StandardScaler()
n = X.shape[0] # rows
batch_size = 1000
i = 0
while i < n:
partial_size = min(batch_size, n - i)
partial_x = X[i:i + partial_size]
scaler.partial_fit(partial_x)
i += partial_size

Why is Scikit-learn RFECV returning very different features for the training dataset?

I have been experimenting with RFECV on the Boston dataset.
My understanding, thus far, is that to prevent data-leakage, it is important to perform activities such as this, only on the training data and not the whole dataset.
I performed RFECV on just the training data, and it indicated that 13 of the 14 features are optimal. However, I then ran the same process on the whole dataset, and this time around, it indicated that only 6 of the features are optimal - which seems more likely.
To illustrate:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
### CONSTANTS
TARGET_COLUMN = 'Price'
TEST_SIZE = 0.1
RANDOM_STATE = 0
### LOAD THE DATA AND ASSIGN TO X and y
data_dict = load_boston()
data = data_dict.data
features = list(data_dict.feature_names)
target = data_dict.target
df = pd.DataFrame(data=data, columns=features)
df[TARGET_COLUMN] = target
X = df[features]
y = df[TARGET_COLUMN]
### PERFORM TRAIN TEST SPLIT
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE,
random_state=RANDOM_STATE)
#### DETERMINE THE DATA THAT IS PASSED TO RFECV
## Just the Training data
X_input = X_train
y_input = y_train
## All the data
# X_input = X
# y_input = y
### IMPLEMENT RFECV AND GET RESULTS
rfecv = RFECV(estimator=LinearRegression(), step=1, scoring='neg_mean_squared_error')
rfecv.fit(X_input, y_input)
rfecv.transform(X_input)
print(f'Optimal number of features: {rfecv.n_features_}')
imp_feats = X.drop(X.columns[np.where(rfecv.support_ == False)[0]], axis=1)
print('Important features:', list(imp_feats.columns))
Running the above will result in:
Optimal number of features: 13
Important features: ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
Now, if change the code so that RFECV fits all the data:
#### DETERMINE THE DATA THAT IS PASSED TO RFECV
## Just the Training data
# X_input = X_train # NOW COMMENTED OUT
# y_input = y_train # NOW COMMENTED OUT
## All the data
X_input = X # NOW UN-COMMENTED
y_input = y # NOW UN-COMMENTED
and run it, I get the following result:
Optimal number of features: 6
Important features: ['CHAS', 'NOX', 'RM', 'DIS', 'PTRATIO', 'LSTAT']
I don't understand why the results are so markedly different (and seemingly more accurate) for the whole dataset as opposed to just the training set.
I have tried making the training set close to the size of the whole data, by making the test_size extremely small (via my TEST_SIZE constant), but I still get this seemingly unlikely difference.
What am I missing?
It certainly seems like unexpected behavior, and especially when, as you say, you can reduce the test size to 10% or even 5% and find a similar disparity, which seems very counter-intuitive. The key to understanding what's going on here is to realize that for this particular dataset the values in each column are not randomly distributed across the rows (for example, try running X['CRIM'].plot()). The train_test_split function you're using to split the data has a parameter shuffle which defaults to True. So if you look at the X_train dataset you'll see that the index is jumbled up, whereas in X it is sequential. This means that when the cross-validation is performed under the hood by the RFECV class, it is getting a biased subset of data in each split of X, but a more representative/random subset of data in each split of X_train. If you pass shuffle=False to train_test_split you'll see that the two results are much closer (or alternatively, and probably better, try shuffling the index of X).

CPU and GPU running only around 10-15% during machine learning, training is very slow

Question edited, with code added.
I am getting into machine learning (not even deep learning yet) and I notice that calculations take extremely long, while my CPU and GPU don't seem to be working very hard. I am playing with the MNIST dataset (70000 samples with 784 features each).
My hardware is:
CPU: AMD Ryzen 5 3600X 6 core
GPU: Radeon RX 570
RAM: 16GM
Windows 10, python 3.8 in jupyter notebook
Here's the code, and the time I measured for each block (cells in jupyter). I have no reference whether these are normal times. Some seem very long to me, (I used a training set of only 10000 instead of 60000, for demonstration purposes), and what I find strange is that my CPU and GPU hardly go above 10-15%.
My question:
is are those times normal for those classification tasks my hardware setup?
Why are my CPU and GPU not working harder? I even wonder if my GPU is doing anything at all.
from sklearn.datasets import fetch_openml
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
mnist = fetch_openml('mnist_784', version=1)
X,y = mnist.data, mnist.target
y = y.astype(np.uint8)
some_digit = X[0]
# following code block is just to visualize what the MNIST dataset is
'''
some_digit_image = some_digit.reshape(28,28)
plt.imshow(some_digit_image, cmap = 'binary')
plt.axis('off')
plt.show()
'''
print( y[0])
# To demonstrate timing, I just took 10000 for training (60000 out of 70000 would make more sense
X_train, X_test, y_train, y_test = X[:10000],X[10000:],y[:10000], y[10000:]
# first make a binary classifier: 5 or not 5
y_train_5 = (y_train ==5)
y_test_5 = (y_test == 5)
# following takes 703 ms
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(max_iter=1000, tol=1e-3, random_state = 40)
sgd_clf.fit(X_train, y_train_5)
sgd_clf.predict([some_digit])
# following takes 1.44 s
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5,cv=3, scoring='accuracy')
# following takes 1.44 s
from sklearn.model_selection import cross_val_predict
y_train_predict = cross_val_predict(sgd_clf,X_train, y_train_5,cv=3)
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train_5, y_train_predict)
# following takes 16.8 s
from sklearn.svm import SVC
svm_clf = SVC()
svm_clf.fit(X_train, y_train)
svm_clf.predict([some_digit])
# following takes 1.7 s
from sklearn.neighbors import KNeighborsClassifier
kn = KNeighborsClassifier()
kn.fit(X_train,y_train)
#following takes 57 s
y_train_predict = cross_val_predict(kn,X_train, y_train,cv=3)
print(confusion_matrix(y_train, y_train_predict))
The easiest way to decrease training time is to increase batch size. Increase it until GPU usage is near max.

reduction of model accuracy while using PCA for a regression problem

I am trying to build a prection problem to predict the fare of flights. My data set has several catergorical variables like class,hour,day of week, day of month, month of year etc. I am using multiple algorithms like xgboost, ANN to fit the model
Intially I have one hot encoded these variables, which led to total of 90 variables, when I tried to fit a model for this data, training r2_score was high around .90 and test score was relatively very low(0.6).
I have used sine and cosine transformation for temporal variables, this led to a total of only 27 variables. With this training accuracy has dropped to .83 but test score is increased to .70
I was thinking that my variables are sparse and tried doing a PCA, but this drastically reduced the performance both on train set and test set.
So I have few questions regarding the same.
Why is PCA not helping and inturn reducing the performance of my model so badly
Any suggestions on how to improve my model performance?
code
from xgboost import XGBRegressor
import pandas as pd
import matplotlib.pyplot as plt
dataset = pd.read_excel('Airline Dataset1.xlsx',sheet_name='Airline Dataset1')
dataset = dataset.drop(columns = ['SL. No.'])
dataset['time'] = dataset['time'] - 24
import numpy as np
dataset['time'] = np.where(dataset['time']==24,0,dataset['time'])
cat_cols = ['demand', 'from_ind', 'to_ind']
cyc_cols = ['time','weekday','month','monthday']
def cyclic_encode(data,col,col_max):
data[col + '_sin'] = np.sin(2*np.pi*data[col]/col_max)
data[col + '_cos'] = np.cos(2*np.pi*data[col]/col_max)
return data
cyclic_encode(dataset,'time',23)
cyclic_encode(dataset,'weekday',6)
cyclic_encode(dataset,'month',11)
cyclic_encode(dataset,'monthday',31)
dataset = dataset.drop(columns=cyc_cols)
ohe_dataset = pd.get_dummies(dataset,columns = cat_cols , drop_first=True)
X = ohe_dataset.iloc[:,:-1]
y = ohe_dataset.iloc[:,27:28]
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train_us, X_test_us, y_train_us, y_test_us = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_Y = StandardScaler()
X_train = sc_X.fit_transform(X_train_us)
X_test = sc_X.transform(X_test_us)
y_train = sc_Y.fit_transform(y_train_us)
y_test = sc_Y.transform(y_test_us)
#Applying PCA
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train,y_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
regressor = XGBRegressor()
model = regressor.fit(X_train,y_train)
# Predicting the Test & Train set with regressor built
y_pred = regressor.predict(X_test)
y_pred = sc_Y.inverse_transform(y_pred)
y_pred_train = regressor.predict(X_train)
y_pred_train = sc_Y.inverse_transform(y_pred_train)
y_train = sc_Y.inverse_transform(y_train)
y_test = sc_Y.inverse_transform(y_test)
#calculate r2_score
from sklearn.metrics import r2_score
score_train = r2_score(y_train,y_pred_train)
score_test = r2_score(y_test,y_pred)
Thanks
You dont really need PCA for such small dimensional problem. Decision trees perform very well even with thousands of variables.
Here are few things you can try
Pass a watchlist and train up until you are not overfitting on validation set. https://github.com/dmlc/xgboost/blob/2d95b9a4b6d87e9f630c59995403988dee390c20/demo/guide-python/basic_walkthrough.py#L64
try all sine cosine transformations and other one hot encoding together and make a model (along with watchlist)
Looks for more causal data. Just seasonal patterns does not cause air fare fluctuations. For starting you can add flags for festivals, holidays, important dates. Also do feature engineering for proximities to these days. Weather data is also easy to find and add.
PCA usually help in cases where you have extreme dimensionality like genome data or algorithm involved doesnt do well in high dimensional data like kNN etc.

how to get more accuracy on CNN with less number of images

currently I am working on flower Classification dataset of kaggle which has only 210 images, with this set of image I am getting accuracy of only 11% on validation set.
enter code here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import cv2
#from tqdm import tqdm
import os
import warnings
warnings.filterwarnings('ignore')
flower_img = r'C:\Users\asus\Downloads\flower_images\flower_images'
data = pd.read_csv(r'C:\Users\asus\Downloads\flower_images\flower_labels.csv')
img = os.listdir(flower_img)[1]
image_name = [img.split('.')[-2] for img in os.listdir(flower_img)]
label_array = np.array(data['label'])
label_unique = np.unique(label_array)
names = [' phlox','rose','calendula','iris','leucanthemum maximum','bellflower','viola','rudbeckia laciniata','peony','aquilegia']
Flower_names = {}
for i in range(10):
Flower_names[i] = names[i]
print(Flower_names)
Flower_names.get(8)
x = data['label'][2]
Flower_names.get(x)
i=0
for img in os.listdir(flower_img):
#print(img)
path = os.path.join(flower_img,img)
#img = cv2.imread(path,cv2.IMREAD_GRAYSCALE)
img = cv2.imread(path)
#print(img.shape)
img = cv2.resize(img,(128,128))
data['file'][i] = np.array(img)
i+=1
data['file'][0].shape
plt.imshow(data['file'][0])
plt.show()
import keras
from keras.models import Sequential
from keras.layers import Dense,Conv2D,Activation,MaxPool2D,Dropout,Flatten
model = Sequential()
model.add(Conv2D(32,kernel_size=3,activation='relu',input_shape=(128,128,3)))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Conv2D(64,kernel_size=3,activation='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Conv2D(128,kernel_size=3,activation='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
#model.add(Conv2D(512,kernel_size=3,activation='relu'))
#model.add(MaxPool2D(pool_size=(2,2)))
model.add(Flatten())
model.add(Dense(512,activation='relu'))
model.add(Dense(10,activation='softmax'))
model.add(Dropout(0.25))
from keras.optimizers import Adam
model.compile(loss='categorical_crossentropy',optimizer=Adam(lr=0.002),metrics=['accuracy'])
model.summary()
x = np.array([i for i in data['file']]).reshape(-1,128,128,3)
y = np.array([i for i in data['label']])
from keras.utils import to_categorical
y = to_categorical(y)
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y)
model.fit(x_train,y_train,validation_data=(x_test,y_test),epochs=10)
model.evaluate(x_test,y_test)
model.evaluate(x_train,y_train)
how can I increase accuracy only using this dataset also how can I predict classes for any input image.
Link of Flower color images dataset : https://www.kaggle.com/olgabelitskaya/flower-color-images
Your dataset size is very small. Convolutional neural networks are optimal when trained using very large data sets. You really want to have thousands of images (or more!) in your data set.
You can try to enhance your current data set by using various image processing techniques to increase the size of the data set. These techniques will take the original images, skew them, rotate them and do other modification to bolster the size of the training data. These techniques can be helpful, but increasing the natural size of the data set is preferred.
If you cannot increase the size of the dataset, you should examine why you need to use a CNN. There are other algorithms that may give better results when trained with a smaller data set. Take a look at Support Vector Machines or k-NN.
If you must use a CNN, Transfer Learning is a good solution. You can use the features from a trained model and apply them to your problem. I have had great success with this approach.
The things you can do:
Progressive resizing link
Image augmentation link
Transfer learning link
To be honest, there are much and much more techniques could be utilized to enhance the effectiveness of used data. Try to search about this topic. These ones are the ones that I remember in a minute. These ones that I've given link are just major example ones. You can dig better with a dedicated research.

Resources