dead kernel when doing feature engineering? - machine-learning

I am working on a prediction problem. In my training set, I have around 8,700 samples and around 1,000 features. I used different models but still, it is highly biased. So, I decided to add new features to the model. I added some lags to the features and then used the polynomial tools in sklearn to generate polynomial features (degree=2).
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(2)
X_poly = poly.fit_transform(X)
X = pd.DataFrame(X_poly, columns=poly.get_feature_names_out(), index=X.index)
Now, I have around 490,000 features. Next, when I want to do the feature scaling,
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X)
I face an error in jupyternotebook saying "dead kernel" and I cannot go further.
What should I do? Any suggestion?

You need to do a batch processing with partial fit and then transform (also needs a loop):
scaler = StandardScaler()
n = X.shape[0] # rows
batch_size = 1000
i = 0
while i < n:
partial_size = min(batch_size, n - i)
partial_x = X[i:i + partial_size]
scaler.partial_fit(partial_x)
i += partial_size

Related

reduction of model accuracy while using PCA for a regression problem

I am trying to build a prection problem to predict the fare of flights. My data set has several catergorical variables like class,hour,day of week, day of month, month of year etc. I am using multiple algorithms like xgboost, ANN to fit the model
Intially I have one hot encoded these variables, which led to total of 90 variables, when I tried to fit a model for this data, training r2_score was high around .90 and test score was relatively very low(0.6).
I have used sine and cosine transformation for temporal variables, this led to a total of only 27 variables. With this training accuracy has dropped to .83 but test score is increased to .70
I was thinking that my variables are sparse and tried doing a PCA, but this drastically reduced the performance both on train set and test set.
So I have few questions regarding the same.
Why is PCA not helping and inturn reducing the performance of my model so badly
Any suggestions on how to improve my model performance?
code
from xgboost import XGBRegressor
import pandas as pd
import matplotlib.pyplot as plt
dataset = pd.read_excel('Airline Dataset1.xlsx',sheet_name='Airline Dataset1')
dataset = dataset.drop(columns = ['SL. No.'])
dataset['time'] = dataset['time'] - 24
import numpy as np
dataset['time'] = np.where(dataset['time']==24,0,dataset['time'])
cat_cols = ['demand', 'from_ind', 'to_ind']
cyc_cols = ['time','weekday','month','monthday']
def cyclic_encode(data,col,col_max):
data[col + '_sin'] = np.sin(2*np.pi*data[col]/col_max)
data[col + '_cos'] = np.cos(2*np.pi*data[col]/col_max)
return data
cyclic_encode(dataset,'time',23)
cyclic_encode(dataset,'weekday',6)
cyclic_encode(dataset,'month',11)
cyclic_encode(dataset,'monthday',31)
dataset = dataset.drop(columns=cyc_cols)
ohe_dataset = pd.get_dummies(dataset,columns = cat_cols , drop_first=True)
X = ohe_dataset.iloc[:,:-1]
y = ohe_dataset.iloc[:,27:28]
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train_us, X_test_us, y_train_us, y_test_us = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_Y = StandardScaler()
X_train = sc_X.fit_transform(X_train_us)
X_test = sc_X.transform(X_test_us)
y_train = sc_Y.fit_transform(y_train_us)
y_test = sc_Y.transform(y_test_us)
#Applying PCA
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train,y_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
regressor = XGBRegressor()
model = regressor.fit(X_train,y_train)
# Predicting the Test & Train set with regressor built
y_pred = regressor.predict(X_test)
y_pred = sc_Y.inverse_transform(y_pred)
y_pred_train = regressor.predict(X_train)
y_pred_train = sc_Y.inverse_transform(y_pred_train)
y_train = sc_Y.inverse_transform(y_train)
y_test = sc_Y.inverse_transform(y_test)
#calculate r2_score
from sklearn.metrics import r2_score
score_train = r2_score(y_train,y_pred_train)
score_test = r2_score(y_test,y_pred)
Thanks
You dont really need PCA for such small dimensional problem. Decision trees perform very well even with thousands of variables.
Here are few things you can try
Pass a watchlist and train up until you are not overfitting on validation set. https://github.com/dmlc/xgboost/blob/2d95b9a4b6d87e9f630c59995403988dee390c20/demo/guide-python/basic_walkthrough.py#L64
try all sine cosine transformations and other one hot encoding together and make a model (along with watchlist)
Looks for more causal data. Just seasonal patterns does not cause air fare fluctuations. For starting you can add flags for festivals, holidays, important dates. Also do feature engineering for proximities to these days. Weather data is also easy to find and add.
PCA usually help in cases where you have extreme dimensionality like genome data or algorithm involved doesnt do well in high dimensional data like kNN etc.

How do I indicate that certain features (words) and certain documents are more important than others in Naive Bayes?

I am developing a binary document classifier using sklearn. I want to indicate that certain features (words) are more (or less) important and certain documents are more (or less) important for learning.
If you are using CountVectorizer you can get the number of occurrences (frequency) of each word in the text corpus:
vec = CountVectorizer().fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
You can also get importance of each feature (word) based on univariate statistical test, using SelectKBest:
from sklearn.feature_selection import SelectKBest, chi2
...
skb = SelectKBest(chi2, k="all").fit(X, y)
feature_importnace = skb.scores_
By training a RandomForestClassifier:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
model = clf.fit(X, y)
# Calculate feature importances
importances = model.feature_importances_

Using scikit-learn's MLPClassifier in AdaBoostClassifier

For a binary classification problem I want to use the MLPClassifier as the base estimator in the AdaBoostClassifier. However, this does not work because MLPClassifier does not implement sample_weight, which is required for AdaBoostClassifier (see here). Before that, I tried using a Keras model and the KerasClassifier within AdaBoostClassifier but that did also not work as mentioned here .
A way, which is proposed by User V1nc3nt is to build an own MLPclassifier in TensorFlow and take into account the sample_weight.
User V1nc3nt shared large parts of his code but since I have only limited experience with Tensorflow, I am not able to fill in the missing parts. Hence, I was wondering if anyone has found a working solution for building Adaboost ensembles from MLPs or can help me out in completing the solution proposed by V1nc3nt.
Thank you very much in advance!
Based on the references, which you had mentioned, I have modified MLPClassifier to accommodate sample_weights.
Try this!
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import load_iris
from sklearn.ensemble import AdaBoostClassifier
class customMLPClassifer(MLPClassifier):
def resample_with_replacement(self, X_train, y_train, sample_weight):
# normalize sample_weights if not already
sample_weight = sample_weight / sample_weight.sum(dtype=np.float64)
X_train_resampled = np.zeros((len(X_train), len(X_train[0])), dtype=np.float32)
y_train_resampled = np.zeros((len(y_train)), dtype=np.int)
for i in range(len(X_train)):
# draw a number from 0 to len(X_train)-1
draw = np.random.choice(np.arange(len(X_train)), p=sample_weight)
# place the X and y at the drawn number into the resampled X and y
X_train_resampled[i] = X_train[draw]
y_train_resampled[i] = y_train[draw]
return X_train_resampled, y_train_resampled
def fit(self, X, y, sample_weight=None):
if sample_weight is not None:
X, y = self.resample_with_replacement(X, y, sample_weight)
return self._fit(X, y, incremental=(self.warm_start and
hasattr(self, "classes_")))
X,y = load_iris(return_X_y=True)
adabooster = AdaBoostClassifier(base_estimator=customMLPClassifer())
adabooster.fit(X,y)

Can we fit a model using the same dataset applied during cross validation process?

I have the following method that performs Cross Validation on a dataset followed by a final model fit:
import numpy as np
import utilities.utils as utils
from sklearn.model_selection import cross_val_score
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.utils import shuffle
def CV(args, path):
df = pd.read_csv(path + 'HIGGS.csv', sep=',')
df = shuffle(df)
df_labels = df[df.columns[0]]
df_features = df.drop(df.columns[0], axis=1)
clf = MLPClassifier(hidden_layer_sizes=(64, 64, 64),
activation='logistic',
solver='adam',
learning_rate_init=1e-3,
max_iter=1000,
batch_size=1000,
learning_rate='adaptive',
early_stopping=True
)
print('\t >>> Start Cross Validation ... ')
scores = cross_val_score(estimator=clf, X=df_features, y=df_labels, cv=5, n_jobs=-1)
print("CV Accuracy: %0.2f (+/- %0.2f)\n\n" % (scores.mean(), scores.std() * 2))
# Final Fit
print('\t >>> Start Final Fit ... ')
num_to_read = (int(10999999) * (args.stages * np.dtype(np.float64).itemsize))
C1 = utils.read_from_disk(path + 'HIGGS.dat', 0, num_to_read, args.stages)
print(C1)
print(C1.shape)
r = C1[:, :1]
C = np.delete(C1, 0, axis=1)
tr_C, ts_C, tr_r, ts_r = train_test_split(C, r, train_size=.8)
clf.fit(tr_C, tr_r)
prd_r = clf.predict(ts_C)
test_acc = accuracy_score(ts_r, prd_r) * 100.
return test_acc
I understand that Cross Validation is about evaluating how well your model is with a given dataset. My questions are:
Is it logically correct to fit the model again by the same dataset I used during the cross validation process?
During each CV fold, are the model parameters carried out to the next fold? For instance, in the case of Neural Network, is the fitted model from fold=1 carried out to fold=2?
Does this process (I mean fitting the entire dataset as I did above) produce a model accuracy near to the average accuracy we get after cross validation?
Thank you
R1. At the very end when you are performing CV you are splitting your dataset into k-sets and each time your are going to train your set with k-1 sets and test/validate with the 1/k of the data (different every time).
R2. Every time MLP is performing learning with a set (k-1 small sets) the learning tasks starts again, at the end the average measure of MSE or the error measure is the mean of errors in k different scenarios.
R3. If the class distribution in data is balance results of k-CV and traditional 70/30 splittings will have approximate generalizations. On the other hand, if the dataset is highly unbalance, a k-CV (10) will tend to better learning generalizations than traditional splittings (since data will represent more effectively all or the majority of the problem classes).

Why is `sklearn.manifold.MDS` random when `skbio's pcoa` is not?

I'm trying to figure out how to implement Principal Coordinate Analysis with various distance metrics. I stumbled across both skbio and sklearn with implementations. I don't understand why sklearn's implementation is different everytime while skbio is the same? Is there a degree of randomness to Multidimensional Scaling and in particular Principal Coordinate Analysis? I see that all of the clusters are very similar but why are they different? Am I implementing this correctly?
Running Principal Coordinate Analysis using Scikit-bio (i.e. Skbio) always gives the same results:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition
import seaborn as sns; sns.set_style("whitegrid", {'axes.grid' : False})
import skbio
from scipy.spatial import distance
%matplotlib inline
np.random.seed(0)
# Iris dataset
DF_data = pd.DataFrame(load_iris().data,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
columns = load_iris().feature_names)
n,m = DF_data.shape
# print(n,m)
# 150 4
Se_targets = pd.Series(load_iris().target,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
name = "Species")
# Scaling mean = 0, var = 1
DF_standard = pd.DataFrame(StandardScaler().fit_transform(DF_data),
index = DF_data.index,
columns = DF_data.columns)
# Distance Matrix
Ar_dist = distance.squareform(distance.pdist(DF_data, metric="braycurtis")) # (n x n) distance measure
DM_dist = skbio.stats.distance.DistanceMatrix(Ar_dist, ids=DF_standard.index)
PCoA = skbio.stats.ordination.pcoa(DM_dist)
Now with sklearn's Multidimensional Scaling:
from sklearn.manifold import MDS
fig, ax=plt.subplots(ncols=5, figsize=(12,3))
for rs in range(5):
M = MDS(n_components=2, metric=True, random_state=rs, dissimilarity='precomputed')
A = M.fit(Ar_dist).embedding_
ax[rs].scatter(A[:,0],A[:,1], c=[{0:"b", 1:"g", 2:"r"}[t] for t in Se_targets])
scikit-bio's PCoA (skbio.stats.ordination.pcoa) and scikit-learn's MDS (sklearn.manifold.MDS) use entirely different algorithms to transform the data. scikit-bio directly solves a symmetric eigenvalue problem and scikit-learn uses an iterative minimization procedure [1].
scikit-bio's PCoA is deterministic, though it is possible to receive different (arbitrary) rotations of the transformed coordinates depending on the system it is executed on [2]. scikit-learn's MDS is stochastic by default unless a fixed random_state is used. random_state appears to be used to initialize the iterative minimization procedure (the scikit-learn docs say that random_state is used to "initialize the centers" [3] though I don't know exactly what that means). Each random_state may produce slightly different embeddings with arbitrary rotation [4].
References: [1], [2], [3], [4]
MDS is a probabalistic algorithm, there is a parameter random_state that you can use to fix the random seed, you can pass this if you want to get the same results each time. PCA on the other hand is a deterministic algorithm, if you use sklearn.decomposition.PCA, you should get the same results each time.

Resources