Splitting MNIST dataset vs CSV dataset - machine-learning

I am trying to use a custom dataset in here (instead of MNIST) and my dataset looks like this:
age gender genre(output)
-- ------ -------------
20 1 HipHop
26 1 Jazz
31 0 Classical
20 0 Dance
Priorly, I used this method for splitting:
X = df.drop(columns=["genre"])
y = df["genre"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
I am struggling how to split this CSV dataset into the method that splits the MNIST dataset.
I would appreciate your helps.
Thanks

Related

XGBoost Survival Model

I'm trying to develop an XGBoost Survival model. Here is a quick snap of my code:
X = df_High_School[['Gender', 'Lived_both_Parents', 'Moth_Born_in_Canada', 'Father_Born_in_Canada','Born_in_Canada','Aboriginal','Visible_Minority']] # covariates
y = df_High_School[['time_to_event', 'event']] # time to event and event indicator
#split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
#Develop the model
model = xgb.XGBRegressor(objective='survival:cox')
It's giving me the following error:
ValueError Traceback (most recent call last)
in
18
19 # fit the model to the training data
---> 20 model.fit(X_train, y_train)
21
22 # make predictions on the test set
2 frames
/usr/local/lib/python3.8/dist-packages/xgboost/core.py in _maybe_pandas_label(label)
261 if isinstance(label, DataFrame):
262 if len(label.columns) > 1:
--> 263 raise ValueError('DataFrame for label cannot have multiple columns')
264
265 label_dtypes = label.dtypes
ValueError: DataFrame for label cannot have multiple columns
As this is a survival model, I need two columns t indicate the event and the time_to_event. I also tried converting the Dataframes to Numpy but it didn't work too.
Any clue? Thanks!

Drift Detection in categorical variables of high cardinality (10000+)

I am trying to solve a drift detection problem where I have to find out the drift in high cardinality (10000+) categorical variables such as ip_address, zipcode, cities. I have data points in the order of millions. I have tried the following methods -
chi square test from evidently python package https://github.com/evidentlyai/evidently/blob/main/src/evidently/analyzers/stattests/chisquare_stattest.py
maximum mean discrepancy test with GaussianRBF Kernel from alibi-detecthttps://github.com/SeldonIO/alibi-detect/blob/master/alibi_detect/cd/mmd.py
I have faced the below problems while applying these methods on my data
In chi square test, there is a constraint that we should have same set of categories in both training and inference datasets. This is highly unlikely for the features like ip address and zipcode. There are some data points which are available in training data but not in inference data. For such data points, I don't get observed frequency. I can assume their frequency as 0 as a work around.
But there are data points which have been newly introduced in the inference dataset and don't have their presence in the training dataset. So I would not be able to find out their expected frequency from training dataset. For such data points, I would have 0 in the denominator of the chi square formula and test statistic will be NaN. As a workaround, I can assume their minimum expected frequency equal to 1. But I wonder whether this is the correct way to approach the drift detection.
Moreover, the larger problem is the following -
The nature of these categorical feature variables is such that they can take any possible value from a very very large set of values. I don't have any control over these features taking a value. The users of the system can login from any IP address and from any zipcode. This becomes very difficult to find out the real drift in the data. Methods like chi square test can always give the significant result for such features. Is their any method which can handle such features for drift detection which takes into consideration the high cardinality and the aforementioned nature of the data.
In MMD test with GaussianRBF kernel, we use to calculate pairwise distance between two vectors. X is my training dataset which is having 15 millions records and 10 features. and Y is my inference dataset which is having 10 millions records and 10 features. Now when I perform MMD test on these datasets. I get the following -
a) K_XX = within similarity of X
b) K_YY = within similarity of Y
c) K_XY = cross similarity between X and Y
K_XX will try to generate a matrix (15 million X 15 million). This gives me "ResourceExhaustedError".
ResourceExhaustedError Traceback (most recent call last)
<command-574623167633967> in <module>
1 from alibi_detect.cd import MMDDrift
---> 2 detector = MMDDrift(x_ref=X, backend='tensorflow')
3 res = detector.predict(x=Y)
/databricks/python/lib/python3.8/site-packages/alibi_detect/utils/warnings.py in wrapper(*args, **kwargs)
15 def wrapper(*args, **kwargs):
16 _rename_kwargs(f.__name__, kwargs, aliases)
---> 17 return f(*args, **kwargs)
18 return wrapper
19 return deco
/databricks/python/lib/python3.8/site-packages/alibi_detect/cd/mmd.py in __init__(self, x_ref, backend, p_val, x_ref_preprocessed, preprocess_at_init, update_x_ref, preprocess_fn, kernel, sigma, configure_kernel_from_x_ref, n_permutations, device, input_shape, data_type)
101 if backend == 'tensorflow' and has_tensorflow:
102 kwargs.pop('device', None)
--> 103 self._detector = MMDDriftTF(*args, **kwargs) # type: ignore
104 else:
105 self._detector = MMDDriftTorch(*args, **kwargs) # type: ignore
/databricks/python/lib/python3.8/site-packages/alibi_detect/utils/warnings.py in wrapper(*args, **kwargs)
15 def wrapper(*args, **kwargs):
16 _rename_kwargs(f.__name__, kwargs, aliases)
---> 17 return f(*args, **kwargs)
18 return wrapper
19 return deco
/databricks/python/lib/python3.8/site-packages/alibi_detect/cd/tensorflow/mmd.py in __init__(self, x_ref, p_val, x_ref_preprocessed, preprocess_at_init, update_x_ref, preprocess_fn, kernel, sigma, configure_kernel_from_x_ref, n_permutations, input_shape, data_type)
86 # compute kernel matrix for the reference data
87 if self.infer_sigma or isinstance(sigma, tf.Tensor):
---> 88 self.k_xx = self.kernel(self.x_ref, self.x_ref, infer_sigma=self.infer_sigma)
89 self.infer_sigma = False
90 else:
/databricks/python/lib/python3.8/site-packages/keras/utils/traceback_utils.py in error_handler(*args, **kwargs)
65 except Exception as e: # pylint: disable=broad-except
66 filtered_tb = _process_traceback_frames(e.__traceback__)
---> 67 raise e.with_traceback(filtered_tb) from None
68 finally:
69 del filtered_tb
/databricks/python/lib/python3.8/site-packages/alibi_detect/utils/tensorflow/kernels.py in call(self, x, y, infer_sigma)
75 y = tf.cast(y, x.dtype)
76 x, y = tf.reshape(x, (x.shape[0], -1)), tf.reshape(y, (y.shape[0], -1)) # flatten
---> 77 dist = distance.squared_pairwise_distance(x, y) # [Nx, Ny]
78
79 if infer_sigma or self.init_required:
/databricks/python/lib/python3.8/site-packages/alibi_detect/utils/tensorflow/distance.py in squared_pairwise_distance(x, y, a_min, a_max)
28 x2 = tf.reduce_sum(x ** 2, axis=-1, keepdims=True)
29 y2 = tf.reduce_sum(y ** 2, axis=-1, keepdims=True)
---> 30 dist = x2 + tf.transpose(y2, (1, 0)) - 2. * x # tf.transpose(y, (1, 0))
31 return tf.clip_by_value(dist, a_min, a_max)
32
ResourceExhaustedError: Exception encountered when calling layer "gaussian_rbf_20" (type GaussianRBF).
OOM when allocating tensor with shape[14335347,14335347] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu [Op:AddV2]
The error seems pretty obvious given the quadratic complexity of MMD. Is their any way to mitigate this issue given the constraints that number of records in millions and high cardinality of categorical features.

Keras model not predicting values in the Test set

I'm building a Keras model to predict predict if the user will select the certain product or not (binary classification).
Model seems to be making progress on Validation set that is heldout while training, but the model's predictions are all 0s when it comes to the test set.
My dataset looks something like this:
train_dataset
customer_id id target customer_num_id
0 TCHWPBT 4 0 1
1 TCHWPBT 13 0 1
2 TCHWPBT 20 0 1
3 TCHWPBT 23 0 1
4 TCHWPBT 28 0 1
... ... ... ... ...
1631695 D4Q7TMM 849 0 7417
1631696 D4Q7TMM 855 0 7417
1631697 D4Q7TMM 856 0 7417
1631698 D4Q7TMM 858 0 7417
1631699 D4Q7TMM 907 0 7417
I split it into Train/Val sets using:
from sklearn.model_selection import train_test_split
Train, Val = train_test_split(train_dataset, test_size=0.1, random_state=42, shuffle=False)
After I split the dataset, I select the features that are used when training and validating the model:
train_customer_id = Train['customer_num_id']
train_vendor_id = Train['id']
train_target = Train['target']
val_customer_id = Val['customer_num_id']
val_vendor_id = Val['id']
val_target = Val['target']
... And run the model:
epochs = 2
for e in range(epochs):
print('EPOCH: ', e)
model.fit([train_customer_id, train_vendor_id], train_target, epochs=1, verbose=1, batch_size=384)
prediction = model.predict(x=[train_customer_id, train_vendor_id], verbose=1, batch_size=384)
train_f1 = f1_score(y_true=train_target.astype('float32'), y_pred=prediction.round())
print('TRAIN F1: ', train_f1)
val_prediction = model.predict(x=[val_customer_id, val_vendor_id], verbose=1, batch_size=384)
val_f1 = f1_score(y_true=val_target.astype('float32'), y_pred=val_prediction.round())
print('VAL F1: ', val_f1)
EPOCH: 0
1468530/1468530 [==============================] - 19s 13us/step - loss: 0.0891
TRAIN F1: 0.1537511577647422
VAL F1: 0.09745762711864409
EPOCH: 1
1468530/1468530 [==============================] - 19s 13us/step - loss: 0.0691
TRAIN F1: 0.308748569645272
VAL F1: 0.2076433121019108
The validation accuracy seems to be improving with time, and model predicts both 1s and 0s:
prediction = model.predict(x=[val_customer_id, val_vendor_id], verbose=1, batch_size=384)
np.unique(prediction.round())
array([0., 1.], dtype=float32)
But when I try predict the test set, model predicts 0 for all values:
prediction = model.predict(x=[test_dataset['customer_num_id'], test_dataset['id']], verbose=1, batch_size=384)
np.unique(prediction.round())
array([0.], dtype=float32)
Test dataset looks similar to the training and validation sets, and it has been left out during training just like the validation set, yet the model can't output values other than 0.
Here's what test dataset looks like:
test_dataset
customer_id id customer_num_id
0 Z59FTQD 243 7418
1 0JP29SK 243 7419
... ... ... ...
1671995 L9G4OFV 907 17414
1671996 L9G4OFV 907 17414
1671997 FDZFYBA 907 17415
Does anyone know what might be the issue here?
EDIT: made dataset text more readable
Please take a look at the distribution of your data. I see in the sample data you've shown that target is all 0's. Consider that if most users don't select the product, then if the model always predicts 0, it will be right most of the time. So, it could be improving it's accuracy by over-fitting to the majority class (0).
You can prevent over-fitting by adjusting params like the learning rate and model architecture by adding dropout layers.
Also, I'm not sure what your model looks like, but you're only training for 2 epochs so it may not have had enough time to generalize the data, and depending on how deep your model is it could need a lot more training time

Value Error : Only 2 class/es in training fold, but 1 in overall dataset. This is not supported for decision_function with imbalanced folds

I am learning machine learning and creating my first model on #mnist data set.
Can anyone help me over here? I have tried Stratified Fold, kfold and other methods to resolve this issue.
Pandas Version '0.25.1', Python Version 3.7, using Anaconda Distribution.
from sklearn.model_selection import train_test_split
train_set ,test_set = train_test_split(mnist,test_size = 0.2, random_state = 29)
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=29)
sgd_clf.fit(X_train,y_train_5)
X_train, y_train = train_set.drop('label',axis = 1), train_set[['label']]
X_test, y_test = test_set.drop('label',axis = 1),test_set[['label']]
y_train_5 = (y_train == 5) #True for all 5's and false otherwise
y_test_5 = (y_train == 5)
from sklearn.model_selection import cross_val_predict
print(X_train.shape)
print(y_train_5.shape)
cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")
Last line of the code block gives an error:
RuntimeWarning: Number of classes in training fold (2) does not match total number of classes (1). Results may not be appropriate for your use case. To fix this, use a cross-validation technique resulting in properly stratified folds
RuntimeWarning)
ValueError Traceback (most recent call last)
<ipython-input-39-da1ad024473a> in <module>
3 print(X_train.shape)
4 print(y_train_5.shape)
----> 5 cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in cross_val_predict(estimator, X, y, groups, cv, n_jobs, verbose, fit_params, pre_dispatch, method)
787 prediction_blocks = parallel(delayed(_fit_and_predict)(
788 clone(estimator), X, y, train, test, verbose, fit_params, method)
--> 789 for train, test in cv.split(X, y, groups))
790
791 # Concatenate the predictions
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
919 # remaining jobs.
920 self._iterating = False
--> 921 if self.dispatch_one_batch(iterator):
922 self._iterating = self._original_iterator is not None
923
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
757 return False
758 else:
--> 759 self._dispatch(tasks)
760 return True
761
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in _dispatch(self, batch)
714 with self._lock:
715 job_idx = len(self._jobs)
--> 716 job = self._backend.apply_async(batch, callback=cb)
717 # A job can complete so quickly than its callback is
718 # called before we get here, causing self._jobs to
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\_parallel_backends.py in apply_async(self, func, callback)
180 def apply_async(self, func, callback=None):
181 """Schedule a func to be run"""
--> 182 result = ImmediateResult(func)
183 if callback:
184 callback(result)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\_parallel_backends.py in __init__(self, batch)
547 # Don't delay the application, to avoid keeping the input
548 # arguments in memory
--> 549 self.results = batch()
550
551 def get(self):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in __call__(self)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
--> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in <listcomp>(.0)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
--> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _fit_and_predict(estimator, X, y, train, test, verbose, fit_params, method)
887 n_classes = len(set(y)) if y.ndim == 1 else y.shape[1]
888 predictions = _enforce_prediction_order(
--> 889 estimator.classes_, predictions, n_classes, method)
890 return predictions, test
891
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _enforce_prediction_order(classes, predictions, n_classes, method)
933 'is not supported for decision_function '
934 'with imbalanced folds. {}'.format(
--> 935 len(classes), n_classes, recommendation))
936
937 float_min = np.finfo(predictions.dtype).min
ValueError: Only 2 class/es in training fold, but 1 in overall dataset. This is not supported for decision_function with imbalanced folds. To fix this, use a cross-validation technique resulting in properly stratified folds
I ran through a similar problem and on further investigation found a warning message with the error log-
DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
There are two ways to solve this:
Use the hint in the warning message and change your code as:
cross_val_predict(sgd_clf, X_train, y_train_5.values.ravel(), cv=3,
method="decision_function")
refer - answere here
Also, using the hint from - A column-vector y was passed when a 1d array was expected.; I released my mistake and did the following:
Even in your error log- Number of classes in training fold (2) does not match total number of classes (1)
I assume y_train_5 here is a DataFrame, (probably you are working your way through Aurelien's publication)
The expected type for y_train_5 is an array-type object (meaning the shaoe to be (n,) or one-dimensional), but DataFrame is 2-dimensional, in your case (n,1).
All you need to do is pass the Series object for your column vector as-
y_train_5.iloc[:,0] (I prefer this)
y_train_5.{COLUMN_NAME} (another variant)
Try running below in your console.
> y_train_5.iloc[:,0].shape
(n,)
cross_val_predict(sgd_clf, X_train, y_train_5.iloc[:,0], cv=3,
method="decision_function")

How to predict multi-label dataset using svm

I'm using a dataset with all decimal values and timestamp which has the following features :
1. sno
2. timestamp
3. v1
4. v2
5. v3
I've the data for 5 months with timestamps for every minute. I need to predict if v1, v2 ,v3 is being used at any time in the future. The values of v1,v2,v3 are between 0 to 25.
How can I do this ?
I've used binary classification before but I've no clue how to process with the multi-label problem to predict. I've used the code below all the time . How should I train the model and how should I use v1,v2,v3 to fit into 'y'?
X_train, X_test, y_train, y_test = train_test_split(train, y, test_size=0.2)
Data:
sno power voltage v1 v2 v3 timestamp
1 3.74 235.24 0 16 18 2006-12-16 18:03:00
2 4.928 237.14 0 37 16 2006-12-16 18:04:00
3 6.052 236.73 0 37 17 2006-12-16 18:05:00
4 6.752 237.06 0 36 17 2006-12-16 18:06:00
5 6.474 237.13 0 37 16 2006-12-16 18:07:00
6 6.308 235.84 0 36 17 2006-12-16 18:08:00
7 4.464 232.69 0 37 16 2006-12-16 18:09:00
8 3.396 230.98 0 22 18 2006-12-16 18:10:00
9 3.09 232.21 0 12 17 2006-12-16 18:11:00
10 3.73 234.19 0 27 17 2006-12-16 18:12:00
11 2.308 234.96 0 1 17 2006-12-16 18:13:00
12 2.388 236.66 0 1 17 2006-12-16 18:14:00
13 4.598 235.84 0 20 17 2006-12-16 18:15:00
14 4.524 235.6 0 9 17 2006-12-16 18:16:00
15 4.202 235.49 0 1 17 2006-12-16 18:17:00
Following the documentation:
The multiclass support is handled according to a one-vs-one scheme (and should thus support one-vs-all strategy).
one-vs-one strat
The one-vs-one scheme basically refers to using a classifier per pair of classes. At a prediction stage, the class that receives the most votes (the outputs of the each classifier) is eventually selected as a prediction. If such a voting has a tie, i.e. having two classes with an equal amount of votes, then the classification confidence plays a role.
To use SVM with such a scheme, one should go:
from sklearn.multiclass import OneVsOneClassifier
from sklearn.svm import SVC
...
subclf = SVC(**params)
clf = OneVsOneClassifier(estimator=subclf)
clf.fit()
one-vs-rest strat
The other way around would be to use a one-vs-all strategy. This strategy fits a classifier per class and against all other classes in the data. It is more popular than the first scheme as it is fairly easier to interpert the results, and the computational time is much weaker. It is as simple to use as the first example:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
...
subclf = SVC(**params)
clf = OneVsRestClassifier(estimator=subclf)
clf.fit()
To read more about multi-label classification and learning proceed here
Aftermath variable coding
So, the basic idea is to instantiate a complex (i.e. multi-label) target variable in a way that:
y equals to 0 if v1 v2 v3 are zeros
y equals to 1 if either v1 or v2 or v3 is one
y equals to 2 if either v1 v2 or v1 v3 or v2 v3 are ones
y equals to 3 if v1 v2 v3 are ones
The workaround may be the following:
import numpy as np
y = []
for i, j, k in zip(data['v1'], data['v2'], data['v3']):
if i and j and k > 0:
y.append(3)
elif i and j or i and k or j and k > 0:
y.append(2)
elif i or j or k > 0:
y.append(1)
else:
y.append(0)

Resources