XGBoost Survival Model - machine-learning

I'm trying to develop an XGBoost Survival model. Here is a quick snap of my code:
X = df_High_School[['Gender', 'Lived_both_Parents', 'Moth_Born_in_Canada', 'Father_Born_in_Canada','Born_in_Canada','Aboriginal','Visible_Minority']] # covariates
y = df_High_School[['time_to_event', 'event']] # time to event and event indicator
#split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
#Develop the model
model = xgb.XGBRegressor(objective='survival:cox')
It's giving me the following error:
ValueError Traceback (most recent call last)
in
18
19 # fit the model to the training data
---> 20 model.fit(X_train, y_train)
21
22 # make predictions on the test set
2 frames
/usr/local/lib/python3.8/dist-packages/xgboost/core.py in _maybe_pandas_label(label)
261 if isinstance(label, DataFrame):
262 if len(label.columns) > 1:
--> 263 raise ValueError('DataFrame for label cannot have multiple columns')
264
265 label_dtypes = label.dtypes
ValueError: DataFrame for label cannot have multiple columns
As this is a survival model, I need two columns t indicate the event and the time_to_event. I also tried converting the Dataframes to Numpy but it didn't work too.
Any clue? Thanks!

Related

Splitting MNIST dataset vs CSV dataset

I am trying to use a custom dataset in here (instead of MNIST) and my dataset looks like this:
age gender genre(output)
-- ------ -------------
20 1 HipHop
26 1 Jazz
31 0 Classical
20 0 Dance
Priorly, I used this method for splitting:
X = df.drop(columns=["genre"])
y = df["genre"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
I am struggling how to split this CSV dataset into the method that splits the MNIST dataset.
I would appreciate your helps.
Thanks

Drift Detection in categorical variables of high cardinality (10000+)

I am trying to solve a drift detection problem where I have to find out the drift in high cardinality (10000+) categorical variables such as ip_address, zipcode, cities. I have data points in the order of millions. I have tried the following methods -
chi square test from evidently python package https://github.com/evidentlyai/evidently/blob/main/src/evidently/analyzers/stattests/chisquare_stattest.py
maximum mean discrepancy test with GaussianRBF Kernel from alibi-detecthttps://github.com/SeldonIO/alibi-detect/blob/master/alibi_detect/cd/mmd.py
I have faced the below problems while applying these methods on my data
In chi square test, there is a constraint that we should have same set of categories in both training and inference datasets. This is highly unlikely for the features like ip address and zipcode. There are some data points which are available in training data but not in inference data. For such data points, I don't get observed frequency. I can assume their frequency as 0 as a work around.
But there are data points which have been newly introduced in the inference dataset and don't have their presence in the training dataset. So I would not be able to find out their expected frequency from training dataset. For such data points, I would have 0 in the denominator of the chi square formula and test statistic will be NaN. As a workaround, I can assume their minimum expected frequency equal to 1. But I wonder whether this is the correct way to approach the drift detection.
Moreover, the larger problem is the following -
The nature of these categorical feature variables is such that they can take any possible value from a very very large set of values. I don't have any control over these features taking a value. The users of the system can login from any IP address and from any zipcode. This becomes very difficult to find out the real drift in the data. Methods like chi square test can always give the significant result for such features. Is their any method which can handle such features for drift detection which takes into consideration the high cardinality and the aforementioned nature of the data.
In MMD test with GaussianRBF kernel, we use to calculate pairwise distance between two vectors. X is my training dataset which is having 15 millions records and 10 features. and Y is my inference dataset which is having 10 millions records and 10 features. Now when I perform MMD test on these datasets. I get the following -
a) K_XX = within similarity of X
b) K_YY = within similarity of Y
c) K_XY = cross similarity between X and Y
K_XX will try to generate a matrix (15 million X 15 million). This gives me "ResourceExhaustedError".
ResourceExhaustedError Traceback (most recent call last)
<command-574623167633967> in <module>
1 from alibi_detect.cd import MMDDrift
---> 2 detector = MMDDrift(x_ref=X, backend='tensorflow')
3 res = detector.predict(x=Y)
/databricks/python/lib/python3.8/site-packages/alibi_detect/utils/warnings.py in wrapper(*args, **kwargs)
15 def wrapper(*args, **kwargs):
16 _rename_kwargs(f.__name__, kwargs, aliases)
---> 17 return f(*args, **kwargs)
18 return wrapper
19 return deco
/databricks/python/lib/python3.8/site-packages/alibi_detect/cd/mmd.py in __init__(self, x_ref, backend, p_val, x_ref_preprocessed, preprocess_at_init, update_x_ref, preprocess_fn, kernel, sigma, configure_kernel_from_x_ref, n_permutations, device, input_shape, data_type)
101 if backend == 'tensorflow' and has_tensorflow:
102 kwargs.pop('device', None)
--> 103 self._detector = MMDDriftTF(*args, **kwargs) # type: ignore
104 else:
105 self._detector = MMDDriftTorch(*args, **kwargs) # type: ignore
/databricks/python/lib/python3.8/site-packages/alibi_detect/utils/warnings.py in wrapper(*args, **kwargs)
15 def wrapper(*args, **kwargs):
16 _rename_kwargs(f.__name__, kwargs, aliases)
---> 17 return f(*args, **kwargs)
18 return wrapper
19 return deco
/databricks/python/lib/python3.8/site-packages/alibi_detect/cd/tensorflow/mmd.py in __init__(self, x_ref, p_val, x_ref_preprocessed, preprocess_at_init, update_x_ref, preprocess_fn, kernel, sigma, configure_kernel_from_x_ref, n_permutations, input_shape, data_type)
86 # compute kernel matrix for the reference data
87 if self.infer_sigma or isinstance(sigma, tf.Tensor):
---> 88 self.k_xx = self.kernel(self.x_ref, self.x_ref, infer_sigma=self.infer_sigma)
89 self.infer_sigma = False
90 else:
/databricks/python/lib/python3.8/site-packages/keras/utils/traceback_utils.py in error_handler(*args, **kwargs)
65 except Exception as e: # pylint: disable=broad-except
66 filtered_tb = _process_traceback_frames(e.__traceback__)
---> 67 raise e.with_traceback(filtered_tb) from None
68 finally:
69 del filtered_tb
/databricks/python/lib/python3.8/site-packages/alibi_detect/utils/tensorflow/kernels.py in call(self, x, y, infer_sigma)
75 y = tf.cast(y, x.dtype)
76 x, y = tf.reshape(x, (x.shape[0], -1)), tf.reshape(y, (y.shape[0], -1)) # flatten
---> 77 dist = distance.squared_pairwise_distance(x, y) # [Nx, Ny]
78
79 if infer_sigma or self.init_required:
/databricks/python/lib/python3.8/site-packages/alibi_detect/utils/tensorflow/distance.py in squared_pairwise_distance(x, y, a_min, a_max)
28 x2 = tf.reduce_sum(x ** 2, axis=-1, keepdims=True)
29 y2 = tf.reduce_sum(y ** 2, axis=-1, keepdims=True)
---> 30 dist = x2 + tf.transpose(y2, (1, 0)) - 2. * x # tf.transpose(y, (1, 0))
31 return tf.clip_by_value(dist, a_min, a_max)
32
ResourceExhaustedError: Exception encountered when calling layer "gaussian_rbf_20" (type GaussianRBF).
OOM when allocating tensor with shape[14335347,14335347] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu [Op:AddV2]
The error seems pretty obvious given the quadratic complexity of MMD. Is their any way to mitigate this issue given the constraints that number of records in millions and high cardinality of categorical features.

Keras model not predicting values in the Test set

I'm building a Keras model to predict predict if the user will select the certain product or not (binary classification).
Model seems to be making progress on Validation set that is heldout while training, but the model's predictions are all 0s when it comes to the test set.
My dataset looks something like this:
train_dataset
customer_id id target customer_num_id
0 TCHWPBT 4 0 1
1 TCHWPBT 13 0 1
2 TCHWPBT 20 0 1
3 TCHWPBT 23 0 1
4 TCHWPBT 28 0 1
... ... ... ... ...
1631695 D4Q7TMM 849 0 7417
1631696 D4Q7TMM 855 0 7417
1631697 D4Q7TMM 856 0 7417
1631698 D4Q7TMM 858 0 7417
1631699 D4Q7TMM 907 0 7417
I split it into Train/Val sets using:
from sklearn.model_selection import train_test_split
Train, Val = train_test_split(train_dataset, test_size=0.1, random_state=42, shuffle=False)
After I split the dataset, I select the features that are used when training and validating the model:
train_customer_id = Train['customer_num_id']
train_vendor_id = Train['id']
train_target = Train['target']
val_customer_id = Val['customer_num_id']
val_vendor_id = Val['id']
val_target = Val['target']
... And run the model:
epochs = 2
for e in range(epochs):
print('EPOCH: ', e)
model.fit([train_customer_id, train_vendor_id], train_target, epochs=1, verbose=1, batch_size=384)
prediction = model.predict(x=[train_customer_id, train_vendor_id], verbose=1, batch_size=384)
train_f1 = f1_score(y_true=train_target.astype('float32'), y_pred=prediction.round())
print('TRAIN F1: ', train_f1)
val_prediction = model.predict(x=[val_customer_id, val_vendor_id], verbose=1, batch_size=384)
val_f1 = f1_score(y_true=val_target.astype('float32'), y_pred=val_prediction.round())
print('VAL F1: ', val_f1)
EPOCH: 0
1468530/1468530 [==============================] - 19s 13us/step - loss: 0.0891
TRAIN F1: 0.1537511577647422
VAL F1: 0.09745762711864409
EPOCH: 1
1468530/1468530 [==============================] - 19s 13us/step - loss: 0.0691
TRAIN F1: 0.308748569645272
VAL F1: 0.2076433121019108
The validation accuracy seems to be improving with time, and model predicts both 1s and 0s:
prediction = model.predict(x=[val_customer_id, val_vendor_id], verbose=1, batch_size=384)
np.unique(prediction.round())
array([0., 1.], dtype=float32)
But when I try predict the test set, model predicts 0 for all values:
prediction = model.predict(x=[test_dataset['customer_num_id'], test_dataset['id']], verbose=1, batch_size=384)
np.unique(prediction.round())
array([0.], dtype=float32)
Test dataset looks similar to the training and validation sets, and it has been left out during training just like the validation set, yet the model can't output values other than 0.
Here's what test dataset looks like:
test_dataset
customer_id id customer_num_id
0 Z59FTQD 243 7418
1 0JP29SK 243 7419
... ... ... ...
1671995 L9G4OFV 907 17414
1671996 L9G4OFV 907 17414
1671997 FDZFYBA 907 17415
Does anyone know what might be the issue here?
EDIT: made dataset text more readable
Please take a look at the distribution of your data. I see in the sample data you've shown that target is all 0's. Consider that if most users don't select the product, then if the model always predicts 0, it will be right most of the time. So, it could be improving it's accuracy by over-fitting to the majority class (0).
You can prevent over-fitting by adjusting params like the learning rate and model architecture by adding dropout layers.
Also, I'm not sure what your model looks like, but you're only training for 2 epochs so it may not have had enough time to generalize the data, and depending on how deep your model is it could need a lot more training time

Value Error : Only 2 class/es in training fold, but 1 in overall dataset. This is not supported for decision_function with imbalanced folds

I am learning machine learning and creating my first model on #mnist data set.
Can anyone help me over here? I have tried Stratified Fold, kfold and other methods to resolve this issue.
Pandas Version '0.25.1', Python Version 3.7, using Anaconda Distribution.
from sklearn.model_selection import train_test_split
train_set ,test_set = train_test_split(mnist,test_size = 0.2, random_state = 29)
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=29)
sgd_clf.fit(X_train,y_train_5)
X_train, y_train = train_set.drop('label',axis = 1), train_set[['label']]
X_test, y_test = test_set.drop('label',axis = 1),test_set[['label']]
y_train_5 = (y_train == 5) #True for all 5's and false otherwise
y_test_5 = (y_train == 5)
from sklearn.model_selection import cross_val_predict
print(X_train.shape)
print(y_train_5.shape)
cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")
Last line of the code block gives an error:
RuntimeWarning: Number of classes in training fold (2) does not match total number of classes (1). Results may not be appropriate for your use case. To fix this, use a cross-validation technique resulting in properly stratified folds
RuntimeWarning)
ValueError Traceback (most recent call last)
<ipython-input-39-da1ad024473a> in <module>
3 print(X_train.shape)
4 print(y_train_5.shape)
----> 5 cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in cross_val_predict(estimator, X, y, groups, cv, n_jobs, verbose, fit_params, pre_dispatch, method)
787 prediction_blocks = parallel(delayed(_fit_and_predict)(
788 clone(estimator), X, y, train, test, verbose, fit_params, method)
--> 789 for train, test in cv.split(X, y, groups))
790
791 # Concatenate the predictions
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
919 # remaining jobs.
920 self._iterating = False
--> 921 if self.dispatch_one_batch(iterator):
922 self._iterating = self._original_iterator is not None
923
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
757 return False
758 else:
--> 759 self._dispatch(tasks)
760 return True
761
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in _dispatch(self, batch)
714 with self._lock:
715 job_idx = len(self._jobs)
--> 716 job = self._backend.apply_async(batch, callback=cb)
717 # A job can complete so quickly than its callback is
718 # called before we get here, causing self._jobs to
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\_parallel_backends.py in apply_async(self, func, callback)
180 def apply_async(self, func, callback=None):
181 """Schedule a func to be run"""
--> 182 result = ImmediateResult(func)
183 if callback:
184 callback(result)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\_parallel_backends.py in __init__(self, batch)
547 # Don't delay the application, to avoid keeping the input
548 # arguments in memory
--> 549 self.results = batch()
550
551 def get(self):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in __call__(self)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
--> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in <listcomp>(.0)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
--> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _fit_and_predict(estimator, X, y, train, test, verbose, fit_params, method)
887 n_classes = len(set(y)) if y.ndim == 1 else y.shape[1]
888 predictions = _enforce_prediction_order(
--> 889 estimator.classes_, predictions, n_classes, method)
890 return predictions, test
891
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _enforce_prediction_order(classes, predictions, n_classes, method)
933 'is not supported for decision_function '
934 'with imbalanced folds. {}'.format(
--> 935 len(classes), n_classes, recommendation))
936
937 float_min = np.finfo(predictions.dtype).min
ValueError: Only 2 class/es in training fold, but 1 in overall dataset. This is not supported for decision_function with imbalanced folds. To fix this, use a cross-validation technique resulting in properly stratified folds
I ran through a similar problem and on further investigation found a warning message with the error log-
DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
There are two ways to solve this:
Use the hint in the warning message and change your code as:
cross_val_predict(sgd_clf, X_train, y_train_5.values.ravel(), cv=3,
method="decision_function")
refer - answere here
Also, using the hint from - A column-vector y was passed when a 1d array was expected.; I released my mistake and did the following:
Even in your error log- Number of classes in training fold (2) does not match total number of classes (1)
I assume y_train_5 here is a DataFrame, (probably you are working your way through Aurelien's publication)
The expected type for y_train_5 is an array-type object (meaning the shaoe to be (n,) or one-dimensional), but DataFrame is 2-dimensional, in your case (n,1).
All you need to do is pass the Series object for your column vector as-
y_train_5.iloc[:,0] (I prefer this)
y_train_5.{COLUMN_NAME} (another variant)
Try running below in your console.
> y_train_5.iloc[:,0].shape
(n,)
cross_val_predict(sgd_clf, X_train, y_train_5.iloc[:,0], cv=3,
method="decision_function")

Dask distributed does not run SVD if some of the chunks contain only NaN values

First of all thank you for providing dask with all its functionality, which is highly appreciated!
However, using dask.distributed to run an SVD on a rasterized dataset, it seems as if it fails when only single chunks consist only of NaN values although most of the dataset does contain correct values.
I read a dataset using xarray.open_mfdataset(chunks={...}) and try to set the chunksize such, that SVD computation (dask.array.linalg) used in the eofs.xarray package makes use of the cores our cluster provides, by using a dask.distributed client.
<xarray.Dataset>
Dimensions: (time: 8760, x: 1000, y: 840)
Coordinates:
* x (x) float64 2.452e+06 2.458e+06 2.462e+06 ... 7.442e+06 7.448e+06
* y (y) float64 1.352e+06 1.358e+06 1.362e+06 ... 5.542e+06 5.548e+06
* time (time) datetime64[ns] 2005-01-01 ... 2005-12-31T23:00:00
Data variables:
capacity (y, x) float64 dask.array<shape=(840, 1000), chunksize=(840, 840)>
capfac (time, y, x) float32 dask.array<shape=(8760, 840, 1000), chunksize=(876, 840, 840)>
However, when I run the computation, it fails with the below-mentioned error message.
ValueError: error encountered in SVD, check that missing values are in the same places at each time and that all the values are not missing
See complete error message:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~/.conda/envs/spagat_py36/lib/python3.6/site-packages/eofs/standard.py in __init__(self, dataset, weights, center, ddof)
164 # Use the parallel Dask algorithm
--> 165 dsvd = dask.array.linalg.svd(dataNoMissing)
166 A, Lh, E = (x.compute() for x in dsvd)
~/.conda/envs/spagat_py36/lib/python3.6/site-packages/dask/array/linalg.py in svd(a)
803 """
--> 804 return tsqr(a, compute_svd=True)
805
~/.conda/envs/spagat_py36/lib/python3.6/site-packages/dask/array/linalg.py in tsqr(data, compute_svd, _max_vchunk_size)
116 "Current shape: {},\nCurrent chunksize: {}".format(
--> 117 data.shape, data.chunksize
118 )
ValueError: Input must have the following properties:
1. Have two dimensions
2. Have only one column of blocks
Note: This function (tsqr) supports QR decomposition in the case of
tall-and-skinny matrices (single column chunk/block; see qr)Current shape: (8760, nan),
Current chunksize: (876, nan)
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-17-f60250fedf8b> in <module>
----> 1 pca.analyze()
~/code/tsa_lib/tsa_lib/time_tools.py in f(*args, **kwargs)
8 def f(*args, **kwargs):
9 before = time.perf_counter() # maybe exchange with time.process_time()
---> 10 rv = func(*args, **kwargs)
11 after = time.perf_counter()
12 print('elapsed time for {.__name__}: {:.2f} minutes'.format(func, (after - before)/60))
~/code/playground/playground/PCA.py in analyze(self)
145 print('PCA completed. Weights used.')
146 else:
--> 147 self.eofs, self.pcs, self.solver = eof_analysis(self.data_variability, n_eofs=None, xarray=True)
148 print('PCA completed. No weights used.')
149
~/code/tsa_lib/tsa_lib/time_tools.py in f(*args, **kwargs)
8 def f(*args, **kwargs):
9 before = time.perf_counter() # maybe exchange with time.process_time()
---> 10 rv = func(*args, **kwargs)
11 after = time.perf_counter()
12 print('elapsed time for {.__name__}: {:.2f} minutes'.format(func, (after - before)/60))
~/code/playground/playground/PCA.py in eof_analysis(data, n_eofs, xarray, wgts, lats)
36 solver = xEof(data, weights=wgts)
37 else:
---> 38 solver = xEof(data)
39
40 eofs = solver.eofsAsCovariance(neofs=n_eofs)
~/.conda/envs/spagat_py36/lib/python3.6/site-packages/eofs/xarray.py in __init__(self, array, weights, center, ddof)
131 weights=wtarray,
132 center=center,
--> 133 ddof=ddof)
134 # Name of the input DataArray.
135 self._name = array.name
~/.conda/envs/spagat_py36/lib/python3.6/site-packages/eofs/standard.py in __init__(self, dataset, weights, center, ddof)
175
176 except (np.linalg.LinAlgError, ValueError):
--> 177 raise ValueError('error encountered in SVD, check that missing '
178 'values are in the same places at each time and '
179 'that all the values are not missing')
ValueError: error encountered in SVD, check that missing values are in the same places at each time and that all the values are not missing
When applying SVD on a rasterized dataset, the below-mentioned error is given. Is it possible, that the error is raised because single chunks might be only containing NaN values?
If so, it could be considered as a bug of dask.distributed because the SVD works fine when applying it without chunking. Hence, the SVD should not fail only because single chunks only contain NaN values, whereas other chunks contain valid values, should it?

Resources