Keras model not predicting values in the Test set - machine-learning

I'm building a Keras model to predict predict if the user will select the certain product or not (binary classification).
Model seems to be making progress on Validation set that is heldout while training, but the model's predictions are all 0s when it comes to the test set.
My dataset looks something like this:
train_dataset
customer_id id target customer_num_id
0 TCHWPBT 4 0 1
1 TCHWPBT 13 0 1
2 TCHWPBT 20 0 1
3 TCHWPBT 23 0 1
4 TCHWPBT 28 0 1
... ... ... ... ...
1631695 D4Q7TMM 849 0 7417
1631696 D4Q7TMM 855 0 7417
1631697 D4Q7TMM 856 0 7417
1631698 D4Q7TMM 858 0 7417
1631699 D4Q7TMM 907 0 7417
I split it into Train/Val sets using:
from sklearn.model_selection import train_test_split
Train, Val = train_test_split(train_dataset, test_size=0.1, random_state=42, shuffle=False)
After I split the dataset, I select the features that are used when training and validating the model:
train_customer_id = Train['customer_num_id']
train_vendor_id = Train['id']
train_target = Train['target']
val_customer_id = Val['customer_num_id']
val_vendor_id = Val['id']
val_target = Val['target']
... And run the model:
epochs = 2
for e in range(epochs):
print('EPOCH: ', e)
model.fit([train_customer_id, train_vendor_id], train_target, epochs=1, verbose=1, batch_size=384)
prediction = model.predict(x=[train_customer_id, train_vendor_id], verbose=1, batch_size=384)
train_f1 = f1_score(y_true=train_target.astype('float32'), y_pred=prediction.round())
print('TRAIN F1: ', train_f1)
val_prediction = model.predict(x=[val_customer_id, val_vendor_id], verbose=1, batch_size=384)
val_f1 = f1_score(y_true=val_target.astype('float32'), y_pred=val_prediction.round())
print('VAL F1: ', val_f1)
EPOCH: 0
1468530/1468530 [==============================] - 19s 13us/step - loss: 0.0891
TRAIN F1: 0.1537511577647422
VAL F1: 0.09745762711864409
EPOCH: 1
1468530/1468530 [==============================] - 19s 13us/step - loss: 0.0691
TRAIN F1: 0.308748569645272
VAL F1: 0.2076433121019108
The validation accuracy seems to be improving with time, and model predicts both 1s and 0s:
prediction = model.predict(x=[val_customer_id, val_vendor_id], verbose=1, batch_size=384)
np.unique(prediction.round())
array([0., 1.], dtype=float32)
But when I try predict the test set, model predicts 0 for all values:
prediction = model.predict(x=[test_dataset['customer_num_id'], test_dataset['id']], verbose=1, batch_size=384)
np.unique(prediction.round())
array([0.], dtype=float32)
Test dataset looks similar to the training and validation sets, and it has been left out during training just like the validation set, yet the model can't output values other than 0.
Here's what test dataset looks like:
test_dataset
customer_id id customer_num_id
0 Z59FTQD 243 7418
1 0JP29SK 243 7419
... ... ... ...
1671995 L9G4OFV 907 17414
1671996 L9G4OFV 907 17414
1671997 FDZFYBA 907 17415
Does anyone know what might be the issue here?
EDIT: made dataset text more readable

Please take a look at the distribution of your data. I see in the sample data you've shown that target is all 0's. Consider that if most users don't select the product, then if the model always predicts 0, it will be right most of the time. So, it could be improving it's accuracy by over-fitting to the majority class (0).
You can prevent over-fitting by adjusting params like the learning rate and model architecture by adding dropout layers.
Also, I'm not sure what your model looks like, but you're only training for 2 epochs so it may not have had enough time to generalize the data, and depending on how deep your model is it could need a lot more training time

Related

XGBoost Survival Model

I'm trying to develop an XGBoost Survival model. Here is a quick snap of my code:
X = df_High_School[['Gender', 'Lived_both_Parents', 'Moth_Born_in_Canada', 'Father_Born_in_Canada','Born_in_Canada','Aboriginal','Visible_Minority']] # covariates
y = df_High_School[['time_to_event', 'event']] # time to event and event indicator
#split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
#Develop the model
model = xgb.XGBRegressor(objective='survival:cox')
It's giving me the following error:
ValueError Traceback (most recent call last)
in
18
19 # fit the model to the training data
---> 20 model.fit(X_train, y_train)
21
22 # make predictions on the test set
2 frames
/usr/local/lib/python3.8/dist-packages/xgboost/core.py in _maybe_pandas_label(label)
261 if isinstance(label, DataFrame):
262 if len(label.columns) > 1:
--> 263 raise ValueError('DataFrame for label cannot have multiple columns')
264
265 label_dtypes = label.dtypes
ValueError: DataFrame for label cannot have multiple columns
As this is a survival model, I need two columns t indicate the event and the time_to_event. I also tried converting the Dataframes to Numpy but it didn't work too.
Any clue? Thanks!

Splitting MNIST dataset vs CSV dataset

I am trying to use a custom dataset in here (instead of MNIST) and my dataset looks like this:
age gender genre(output)
-- ------ -------------
20 1 HipHop
26 1 Jazz
31 0 Classical
20 0 Dance
Priorly, I used this method for splitting:
X = df.drop(columns=["genre"])
y = df["genre"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
I am struggling how to split this CSV dataset into the method that splits the MNIST dataset.
I would appreciate your helps.
Thanks

Value Error : Only 2 class/es in training fold, but 1 in overall dataset. This is not supported for decision_function with imbalanced folds

I am learning machine learning and creating my first model on #mnist data set.
Can anyone help me over here? I have tried Stratified Fold, kfold and other methods to resolve this issue.
Pandas Version '0.25.1', Python Version 3.7, using Anaconda Distribution.
from sklearn.model_selection import train_test_split
train_set ,test_set = train_test_split(mnist,test_size = 0.2, random_state = 29)
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=29)
sgd_clf.fit(X_train,y_train_5)
X_train, y_train = train_set.drop('label',axis = 1), train_set[['label']]
X_test, y_test = test_set.drop('label',axis = 1),test_set[['label']]
y_train_5 = (y_train == 5) #True for all 5's and false otherwise
y_test_5 = (y_train == 5)
from sklearn.model_selection import cross_val_predict
print(X_train.shape)
print(y_train_5.shape)
cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")
Last line of the code block gives an error:
RuntimeWarning: Number of classes in training fold (2) does not match total number of classes (1). Results may not be appropriate for your use case. To fix this, use a cross-validation technique resulting in properly stratified folds
RuntimeWarning)
ValueError Traceback (most recent call last)
<ipython-input-39-da1ad024473a> in <module>
3 print(X_train.shape)
4 print(y_train_5.shape)
----> 5 cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in cross_val_predict(estimator, X, y, groups, cv, n_jobs, verbose, fit_params, pre_dispatch, method)
787 prediction_blocks = parallel(delayed(_fit_and_predict)(
788 clone(estimator), X, y, train, test, verbose, fit_params, method)
--> 789 for train, test in cv.split(X, y, groups))
790
791 # Concatenate the predictions
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
919 # remaining jobs.
920 self._iterating = False
--> 921 if self.dispatch_one_batch(iterator):
922 self._iterating = self._original_iterator is not None
923
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
757 return False
758 else:
--> 759 self._dispatch(tasks)
760 return True
761
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in _dispatch(self, batch)
714 with self._lock:
715 job_idx = len(self._jobs)
--> 716 job = self._backend.apply_async(batch, callback=cb)
717 # A job can complete so quickly than its callback is
718 # called before we get here, causing self._jobs to
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\_parallel_backends.py in apply_async(self, func, callback)
180 def apply_async(self, func, callback=None):
181 """Schedule a func to be run"""
--> 182 result = ImmediateResult(func)
183 if callback:
184 callback(result)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\_parallel_backends.py in __init__(self, batch)
547 # Don't delay the application, to avoid keeping the input
548 # arguments in memory
--> 549 self.results = batch()
550
551 def get(self):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in __call__(self)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
--> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in <listcomp>(.0)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
--> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _fit_and_predict(estimator, X, y, train, test, verbose, fit_params, method)
887 n_classes = len(set(y)) if y.ndim == 1 else y.shape[1]
888 predictions = _enforce_prediction_order(
--> 889 estimator.classes_, predictions, n_classes, method)
890 return predictions, test
891
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _enforce_prediction_order(classes, predictions, n_classes, method)
933 'is not supported for decision_function '
934 'with imbalanced folds. {}'.format(
--> 935 len(classes), n_classes, recommendation))
936
937 float_min = np.finfo(predictions.dtype).min
ValueError: Only 2 class/es in training fold, but 1 in overall dataset. This is not supported for decision_function with imbalanced folds. To fix this, use a cross-validation technique resulting in properly stratified folds
I ran through a similar problem and on further investigation found a warning message with the error log-
DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
There are two ways to solve this:
Use the hint in the warning message and change your code as:
cross_val_predict(sgd_clf, X_train, y_train_5.values.ravel(), cv=3,
method="decision_function")
refer - answere here
Also, using the hint from - A column-vector y was passed when a 1d array was expected.; I released my mistake and did the following:
Even in your error log- Number of classes in training fold (2) does not match total number of classes (1)
I assume y_train_5 here is a DataFrame, (probably you are working your way through Aurelien's publication)
The expected type for y_train_5 is an array-type object (meaning the shaoe to be (n,) or one-dimensional), but DataFrame is 2-dimensional, in your case (n,1).
All you need to do is pass the Series object for your column vector as-
y_train_5.iloc[:,0] (I prefer this)
y_train_5.{COLUMN_NAME} (another variant)
Try running below in your console.
> y_train_5.iloc[:,0].shape
(n,)
cross_val_predict(sgd_clf, X_train, y_train_5.iloc[:,0], cv=3,
method="decision_function")

Autoencoder not learning identity function

I'm somewhat new to machine learning in general, and I wanted to make a simple experiment to get more familiar with neural network autoencoders: To make an extremely basic autoencoder that would learn the identity function.
I'm using Keras to make life easier, so I did this first to make sure it works:
# Weights are given as [weights, biases], so we give
# the identity matrix for the weights and a vector of zeros for the biases
weights = [np.diag(np.ones(84)), np.zeros(84)]
model = Sequential([Dense(84, input_dim=84, weights=weights)])
model.compile(optimizer='sgd', loss='mean_squared_error')
model.fit(X, X, nb_epoch=10, batch_size=8, validation_split=0.3)
As expected, the loss is zero, both in train and validation data:
Epoch 1/10
97535/97535 [==============================] - 27s - loss: 0.0000e+00 - val_loss: 0.0000e+00
Epoch 2/10
97535/97535 [==============================] - 28s - loss: 0.0000e+00 - val_loss: 0.0000e+00
Then I tried to do the same but without initializing the weights to the identity function, expecting that after a while of training it would learn it. It didn't. I've let it run for 200 epochs various times in different configurations, playing with different optimizers, loss functions, and adding L1 and L2 activity regularizers. The results vary, but the best I've got is still really bad, looking nothing like the original data, just being kinda in the same numeric range.
The data is simply some numbers oscillating around 1.1. I don't know if an activation layer makes sense for this problem, should I be using one?
If this "neural network" of one layer can't learn something as simple as the identity function, how can I expect it to learn anything more complex? What am I doing wrong?
EDIT
To have better context, here's a way to generate a dataset very similar to the one I'm using:
X = np.random.normal(1.1090579, 0.0012380764, (139336, 84))
I'm suspecting that the variations between the values might be too small. The loss function ends up having decent values (around 1e-6), but it's not enough precision for the result to have a similar shape to the original data. Maybe I should scale/normalize it somehow? Thanks for any advice!
UPDATE
In the end, as it was suggested, the issue was with the dataset having too small variations between the 84 values, so the resulting prediction was actually pretty good in absolute terms (loss function) but comparing it to the original data, the variations were far off. I solved it by normalizing the 84 values in each sample around the sample's mean and dividing by the sample's standard deviation. Then I used the original mean and standard deviation to denormalize the predictions at the other end. I guess this could be done in a few different ways, but I did it by adding this normalization/denormalization into the model itself by using some Lambda layers that operated on the tensors. That way all the data processing was incorporated into the model, which made it nicer to work with. Let me know if you would like to see the actual code.
I believe the problem could be either the number of epoch or the way you inizialize X.
I ran your code with an X of mine for 100 epochs and printed the argmax() and max values of the weights, it gets really close to the identity function.
I'm adding the code snippet that I used
from keras.models import Sequential
from keras.layers import Dense
import numpy as np
import random
import pandas as pd
X = np.array([[random.random() for r in xrange(84)] for i in xrange(1,100000)])
model = Sequential([Dense(84, input_dim=84)], name="layer1")
model.compile(optimizer='sgd', loss='mean_squared_error')
model.fit(X, X, nb_epoch=100, batch_size=80, validation_split=0.3)
l_weights = np.round(model.layers[0].get_weights()[0],3)
print l_weights.argmax(axis=0)
print l_weights.max(axis=0)
And I'm getting:
Train on 69999 samples, validate on 30000 samples
Epoch 1/100
69999/69999 [==============================] - 1s - loss: 0.2092 - val_loss: 0.1564
Epoch 2/100
69999/69999 [==============================] - 1s - loss: 0.1536 - val_loss: 0.1510
Epoch 3/100
69999/69999 [==============================] - 1s - loss: 0.1484 - val_loss: 0.1459
.
.
.
Epoch 98/100
69999/69999 [==============================] - 1s - loss: 0.0055 - val_loss: 0.0054
Epoch 99/100
69999/69999 [==============================] - 1s - loss: 0.0053 - val_loss: 0.0053
Epoch 100/100
69999/69999 [==============================] - 1s - loss: 0.0051 - val_loss: 0.0051
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83]
[ 0.85000002 0.85100001 0.79799998 0.80500001 0.82700002 0.81900001
0.792 0.829 0.81099999 0.80800003 0.84899998 0.829 0.852
0.79500002 0.84100002 0.81099999 0.792 0.80800003 0.85399997
0.82999998 0.85100001 0.84500003 0.847 0.79699999 0.81400001
0.84100002 0.81 0.85100001 0.80599999 0.84500003 0.824
0.81999999 0.82999998 0.79100001 0.81199998 0.829 0.85600001
0.84100002 0.792 0.847 0.82499999 0.84500003 0.796
0.82099998 0.81900001 0.84200001 0.83999997 0.815 0.79500002
0.85100001 0.83700001 0.85000002 0.79900002 0.84100002 0.79699999
0.838 0.847 0.84899998 0.83700001 0.80299997 0.85399997
0.84500003 0.83399999 0.83200002 0.80900002 0.85500002 0.83899999
0.79900002 0.83399999 0.81 0.79100001 0.81800002 0.82200003
0.79100001 0.83700001 0.83600003 0.824 0.829 0.82800001
0.83700001 0.85799998 0.81999999 0.84299999 0.83999997]
When I used only 5 numbers as an input and printed the actual weights I got this:
array([[ 1., 0., -0., 0., 0.],
[ 0., 1., 0., -0., -0.],
[-0., 0., 1., 0., 0.],
[ 0., -0., 0., 1., -0.],
[ 0., -0., 0., -0., 1.]], dtype=float32)

Decision Trees (Random Forest and Random Tree) classification on a small data set. Something wrong?

I performed classification on a small data set 65x9 using Decision Trees (Random Forest and Random Tree). I have four classes and 8 Attributes and 65 Instances.
My Application is in assistive robotics. So,Im extracting some parameters from my sensor data that I think are relevant to classify the users run while they are performing some task. I get the movement data from the sensor package deployed on the wheelchair. Im classify certain action like turning 180 degree, and Im giving him a mark (from 1 to 4) So from the sensor package and the software I had extracted parameters like velocity, distance, time, standard deviation of the velocity etc. that are relevant for the classification of the users run. So my data are all numbers.
When I performed Decision Trees Classify I got this Results
=== Classifier model (full training set) ===
Random forest of 10 trees, each constructed while considering 4 random features.
Out of bag error: 0.5231
Time taken to build model: 0.01 seconds
=== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances 64 98.4615 %
Incorrectly Classified Instances 1 1.5385 %
Kappa statistic 0.9791
Mean absolute error 0.0715
Root mean squared error 0.1243
Relative absolute error 19.4396 %
Root relative squared error 29.0038 %
Total Number of Instances 65
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1 0 1 1 1 1 c1
1 0 1 1 1 1 c2
0.952 0 1 0.952 0.976 1 c3
1 0.019 0.917 1 0.957 1 c4
Weighted Avg. 0.985 0.003 0.986 0.985 0.985 1
=== Confusion Matrix ===
a b c d <-- classified as
14 0 0 0 | a = c1
0 19 0 0 | b = c2
0 0 20 1 | c = c3
0 0 0 11 | d = c4
This is too good. Am I doing something wrong?

Resources