Binary Classification Problem: How to Proceed With Severe Data Imbalance? - machine-learning

The Problem
After pre-processing a raw dataset, I obtained a clean but severely imbalanced dataset with 341 observations with label 1 and 3 observations with label 0 (more details about the dataset at the bottom).
Dataset shape: (344, 1500)
Dataset class label distribution: Counter({1: 341, 0: 3})
What can I do to proceed with this dataset for classification?
What I have tried:
Split the dataset into train-test sets with 70:30 ratio with stratify on class label
Train data shape: (240, 1500)
Train data class label distribution: Counter({1: 238, 0: 2})
Test data shape: (104, 1500)
Test data class label distribution: Counter({1: 103, 0: 1})
Perform oversampling on train data using SMOTE (synthetic minority oversampling technique) with k_neighbour set to 1
After SMOTE:
Train data shape: (476, 1500)
Train data class label distribution: Counter({1: 238, 0: 238})
I plan to train a classifier using the oversampled train data and use the test data to get the classification result.
But does this make sense? In my opinion it does not make sense since
The oversampled train data might overfit the model because the oversampled train data now has many observations with class label 0 which are oversampled based on only 2 observations.
The minority class label of the test data only have 1 observation out of 104 samples. Therefore the classifier will have high accuracy by just making prediction on the majority class label (Initially I plan to perform SMOTE on test data too but I read from somewhere that oversampling techniques are only used on train data).
I am really stuck here and I could not find any relevant information for this problem.
A brief summary of the acquired mulit-omics dataset:
The raw lung cancer (LUSC) dataset was obtained from http://acgt.cs.tau.ac.il/multi_omic_benchmark/download.html. It consists of 3 omics data types with 1 clinical dataset. The 3 omics data types consists of 3 different omics expressions (gene expression, DNA methylation & miRNA expression) while the clinical dataset consists of the binary class label sample_type (along with other unimportant attributes) for the 3 omics data types.
The aim is to obtain a multi-omics dataset by combining the 3 omics data types.
To obtain the multi-omics data, the 3 omics data types were concatenated with the clinical data (with sample_type as the class label) based on sampleID in all 4 datasets. The end product is a severely imbalaned dataset which consists of 344 observations with 341 observations with Primary Tumour label (has cancer, referred as 1) and 3 observations with Solid Tissue Normal label (no cancer, referred as 0)

This is more of a statistics question. In my opinion, you should not try to estimate anything on these data. You do not know what sets the 0's apart. Just to make a simple logistic regression, I'd recommend having at least 30-40 observations (ideally more).
The simplest estimator based on your data would be to guess 1 every time. That would lead to 99 percent accuracy, you can't expect to beat that with any complex models.

Related

Calculating Probability of a Classification Model Prediction

I have a classification task. The training data has 50 different labels. The customer wants to differentiate the low probability predictions, meaning that, I have to classify some test data as Unclassified / Other depending on the probability (certainty?) of the model.
When I test my code, the prediction result is a numpy array (I'm using different models, this is one is pre-trained BertTransformer). The prediction array doesn't contain probabilities such as in Keras predict_proba() method. These are numbers generated by prediction method of pretrained BertTransformer model.
[[-1.7862008 -0.7037363 0.09885322 1.5318055 2.1137428 -0.2216074
0.18905772 -0.32575375 1.0748093 -0.06001111 0.01083148 0.47495762
0.27160102 0.13852511 -0.68440574 0.6773654 -2.2712054 -0.2864312
-0.8428862 -2.1132915 -1.0157436 -1.0340284 -0.35126117 -1.0333195
9.149789 -0.21288703 0.11455813 -0.32903734 0.10503325 -0.3004114
-1.3854568 -0.01692022 -0.4388664 -0.42163098 -0.09182278 -0.28269592
-0.33082992 -1.147654 -0.6703184 0.33038092 -0.50087476 1.1643585
0.96983343 1.3400391 1.0692116 -0.7623776 -0.6083422 -0.91371405
0.10002492]]
I'm using numpy.argmax() to identify the correct label. The prediction works just fine. However, since these are not probabilities, I cannot compare the best result with a threshold value.
My question is, how can I define a threshold (say, 0.6), and then compare the probability of the argmax() element of the BertTransformer prediction array so that I can classify the prediction as "Other" if the probability is less than the threshold value?
Edit 1:
We are using 2 different models. One is Keras, and the other is BertTransformer. We have no problem in Keras since it gives the probabilities so I'm skipping Keras model.
The Bert model is pretrained. Here is how it is generated:
def model(self, data):
number_of_categories = len(data['encoded_categories'].unique())
model = BertForSequenceClassification.from_pretrained(
"dbmdz/bert-base-turkish-128k-uncased",
num_labels=number_of_categories,
output_attentions=False,
output_hidden_states=False,
)
# model.cuda()
return model
The output given above is the result of model.predict() method. We compare both models, Bert is slightly ahead, therefore we know that the prediction works just fine. However, we are not sure what those numbers signify or represent.
Here is the Bert documentation.
BertForSequenceClassification returns logits, i.e., the classification scores before normalization. You can normalize the scores by calling F.softmax(output, dim=-1) where torch.nn.functional was imported as F.
With thousands of labels, the normalization can be costly and you do not need it when you are only interested in argmax. This is probably why the models return the raw scores only.

Time Series Classification Problem: Training EEG dataset with Sliding Window and Majority Vote

I'm working to a classification problem (Emotion Recognition throuhg EEG data) with a custom dataset (not publically available, for now) in pytorch.
EEG data are samples of shape (C, T), with C being the number of channels (electrodes) and T the number of time-points (variable-lenth samples, say T in range [9000-120000] time points). Each sample has a emotion label for emotion classification.
I'm trying to train a dataset with 400 samples. For each epoch, from each sample a random trial of shape (C, 1280) is selected with the given label.
Since the used time-points are only 1/10 of the whole sample, I was thinking to create a trainer that, for each epoch, for each sample:
gets N trials of size (C, 1280), sliding a window of 1280 time-points over the whole sample;
gets N labels, one for each trial;
assigns to the sample, the label that got the majority among the N labels.
The solution I'm thinking about is that of:
create N datasets, each one samples the eeg data, fetching a fixed window;
Train simoultaneisy a model using the N given datasets (using SequentialSampler, simultaneously batches will be formed based on the same sample);
calculate the cross-entropy using only the label that gave the majority among the N labels.
I'm not very happy about this solution: is it necessary to create N Datasets?
Does, for example, DataLoader in pytorch could fetch the sample N times, with a sliding window instead?

Linear Regression test data violating training data.Please explain where i went wrong

This is a part of a dataset containing 1000 entries of pricing of rents of houses at different locations.
after training the model, if i send same training data as test data, i am getting incorrect results. How is this even possible?
X_loc = df[{'area','rooms','location'}]
y_loc = df[:]['price']
X_train, X_test, y_train, y_test = train_test_split(X_loc, y_loc, test_size = 1/3, random_state = 0)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_train[0:1])
DATASET:
price rooms area location
0 0 22000 3 1339 140
1 1 45000 3 1580 72
3 3 72000 3 2310 72
4 4 40000 3 1800 41
5 5 35000 3 2100 57
expected output (y_pred)should be 220000 but its showing 290000 How can it violate the already trained input?
What you observed is exactly what is referred to as the "training error". Machine learning models are meant to find the "best" fit which minimizes the "total error" (i.e. for all data points and not every data point).
22000 is not very far from 29000, although it is not the exact number. This because linear regression tries compress all the variations in your data to follow one straight line.
Possibly the model is nonlinear and so applying a Linear Regression yields bad results. There are other reasons why a Linear Regression may fail cf. https://stats.stackexchange.com/questions/393706/bad-linear-regression-results
Nonlinear data often appears when there are (statistical) interactions between features.
A generalization of Linear Regression is the Generalized Linear Model (GLM), that is able to handle nonlinearities by its nonlinear link functions : https://en.wikipedia.org/wiki/Generalized_linear_model
In scikit-learn you can use a Support Vector Regression with polynomial or RBF kernel for a nonlinear model https://scikit-learn.org/stable/auto_examples/svm/plot_svm_regression.html
An alternative ansatz is to analyze the data on interactions and apply methods that are described in https://en.wikipedia.org/wiki/Generalized_linear_model#Correlated_or_clustered_data however this is complex. Possibly try Ridge Regression for this assumption because it can handle multicollinearity tht is one form of statistical interactions: https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Ridge_Regression.pdf
https://statisticsbyjim.com/regression/difference-between-linear-nonlinear-regression-models/

Non-linear multivariate time-series response prediction using RNN

I am trying to predict the hygrothermal response of a wall, given the interior and exterior climate. Based on literature research, I believe this should be possible with RNN but I have not been able to get good accuracy.
The dataset has 12 input features (time-series of exterior and interior climate data) and 10 output features (time-series of hygrothermal response), both containing hourly values for 10 years. This data was created with hygrothermal simulation software, there is no missing data.
Dataset features:
Dataset targets:
Unlike most time-series prediction problems, I want to predict the response for the full length of the input features time-series at each time-step, rather than the subsequent values of a time-series (eg financial time-series prediction). I have not been able to find similar prediction problems (in similar or other fields), so if you know of one, references are very welcome.
I think this should be possible with RNN, so I am currently using LSTM from Keras. Before training, I preprocess my data the following way:
Discard first year of data, as the first time steps of the hygrothermal response of the wall is influenced by the initial temperature and relative humidity.
Split into training and testing set. Training set contains the first 8 years of data, the test set contains the remaining 2 years.
Normalise training set (zero mean, unit variance) using StandardScaler from Sklearn. Normalise test set analogously using mean an variance from training set.
This results in: X_train.shape = (1, 61320, 12), y_train.shape = (1, 61320, 10), X_test.shape = (1, 17520, 12), y_test.shape = (1, 17520, 10)
As these are long time-series, I use stateful LSTM and cut the time-series as explained here, using the stateful_cut() function. I only have 1 sample, so batch_size is 1. For T_after_cut I have tried 24 and 120 (24*5); 24 appears to give better results. This results in X_train.shape = (2555, 24, 12), y_train.shape = (2555, 24, 10), X_test.shape = (730, 24, 12), y_test.shape = (730, 24, 10).
Next, I build and train the LSTM model as follows:
model = Sequential()
model.add(LSTM(128,
batch_input_shape=(batch_size,T_after_cut,features),
return_sequences=True,
stateful=True,
))
model.addTimeDistributed(Dense(targets)))
model.compile(loss='mean_squared_error', optimizer=Adam())
model.fit(X_train, y_train, epochs=100, batch_size=batch=batch_size, verbose=2, shuffle=False)
Unfortunately, I don't get accurate prediction results; not even for the training set, thus the model has high bias.
The prediction results of the LSTM model for all targets
How can I improve my model? I have already tried the following:
Not discarding the first year of the dataset -> no significant difference
Differentiating the input features time-series (subtract previous value from current value) -> slightly worse results
Up to four stacked LSTM layers, all with the same hyperparameters -> no significant difference in results but longer training time
Dropout layer after LSTM layer (though this is usually used to reduce variance and my model has high bias) -> slightly better results, but difference might not be statistically significant
Am I doing something wrong with the stateful LSTM? Do I need to try different RNN models? Should I preprocess the data differently?
Furthermore, training is very slow: about 4 hours for the model above. Hence I am reluctant to do an extensive hyperparameter gridsearch...
In the end, I managed to solve this the following way:
Using more samples to train instead of only 1 (I used 18 samples to train and 6 to test)
Keep the first year of data, as the output time-series for all samples have the same 'starting point' and the model needs this information to learn
Standardise both input and output features (zero mean, unit variance). I found this improved prediction accuracy and training speed
Use stateful LSTM as described here, but add reset states after epoch (see below for code). I used batch_size = 6 and T_after_cut = 1460. If T_after_cut is longer, training is slower; if T_after_cut is shorter, accuracy decreases slightly. If more samples are available, I think using a larger batch_size will be faster.
use CuDNNLSTM instead of LSTM, this speed up the training time x4!
I found that more units resulted in higher accuracy and faster convergence (shorter training time). Also I found that the GRU is as accurate as the LSTM tough converged faster for the same number of units.
Monitor validation loss during training and use early stopping
The LSTM model is build and trained as follows:
def define_reset_states_batch(nb_cuts):
class ResetStatesCallback(Callback):
def __init__(self):
self.counter = 0
def on_batch_begin(self, batch, logs={}):
# reset states when nb_cuts batches are completed
if self.counter % nb_cuts == 0:
self.model.reset_states()
self.counter += 1
def on_epoch_end(self, epoch, logs={}):
# reset states after each epoch
self.model.reset_states()
return(ResetStatesCallback)
model = Sequential()
model.add(layers.CuDNNLSTM(256, batch_input_shape=(batch_size,T_after_cut ,features),
return_sequences=True,
stateful=True))
model.add(layers.TimeDistributed(layers.Dense(targets, activation='linear')))
optimizer = RMSprop(lr=0.002)
model.compile(loss='mean_squared_error', optimizer=optimizer)
earlyStopping = EarlyStopping(monitor='val_loss', min_delta=0.005, patience=15, verbose=1, mode='auto')
ResetStatesCallback = define_reset_states_batch(nb_cuts)
model.fit(X_dev, y_dev, epochs=n_epochs, batch_size=n_batch, verbose=1, shuffle=False, validation_data=(X_eval,y_eval), callbacks=[ResetStatesCallback(), earlyStopping])
This gave me very statisfying accuracy (R2 over 0.98):
This figure shows the temperature (left) and relative humidity (right) in the wall over 2 years (data not used in training), prediction in red and true output in black. The residuals show that the error is very small and that the LSTM learns to capture the long-term dependencies to predict the relative humidity.

The proper way of using IsolationForest to detect outliers of high-dim dataset

I use the following simple IsolationForest algorithm to detect the outliers of given dataset X of 20K samples and 16 features, I run the following
train_X, tesy_X, train_y, test_y = train_test_split(X, y, train_size=.8)
clf = IsolationForest()
clf.fit(X) # Notice I am using the entire dataset X when fitting!!
print (clf.predict(X))
I get the result:
[ 1 1 1 -1 ... 1 1 1 -1 1]
This question is: Is it logically correct to use the entire dataset X when fitting into IsolationForest or only train_X?
Yes, it is logically correct to ultimately train on the entire dataset.
With that in mind, you could measure the test set performance against the training set's performance. This could tell you if the test set is from a similar distribution as your training set.
If the test set scores anomalous as compared to the training set, then you can expect future data to be similar. In this case, I would like more data to have a more complete view of what is 'normal'.
If the test set scores similarly to the training set, I would be more comfortable with the final Isolation Forest trained on all data.
Perhaps you could use sklearn TimeSeriesSplit CV in this fashion to get a sense for how much data is enough for your problem?
Since this is unlabeled data to the anomaly detector, the more data the better when defining 'normal'.

Resources