How to handle noisy data to effectively build regression models? - machine-learning

My dataset has 4k rows and 10 columns. The data has many outliers,not normally distributed.I didn't do outlier handing or scaling/transformation. I did RFE and selected 5 features for modelling. I got an 0.93 r2 score on train as well as test data but my MSE is very high(60010869006). How to handle the noisy data while using regression models
Train MSE : 161428894147.16986
Test MSE : 60010869006.13406
Train MAE : 32656.965643328014
Test MAE : 44556.38750475175
Train R2 : 0.9344080790458971
Test R2 : 0.9382632258022047

Related

Zero predictions on my LSTM model though accuracy is 30%

I am building an LSTM model, with the input shape of:
x_train_text: (3500,80) " I have 3500 examples, and 80 features extracted from WordEmbedding"
y_train_text: (3500,6) "I have 6 classes, unbalanced"
x_validate_text: (1000,80)
y_validate_text: (1000,6)
Now, I trained the model and the overall accuracy was 30%. I am fine with that as I am building a simple LSTM. The result is as follow:
model.fit(x_train_text,y_train_text,
validation_data = (x_validate_text,y_validate_text)
epochs= 10)
{'model': [['loss', 1.7275227308273315],
['accuracy', 0.24708323180675507],
['val_loss', 1.7259385585784912],
['val_accuracy', 0.2551288902759552]]}
Now, I am trying to do error analysis to see which classes are underfitting. Whenever I run Model.predict(x_train_text) I get only ZEROS although it is the same training dataset!!!
Shouldn't this be at least the same as training accuracy overall?

Should I use MinMaxScaler which was fit on train dataset to transform test dataset, or use a separate MinMaxScaler to fit and transform test dataset?

Assume that I have 3 dataset in a ML problem.
train dataset: used to estimate ML model parameters (training)
test dataset: used to evaulate trained model, calculate accuracy of trained model
prediction dataset: used only for prediction after model deployment
I don't have evaluation dataset, and I use Grid Search with k-fold cross validation to find the best model.
Also, I have two python scripts as follows:
train.py: used to train and test ML model, load train and test dataset, save the trained model, best model is found by Grid Search.
predict.py: used to load pre-trained model & load prediction dataset, predict model output and calculate accuracy.
Before starting training process in train.py, I use MinMaxScaler as follows:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(x_train) # fit only on train dataset
x_train_norm = scaler.transform(x_train)
x_test_norm = scaler.transform(x_test)
In predict.py, after loding prediction dataset, I need to use the same data pre-processing as below:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(x_predict)
x_predict_norm = scaler.transform(x_predict)
As you can see above, both fit and transform are done on prediction dataset. However, in train.py, fit is done on train dataset, and the same MinMaxScaler is applied to transform test dataset.
My understanding is that test dataset is a simulation of real data that model is supposed to predict after deployment. Therefore, data pre-processing of test and prediction dataset should be the same.
I think separate MinMaxScaler should be used in train.py for train and test dataset as follows:
from sklearn.preprocessing import MinMaxScaler
scaler_train = MinMaxScaler()
scaler_test = MinMaxScaler()
scaler_train.fit(x_train) # fit only on train dataset
x_train_norm = scaler_train.transform(x_train)
scaler_test.fit(x_test) # fit only on test dataset
x_test_norm = scaler_test.transform(x_test)
What is the difference?
Value of x_test_norm will be different if I use separate MinMaxScaler as explained above. In this case, value of x_test_norm is in the range of [-1, 1]. However, If I transform test dataset by a MinMaxScaler which was fit by train dataset, value of x_test_norm can be outside the range of [-1, 1].
Please let me know your idea about it.
When you run .transform() MinMax scaling does something like: (value - min) / (Max - min) The value of min and Max are defined when you run .fit(). So the answer - yes, you should fit MinMaxScaller on the training dataset and then use it on the test dataset.
Just imagine the situation when in the training dataset you have some feature with Max=100 and min=10, while in the test dataset Max=10 and min=1. If you will train separate MinMaxScaller for test subset, yes, it will scale the feature in the range [-1, 1], but in comparison to the training dataset, the called values should be lower.
Also, regarding Grid Search with k-fold cross-validation, you should use the Pipeline. In this case, Grid Search will automatically fit MinMaxScaller on the k-1 folds. Here is a good example of how to organize pipeline with Mixed Types.

Linear Regression test data violating training data.Please explain where i went wrong

This is a part of a dataset containing 1000 entries of pricing of rents of houses at different locations.
after training the model, if i send same training data as test data, i am getting incorrect results. How is this even possible?
X_loc = df[{'area','rooms','location'}]
y_loc = df[:]['price']
X_train, X_test, y_train, y_test = train_test_split(X_loc, y_loc, test_size = 1/3, random_state = 0)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_train[0:1])
DATASET:
price rooms area location
0 0 22000 3 1339 140
1 1 45000 3 1580 72
3 3 72000 3 2310 72
4 4 40000 3 1800 41
5 5 35000 3 2100 57
expected output (y_pred)should be 220000 but its showing 290000 How can it violate the already trained input?
What you observed is exactly what is referred to as the "training error". Machine learning models are meant to find the "best" fit which minimizes the "total error" (i.e. for all data points and not every data point).
22000 is not very far from 29000, although it is not the exact number. This because linear regression tries compress all the variations in your data to follow one straight line.
Possibly the model is nonlinear and so applying a Linear Regression yields bad results. There are other reasons why a Linear Regression may fail cf. https://stats.stackexchange.com/questions/393706/bad-linear-regression-results
Nonlinear data often appears when there are (statistical) interactions between features.
A generalization of Linear Regression is the Generalized Linear Model (GLM), that is able to handle nonlinearities by its nonlinear link functions : https://en.wikipedia.org/wiki/Generalized_linear_model
In scikit-learn you can use a Support Vector Regression with polynomial or RBF kernel for a nonlinear model https://scikit-learn.org/stable/auto_examples/svm/plot_svm_regression.html
An alternative ansatz is to analyze the data on interactions and apply methods that are described in https://en.wikipedia.org/wiki/Generalized_linear_model#Correlated_or_clustered_data however this is complex. Possibly try Ridge Regression for this assumption because it can handle multicollinearity tht is one form of statistical interactions: https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Ridge_Regression.pdf
https://statisticsbyjim.com/regression/difference-between-linear-nonlinear-regression-models/

Non-linear multivariate time-series response prediction using RNN

I am trying to predict the hygrothermal response of a wall, given the interior and exterior climate. Based on literature research, I believe this should be possible with RNN but I have not been able to get good accuracy.
The dataset has 12 input features (time-series of exterior and interior climate data) and 10 output features (time-series of hygrothermal response), both containing hourly values for 10 years. This data was created with hygrothermal simulation software, there is no missing data.
Dataset features:
Dataset targets:
Unlike most time-series prediction problems, I want to predict the response for the full length of the input features time-series at each time-step, rather than the subsequent values of a time-series (eg financial time-series prediction). I have not been able to find similar prediction problems (in similar or other fields), so if you know of one, references are very welcome.
I think this should be possible with RNN, so I am currently using LSTM from Keras. Before training, I preprocess my data the following way:
Discard first year of data, as the first time steps of the hygrothermal response of the wall is influenced by the initial temperature and relative humidity.
Split into training and testing set. Training set contains the first 8 years of data, the test set contains the remaining 2 years.
Normalise training set (zero mean, unit variance) using StandardScaler from Sklearn. Normalise test set analogously using mean an variance from training set.
This results in: X_train.shape = (1, 61320, 12), y_train.shape = (1, 61320, 10), X_test.shape = (1, 17520, 12), y_test.shape = (1, 17520, 10)
As these are long time-series, I use stateful LSTM and cut the time-series as explained here, using the stateful_cut() function. I only have 1 sample, so batch_size is 1. For T_after_cut I have tried 24 and 120 (24*5); 24 appears to give better results. This results in X_train.shape = (2555, 24, 12), y_train.shape = (2555, 24, 10), X_test.shape = (730, 24, 12), y_test.shape = (730, 24, 10).
Next, I build and train the LSTM model as follows:
model = Sequential()
model.add(LSTM(128,
batch_input_shape=(batch_size,T_after_cut,features),
return_sequences=True,
stateful=True,
))
model.addTimeDistributed(Dense(targets)))
model.compile(loss='mean_squared_error', optimizer=Adam())
model.fit(X_train, y_train, epochs=100, batch_size=batch=batch_size, verbose=2, shuffle=False)
Unfortunately, I don't get accurate prediction results; not even for the training set, thus the model has high bias.
The prediction results of the LSTM model for all targets
How can I improve my model? I have already tried the following:
Not discarding the first year of the dataset -> no significant difference
Differentiating the input features time-series (subtract previous value from current value) -> slightly worse results
Up to four stacked LSTM layers, all with the same hyperparameters -> no significant difference in results but longer training time
Dropout layer after LSTM layer (though this is usually used to reduce variance and my model has high bias) -> slightly better results, but difference might not be statistically significant
Am I doing something wrong with the stateful LSTM? Do I need to try different RNN models? Should I preprocess the data differently?
Furthermore, training is very slow: about 4 hours for the model above. Hence I am reluctant to do an extensive hyperparameter gridsearch...
In the end, I managed to solve this the following way:
Using more samples to train instead of only 1 (I used 18 samples to train and 6 to test)
Keep the first year of data, as the output time-series for all samples have the same 'starting point' and the model needs this information to learn
Standardise both input and output features (zero mean, unit variance). I found this improved prediction accuracy and training speed
Use stateful LSTM as described here, but add reset states after epoch (see below for code). I used batch_size = 6 and T_after_cut = 1460. If T_after_cut is longer, training is slower; if T_after_cut is shorter, accuracy decreases slightly. If more samples are available, I think using a larger batch_size will be faster.
use CuDNNLSTM instead of LSTM, this speed up the training time x4!
I found that more units resulted in higher accuracy and faster convergence (shorter training time). Also I found that the GRU is as accurate as the LSTM tough converged faster for the same number of units.
Monitor validation loss during training and use early stopping
The LSTM model is build and trained as follows:
def define_reset_states_batch(nb_cuts):
class ResetStatesCallback(Callback):
def __init__(self):
self.counter = 0
def on_batch_begin(self, batch, logs={}):
# reset states when nb_cuts batches are completed
if self.counter % nb_cuts == 0:
self.model.reset_states()
self.counter += 1
def on_epoch_end(self, epoch, logs={}):
# reset states after each epoch
self.model.reset_states()
return(ResetStatesCallback)
model = Sequential()
model.add(layers.CuDNNLSTM(256, batch_input_shape=(batch_size,T_after_cut ,features),
return_sequences=True,
stateful=True))
model.add(layers.TimeDistributed(layers.Dense(targets, activation='linear')))
optimizer = RMSprop(lr=0.002)
model.compile(loss='mean_squared_error', optimizer=optimizer)
earlyStopping = EarlyStopping(monitor='val_loss', min_delta=0.005, patience=15, verbose=1, mode='auto')
ResetStatesCallback = define_reset_states_batch(nb_cuts)
model.fit(X_dev, y_dev, epochs=n_epochs, batch_size=n_batch, verbose=1, shuffle=False, validation_data=(X_eval,y_eval), callbacks=[ResetStatesCallback(), earlyStopping])
This gave me very statisfying accuracy (R2 over 0.98):
This figure shows the temperature (left) and relative humidity (right) in the wall over 2 years (data not used in training), prediction in red and true output in black. The residuals show that the error is very small and that the LSTM learns to capture the long-term dependencies to predict the relative humidity.

The proper way of using IsolationForest to detect outliers of high-dim dataset

I use the following simple IsolationForest algorithm to detect the outliers of given dataset X of 20K samples and 16 features, I run the following
train_X, tesy_X, train_y, test_y = train_test_split(X, y, train_size=.8)
clf = IsolationForest()
clf.fit(X) # Notice I am using the entire dataset X when fitting!!
print (clf.predict(X))
I get the result:
[ 1 1 1 -1 ... 1 1 1 -1 1]
This question is: Is it logically correct to use the entire dataset X when fitting into IsolationForest or only train_X?
Yes, it is logically correct to ultimately train on the entire dataset.
With that in mind, you could measure the test set performance against the training set's performance. This could tell you if the test set is from a similar distribution as your training set.
If the test set scores anomalous as compared to the training set, then you can expect future data to be similar. In this case, I would like more data to have a more complete view of what is 'normal'.
If the test set scores similarly to the training set, I would be more comfortable with the final Isolation Forest trained on all data.
Perhaps you could use sklearn TimeSeriesSplit CV in this fashion to get a sense for how much data is enough for your problem?
Since this is unlabeled data to the anomaly detector, the more data the better when defining 'normal'.

Resources