Have RMSE in Random Survival Forest in R program - random-forest

I should have RMSE in three model to compare them with each other to say which one is better than the others. My models which I should run are Survival decision tree , Random survival forest and Bagging. I have been running my models but in the end I only have some predict. I brought Random survival forest result in the following. What should I do to have RMSE?
library(survival)
library(randomForestSRC)
dataset<-data.frame(data)
dataset
n.sample=round(0.5*nrow(dataset))
dataset1=sample (1: nrow(dataset),n.sample)
train=data[dataset1,]
test= data[-dataset1 ,]
set.seed(1369)
rsf0=rfsrc(Surv(time,status)~.,train,importance=TRUE,forest=T,ensemble="oob",mtry=NULL,block.size=1,splitrule="logrank")
print(rsf0)
Results:
Sample size: 821
Number of deaths: 209
Number of trees: 1000
Forest terminal node size: 15
Average no. of terminal nodes: 38.62
No. of variables tried at each split: 4
Total no. of variables: 14
Resampling used to grow trees: swor
Resample size used to grow trees: 519
Analysis: RSF
Family: surv
Splitting rule: logrank random
Number of random split points: 10
Error rate: 36.15%

I think you slightly misunderstand what survival analysis models are usually used for. Normally we want to predict the distribution of the survival time and not the survival time itself. The RMSE can only be used when the actual survival time is predicted. In your example, the models you discuss make a distribution prediction.
So firstly I've cleaned up your code slightly and added an example dataset to make it reproducible:
library(survival)
library(randomForestSRC)
# use the rats dataset to make the example reproducible
dataset <- data.frame(survival::rats)
dataset$sex <- factor(dataset$sex)
# note that you need to set.seed before you use `sample`
set.seed(1369)
# again specifying train/test split but this time as two separate sets of integers
train = sample(nrow(dataset), 0.5 * nrow(dataset))
test = setdiff(seq(nrow(dataset)), train)
# train the random forest model on the training data
rsf0 = rfsrc(Surv(time,status)~., dataset[train, ], importance=TRUE, forest=T,
ensemble="oob", mtry=NULL, block.size=1, splitrule="logrank")
# now make predictions
predictions = predict(rsf0, newdata = dataset[-train, ])
# view the predicted survival probabilities
predictions$survival
With these probabilities, you have to make a decision about how to convert them to survival time predictions, and then you have to manually compute the RMSE after first removing all censored observations. Common conversions to survival time are to take the mean of the predicted individual distributions or the median.
As an alternative, and plugging my own package here, you could use {mlr3proba} which does this for you:
# load required packages
library(mlr3); library(mlr3proba);library(mlr3extralearners); library(mlr3pipelines)
# use the rats dataset to make the example reproducible
dataset <- data.frame(survival::rats)
dataset$sex <- factor(dataset$sex)
# note that you need to set.seed before you use `sample`
set.seed(1369)
# again specifying train/test split but this time as two separate sets of integers
train = sample(nrow(dataset), 0.5 * nrow(dataset))
test = setdiff(seq(nrow(dataset)), train)
# select the random forest model and use the `crankcompositor` to automatically
# create survival time predictions
learn = ppl("crankcompositor", lrn("surv.rfsrc"), response = TRUE, graph_learner = TRUE)
# create a task which stores your dataset
task = TaskSurv$new("data", backend = dataset, time = "time", event = "status")
# train your learner on training data
learn$train(task, row_ids = train)
# make predictions on test data
predictions = learn$predict(task, row_ids = test)
# view your survival time predictions
predictions$response
# calculate RMSE
predictions$score(msr("surv.rmse"))
This second option is more complicated if you're not used to R6, but I suspect that in your use-case it will benefit you as you can also compare multiple models at the same time with this.

Related

Low accuracy scores on portions of test dataset, high scores on full dataset

I have a dataset of satellite spectral data collected in the four-year period 2018 - 2021 (separate observations for each year). I decided to split the dataset 7:2:1 to get the training, validation, and test subsets. I have trained a CNN (achieving good accuracy), validated it (also good accuracy), and finally tested it (good accuracy as well). I plan to use the model to classify upcoming datasets - the new ones will, of course, come from a single-year period (2022) thus I wanted to simulate how would it behave, and in order to do so, I split my test dataset to single year observations (eg. 2018 only). When fed to the previously trained model it yielded a pretty poor accuracy score. It happened for all four years. model performance on single-year datasets was always significantly inferior to the full four-year dataset.
below is the example where i evaluate the model on two datasets and subsequently on the third dataset that consists of the previous two.
#load trained model
new_model = tf.keras.models.load_model("/content/drive/MyDrive/GEEMAP/modelC2D")
#load datsets 1 and 2 and concatinate both to make dataset 3
df1 = pd.read_csv("/content/drive/MyDrive/GEEMAP/train_species/digest1.csv", index_col=0)
df2 = pd.read_csv("/content/drive/MyDrive/GEEMAP/train_species/digest2.csv", index_col=0)
df3 = pd.concat([df1, df2], axis=0, ignore_index=True) # merged dataset consists of df1 and df2
#prepare dataset 1
df1x= df1.iloc[:,:-1] # separate values from classes
df1x=(df1x - df1x.mean())/df1x.std() #standardise
df1x= df1x.values.reshape(-1, 8, 4) #reshape to fit the model settings
df1y= df1.iloc[:,-1] # separate classes
#prepare dataset 2
df2x= df2.iloc[:,:-1]
df2x=(df2x - df2x.mean())/df2x.std()
df2x= df2x.values.reshape(-1, 8, 4)
df2y= df2.iloc[:,-1]
#prepare dataset 3
df3x= df3.iloc[:,:-1]
df3x=(df3x - df3x.mean())/df3x.std()
df3x= df3x.values.reshape(-1, 8, 4)
df3y= df3.iloc[:,-1]
results1 = new_model.evaluate(df1x, df1y)
results2 = new_model.evaluate(df2x, df2y)
results3 = new_model.evaluate(df3x, df3y)
print("dataset1 test loss, test acc:", results1)
print("dataset2 test loss, test acc:", results2)
print("dataset3 test loss, test acc:", results3)
the output i get is:
dataset1 test loss, test acc: [1.0961114168167114, 0.7297353148460388]
dataset2 test loss, test acc: [0.9534237384796143, 0.7568098306655884]
dataset3 test loss, test acc: [0.6185511350631714, 0.8320053815841675]
How is this possible? Is there something profoundly basic about the workings of the cnn models that I am missing? It doesn't make much sense to me that the already trained model predicts with better accuracy when given just more work.

MLP Time Series Forecasting Model not learning the correct distribution

The goal is to build a model that can rank the stocks based on their Target Return.
The dataset I am using is structured the following way:
Date
Stock_Id
Volume
Open
Close
High
Low
Target
2022-01-04
8341
103422
2734
2742
2755
2730
0.0007304601899196
The data is chronological, and all the stocks are grouped by day. There are only 2000 stocks.
Using the Open, High, Low, Close, and Volume features, I was able to generate about 65+ new features using talib library
The Neural Net takes in all 74 total features and is meant to predict the Target number of each stock. The Target represents the rate of change of the stock between t+2 and t+1.
What I've done:
I have normalized the data accross time. This means that I take all of the time domain data for any given StockId and use it to compute mean and std in order to normalize the data.
The dataloader indexes into the dataset by day. This acts as a mini batch since when getitem is called a (2000, 74) tensor is returned.
The train dataloader is set to shuffle, so the indexing is random.
The loss function is meant to optimize for the highest returns all while keeping the standard deviation of those returns in check.
Because of the non chronological nature of the dataloader and the fact that I need all outputs to compute mean and std, the training loop using optim.SGD but acts as Gradient Descent.
During training, the loss converges very well, but the predictions are very off.
After training, I use the test set to view the accuracy of the prediction. This is the distribution of the predictions. And this is the distribution of the true targets.
The range of the Targets percentages are mainly between -15 and 15. Where as the predictions have a smaller range, -0.07 to -0.05.
I have tried running the model using different learning rates, changing the hidden layer sizes in the network, multiplying the targets by 100 to represent them as percentages, changing the timeperiod argument in many of the talib features. But I always get a model that poorly represents the data.
More Info:
The data is from 1300 trading days with each day containing data for 2000 stocks.
I have adjusted the Open, Close, High, Low prices before normalizing.
I have tried different neural network sizes. Having input layer length 74, and output being 1, with any arbitrary number of layers and sizes in between.
This is what the training loop looks like.
loss_lst = []
for epoch in range(50): # loop over the dataset multiple times
optimizer.zero_grad()
running_error = torch.tensor(0.0)
running_return = []
running_loss = torch.tensor(0.0)
for i, data in enumerate(trainloader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels, date = data
inputs = inputs.cuda().squeeze()
labels = labels.cuda().squeeze()
# forward pass
outputs = net(inputs).squeeze()
error = error_criterion(outputs, labels) # Error between prediction and label
running_error += error.cpu()
day_return = return_criterion(outputs,labels).cpu()
running_return.append(day_return) # Array of daily returns that will be used to compute the std for the loss function.
del outputs
del inputs
del labels
avg_error = running_error / len(trainloader)
std_return = torch.stack(running_return).cuda().std()
loss = (mean_hp * avg_error) + (std_hp * std_return)
loss.backward()
optimizer.step()
running_loss += loss.item()
loss_lst.append(loss.item())
print(f'[{epoch + 1}] loss: {running_loss:.10f}')
# loss_lst.append(sum(running_loss)/len(trainloader))
running_loss = 0.0
print('Finished Training')
Let me know if any additional clarifications are needed.

Finding "look_back" & "look_ahead" hyper-parameters for Seq2Seq models

For Seq2Seq deep learning architectures, viz., LSTM/GRU and multivariate, multistep time series forecasting, its important to convert the data to a 3D dimension: (batch_size, look_back, number_features). Here look_back decides the number of past data points/samples to consider using number_features from your training dataset. Similarly, look_ahead needs to be defined which defines the number of steps in future, you want your model to forecast for.
I have a written a function to help achieve this:
def split_series_multivariate(data, n_past, n_future):
'''
Create training and testing splits required by Seq2Seq
architecture(s) for multivariate, multistep and multivariate
output time-series modeling.
'''
X, y = list(), list()
for window_start in range(len(data)):
past_end = window_start + n_past
future_end = past_end + n_future
if future_end > len(data):
break
# slice past and future parts of window-
past, future = data[window_start: past_end, :], data[past_end: future_end, :]
# past, future = data[window_start: past_end, :], data[past_end: future_end, 4]
X.append(past)
y.append(future)
return np.array(X), np.array(y)
But, look_back and look_ahead are hyper-parameters which need to be tuned for a given dataset.
# Define hyper-parameters for Seq2Seq modeling:
# look-back window size-
n_past = 30
# number of future steps to predict for-
n_future = 10
# number of features used-
n_features = 8
What is the best practice for choosing/finding look_back and look_ahead hyper-parameters?

What is the standard way to train a PyTorch script until convergence?

what is the standard way to detect if a model has converged? I was going to record 5 losses with 95 confidence intervals each loss and if they all agreed then I’d halt the script. I assume training until convergence must be implemented already in PyTorch or PyTorch Lightning somewhere. I don’t need a perfect solution, just the standard way to do this automatically - i.e. halt when converged.
My solution is easy to implement. Once create a criterion and changes the reduction to none. Then it will output a tensor of size [B]. Every you log you record that and it's 95 confidence interval (or std if you prefer, but that is much less accuracy). Then every time you add a new loss with it's confidence interval make sure it remains of size 5 (or 10) and that the 5 losses are within a 95 CI of each other. Then if that is true halt.
You can compute the CI with this:
def torch_compute_confidence_interval(data: Tensor,
confidence: float = 0.95
) -> Tensor:
"""
Computes the confidence interval for a given survey of a data set.
"""
n = len(data)
mean: Tensor = data.mean()
# se: Tensor = scipy.stats.sem(data) # compute standard error
# se, mean: Tensor = torch.std_mean(data, unbiased=True) # compute standard error
se: Tensor = data.std(unbiased=True) / (n**0.5)
t_p: float = float(scipy.stats.t.ppf((1 + confidence) / 2., n - 1))
ci = t_p * se
return mean, ci
and you can create the criterion as follow:
loss: nn.Module = nn.CrossEntropyLoss(reduction='none')
so the train loss is now of size [B].
note that I know how to train with a fixed number of epochs, so I am not really looking for that - just the halting criterion for when to stop when models looks converged, what a person would sort of do when they look at their learning curve but automatically.
ref:
https://forums.pytorchlightning.ai/t/what-is-the-standard-way-to-halt-a-script-when-it-has-converged/1415
Set an EarlyStopping (https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.callbacks.EarlyStopping.html#pytorch_lightning.callbacks.EarlyStopping) callback in your trainer by
checkpoint_callbacks = [
EarlyStopping(
monitor="val_f1_score",
min_delta=0.01,
patience=10, # NOTE no. val epochs, not train epochs
verbose=False,
mode="min",
),
]
trainer = pl.Trainer(callbacks=callbacks)
This will monitor changes in val_f1_score during training (notice that you have to log this value with self.log("val_f1_score", val_f1) in your pl.LightningModule). And it will stop the training if the minimum change to quantity to qualify as an improvement (min_delta) for more than the number of epoch specified as patience

Non-linear multivariate time-series response prediction using RNN

I am trying to predict the hygrothermal response of a wall, given the interior and exterior climate. Based on literature research, I believe this should be possible with RNN but I have not been able to get good accuracy.
The dataset has 12 input features (time-series of exterior and interior climate data) and 10 output features (time-series of hygrothermal response), both containing hourly values for 10 years. This data was created with hygrothermal simulation software, there is no missing data.
Dataset features:
Dataset targets:
Unlike most time-series prediction problems, I want to predict the response for the full length of the input features time-series at each time-step, rather than the subsequent values of a time-series (eg financial time-series prediction). I have not been able to find similar prediction problems (in similar or other fields), so if you know of one, references are very welcome.
I think this should be possible with RNN, so I am currently using LSTM from Keras. Before training, I preprocess my data the following way:
Discard first year of data, as the first time steps of the hygrothermal response of the wall is influenced by the initial temperature and relative humidity.
Split into training and testing set. Training set contains the first 8 years of data, the test set contains the remaining 2 years.
Normalise training set (zero mean, unit variance) using StandardScaler from Sklearn. Normalise test set analogously using mean an variance from training set.
This results in: X_train.shape = (1, 61320, 12), y_train.shape = (1, 61320, 10), X_test.shape = (1, 17520, 12), y_test.shape = (1, 17520, 10)
As these are long time-series, I use stateful LSTM and cut the time-series as explained here, using the stateful_cut() function. I only have 1 sample, so batch_size is 1. For T_after_cut I have tried 24 and 120 (24*5); 24 appears to give better results. This results in X_train.shape = (2555, 24, 12), y_train.shape = (2555, 24, 10), X_test.shape = (730, 24, 12), y_test.shape = (730, 24, 10).
Next, I build and train the LSTM model as follows:
model = Sequential()
model.add(LSTM(128,
batch_input_shape=(batch_size,T_after_cut,features),
return_sequences=True,
stateful=True,
))
model.addTimeDistributed(Dense(targets)))
model.compile(loss='mean_squared_error', optimizer=Adam())
model.fit(X_train, y_train, epochs=100, batch_size=batch=batch_size, verbose=2, shuffle=False)
Unfortunately, I don't get accurate prediction results; not even for the training set, thus the model has high bias.
The prediction results of the LSTM model for all targets
How can I improve my model? I have already tried the following:
Not discarding the first year of the dataset -> no significant difference
Differentiating the input features time-series (subtract previous value from current value) -> slightly worse results
Up to four stacked LSTM layers, all with the same hyperparameters -> no significant difference in results but longer training time
Dropout layer after LSTM layer (though this is usually used to reduce variance and my model has high bias) -> slightly better results, but difference might not be statistically significant
Am I doing something wrong with the stateful LSTM? Do I need to try different RNN models? Should I preprocess the data differently?
Furthermore, training is very slow: about 4 hours for the model above. Hence I am reluctant to do an extensive hyperparameter gridsearch...
In the end, I managed to solve this the following way:
Using more samples to train instead of only 1 (I used 18 samples to train and 6 to test)
Keep the first year of data, as the output time-series for all samples have the same 'starting point' and the model needs this information to learn
Standardise both input and output features (zero mean, unit variance). I found this improved prediction accuracy and training speed
Use stateful LSTM as described here, but add reset states after epoch (see below for code). I used batch_size = 6 and T_after_cut = 1460. If T_after_cut is longer, training is slower; if T_after_cut is shorter, accuracy decreases slightly. If more samples are available, I think using a larger batch_size will be faster.
use CuDNNLSTM instead of LSTM, this speed up the training time x4!
I found that more units resulted in higher accuracy and faster convergence (shorter training time). Also I found that the GRU is as accurate as the LSTM tough converged faster for the same number of units.
Monitor validation loss during training and use early stopping
The LSTM model is build and trained as follows:
def define_reset_states_batch(nb_cuts):
class ResetStatesCallback(Callback):
def __init__(self):
self.counter = 0
def on_batch_begin(self, batch, logs={}):
# reset states when nb_cuts batches are completed
if self.counter % nb_cuts == 0:
self.model.reset_states()
self.counter += 1
def on_epoch_end(self, epoch, logs={}):
# reset states after each epoch
self.model.reset_states()
return(ResetStatesCallback)
model = Sequential()
model.add(layers.CuDNNLSTM(256, batch_input_shape=(batch_size,T_after_cut ,features),
return_sequences=True,
stateful=True))
model.add(layers.TimeDistributed(layers.Dense(targets, activation='linear')))
optimizer = RMSprop(lr=0.002)
model.compile(loss='mean_squared_error', optimizer=optimizer)
earlyStopping = EarlyStopping(monitor='val_loss', min_delta=0.005, patience=15, verbose=1, mode='auto')
ResetStatesCallback = define_reset_states_batch(nb_cuts)
model.fit(X_dev, y_dev, epochs=n_epochs, batch_size=n_batch, verbose=1, shuffle=False, validation_data=(X_eval,y_eval), callbacks=[ResetStatesCallback(), earlyStopping])
This gave me very statisfying accuracy (R2 over 0.98):
This figure shows the temperature (left) and relative humidity (right) in the wall over 2 years (data not used in training), prediction in red and true output in black. The residuals show that the error is very small and that the LSTM learns to capture the long-term dependencies to predict the relative humidity.

Resources