I have layoutlm model train on
Funds dataset and I need to train
On docbank dataset
I am trying to combine out of sample forecast horizon for supervised models. Furthermore, they are multi-output as there are a lot of simultaneous univariate time series going parallel. How can a avoid using X_test samples for predicting on this type of models?
code is like this .... (any other regressor - RF, AdaBoost etc)
multioutputregressor =
MultiOutputRegressor(xgb.XGBRegressor(objective='reg:squarederror',verbose = 1)).fit(X_train,y_train) ....
y_multirf1 = multioutputregressor.predict(X_test)
Here I need to forecast on univariate data. Besides, it looks like there is only 'time' as an exogenous variable. But it is a violation to put it as X(train/test). Are there any special models for supervised forecasting with out-of-sample predictions?
Thanx.
if values of my prediction column ranges from 5-50 in a particular question.
so should i use RandomForestRegressor or RandomForestClassifier.
this question is related to boston house pricing.
Prediction Column --> (MEDV-Median value of owner-occupied homes in $1000's)
also i have read somewhere that if the predicting values are known we should use Classifier, otherwise Regressor.
Your prediction column has continuous values, hence it is regression problem.
You can use Linear regression model.
A quick answer to your question is RandomForestRegressor
You can refer the documentation here
I have a dataset of 100K rows and 100 columns and i want to generate samples based on this existing dataset in order to make the output shape of dataset 10M rows and 100 columns?
Any idea how to do this in Python?
I don't want oversampling methods because my dataset is already balanced.
You should first split your data to train & validation / test and oversample only training data to avoid "bleeding" samples between these datasets
check out these:
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
more about SMOTE
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html
In many examples, I see train/cross-validation dataset splits being performed by using a Kfold, StratifiedKfold, or other pre-built dataset splitter. Keras models have a built in validation_split kwarg that can be used for training.
model.fit(self, x, y, batch_size=32, nb_epoch=10, verbose=1, callbacks=[], validation_split=0.0, validation_data=None, shuffle=True, class_weight=None, sample_weight=None)
(https://keras.io/models/model/)
validation_split: float between 0 and 1: fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch.
I am new to the field and tools, so my intuition on what the different splitters offer you. Mainly though, I can't find any information on how Keras' validation_split works. Can someone explain it to me and when separate method is preferable? The built-in kwarg seems to me like the cleanest and easiest way to split test datasets, without having to architect your training loops much differently.
The difference between the two is quite subtle and they can be used in conjunction.
Kfold and similar functions in scikit-learn will randomly split your data into k folds. You can then train models holding out a single fold each time and testing on the fold.
validation_split takes a fraction of your data non-randomly. According to the Keras documentation it will take the fraction from the end of your data, e.g. 0.1 will hold out the final 10% of rows in the input matrix. The purpose of the validation split is to allow you to assess how the model is performing on the training set and a held out set at every epoch in the training period. If the model continues to improve on the training set but not the validation set then it is a clear sign of potential overfitting.
You could theoretically use KFold cross-validation to construct a model while also using validation_split to monitor the performance of each model. At each fold you will be generating a new validation_split from the training data.