How can I use AutoML in GCP to predict rare event? - machine-learning

As a title, I tried to use AutoML in Google Cloud Platform to predict some rare results.
For example, suppose I have 5 types of independent variables: age, living area, income, family size, and gender. I want to predict a rare event called "purchase".
Purchases are very rare, because for 10,000 data points, I will only get 3-4 purchases. Fortunately, I got loads more than just 10,000 data points. (I got 100 million data points)
I have tried to use AutoML to model the best combination, but since this is a rare result, the model only predicts for me that the number of purchases for all types of combinations in these 5 categories is 0. May I know how to solve this problem in AutoML?

In Cloud AutoML, the model predictions and the model evaluation metrics depend on the confidence threshold that is set. By default, in Cloud AutoML, the confidence threshold is 0.5. This value can be changed in the “Evaluate” tab of the “Models” section. To evaluate your model, change the confidence threshold to see how precision and recall are affected. The best confidence threshold depends on your use case. Here are some example scenarios to learn how evaluation metrics can be used. In your case, the recall metric has to be maximized (which would result in fewer false negatives) in order to correctly predict the purchase column.
Also, the training data has to be composed of a comparable number of examples from each class in the target variable so that the model can predict values with a higher confidence. Since your training data is highly skewed, preprocessing of the data such as resampling has to be performed to handle the skewness.

Related

Random Forest important features output stability issue

I'm fitting 2 almost identical Random Forest regression models. Both models use the same data set that have 60 features and 90 data points. The only difference is they're using different targets (the target column of each model is excluded from the respective features dataframes, of course). All of the cross validation settings are same of both models (number of folds, number of iterations, scoring) and the hyperparameter grids are also identical.
I'm interested in the feature importance output. However, one of the model consistently output the same top features while the other doesn't. Does anyone know why this is the case?
You can set a seed or the parameter random_state in case you rely on sklearn.ensemble.RandomForestRegressor in order to stabilize your results.
It's quite normal to get varying feature importance since the forest is assembled randomly. Furthermore, feature importance may not be the optimal metric to evaluate actual feature importance. You could try Boruta-Algorithm/Permutation Feature Importance to get a different perspective.
Towards your actual question, maybe your regressors are better suited to predict one target variable over the other.
How do both models perform accuracy-wise on the data? This might be one possibility to explain why one model is more stable. Do feature importances remain unstable for a larger amount of trees fitted?

Why does my Random Forest Classifier perform better on test and validation data than on training data?

I'm currently training a random forest on some data I have and I'm finding that the model performs better on the validation set, and even better on the test set, than on the train set. Here are some details of what I'm doing - please let me know if I've missed any important information and I will add it in.
My question
Am I doing anything obviously wrong and do you have any advice for how I should improve my approach because I just can't believe that I'm doing it right when my model predicts significantly better on unseen data than training data!
Data
My underlying data consists of tables of features describing customer behaviour and a binary target (so this is a binary classification problem). Technically I have one such table per month and I tend to use several months of data to train and then a different month to predict (e.g. Train on Apr, May and Predict on Jun)
Generally this means I end up with a training dataset of about 100k rows and 20 features (I've previously looked into feature selection and found a set of 7 features which seem to perform best, so have been using these lately). My prediction set generally has around 50k rows.
My dataset is heavily unbalanced (approximately 2% incidence of target feature), so I'm using oversampling techniques - more on that below.
Method
I've searched around online quite a lot and this has led me to the following approach:
Take scaleable (continuous) features in the training data and standardise them (currently using sklearn StandardScaler)
Take categorical features and encode them into separate binary columns (one-hot) using Pandas get_dummies function
Remove 10% of the training data to form a validation set (I'm currently using a random seed in this process for comparability whilst I vary different things such as hyperparameters in the model)
Take the remaining 90% of training data and perform a grid search across a few parameters of the RandomForestClassifier() (currently min_samples_split, max_depth, n_estimators and max_features)
Within each hyperparameter combination from the grid I perform kfold validation with 5 folds and using a random state
Within each fold I oversample my minority class for training data only (sometimes using imbalanced-learn's RandomOverSampler() and sometimes using SMOTE() from the same package), train the model on the training data and then apply the model to the kth fold and record performance metrics (precision, recall, F1 and AUC)
Once I've been through 5 folds on each hyperparameter combination I find the best F1 score (and best precision if two combinations are tied on F1 score) and retrain a random forest on the entire 90% training data using those hyperparameters. During this step I use the same oversampling technique as I did in the kfold process
I then use this model to make predictions on the 10% of training data that I put aside earlier as a validation set, evaluating the same metrics as above
Finally I have a test set, which is actually based on data from another month, which I apply the already trained model to and evaluate the same metrics
Outcome
At the moment I'm finding that my training set achieves an F1 score of around 30%, the validation set is consistently slightly higher than this at around 36% (mostly driven by a much better precision than the training data e.g. 60% vs. 30%) and then the testing set is getting an F1 score of between 45% and 50% which is again driven by a better precision (around 65%)
Notes
Please do ask about any details I haven't mentioned; I've had my stuck in this for weeks and so have doubtless omitted some details
I've had a brief look (not a systematic analysis) of the stability of metrics between folds in the kfold validation and it seems that they aren't varying very much, so I'm fairly happy with the stability of the model here
I'm actually performing the grid search manually rather than using a Python pipeline because try as I might I couldn't get imbalanced-learn's Pipeline function to work with the oversampling functions and so I run a loop with combinations of hyperparameters, but I'm confident that this isn't impacting the results I've talked about above in an adverse way
When I apply the final model to the prediction data (and get an F1 score around 45%) I also apply it back to the training data itself out of interest and get F1 scores around 90% - 100%. I suppose this is to be expected as the model is trained and predicts on almost exactly the same data (except the 10% holdout validation set)

How to estimate the accuracy on a large dataset?

Given that I have a deep learning model(handover from former colleague). For some reason, the train/dev set was missing.
In my situation, I want to classify my dataset into 100 categories. The dataset is extremely imbalanced. The dataset size is about tens of millions
First of all, I run the model and got the prediction on the whole dataset.
Then, I sample 100 records per category(according to the prediction) and got a 10,000 test set.
Next, I labeled the ground truth of each record for the test set and calculate the precision, recall, f1 for each category and got F1-micro and F1-macro.
How to estimate the accuracy or other metrics on the whole dataset? Is it correct that I use the weighted sum of each category's precision(the weight is the proportion of prediction on the whole) to estimate?
Since the distribution of prediction category is not same as the distribution of real category, I guess the weighted approach does not work. Any one can explain it?
The issue if you take a weighted average is that if your classifier performs well on the majority class, but poorly on minority classes (which is the typical scenario), it will not be reflected in the score.
One of the recommended approaches is rather to use the balanced accuracy score (see here for the scikit learn implementation). Basically, it is an average of all recall scores: for each observation in a class, it looks at how many of were correctly classified, and averages this across all classes. This will give you a sensible overall score to report.

Classification of Stock Prices Based on Probabilities

I'm trying to build a classifier to predict stock prices. I generated extra features using some of the well-known technical indicators and feed these values, as well as values at past points to the machine learning algorithm. I have about 45k samples, each representing an hour of ohlcv data.
The problem is actually a 3-class classification problem: with buy, sell and hold signals. I've built these 3 classes as my targets based on the (%) change at each time point. That is: I've classified only the largest positive (%) changes as buy signals, the opposite for sell signals and the rest as hold signals.
However, presenting this 3-class target to the algorithm has resulted in poor accuracy for the buy & sell classifiers. To improve this, I chose to manually assign classes based on the probabilities of each sample. That is, I set the targets as 1 or 0 based on whether there was a price increase or decrease.
The algorithm then returns a probability between 0 and 1(usually between 0.45 and 0.55) for its confidence on which class each sample belongs to. I then select some probability bound for each class within those probabilities. For example: I select p > 0.53 to be classified as a buy signal, p < 0.48 to be classified as a sell signal and anything in between as a hold signal.
This method has drastically improved the classification accuracy, at some points to above 65%. However, I'm failing to come up with a method to select these probability bounds without a large validation set. I've tried finding the best probability values within a validation set of 3000 and this has improved the classification accuracy, yet the larger the validation set becomes, it is clear that the prediction accuracy in the test set is decreasing.
So, what I'm looking for is any method by which I could discern what the specific decision probabilities for each training set should be, without large validation sets. I would also welcome any other ideas as to how to improve this process. Thanks for the help!
What you are experiencing is called non-stationary process. The market movement depends on time of the event.
One way I used to deal with it is to build your model with data in different time chunks.
For example, use data from day 1 to day 10 for training, and day 11 for testing/validation, then move up one day, day 2 to day 11 for training, and day 12 for testing/validation.
you can save all your testing results together to compute an overall score for your model. this way you have lots of test data and a model that will adapt to time.
and you get 3 more parameters to tune, #1 how much data to use for train, #2 how much data for test, # per how many days/hours/data points you retrain your data.

evaluating accuracy of decision tree/forrest model

Im relatively new to ML. Ive created a decision tree model to predict prices of an item based on some criteria.
For an example, lets say the model predicts the price of a car based on a few features such as engine size, number of doors, fuel type, mileage and age.
Analysis of the data showed me that my data was not linear, so decision tree was a better fit. The model also does an ok job at predicting but before i can give it to any users, i need to quantify its accuracy.
As its non linear, R squared doesnt seem liek a good method of assessing accuracy, but im unsure what i should use.
Appreciate any advice on this.
In these cases, what you can usually do is to assess the performance of the model against a test or hold-out set (not used during the construction of the model), using a evaluation metric.
For regression problems (like the ones you are describing) there are several evaluation metrics available. The most common ones are MAE (Mean Absolute Error) and RMSE (Root Mean Squared Error)
To fully understand how good the performance of your model is, you can then compare it against other models, or against simple baselines (like predicting always the average price, or returning the price of the most similar car in the training set).

Resources