What should I use to make a prediction model for the following problem:
I have a dataset of several hudred users with 20+ features (variables that correlate with each other but some of the values are missing) in a certain time frame for each user.
I managed to do multivariate imputation to fill in missed data using sklearn's IterativeImputer (but it did not take into account data in dynamic (progress) frame
So now I need to make a model to predict future values based on the history of the user and other users with similar progress (vectors).
Related
As a title, I tried to use AutoML in Google Cloud Platform to predict some rare results.
For example, suppose I have 5 types of independent variables: age, living area, income, family size, and gender. I want to predict a rare event called "purchase".
Purchases are very rare, because for 10,000 data points, I will only get 3-4 purchases. Fortunately, I got loads more than just 10,000 data points. (I got 100 million data points)
I have tried to use AutoML to model the best combination, but since this is a rare result, the model only predicts for me that the number of purchases for all types of combinations in these 5 categories is 0. May I know how to solve this problem in AutoML?
In Cloud AutoML, the model predictions and the model evaluation metrics depend on the confidence threshold that is set. By default, in Cloud AutoML, the confidence threshold is 0.5. This value can be changed in the “Evaluate” tab of the “Models” section. To evaluate your model, change the confidence threshold to see how precision and recall are affected. The best confidence threshold depends on your use case. Here are some example scenarios to learn how evaluation metrics can be used. In your case, the recall metric has to be maximized (which would result in fewer false negatives) in order to correctly predict the purchase column.
Also, the training data has to be composed of a comparable number of examples from each class in the target variable so that the model can predict values with a higher confidence. Since your training data is highly skewed, preprocessing of the data such as resampling has to be performed to handle the skewness.
I know this may be a basic question but I want to know if I am using the train, test split correctly.
Say I have data that ends at 2019, and I want to predict values in the next 5 years.
The graph I produced is provided below:
My training data starts from 1996-2014 and my test data starts from 2014-2019. The test data perfectly fits the training data. I then used this test data to make predictions from 2019-2024.
Is this the correct way to do it, or my predictions should also be from 2014-2019 just like the test data?
The test/validation data is useful for you to evaluate the predictor to use. Once you have decided which model to use, you should train the model with the whole dataset 1996-2019 so that you do not lose possible valuable knowledge from 2014-2019. Take into account that when working with time-series, usually the newer part of the serie has more importance in your prediction than older values of the serie.
Used K-means clustering Model for detecting anomaly using BigQuery ML.
Datasets information
date Date
trade_id INT
trade_name STRING
agent_id INT
agent_name String
total_item INT
Mapping - One trade has multiple agent based on date.
Model Trained with below information by sum(total_iteam)
trade_id
trade_name
agent_id
agent_name
Number of cluster: 4
Need to find the anomaly for each trades and agent based on date.
Model is trained with given set of data and distance_from_closest_centroid is calculated. for each trade and agent based on date distance is called. Rightest distance is consider as a anomaly. Using this information
Questions
1. How to find the number of cluster need to use for model(eg: Elbow method used for selecting minimal cluster number selection).
Questions
2. How to build the model in case when trade data added on daily basis. Its is possible to build the incremental way of building the model on daily basis.
As the question was updated, I will sum up our discussion as an answer to further contribute to the community.
Regarding your first question "How to find the number of cluster need to use for model(eg: Elbow method used for selecting minimal cluster number selection).".
According to the documentation, if you omit the num_clusters option, BigQuery ML will choose a reasonable default based on the total number of rows in the training data. However, if you want to select the most optimal number, you can perform hyperarameter tunning, which is the process of selecting one (or a set) of optimal hyperparameter for a learning algorithm, in your case K-means within BigQuery ML. In order to determine the ideal number of clusters you would run CREATE MODEL query for different values of num_clusters. Then , finding the error measure and select the point which it is at the minimum value. You can select the error measure in the training tab Evaluation, it will show the Davies–Bouldin index and Mean squared distance.
Your second question was "How to build the model in case when trade data added on daily basis. Its is possible to build the incremental way of building the model on daily basis."
K-means is an unsupervised leaning algorithm. So you will train your model with your current data. Then store it in a data set. This model is already trained and can certainly be used with new data, using the ML.PREDICT. So it will use the model to predict which clusters the new data belong to.
As a bonus information, I would like to share this link for the documentation which explains how K-means in BigQuery ML can be used to detect data anomaly.
UPDATE:
Regarding your question about retraining the model:
Question: "I want to rebuild the model because new trade information has to be updated in my existing model. In this case is this possible to append the model with only two month of data or should we need to rebuild the entire model?"
Answer: You would have to retrain the whole model if new relevant data arrives. There is not the possibility to append the model with only two months of new data. Although, I must mention that you can and should use warm_start to retrain your already existing model, here.
As per #Alexandre Moraes
omiting the num_clusters using K-means, BigQuery ML will choose a reasonable amount based in the number of rows in the training data. In addition, you can also use hyperparameter tuning to determine a optimal number of clusters. Thus, you would have to run the CREATE MODEL query for different values of num_clusters, find the error measure and pick the point which the error is minimum, link. –
I loaded a dataset with 156 variables for a project. The goal is to figure out a model to predict a test data set. I am confused about where to start with. Normally I would start with the basic linear regression model, but with 156 columns/variables, how should one start with a model building? Thank you!
The question here is pretty open ended.
You need to confirm whether you are solving for regression or classification.
You need to go through some descriptive statistics of your data set to find out the type of values you have in the dataset. Are there outliers, missing values, columns whose values are in billions as against columns who values are in small fractions.
If you have categorical data, what type of categories do you have. What is the frequency count of the categorical values.
Accordingly you clean the data (if required)
Post this you may want to understand the correlation(via pearsons or chi-square depending on the data types of the variables you have) among these 156 variables and see how correlated they are.
You may then choose to get rid of certain variables after looking at the correlation or by performing a PCA (which helps to retain high variance among the dataset) and bringing the dataset variables down to fewer dimensions.
You may then look at fitting regression models or classification models(depending on your need) to have a simpler model at first and then adjusting things as you look at improving your accuracy (or minimizing the loss)
I have around 1000 data points and each data point belongs to a specific user. In total I have 80 users, so each user has around 12 data points. I can do leave one user out cross-validation by using LeaveOneGroupOut from scikit-learn.
But now I would like to use leave-one-out cross-validation, i.e. using only one data point in the test set (instead of a user). But instead of using the full remaining training set, I would like to use a slightly different training set: If data point n from user x is in the test set, then the training set should consist of the data points of all other users plus data points 1,2,...n-1 of user x. If data point 1 from user x is in the test set, then no data point from this user is in the training set.
How can this be done? I'm using a Pipeline with RandomizedSearchCV and SVM, so I would be very happy if there is a solution like LeaveOneGroupOut which I can pass to these methods.