I have around 1000 data points and each data point belongs to a specific user. In total I have 80 users, so each user has around 12 data points. I can do leave one user out cross-validation by using LeaveOneGroupOut from scikit-learn.
But now I would like to use leave-one-out cross-validation, i.e. using only one data point in the test set (instead of a user). But instead of using the full remaining training set, I would like to use a slightly different training set: If data point n from user x is in the test set, then the training set should consist of the data points of all other users plus data points 1,2,...n-1 of user x. If data point 1 from user x is in the test set, then no data point from this user is in the training set.
How can this be done? I'm using a Pipeline with RandomizedSearchCV and SVM, so I would be very happy if there is a solution like LeaveOneGroupOut which I can pass to these methods.
Related
What should I use to make a prediction model for the following problem:
I have a dataset of several hudred users with 20+ features (variables that correlate with each other but some of the values are missing) in a certain time frame for each user.
I managed to do multivariate imputation to fill in missed data using sklearn's IterativeImputer (but it did not take into account data in dynamic (progress) frame
So now I need to make a model to predict future values based on the history of the user and other users with similar progress (vectors).
I've trained a Keras LSTM classification model on characters, saved its model architecture and weights, and now want to load it into a separate application that I can run separate from the training system, stick a REST endpoint on it, and then be able to make predictions via REST...
I haven't found - maybe poor googlefu - references to how other people are doing it, and the main vagueness I'm running into is how to load the original text index & the corresponding labels index.
i.e. the index of 1="a",2="g",3=" ",4="b", and the "original" labels of ["green","blue","red","orange"] prior to the label being 1-hot encoded...
So this is my understanding:
the weights are based on the numerical inputs that were given to the originally trained model
the numerical inputs & the generated index are based on the specific data set that was used for training
the outputs from the model that represent the classification are based on the order in which the original data set's labels were added - i.e. if green was in position 0 of the training labels, and is in position 1 of the actual "runtime" labels, then that's not gonna work... true?
which means that reusing the model + weights, not only requires the actual model architecture & weights, but it also requires the indices of the input & output data...
Is that correct? Or am I missing something major?
Because the thing then is... IF this is the case, is there a way to save & load the indices other than doing it manually?
Because if it needs to be done manually, then we kinda lose the benefits of keras' preprocessing functionality (like the Tokenizer and the np_utils.to_categorically) that we WERE able to use in the training system...
Does anybody have a pattern for doing this sort of activity?
I'm currently doing something along the lines of:
save the X index & the Y label array during training together with the model architecture & weights
in the prediction application, load and recreate the model with the architecture & weights
have a custom class to tokenise input words based on the X index, pad it to the right length, etc
make the prediction
take the prediction and map the highest probability item to the original Y labels array, and therefore figure out what the predicted label is
I am using z-score to normalize my data before training my model. When I do predictions on a daily basis, I tend to have very few observations each day, perhaps just a dozen or so. My question is, can I normalize the test data just by itself, or should I attach it to the entire training set to normalize it?
The reason I am asking is, the normalization is based on mean and std_dev, which obviously might look very different if my dataset consists only of a few observations.
You need to have all of your data in the same units. Among other things, this means that you need to use the same normalization transformation for all of your input. You don't need to include the new data in the training per se -- however, keep the parameters of the normalization (the m and b of y = mx + b) and apply those to the test data as you receive them.
It's certainly not a good idea to predict on a test set using a model trained with a very different data distribution. I would use the same mean and std of your training data to normalize you test set.
I have implemented a recommender system based upon matrix factorization techniques. I want to evaluate it.
I want to use 10-fold-cross validation with All-but-one protocol (https://ai2-s2-pdfs.s3.amazonaws.com/0fcc/45600283abca12ea2f422e3fb2575f4c7fc0.pdf).
My data set has the following structure:
user_id,item_id,rating
1,1,2
1,2,5
1,3,0
2,1,5
...
It's confusing for me to think how the data is going to be splitted, because I can't put some triples (user,item,rating) in the testing set. For example, if I select the triple (2,1,5) to the testing set and this is the only rating user 2 has made, there won't be any other information about this user and the trained model won't predict any values for him.
Considering this scenario, how should I do the splitting?
You didn't specify a language or toolset so I cannot give you a concise answer that is 100% applicable to you, but here's the approach I took to solve this same exact problem.
I'm working on a recommender system using Treasure Data (i.e. Presto) and implicit observations, and ran into a problem with my matrix where some users and items were not present. I had to re-write the algorithm to split the observations into train and test so that every user and every item would be represented in the training data. For the description of my algorithm I assume there are more users than items. If this is not true for you then just swap the two. Here's my algorithm.
Select one observation for each user
For each item that has only one observation and has not already been selected from the previous step select one observation
Merge the results of the previous two steps together.
This should produce a set of observations that covers all of the users and all of the items.
Calculate how many observations you need to fill your training set (generally 80% of the total number of observations)
Calculate how many observations are in the merged set from step 3.
The difference between steps 4 and 5 is the number of remaining observations necessary to fill the training set.
Randomly select enough of the remaining observations to fill the training set.
Merge the sets from step 3 and 6: this is your training set.
The remaining observations is your testing set.
As I mentioned, I'm doing this using Treasure Data and Presto so the only tool I have at my disposal is SQL, common table expressions, temporary tables, and Treasure Data workflow.
You're quite correct in your basic logic: if you have only one observation in a class, you must include that in the training set for the model to have any validity in that class.
However, dividing the input into these classes depends on the interactions among various observations. Can you identify classes of data, such as the "only rating" issue you mentioned? As you find other small classes, you'll also need to ensure that you have enough of those observations in your training data.
Unfortunately, this is a process that's tricky to automate. Most one-time applications simply have to hand-pick those observations from the data, and then distribute the others per normal divisions. This does have a problem that the special cases are over-represented in the training set, which can detract somewhat from the normal cases in training the model.
Do you have the capability of tuning the model as you encounter later data? This is generally the best way to handle sparse classes of input.
collaborative filtering (matrix factorization) can't have a good recommendation for an unseen user with no feedback. Nevertheless, an evaluation should consider this case and take it into account.
One thing you can do is to report performance for all test users, just test users with some feedback and just unseen users with no feedback.
So I'd say keep the test, train split random but evaluate separately for unseen users.
More info here.
I'm using the Orange GUI, and trying to perform cross-validation. My data has 8 different groups (specified by a variable in the input data), and I'd like each fold to hold out a different group. Is this possible to do using Orange? I can select the number of folds for cross-validation, but I don't see any way of determining which data is in each one.
Cross-validation does random sampling. I don't think what you seek is possible out of the box.
If you really want to have it honor the splits you made beforehand (according to some input variable), and you aren't afraid of some manual labor, you can use Select Rows widget to select the rows of one group (i.e. Matching Data), pass that into Test & Score as Test Data, and have all the rest of data (i.e. Unmatched Data) as training Data. This way, you get the cross-validation for a single fold (group). Repeat, and finally average, to obtain results for all folds.
If you know some Python, there's always Orange scripting layer you can fall back to.