Splitting data set into training and testing sets on recommender systems - machine-learning

I have implemented a recommender system based upon matrix factorization techniques. I want to evaluate it.
I want to use 10-fold-cross validation with All-but-one protocol (https://ai2-s2-pdfs.s3.amazonaws.com/0fcc/45600283abca12ea2f422e3fb2575f4c7fc0.pdf).
My data set has the following structure:
user_id,item_id,rating
1,1,2
1,2,5
1,3,0
2,1,5
...
It's confusing for me to think how the data is going to be splitted, because I can't put some triples (user,item,rating) in the testing set. For example, if I select the triple (2,1,5) to the testing set and this is the only rating user 2 has made, there won't be any other information about this user and the trained model won't predict any values for him.
Considering this scenario, how should I do the splitting?

You didn't specify a language or toolset so I cannot give you a concise answer that is 100% applicable to you, but here's the approach I took to solve this same exact problem.
I'm working on a recommender system using Treasure Data (i.e. Presto) and implicit observations, and ran into a problem with my matrix where some users and items were not present. I had to re-write the algorithm to split the observations into train and test so that every user and every item would be represented in the training data. For the description of my algorithm I assume there are more users than items. If this is not true for you then just swap the two. Here's my algorithm.
Select one observation for each user
For each item that has only one observation and has not already been selected from the previous step select one observation
Merge the results of the previous two steps together.
This should produce a set of observations that covers all of the users and all of the items.
Calculate how many observations you need to fill your training set (generally 80% of the total number of observations)
Calculate how many observations are in the merged set from step 3.
The difference between steps 4 and 5 is the number of remaining observations necessary to fill the training set.
Randomly select enough of the remaining observations to fill the training set.
Merge the sets from step 3 and 6: this is your training set.
The remaining observations is your testing set.
As I mentioned, I'm doing this using Treasure Data and Presto so the only tool I have at my disposal is SQL, common table expressions, temporary tables, and Treasure Data workflow.

You're quite correct in your basic logic: if you have only one observation in a class, you must include that in the training set for the model to have any validity in that class.
However, dividing the input into these classes depends on the interactions among various observations. Can you identify classes of data, such as the "only rating" issue you mentioned? As you find other small classes, you'll also need to ensure that you have enough of those observations in your training data.
Unfortunately, this is a process that's tricky to automate. Most one-time applications simply have to hand-pick those observations from the data, and then distribute the others per normal divisions. This does have a problem that the special cases are over-represented in the training set, which can detract somewhat from the normal cases in training the model.
Do you have the capability of tuning the model as you encounter later data? This is generally the best way to handle sparse classes of input.

collaborative filtering (matrix factorization) can't have a good recommendation for an unseen user with no feedback. Nevertheless, an evaluation should consider this case and take it into account.
One thing you can do is to report performance for all test users, just test users with some feedback and just unseen users with no feedback.
So I'd say keep the test, train split random but evaluate separately for unseen users.
More info here.

Related

Feature selection needed before train-test split due to the small size of test set and small size of instance. What should be done?

I am working on an NLP project where I need to predict correct classes of short sentences -- which are instances in my case. I am using root-words as features. My dataset is not too large (about 6000 instances/sentences). Since there are too many features I used MI based feature-selection method to reduce the number of features to about 1000.
My problem is: if I split the dataset and then do feature selection on training set only, then the model/classifier is built based on features available in training set only -- most of which (features in trained model) are absent in the testing set. As a result our model may perform very bad.
What should I do to resolve this issue?
I am currently selecting features first and then doing CV. I know that this approach may cause data leakage from test set to train set. But I'm still doing that because of the aforementioned issue.

Can / should I use past (e.g. monthly) label columns from a database as features in an ML prediction (no time-series!)?

The question: Is it normal / usual / professional to use the past of the labels as features?
I could not find anything reliable on this, although it is a basic question.
Edited: Please mind, this is not a time-series question, I have deleted the time-series tag now and I changed the question. This question is about features that change regularly over time, yes! But we do not create a time-series from this, as there are many other features as well which are not like the label and are also important features in the model. Now please think of using past labels as normal features without a time-series approach.
I try to predict a certain month of data that is available monthly, thus a time-series, but I am not using it as a time-series, it is just monthly avaiable data of various different features.
It is a classification model, and now I want to predict a label column of a selected month of that time-series. The previous months before the selected label month are now the point of the question.
I do not want to just drop the past months of the label just because they are "almost" a label (or in other words: they were just the label columns of the preceding models in time). I know the past of the label, why not considering it as features as well?
My predictions are of course much better when adding the past labels of the time-series of labels to the features. This is logical as the labels usually do not change so much from one month to the other and thus can be predicted very well if you have fed the data with the past of the label. It would be strange not to use such "past labels" as features, as any simple time-series regression would then be better than the ml model.
Example: Let's say I predict the IQ test result of a person, and I use her past IQ test results as features in addition to other normal "non-label" features like age, education aso. I use the first 11 months of "past labels" of a year as features in addition to my normal "non-label" features. I predict the label of the 12th month.
Predicting the label of the 12th month works much better if you add the past of the labels to the features - obviously. This is because the historical labels, if there are any, are of course better indicators of the final outcome than normal columns like age and education.
Possibly related p.s.:
p.s.1: In auto-regressive models, the past of the dependent variable can well be used as independent variable, see: https://de.wikipedia.org/wiki/Regressionsanalyse
p.s.2: In ML you can perhaps just try any features and take what gives you the best results, a bit like >Good question, try them [feature selection methods] all and see what works best< in https://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/ >If the features are relevant to the outcome, the model will figure out how to use them. Or most models will.< The same is said in Does the feature selection matter for learning algorithm with regularization?
p.s.3: Also probably relevant is the problem of multicollinearity: https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/ though multicollinearity is said to be no issue for the prediction: >Multicollinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the predictions, and the goodness-of-fit statistics. If your primary goal is to make predictions, and you don’t need to understand the role of each independent variable, you don’t need to reduce severe multicollinearity.
It is perfectly possible and also good practice to include past label columns as features, though it depends on your question: do you want to explain the label only with other features (on purpose), or do you want to consider other and your past label columns to get the next label predicted, as a sort of adding a time-series character to the model without using a time-series?
The sequence in time is not even important, as long as all of such monthly columns are shifted in time consistently by the same time when going over to the predicting set. The model does not care if it is just January and February of the same column type, for the model, every feature is isolated.
Example: You can perfectly run a random forest model on various features, including their past label columns that repeat the same column type again and again, only representing different months. Any month's column can be dealt with as an independent new feature in the ml model, the only importance is to shift all of those monthly columns by the exactly same period to reach a consistent predicting set. In other words, obviously you should avoid replacing January with March column when you go from a training set January-June to a predicting set February-July, instead you must replace January with February of course.
Update 202301: model name is "walk-forward"
This model setup is called "walk-forward", see Why isn’t out-of-time validation more ubiquitous? --> option 3 almost at the bottom of the page.
I got this from a comment at Splitting Time Series Data into Train/Test/Validation Sets.
In the following, it shows only training and testing set. It writes "validation set", but it is known that this gets mixed up all over the place, see What is the Difference Between Test and Validation Datasets?, and it must be meant as the testing set in the default understanding of it.
Thus, with the right wording, it is:
This should be the best model for labels that become features in time.
validation set in a "walk-forward" model?
As you can see in the model, no validation set is needed since the test data must be biased "forward" in time, that is the whole idea of predicting the "step forward in time", and any validation set would have to be in that same biased artificial future - which is already the past at the time of training, but the model does not know this.
The validation happens by default, without a needed dataset split, during the walk-forward, when the model learns again and again to predict the future and the output metrics can be put against each other. As the model is to predict the time-biased future, there is no need to prove that or how the artificial future is biased and sort of "overtrained by time". It is the aim of the model to have the validation in the artificial future and predict the real future as a last step only.
But then, why not still having a validation set on top of this, at least if it is just a small k-fold validation? It could play a role if the testing set has a few strong changes that happen in small time windows but which are still important to be predicted, or at least hinted at, but should also not be overtrained within each training step. The validation set would hit some of these time windows and might show whether the model can handle them well enough. Any other method than k-fold would shrink the power of the model too much. The more you take away from the testing set during training, the less it can predict the future.
Wrap up:
Try it out, and in doubt, leave the validation aside and judge upon the model by checking its metrics over time, during the "walk-forward". This model is not like the others.
Thus, in the end, you can, but you do not have to, split a k-fold validation from the testing set. That would look like:
After predicting a lot of known futures, the very last step in time is then the prediction of the unknown future.
This also answers Does the training+testing set have to be different from the predicting set (so that you need to apply a time-shift to ALL columns)?.

How to classify text with Knime

I'm trying to classify some data using knime with knime-labs deep learning plugin.
I have about 16.000 products in my DB, but I have about 700 of then that I know its category.
I'm trying to classify as much as possible using some DM (data mining) technique. I've downloaded some plugins to knime, now I have some deep learning tools as some text tools.
Here is my workflow, I'll use it to explain what I'm doing:
I'm transforming the product name into vector, than applying into it.
After I train a DL4J learner with DeepMLP. (I'm not really understand it all, it was the one that I thought I got the best results). Than I try to apply the model in the same data set.
I thought I would get the result with the predicted classes. But I'm getting a column with output_activations that looks that gets a pair of doubles. when sorting this column I get some related date close to each other. But I was expecting to get the classes.
Here is a print of the result table, here you can see the output with the input.
In columns selection it's getting just the converted_document and selected des_categoria as Label Column (learning node config). And in Predictor node I checked the "Append SoftMax Predicted Label?"
The nom_produto is the text column that I'm trying to use to predict the des_categoria column that it the product category.
I'm really newbie about DM and DL. If you could get me some help to solve what I'm trying to do would be awesome. Also be free to suggest some learning material about what attempting to achieve
PS: I also tried to apply it into the unclassified data (17,000 products), but I got the same result.
I won't answer with a workflow on this one because it is not going to be a simple one. However, be sure to find the text mining example on the KNIME server, i.e. the one that makes use of the bag of words approach.
The task
Product mapping to categories should be a straight-forward data mining task because the information that explains the target variable is available in a quasi-exhaustive manner. Depending on the number of categories to train though, there is a risk that you might need more than 700 instances to learn from.
Some resources
Here are some resources, only the first one being truly specialised in text mining:
Introduction on Information Retrieval, in particular chapter 13;
Data Science for Business is an excellent introduction to data mining, including text mining (chapter 10), also do not forget the chapter about similarity (chapter 6);
Machine Learning with R has the advantage of being accessible enough (chapter 4 provides an example of text classification with R code).
Preprocessing
First, you will have to preprocess your product labels a bit. Use KNIME's text analytics preprocessing nodes for that purpose, that is after you've transformed the product labels with Strings to Document:
Case Convert, Punctuation Erasure and Snowball Stemmer;
you probably won't need Stop Word Filter, however, there may be quasi-stop words such as "product", which you may need to remove manually with Dictionary Filter;
Be careful not to use any of the following without testing testing their impact first: N Chars Filter (g may be a useful word), Number Filter (numbers may indicate quantities, which may be useful for classification).
Should you encounter any trouble with the relevant nodes (e.g. Punctuation Erasure can be tricky amazingly thanks to the tokenizer), you can always apply String Manipulation with regex before converting the Strings to Document.
Keep it short and simple: the lookup table
You could build a lookup table based on the 700 training instances. The book Data mining techniques as well as resource (2) present this approach in some detail. If any model performs any worse than the lookup table, you should abandon the model.
Nearest neighbors
Neural networks are probably overkill for this task.
Start with a K Nearest Neighbor node (applying a string distance such as Cosine, Levensthein or Jaro-Winkler). This approach requires the least amount of data wrangling. At the very least, it will provide an excellent baseline model, so it is most definitely worth a shot.
You'll need to tune the parameter k and to experiment with the distance types. The Parameter Optimization Loop pair will help you with optimizing k, you can include a Cross-Validation meta node inside of the said loop to obtain an estimate of the expected performance given k instead of only one point estimate per value of k. Use Cohen's Kappa as an optimization criterion, as proposed by the resource number (3) and available via the Scorer node.
After the parameter tuning, you'll have to evaluate the relevance of your model using yet another Cross-Validation meta node, then follow up with a Loop pair including Scorer to calculate the descriptives on performance metric(s) per iteration, finally use Statistics. Kappa is a convenient metric for this task because the target variable consists of many product categories.
Don't forget to test its performance against the lookup table.
What next ?
Should lookup table or k-nn work well for you, then there's nothing else to add.
Should any of those approaches fail, you might want to analyse the precise cases on which it fails. In addition, training set size may be too low, so you could manually classify another few hundred or thousand instances.
If after increasing the training set size, you are still dealing with a bad model, you can try the bag of words approach together with a Naive Bayes classifier (see chapter 13 of the Information Retrieval reference). There is no room here to elaborate on the bag of words approach and Naive Bayes but you'll find the resources here above useful for that purpose.
One last note. Personally, I find KNIME's Naive Bayes node to perform poorly, probably because it does not implement Laplace smoothening. However, KNIME's R Learner and R Predictor nodes will allow you to use R's e1071 package, as demonstrated by resource (3).

Training Random forest with different datasets gives totally different result! Why?

I am working with a dataset which contains 12 attributes including the timestamp and one attribute as the output. Also it has about 4000 rows. Besides there is no duplication in the records. I am trying to train a random forest to predict the output. For this purpose I created two different datasets:
ONE: Randomly chose 80% of data for the training and the other 20% for the testing.
TWO: Sort the dataset based on timestamp and then the first 80% for the training and the last 20% for the testing.
Then I removed the timestamp attribute from the both dataset and used the other 11 attributes for the training and the testing (I am sure the timestamp should not be part of the training).
RESULT: I am getting totally different result for these two datasets. For the first one AUC(Area under the curve) is 85%-90% (I did the experiment several times) and for the second one is 45%-50%.
I do appreciate if someone can help me to know
why I have this huge difference.
Also I need to have the test dataset with the latest timestamps (same as the dataset in the second experiment). Is there anyway to select data from the rest of the dataset for the training to improve the
training.
PS: I already test the random selection from the first 80% of the timestamp and it doesn't improved the performance.
First of all, it is not clear how exactly you're testing. Second, either way, you are doing the testing wrong.
RESULT: I am getting totally different result for these two datasets. For the first one AUC(Area under the curve) is 85%-90% (I did the experiment several times) and for the second one is 45%-50%.
Is this for the training set or the test set? If the test set, that means you have poor generalization.
You are doing it wrong because you are not allowed to tweak your model so that it performs well on the same test set, because it might lead you to a model that does just that, but that generalizes badly.
You should do one of two things:
1. A training-validation-test split
Keep 60% of the data for training, 20% for validation and 20% for testing in a random manner. Train your model so that it performs well on the validation set using your training set. Make sure you don't overfit: the performance on the training set should be close to that on the validation set, if it's very far, you've overfit your training set. Do not use the test set at all at this stage.
Once you're happy, train your selected model on the training set + validation set and test it on the test set you've held out. You should get acceptable performance. You are not allowed to tweak your model further based on the results you get on this test set, if you're not happy, you have to start from scratch.
2. Use cross validation
A popular form is 10-fold cross validation: shuffle your data and split it into 10 groups of equal or almost equal size. For each of the 10 groups, train on the other 9 and test on the remaining one. Average your results on the test groups.
You are allowed to make changes on your model to improve that average score, just run cross validation again after each change (make sure to reshuffle).
Personally I prefer cross validation.
I am guessing what happens is that by sorting based on timestamp, you make your algorithm generalize poorly. Maybe the 20% you keep for testing differ significantly somehow, and your algorithm is not given a chance to capture this difference? In general, your data should be sorted randomly in order to avoid such issues.
Of course, you might also have a buggy implementation.
I would suggest you try cross validation and see what results you get then.

How to split train/test of extreme sparse dataset of recommender system?

I'm using CF algorithm(SVD) on a real world data set. Now I meet a problem about the data sparse problem. That means the sparsity of the user/item rating matrix is around 0.01%. I split the data into train/test set with 80/20, I find that there're just a few users and items in testing set appear in the training set, so I can just use a few rating in testing set to calculate RMSE. Would you give me some advise to fix it?
In case of recommender systems one usually splits each user's history into train and test. More detailed:
For each user we write out items he interacted with.
Preferably, we order them by (incresing) time to overcome "time-traveling issue" (user can revisit already known items, so you don't want to test on early dataset).
As usual, you use first (1-k) percents of your dataset as a train set and the rest as a test set.

Resources