Proper evaluation of a Recommender System - machine-learning

Suppose you are given two Recommender Systems to evaluate, A and B. Model A is trained with large data, and model B with small data (this implies that A would have a larger pool of items to pick for recommendations).
How would you compare the two models? One strategy would be to calculate precision and recall in the scenario where both models are subjected to their data sets ('A' with big data, 'B' with small data) with a 80/20 split, and then calculate precision and recall. However, I'm not sure if the precision and recall results are comparable in this case. What do you think?
Another approach would be to train A with big data, train B with small data, but fix the test set (meaning, the test set would be the same for both A and B). But isn't this "unfair", given that model A is based on big-data, and therefore has a larger pool of items to recommend from?
How would you compare the two models?

Related

Machine Learning model generalisation

I'm new to Machine Learning, and I'd like to make a question regarding the model generalization. In my case, I'm going to produce some mechanical parts, and I'm interested in the control of the input parameters to obtain certain properties on the final part.
More particularly, I'm interested in 8 parameters (say, P1, P2, ..., P8). In which to optimize the number of required pieces produced to maximize the combinations of parameters explored, I've divided the problem into 2 sets. For the first set of pieces, I'll vary the first 4 parameters (P1 ... P4), while the others will be held constant. In the second case, I'll do the opposite (variables P5 ... P8 and constants P1 ... P4).
So I'd like to know if it's possible to make a single model that has the eight parameters as inputs to predict the properties of the final part. I ask because as I'm not varying all the 8 variables at once, I thought that maybe I would have to do 1 model for each set of parameters, and the predictions of the 2 different models couldn't be related one to the other.
Thanks in advance.
In most cases having two different models will have a better accuracy then one big model. The reason is that in local models, the model will only look at 4 features and will be able to identify patterns among them to make prediction.
But this particular approach will most certainly fail to scale. Right now you only have two sets of data but what if it increases and you have 20 sets of data. It will not be possible for you to create and maintain 20 ML models in production.
What works best for your case will need some experimentation. Take a random sample from data and train ML models. Take one big model and two local models and evaluate their performance. Not just accuracy, but also their F1 score, AUC-PR and ROC curve too to find out what works best for you. If you do not see a major performance drop, then one big model for the entire dataset will be a better option. If you know that your data will always be divided into these two sets and you dont care about scalability, then go with two local models.

Different scenario based queries on Imputing and Machine Learning

I am new to Data Science and learning to impute and about model training. Below are my few queries that I came across when training the datasets. Please provide answers to these.
Suppose I have a dataset with 1000 observations. Now I train the model on the complete dataset in one go. Another way I did it, I divided my dataset in 80% and 20% and trained my model first at 80% and then on 20% data. Is it same or different? Basically, if I train my already trained model on new data, what does it mean?
Imputing Related
Another question is related to imputing. Imagine I have a dataset of some ship passengers, where only first-class passengers were given cabin. There is a column that holds cabin numbers (categorical) but very few observations have these cabin numbers. Now I know this column is important so I cannot remove it and because it has many missing values, so most of the algorithms do not work. How to handle imputing of this type of column?
When imputing the validation data, do we impute with same values that were used to impute training data or the imputing values are again calculated from validation data itself?
How to impute data in the form of a string like a Ticket number (like A-123). The column is important because the 1st alphabet tells the class of passenger. Therefore, we cannot drop it.
Suppose I have a dataset with 1000 observations. Now I train the model
on the complete dataset in one go. Another way I did it, I divided my
dataset in 80% and 20% and trained my model first at 80% and then on
20% data. Is it same or different?
It's hard to say: is it good or not. Generally, if your data (splits) are taken from the same distribution - you can perform additional training. However, not all model types are good for it. I advice you to run some kind of cross-validation with 80/20 splitting and error measurement checking before additional training and after.
Basically, if I train my already
trained model on new data, what does it mean?
If you take the datasets from the same distribution: you perform additional learning what theoretically should have positive influence on your model.
Imagine I have a dataset of some ship passengers, where only first-class passengers were given cabin. There is a column that holds cabin numbers (categorical) but very few observations have these cabin numbers. Now I know this column is important so I cannot remove it and because it has many missing values, so most of the algorithms do not work. How to handle imputing of this type of column?
You need clearly understand what do you want to do by imputation. If only first-class has values, how you can perform imputation for the second- or third-class? What do you need to find? Deck? Cabin number? Do you want to find new values or impute by already existing values?
When imputing the validation data, do we impute with same values that were used to impute training data or the imputing values are again calculated from validation data itself?
Very generally, you run imputation algorithm on the whole data you have (without target column).
How to impute data in the form of a string like a Ticket number (like A-123). The column is important because the 1st alphabet tells the class of passenger. Therefore, we cannot drop it.
If you have the finite number of cases, you just need to impute values as strings. If not, perform feature engineering: try to predict letter, number, first digit of the number, len(number) and so on.

Machine Learning Predictions and Normalization

I am using z-score to normalize my data before training my model. When I do predictions on a daily basis, I tend to have very few observations each day, perhaps just a dozen or so. My question is, can I normalize the test data just by itself, or should I attach it to the entire training set to normalize it?
The reason I am asking is, the normalization is based on mean and std_dev, which obviously might look very different if my dataset consists only of a few observations.
You need to have all of your data in the same units. Among other things, this means that you need to use the same normalization transformation for all of your input. You don't need to include the new data in the training per se -- however, keep the parameters of the normalization (the m and b of y = mx + b) and apply those to the test data as you receive them.
It's certainly not a good idea to predict on a test set using a model trained with a very different data distribution. I would use the same mean and std of your training data to normalize you test set.

How do neural networks learn functions instead of memorize them?

For a class project, I designed a neural network to approximate sin(x), but ended up with a NN that just memorized my function over the data points I gave it. My NN took in x-values with a batch size of 200. Each x-value was multiplied by 200 different weights, mapping to 200 different neurons in my first layer. My first hidden layer contained 200 neurons, each one a linear combination of the x-values in the batch. My second hidden layer also contained 200 neurons, and my loss function was computed between the 200 neurons in my second layer and the 200 values of sin(x) that the input mapped to.
The problem is, my NN perfectly "approximated" sin(x) with 0 loss, but I know it wouldn't generalize to other data points.
What did I do wrong in designing this neural network, and how can I avoid memorization and instead design my NN's to "learn" about the patterns in my data?
It is same with any machine learning algorithm. You have a dataset based on which you try to learn "the" function f(x), which actually generated the data. In real life datasets, it is impossible to get the original function from the data, and therefore we approximate it using something g(x).
The main goal of any machine learning algorithm is to predict unseen data as best as possible using the function g(x).
Given a dataset D you can always train a model, which will perfectly classify all the datapoints (you can use a hashmap to get 0 error on the train set), but which is overfitting or memorization.
To avoid such things, you yourself have to make sure that the model does not memorise and learns the function. There are a few things which can be done. I am trying to write them down in an informal way (with links).
Train, Validation, Test
If you have large enough dataset, use Train, Validation, Test splits. Split the dataset in three parts. Typically 60%, 20% and 20% for Training, Validation and Test, respectively. (These numbers can vary based on need, also in case of imbalanced data, check how to get stratified partitions which preserve the class ratios in every split). Next, forget about the Test partition, keep it somewhere safe, don't touch it. Your model, will be trained using the Training partition. Once you have trained the model, evaluate the performance of the model using the Validation set. Then select another set of hyper-parameter configuration for your model (eg. number of hidden layer, learaning algorithm, other parameters etc.) and then train the model again, and evaluate based on Validation set. Keep on doing this for several such models. Then select the model, which got you the best validation score.
The role of validation set here is to check what the model has learned. If the model has overfit, then the validation scores will be very bad, and therefore in the above process you will discard those overfit models. But keep in mind, although you did not use the Validation set to train the model, directly, but the Validation set was used indirectly to select the model.
Once you have selected a final model based on Validation set. Now take out your Test set, as if you just got new dataset from real life, which no one has ever seen. The prediction of the model on this Test set will be an indication how well your model has "learned" as it is now trying to predict datapoints which it has never seen (directly or indirectly).
It is key to not go back and tune your model based on the Test score. This is because once you do this, the Test set will start contributing to your mode.
Crossvalidation and bootstrap sampling
On the other hand, if your dataset is small. You can use bootstrap sampling, or k-fold cross-validation. These ideas are similar. For example, for k-fold cross-validation, if k=5, then you split the dataset in 5 parts (also be carefull about stratified sampling). Let's name the parts a,b,c,d,e. Use the partitions [a,b,c,d] to train and get the prediction scores on [e] only. Next, use the partitions [a,b,c,e] and use the prediction scores on [d] only, and continue 5 times, where each time, you keep one partition alone and train the model with the other 4. After this, take an average of these scores. This is indicative of that your model might perform if it sees new data. It is also a good practice to do this multiple times and perform an average. For example, for smaller datasets, perform a 10 time 10-folds cross-validation, which will give a pretty stable score (depending on the dataset) which will be indicative of the prediction performance.
Bootstrap sampling is similar, but you need to sample the same number of datapoints (depends) with replacement from the dataset and use this sample to train. This set will have some datapoints repeated (as it was a sample with replacement). Then use the missing datapoins from the training dataset to evaluate the model. Perform this multiple times and average the performance.
Others
Other ways are to incorporate regularisation techniques in the classifier cost function itself. For example in Support Vector Machines, the cost function enforces conditions such that the decision boundary maintains a "margin" or a gap between two class regions. In neural networks one can also do similar things (although it is not same as in SVM).
In neural network you can use early stopping to stop the training. What this does, is train on the Train dataset, but at each epoch, it evaluates the performance on the Validation dataset. If the model starts to overfit from a specific epoch, then the error for Training dataset will keep on decreasing, but the error of the Validation dataset will start increasing, indicating that your model is overfitting. Based on this one can stop training.
A large dataset from real world tends not to overfit too much (citation needed). Also, if you have too many parameters in your model (to many hidden units and layers), and if the model is unnecessarily complex, it will tend to overfit. A model with lesser pameter will never overfit (though can underfit, if parameters are too low).
In the case of you sin function task, the neural net has to overfit, as it is ... the sin function. These tests can really help debug and experiment with your code.
Another important note, if you try to do a Train, Validation, Test, or k-fold crossvalidation on the data generated by the sin function dataset, then splitting it in the "usual" way will not work as in this case we are dealing with a time-series, and for those cases, one can use techniques mentioned here
First of all, I think it's a great project to approximate sin(x). It would be great if you could share the snippet or some additional details so that we could pin point the exact problem.
However, I think that the problem is that you are overfitting the data hence you are not able to generalize well to other data points.
Few tricks that might work,
Get more training points
Go for regularization
Add a test set so that you know whether you are overfitting or not.
Keep in mind that 0 loss or 100% accuracy is mostly not good on training set.

How to select training data for naive bayes classifier

I want to double check some concepts I am uncertain of regarding the training set for classifier learning. When we select records for our training data, do we select an equal number of records per class, summing to N or should it be randomly picking N number of records (regardless of class)?
Intuitively I was thinking of the former but thought of the prior class probabilities would then be equal and not be really helpful?
It depends on the distribution of your classes and the determination can only be made with domain knowledge of problem at hand.
You can ask the following questions:
Are there any two classes that are very similar and does the learner have enough information to distinguish between them?
Is there a large difference in the prior probabilities of each class?
If so, you should probably redistribute the classes.
In my experience, there is no harm in redistributing the classes, but it's not always necessary.
It really depends on the distribution of your classes. In the case of fraud or intrusion detection, the distribution of the prediction class can be less than 1%.
In this case you must distribute the classes evenly in the training set if you want the classifier to learn differences between each class. Otherwise, it will produce a classifier that correctly classifies over 99% of the cases without ever correctly identifying a fraud case, which is the whole point of creating a classifier to begin with.
Once you have a set of evenly distributed classes you can use any technique, such as k-fold, to perform the actual training.
Another example where class distributions need to be adjusted, but not necessarily in an equal number of records for each, is the case of determining upper-case letters of the alphabet from their shapes.
If you take a distribution of letters commonly used in the English language to train the classifier, there will be almost no cases, if any, of the letter Q. On the other hand, the letter O is very common. If you don't redistribute the classes to allow for the same number of Q's and O's, the classifier doesn't have enough information to ever distinguish a Q. You need to feed it enough information (i.e. more Qs) so it can determine that Q and O are indeed different letters.
The preferred approach is to use K-Fold Cross validation for picking up learning and testing data.
Quote from wikipedia:
K-fold cross-validation
In K-fold cross-validation, the
original sample is randomly
partitioned into K subsamples. Of the
K subsamples, a single subsample is
retained as the validation data for
testing the model, and the remaining K
− 1 subsamples are used as training
data. The cross-validation process is
then repeated K times (the folds),
with each of the K subsamples used
exactly once as the validation data.
The K results from the folds then can
be averaged (or otherwise combined) to
produce a single estimation. The
advantage of this method over repeated
random sub-sampling is that all
observations are used for both
training and validation, and each
observation is used for validation
exactly once. 10-fold cross-validation
is commonly used.
In stratified K-fold cross-validation,
the folds are selected so that the
mean response value is approximately
equal in all the folds. In the case of
a dichotomous classification, this
means that each fold contains roughly
the same proportions of the two types
of class labels.
You should always take the common approach in order to have comparable results with other scientific data.
I built an implementation of a Bayesian classifier to determine if a sample is NSFW (Not safe for work) by examining the occurrence of words in examples. When training a classifier for NSFW detection I've tried making it so that each class in the training sets has the same number of examples. This didn't work out as well as I had planned being that one of the classes had many more words per example than the other class.
Since I was computing the likelihood of NSFW based on these words I found that balancing out the classes based on their actual size (in MB) worked. I tried 10-cross fold validation for both approaches (balancing by number of examples and size of classes) and found that balancing by the size of the data worked well.

Resources