How to differentiate between real improvement and random noise? - machine-learning

I am building an automatic translator in moses. To improve its performance, I use log-linear weight optimisation. This technique has a random component, which can affect slightly the final result (but I do not know exactly how much).
Suppose that the current performance of the model is 25 BLEU.
Suppose now I modify the language model (e.g. change the smoothing), and I get a performance of 26 BLEU.
My question is: how can I know if the improvement is because the modification, or is just noise from the random component?

This is pretty much what statistics is all about. You can basically do one of the two things (from the basic set of solutions, of course there are many more advanced):
try to measure/model/quantify the effect of randomness, if you know what is causing it, you might be able to actually compute how much it can affect your model. If analytical solution is not possible, you can always train 20 models with the same data/settings, gather results and estimate noise distribution. Once you have this you can perform statistical tests to check whether the improvement is statistically significant (for example by ANOVA tests).
simpler approach (but more expensive in terms of data/time) is to simply reduce the variance by averaging. In short - instead of training one model (or evaluating model once) which has this hard to determine noise component - do it many times, 10, 20, and average the results. This way you reduce the variance of the results in your analysis. This can (and should) be combined with the previous option - since now you have 20 results per run, thus you can again use statistical testes to see whether these are significantly different things.

Related

Model selection for classification with random train/test sets

I'm working with an extremelly unbalanced and heterogeneous multiclass {K = 16} database for research, with a small N ~= 250. For some labels the database has a sufficient amount of examples for supervised machine learning, but for others I have almost none. I'm also not in a position to expand my database for a number of reasons.
As a first approach I divided my database into training (80%) and test (20%) sets in a stratified way. On top of that, I applied several classification algorithms that provide some results. I applied this procedure over 500 stratified train/test sets (as each stratified sampling takes individuals randomly within each stratum), hoping to select an algorithm (model) that performed acceptably.
Because of my database, depending on the specific examples that are part of the train set, the performance on the test set varies greatly. I'm dealing with runs that have as high (for my application) as 82% accuracy and runs that have as low as 40%. The median over all runs is around 67% accuracy.
When facing this situation, I'm unsure on what is the standard procedure (if there is any) when selecting the best performing model. My rationale is that the 90% model may generalize better because the specific examples selected in the training set are be richer so that the test set is better classified. However, I'm fully aware of the possibility of the test set being composed of "simpler" cases that are easier to classify or the train set comprising all hard-to-classify cases.
Is there any standard procedure to select the best performing model considering that the distribution of examples in my train/test sets cause the results to vary greatly? Am I making a conceptual mistake somewhere? Do practitioners usually select the best performing model without any further exploration?
I don't like the idea of using the mean/median accuracy, as obviously some models generalize better than others, but I'm by no means an expert in the field.
Confusion matrix of the predicted label on the test set of one of the best cases:
Confusion matrix of the predicted label on the test set of one of the worst cases:
They both use the same algorithm and parameters.
Good Accuracy =/= Good Model
I want to firstly point out that a good accuracy on your test set need not equal a good model in general! This has (in your case) mainly to do with your extremely skewed distribution of samples.
Especially when doing a stratified split, and having one class dominatingly represented, you will likely get good results by simply predicting this one class over and over again.
A good way to see if this is happening is to look at a confusion matrix (better picture here) of your predictions.
If there is one class that seems to confuse other classes as well, that is an indicator for a bad model. I would argue that in your case it would be generally very hard to find a good model unless you do actively try to balance your classes more during training.
Use the power of Ensembles
Another idea is indeed to use ensembling over multiple models (in your case resulting from different splits), since it is assumed to generalize better.
Even if you might sacrifice a lot of accuracy on paper, I would bet that a confusion matrix of an ensemble is likely to look much better than the one of a single "high accuracy" model. Especially if you disregard the models that perform extremely poor (make sure that, again, the "poor" performance comes from an actual bad performance, and not just an unlucky split), I can see a very good generalization.
Try k-fold Cross-Validation
Another common technique is k-fold cross-validation. Instead of performing your evaluation on a single 80/20 split, you essentially divide your data in k equally large sets, and then always train on k-1 sets, while evaluating on the other set. You then not only get a feeling whether your split was reasonable (you usually get all the results for different splits in k-fold CV implementations, like the one from sklearn), but you also get an overall score that tells you the average of all folds.
Note that 5-fold CV would equal a split into 5 20% sets, so essentially what you are doing now, plus the "shuffling part".
CV is also a good way to deal with little training data, in settings where you have imbalanced classes, or where you generally want to make sure your model actually performs well.

How specific should a Support Vector Machine Model be?

The whole point of using an SVM is that the algorithm will be able to decide whether an input is true or false etc etc.
I am trying to use an SVM for predictive maintenance to predict how likely a system is to overheat.
For my example, the range is 0-102°C and if the temperature reaches 80°C or above it's classed as a failure.
My inputs are arrays of 30 doubles(the last 30 readings).
I am making some sample inputs to train the SVM and I was wondering if it is good practice to pass in very specific data to train it - eg passing in arrays 80°C, 81°C ... 102°C so that the model will automatically associate these values with failure. You could do an array of 30 x 79°C as well and set that to pass.
This seems like a complete way of doing it, although if you input arrays like that - would it not be the same as hardcoding a switch statement to trigger when the temperature reads 80->102°C.
Would it be a good idea to pass in these "hardcoded" style arrays or should I stick to more random inputs?
If there is a finite set of possibilities I would really recommend using Naïve Bayes, as that method would fit this problem perfectly. However if you are forced to use an SVM, I would say that would be rather difficult. For starters the main idea with an SVM is to use it for classification, and the amount of scenarios does not really matter. The input is however seldom discrete, so I guess there usually are infinite scenarios. However, an SVM implemented normally would only give you a classification, unless you have 100 classes one for 1% another one for 2%, this wouldn't really solve problem.
The conclusion is that this could work, but it would not be considered "best practice". You can imagine your 30 dimensional vector space divided into 100 small sub spaces, and each datapoint, a 30x1 vector is a point in that vectorspace so that the probability is decided by which of the 100 subsets its in. However, having a 100 classes and not very clean or insufficient data, will lead to very bad, hard performing models.
Cheers :)

Splitting training and test data

Can someone recommend what is the best percent of divided the training data and testing data in Machine learning. What are the disadvantages if i split training and test data in 30-70 percentage ?
There is no one "right way" to split your data unfortunately, people use different values which are chosen based on different heuristics, gut feeling and personal experience/preference. A good starting point is the Pareto principle (80-20).
Sometimes using a simple split isn't an option as you might simply have too much data - in that case you might need to sample your data or use smaller testing sets if your algorithm is computationally complex. An important part is to randomly choose your data. The trade-off is pretty simple: less testing data = the performance of your algorithm will have bigger variance. Less training data = parameter estimates will have bigger variance.
For me personally more important than the size of the split is that you obviously shouldn't always perform your tests only once on the same test split as it might be biased (you might be lucky or unlucky with your split). That's why you should do tests for several configurations (for instance you run your tests X times each time choosing a different 20% for testing). Here you can have problems with the so called model variance - different splits will result in different values. That's why you should run tests several times and average out the results.
With the above method you might find it troublesome to test all the possible splits. A well established method of splitting data is the so called cross validation and as you can read in the wiki article there are several types of it (both exhaustive and non-exhaustive). Pay extra attention to the k-fold cross validation.
Read up on the various strategies for cross-validation.
A 10%-90% split is popular, as it arises from 10x cross-validation.
But you could do 3x or 4x cross validation, too. (33-67 or 25-75)
Much larger errors arise from:
having duplicates in both test and train
unbalanced data
Make sure to first merge all duplicates, and do stratified splits if you have unbalanced data.

Input selection for neural networks

I am going to use ANN for my work in which I have a large dataset, let say input[600x40] and output[600x6]. As one can see, the number of inputs (40) is too high for ANN and it may trap in local minimum and/or increases the CPU time dramatically. Is there any way to select the most informative input?
As my first try, I used the following code in Matlab to find the cross-correlation between each two inputs:
[rho, ~] = corr(inputs, 'rows','pairwise')
However, I think this simple correlation cannot identify some hidden complex relation between the inputs.
Any ideas?
First of all 40 inputs is a very small space and it should not be reduced. Large number of inputs is 100,000, not 40. Also, 600x40 is not a big dataset, nor the one "increasing the CPU time dramaticaly", if it learns slowly than check your code because it appears to be the problem, not your data.
Furthermore, feature selection is not a good way to go, you should use it only when gathering features is actually expensive. In any other scenario you are looking for dimensionality reduction, such as PCA, LDA etc. although as said before - your data should not be reduced, rather - you should consider getting more of it (new samples/new features).
Disclaimer: I'm with lejlot on this - you should get more data and
more features instead of trying to remove features. Still, that doesn't answer your question, so here we go.
Try most basic greedy approach - try removing each feature and retrain your ANN (several times, of course) and see if your results got better or worse. Choose this situation where results got better and improvement was the best. Repeat until you'll get no improvement by removing features. This will take a lot of time, so you may want to try doing it on some subset of your data (for example on 3 folds of dataset splitted into 10 folds).
It's ugly, but sometimes it works.
I repeat what I've said in disclaimer - this is not the way to go.

Interpreting the parameters of the evaluate() function of a item-based recommender in Mahout

I am working with boolean values, trying to evaluate a recommending engine in Mahout. My questions are about the selection of the "correct" parameters of the evaluate function. Apologize in advance for the lengthy post.
IRStatistics evaluate(RecommenderBuilder recommenderBuilder,
DataModelBuilder dataModelBuilder,
DataModel dataModel,
IDRescorer rescorer,
int at,
double relevanceThreshold,
double evaluationPercentage) throws TasteException;
1) Can you think of an example in which the following two parameters must be used:
- DataModelBuilder dataModelBuilder
- IDRescorer rescorer
2) For the double relevanceThreshold variable, I set the value GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD, however, I was wondering if a "better" model could be built by setting a different value.
3) In my project, I need to recommend at most 10 items per user. Does this mean that it shouldn't make sense to set a value bigger than 10 for variable int at?
4) Given that I don't bother if I have to wait a lot for building the model, is it a good practice to set variable double evaluationPercentage equal to 1? Can you think of any case where 1 will not give the optimum model?
5) Why precision / recall (note that I am working on boolean data) increases as long as the number of recommendations (i.e. variable int at) increases (I proved that experimentally)?
6) Where does the spiting of both testing and training tests is taking place within mahout, and how could I change that percentage (unless if this is not the case for item-based recommendations)?
Accurate recommendations alone do not guarantee users of recommender systems an effective and satisfying experience, so measurements should be taken only as a reference point. That said, ideally real users would use your system against a baseline you set (like random recommendations) and do A/B test and see which has better performance. But that can be troublesome and not quite practical.
Precision and recall at N recommendations, are not a great metrics for recommenders. You are better off using a metric like AUC (area under the curve)
Have a look a the Mahout in Action book example (link)
Letting Mahout choose a threshold is fine, but it will be more computationally expensive
Yes, if you are making 10 recommendations, evaluating at 10 makes a lot of sense
Depends on the size of your data really. If using 100% (that is 1.0) is fast enough, I would use that. But if you do use something different (less), I would strongly suggest you use RandomUtils.useTestSeed(); when testing so you know the sampling will be done in the same manner every time you evaluate. (don't use it in production though)
Not sure. Depends on how your data looks like. But normally if precision increases, recall decreases and vice versa. See F1 Score (also available from Mahout IRStatistics)
For IRStatistics I'm not entirely sure where it happens (or if it happens at all). Notice it doesn't even take a % for division into training and test. Although there might be a default somewhere. If I were you I would go through the Mahout code and find out.

Resources