Train/Test Datasets in Machine Learning - machine-learning

I just have a general question:
In a previous job, I was tasked with building a series of non-linear models to quantify the impact of certain factors on the number of medical claims filed. We had a set of variables we would use in all models (eg: state, year, Sex, etc.). We used all of our data to build these models; meaning we never split the data into training and test data sets.
If I were to go back in time to this job and split the data into training and test data sets, what would the advantages of that approach be besides assessing the prediction accuracy of our models. What is an argument for not splitting the data and then fitting the model? Never really thought about it too much until now - curious as to why we didn't take that approach.
Thanks!

The sole purpose of setting aside a test set is to assess prediction accuracy. However, there is more to this than just checking the number and thinking "huh, that's how my model performs"!
Knowing how your model performs at a given moment gives you an important benchmark for potential improvements of the model. How will you know otherwise whether adding a feature increases model performance? Moreover, how do you know otherwise whether your model is at all better than mere random guessing? Sometimes, extremely simple models outperform the more complex ones.
Another thing is removal of features or observations. This depends a bit on the kind of models you use, but some models (e.g., k-Nearest-Neighbors) perform significantly better if you remove unimportant features from the data. Similarly, suppose you add more training data and suddenly your model's test performance drops significantly. Perhaps there is something wrong with the new observations? You should be aware of these things.
The only argument I can think of for not using a test set is that otherwise you'd have too little training data for the model to perform optimally.

Related

Removing duplicates for ML training set?

I am wondering what is the common practice (if there is any) for handling duplicate observations for machine learning training sets.
Dropping duplicate observations would surely speed up the computations so that's a benefit.
But would it not throw the model off by simplifying it? Do models take the number of duplicates into account? I have a feeling it depends on the model, but am not able to find a clear answer.
I can imagine this differing very much for your specific use case, your data, and the type of models you use.
Many models would tend towards getting a certain record right if there are many duplicates of a that record: whether it's the C4.5 algorithm behind many decision trees, or the stochastic gradient descent behind neural networks.
Removing duplicates could be a very legitimate thing to do if you learn that the duplicates are a result of faulty training data, because in that case you'd want to modify your data to represent the real world as accurately as possible.
Though if the nature of your data is just that many records are identical, but they're still legitimate data points, then for many applications you'd want your model to weigh those data points appropriately, because in the end, that's what your real-world data would look like as well.

ML algorithm suggestion for databases that change a lot with time after model training

I have a classification problem and I'm using a logistic regression (I tested it among other models and this one was the best). I look for information from game sites and test if a user has the potential to be a buyer of certain games.
The problem is that lately some sites from which I get this information (and also from where I got the information to train the model) change weekly and, with that, part of the database I use for prediction is "partially" different from the one used for training (with different information for each user, in this case). Since when these sites started to change, the model's predictive ability has dropped considerably.
To solve this, an alternative would be, of course, to retrain the model. It's something we're considering, although we'll have to do it with some frequency given the fact that the sites are changing every couple of weeks, considerably.
other solutions considered was the use of algorithms that could adapt to these changes and, with that, we could retrain the model less frequently.
Two options raised were neural networks to classify or try to adapt some genetic algorithm. However, I have read that genetic algorithms would be very expensive and are not a good option for classification problems, given the fact that they may not converge.
Does anyone have any suggestions for a modeling approach that we can test?

Build one more model after dropping the features with importance

We build a Data Science model and watch feature importance. If we drop the features and build a new model , will there be any improvement in the accuracy?. I see only one advantage is that consumer of model can pass only limited parameters to get the prediction. Are there any other advantages?
Yes!
Fewer parameters mean a faster learning and a faster prediction. Done correctly, it also means a smaller chance of overfitting your data.
The non-relevant features act as a noise, thus it will reduce the accuracy of the model.
While training the model it makes the convergence towards global minima harder, as it minimizes a more complex function.

How to build a good training data set for machine learning and predictions?

I have a school project to make a program that uses the Weka tools to make predictions on football (soccer) games.
Since the algorithms are already there (the J48 algorithm), I need just the data. I found a website that offers football game data for free and I tried it in Weka but the predictions were pretty bad so I assume my data is not structured properly.
I need to extract the data from my source and format it another way in order to make new attributes and classes for my model. Does anyone know of a course/tutorial/guide on how to properly create your attributes and classes for machine learning predictions? Is there a standard that describes the best way of choosing the attributes of a data set for training a machine learning algorithm? What's the approach on this?
here's an example of the data that I have at the moment: http://www.football-data.co.uk/mmz4281/1516/E0.csv
and here is what the columns mean: http://www.football-data.co.uk/notes.txt
The problem may be that the data set you have is too small. Suppose you have ten variables and each variable has a range of 10 values. There are 10^10 possible configurations of these variables. It is unlikely your data set will be this large let alone cover all of the possible configurations. The trick is to narrow down the variables to the most relevant to avoid this large potential search space.
A second problem is that certain combinations of variables may be more significant than others.
The J48 algorithm attempts to to find the most relevant variable using entropy at each level in the tree. each path through the tree can be thought of as an AND condition: V1==a & V2==b ...
This covers the significance due to joint interactions. But what if the outcome is a result of A&B&C OR W&X&Y? The J48 algorithm will find only one and it will be the one where the the first variable selected will have the most overall significance when considered alone.
So, to answer your question, you need to not only find a training set which will cover the most common variable configurations in the "general" population but find an algorithm which will faithfully represent these training cases. Faithful meaning it will generally apply to unseen cases.
It's not an easy task. Many people and much money are involved in sports betting. If it were as easy as selecting the proper training set, you can be sure it would have been found by now.
EDIT:
It was asked in the comments how to you find the proper algorithm. The answer is the same way you find a needle in a haystack. There is no set rule. You may be lucky and stumble across it but in a large search space you won't ever know if you have. This is the same problem as finding the optimum point in a very convoluted search space.
A short-term answer is to
Think about what the algorithm can really accomplish. The J48 (and similar) algorithms are best suited for classification where the influence of the variables on the result are well known and follow a hierarchy. Flower classification is one example where it will likely excel.
Check the model against the training set. If it does poorly with the training set then it will likely have poor performance with unseen data. In general, you should expect the model to performance against the training to exceed the performance against unseen data.
The algorithm needs to be tested with data it has never seen. Testing against the training set, while a quick elimination test, will likely lead to overconfidence.
Reserve some of your data for testing. Weka provides a way to do this. The best case scenario would be to build the model on all cases except one (Leave On Out Approach) then see how the model performs on the average with these.
But this assumes the data at hand are not in some way biased.
A second pitfall is to let the test results bias the way you build the model.For example, trying different models parameters until you get an acceptable test response. With J48 it's not easy to allow this bias to creep in but if it did then you have just used your test set as an auxiliary training set.
Continue collecting more data; testing as long as possible. Even after all of the above, you still won't know how useful the algorithm is unless you can observe its performance against future cases. When what appears to be a good model starts behaving poorly then it's time to go back to the drawing board.
Surprisingly, there are a large number of fields (mostly in the soft sciences) which fail to see the need to verify the model with future data. But this is a matter better discussed elsewhere.
This may not be the answer you are looking for but it is the way things are.
In summary,
The training data set should cover the 'significant' variable configurations
You should verify the model against unseen data
Identifying (1) and doing (2) are the tricky bits. There is no cut-and-dried recipe to follow.

Multiple cross-validation + testing on a small dataset to improve confidence

I am currently working on a very small dataset of about 25 samples (200 features) and I need to perform model selection and also have a reliable classification accuracy. I was planning to split the dataset in a training set (for a 4-fold CV) and a test set (for testing on unseen data). The main problem is that the resulting accuracy obtained from the test set is not reliable enough.
So, performing multiple time the cross-validation and testing could solve the problem?
I was planning to perform multiple times this process in order to have a better confidence on the classification accuracy. For instance: I would run one cross-validation plus testing and the output would be one "best" model plus the accuracy on the test set. The next run I would perform the same process, however, the "best" model may not be the same. By performing this process multiple times I eventually end up with one predominant model and the accuracy will be the average of the accuracies obtained on that model.
Since I never heard about a testing framework like this one, does anyone have any suggestion or critics on the algorithm proposed?
Thanks in advance.
The algorithm seems interesting but you need to make lots of passes through data and ensure that some specific model is really dominant (that it surfaces in real majority of tests, not just 'more than others'). In general, in ML a real problem is having too little data. As anyone will tell you, not the team with the most complicated algorithm wins, but the team with biggest amount of data.
In your case I would also suggest one additional approach - bootstrapping. Details are here:
what is the bootstrapped data in data mining?
Or can be googled. Long story short it is a sampling with replacement, which should help you to expand your dataset from 25 samples to something more interesting.
When the data is small like yours you should consider 'LOOCV' or leave one out cross validation. In this case you partition the data into 25 different samples where and each one a single different observatin is held out. Performance is then calcluated using the 25 individual held out predictions.
This will allow you to use the most data in your modeling and you will still have a good measure of performance.

Resources