I am working on a personal project in which I log data of the bike rental service my city has in a MySQL database. A script runs every thirty minutes and logs data for every bike station and the free bikes each one has. Then, in my database I average the availability of each station for each day at that given time making it, as today, an approximate prediction with 2 months of data logging.
I've read a bit on machine learning and I'd like to learn a bit. Would it be possible to train a model with my data and make better predictions with ML in the future?
The answer is very likely yes.
The first step is to have some data, and it sounds like you do. You have a response (free bikes) and some features on which it varies (time, location). You have already applied a basic conditional means model by averaging values over factors.
You might augment the data you know about locations with some calendar events like holiday or local event flags.
Prepare a data set with one row per observation, and benchmark the accuracy of your current forecasting process for a period of time on a metric like Mean Absolute Percentage Error (MAPE). Ensure your predictions (averages) for the validation period do not include any of the data within the validation period!
Use the data for this period to validate other models you try.
Split off part of the remaining data into a test set, and use the rest for training. If you have a lot of data, then a common training/test split is 70/30. If the data is small you might go down to 90/10.
Learn one or more machine learning models on the training set, checking performance periodically on the test set to ensure generalization performance is still increasing. Many training algorithm implementations will manage this for you, and stop automatically when test performance starts to decrease due to overfitting. This a big benefit of machine learning over your current straight average, the ability to learn what generalizes and throw away what does not.
Validate each model by predicting over the validation set, computing the MAPE and compare the MAPE of the model to that of your original process on the same period. Good luck, and enjoy getting to know machine learning!
Related
The basic process for most supervised machine learning problems is to divide the dataset into a training set and test set and then train a model on the training set and evaluate its performance on the test set. But in many (most) settings, disease diagnosis for example, more data will be available in the future. How can I use this to improve upon the model? Do I need to retrain from scratch? When might be the appropriate time to retrain if this is the case (e.g., a specific percent of additional data points)?
Let’s take the example of predicting house prices. House prices change all the time. The data you used to train a machine learning model that predicts house prices six months ago could provide terrible predictions today. For house prices, it’s imperative that you have up-to-date information to train your models.
When designing a machine learning system it is important to understand how your data is going to change over time. A well-architected system should take this into account, and a plan should be put in place for keeping your models updated.
Manual retraining
One way to maintain models with fresh data is to train and deploy your models using the same process you used to build your models in the first place. As you can imagine this process can be time-consuming. How often do you retrain your models? Weekly? Daily? There is a balance between cost and benefit. Costs in model retraining include:
Computational Costs
Labor Costs
Implementation Costs
On the other hand, as you are manually retraining your models you may discover a new algorithm or a different set of features that provide improved accuracy.
Continuous learning
Another way to keep your models up-to-date is to have an automated system to continuously evaluate and retrain your models. This type of system is often referred to as continuous learning, and may look something like this:
Save new training data as you receive it. For example, if you are receiving updated prices of houses on the market, save that information to a database.
When you have enough new data, test its accuracy against your machine learning model.
If you see the accuracy of your model degrading over time, use the new data, or a combination of the new data and old training data to build and deploy a new model.
The benefit to a continuous learning system is that it can be completely automated.
I just have a general question:
In a previous job, I was tasked with building a series of non-linear models to quantify the impact of certain factors on the number of medical claims filed. We had a set of variables we would use in all models (eg: state, year, Sex, etc.). We used all of our data to build these models; meaning we never split the data into training and test data sets.
If I were to go back in time to this job and split the data into training and test data sets, what would the advantages of that approach be besides assessing the prediction accuracy of our models. What is an argument for not splitting the data and then fitting the model? Never really thought about it too much until now - curious as to why we didn't take that approach.
Thanks!
The sole purpose of setting aside a test set is to assess prediction accuracy. However, there is more to this than just checking the number and thinking "huh, that's how my model performs"!
Knowing how your model performs at a given moment gives you an important benchmark for potential improvements of the model. How will you know otherwise whether adding a feature increases model performance? Moreover, how do you know otherwise whether your model is at all better than mere random guessing? Sometimes, extremely simple models outperform the more complex ones.
Another thing is removal of features or observations. This depends a bit on the kind of models you use, but some models (e.g., k-Nearest-Neighbors) perform significantly better if you remove unimportant features from the data. Similarly, suppose you add more training data and suddenly your model's test performance drops significantly. Perhaps there is something wrong with the new observations? You should be aware of these things.
The only argument I can think of for not using a test set is that otherwise you'd have too little training data for the model to perform optimally.
Scenario - I have data that does not have labels but I can create a function to label the data based on behavior and deploy the model so I don't have to keep labeling the data. Is this considered machine learning?
Objective: classify accounts with Volume spikes based on high medium low labels to deploy on big data (trillions of lines of data)
Data: the data I have includes the following attributes:
Account, Time, Date, Volume amount.
Method:
Create a new feature column called "spike" and create a pandas function to ID a spike greater than 5. Is this feature engineering?
Next I create my label column and classify it as low medium or high spike.
Next I Train a machine learning classifier and deploy it to label future accounts with similar patterns in big data.
Thoughts on this process? Is this approach correct for Machine learning?
1st question:
If your algorithm takes the decision, that is, put a label in a sample, based on the set of samples that you have, I'd say it's a machine learning algorithm. But if you design a code that takes into account your experience regarding the data, I think it's not an ML method. In brief, ML look at the data to get patterns and insights from them. I don't know why you're doing that, but is it need to be an ML algorithm? Sometimes you can solve the problem in a very simple way, without using ML.
2nd question: I'm afraid not. Select your data attributes (ex: Account, Time, Date, Volume amount), checking their correlations, try to figure out if you have a dominant one, etc. This process is pre ML. The feature engineering will select what are the best features to present to our algorithm in order to perform the classification (in your case)
3rd question: I think it's fair enough to start playing with some ML algorithms, such as KNN, SVM, NNs, Decision Tree, etc.
I have a problem whereby our users receive the balance of an account each day, and based on the balance, perform an action.
Given the list of historical balances and resulting actions, is it possible to use machine learning to predict the future actions? Preferably in the .net platform.
Thanks.
Ark
I've never used .NET for any data analytics, but I'm sure it won't be too terribly difficult to transpose what I say here into logic in .NET
One of the things people don't like about data sciences is that in order to see if something IS actually possible (predicting future outcomes in this case), you need to do a lot of exploring with the data and see if the data has enough of a pattern to be learned (by either human or by a ML algorithm).
The way to do this would be to shuffle and split the data in some way...let's say into one group with 70 percent of the data and a second with 30 percent of the data.
Once you do this, you want to train some algorithm with the first group (training set) and use the second group(test set) to verify the accuracy of your algorithm.
So how do you chose an algorithm? That's the trickiest part. Only you can say which is best for your particular scenario given full access to the data. However, given that your output seems to be very discrete (let's say max 5 actions), that makes this a supervised learning classification problem. I'd do some analysis using one of these algorithms (SVM, kNN, and DecisionsTrees are a few popular ones), and use some error LIKE F1 or R^2 to determine how well your fitted algorithm performs on your test set.
To perform supervised Machine Learning in .NET, the ML.NET Framework has been announced, and a preview is now available (as of 7th May 2018).
A good starting place for ML.NET is here.
I am currently working on a very small dataset of about 25 samples (200 features) and I need to perform model selection and also have a reliable classification accuracy. I was planning to split the dataset in a training set (for a 4-fold CV) and a test set (for testing on unseen data). The main problem is that the resulting accuracy obtained from the test set is not reliable enough.
So, performing multiple time the cross-validation and testing could solve the problem?
I was planning to perform multiple times this process in order to have a better confidence on the classification accuracy. For instance: I would run one cross-validation plus testing and the output would be one "best" model plus the accuracy on the test set. The next run I would perform the same process, however, the "best" model may not be the same. By performing this process multiple times I eventually end up with one predominant model and the accuracy will be the average of the accuracies obtained on that model.
Since I never heard about a testing framework like this one, does anyone have any suggestion or critics on the algorithm proposed?
Thanks in advance.
The algorithm seems interesting but you need to make lots of passes through data and ensure that some specific model is really dominant (that it surfaces in real majority of tests, not just 'more than others'). In general, in ML a real problem is having too little data. As anyone will tell you, not the team with the most complicated algorithm wins, but the team with biggest amount of data.
In your case I would also suggest one additional approach - bootstrapping. Details are here:
what is the bootstrapped data in data mining?
Or can be googled. Long story short it is a sampling with replacement, which should help you to expand your dataset from 25 samples to something more interesting.
When the data is small like yours you should consider 'LOOCV' or leave one out cross validation. In this case you partition the data into 25 different samples where and each one a single different observatin is held out. Performance is then calcluated using the 25 individual held out predictions.
This will allow you to use the most data in your modeling and you will still have a good measure of performance.