Does addition on Date column leads to over fitting? - machine-learning

I'm working on a dataset and what to predict whether it will rain or not, so should I include the date column. I haven't built the model yet, but I think it will lead to overfitting.

I don't think datetime is a vital feature. Though useful feature could be the season but now-a-days it's changing rapidly due to climate change and so on.
Anyways as it's a time-series problem the results are much more dependent on the condition of prior days but of course there are subtle changes which makes it harder to predict.
There are some existing works you can find below:
https://pdfs.semanticscholar.org/2761/8afb77c5081d942640333528943149a66edd.pdf
(used 2 prior days info as features)
https://stackabuse.com/using-machine-learning-to-predict-the-weather-part-1/ (3 prior days info as features)
I think these are some good starting point.

Related

What model to use for forecasting when customers will arrive?

I'm trying to build a model which will give the probability of every customer in a database will show up on a certain day (i.e. I pass in 8/25/19 and the list of all customers shows up with their respective probability). I have the logs for all customers transactions and the date. I'm thinking of using some sort of RNN to do this. Is this the proper way to do this? If not, what is the best way to do it? I want to discover the patterns and high confidence leads for which customers show up. There is around 400,000 records for 3 years.
You have time series data.
RNN is a good starting point. Check out this step-by-step instructions of sales prediction. RNN is an easy start and might give you really good quality. Also there is an adaptation of xgboost algorithm for time series that also gives a good quality, but might be slower.
Good luck!

How to tell if you can apply Machine Learning to a project?

I am working on a personal project in which I log data of the bike rental service my city has in a MySQL database. A script runs every thirty minutes and logs data for every bike station and the free bikes each one has. Then, in my database I average the availability of each station for each day at that given time making it, as today, an approximate prediction with 2 months of data logging.
I've read a bit on machine learning and I'd like to learn a bit. Would it be possible to train a model with my data and make better predictions with ML in the future?
The answer is very likely yes.
The first step is to have some data, and it sounds like you do. You have a response (free bikes) and some features on which it varies (time, location). You have already applied a basic conditional means model by averaging values over factors.
You might augment the data you know about locations with some calendar events like holiday or local event flags.
Prepare a data set with one row per observation, and benchmark the accuracy of your current forecasting process for a period of time on a metric like Mean Absolute Percentage Error (MAPE). Ensure your predictions (averages) for the validation period do not include any of the data within the validation period!
Use the data for this period to validate other models you try.
Split off part of the remaining data into a test set, and use the rest for training. If you have a lot of data, then a common training/test split is 70/30. If the data is small you might go down to 90/10.
Learn one or more machine learning models on the training set, checking performance periodically on the test set to ensure generalization performance is still increasing. Many training algorithm implementations will manage this for you, and stop automatically when test performance starts to decrease due to overfitting. This a big benefit of machine learning over your current straight average, the ability to learn what generalizes and throw away what does not.
Validate each model by predicting over the validation set, computing the MAPE and compare the MAPE of the model to that of your original process on the same period. Good luck, and enjoy getting to know machine learning!

Methods to Find 'Best' Cut-Off Point for a Continuous Target Variable

I am working on a machine learning scenario where the target variable is Duration of power outages.
The distribution of the target variable is severely skewed right (You can imagine most power outages occur and are over with fairly quick, but then there are many, many outliers that can last much longer) A lot of these power outages become less and less 'explainable' by data as the durations get longer and longer. They become more or less, 'unique outages', where events are occurring on site that are not necessarily 'typical' of other outages nor is data recorded on the specifics of those events outside of what's already available for all other 'typical' outages.
This causes a problem when creating models. This unexplainable data mingles in with the explainable parts and skews the models ability to predict as well.
I analyzed some percentiles to decide on a point that I considered to encompass as many outages as possible while I still believed that the duration was going to be mostly explainable. This was somewhere around the 320 minute mark and contained about 90% of the outages.
This was completely subjective to my opinion though and I know there has to be some kind of procedure in order to determine a 'best' cut-off point for this target variable. Ideally, I would like this procedure to be robust enough to consider the trade-off of encompassing as much data as possible and not telling me to make my cut-off 2 hours and thus cutting out a significant amount of customers as the purpose of this is to provide an accurate Estimated Restoration Time to as many customers as possible.
FYI: The methods of modeling I am using that appear to be working the best right now are random forests and conditional random forests. Methods I have used in this scenario include multiple linear regression, decision trees, random forests, and conditional random forests. MLR was by far the least effective. :(
I have exactly the same problem! I hope someone more informed brings his knowledge. I wander to what point is a long duration something that we want to discard or that we want to predict!
Also, I tried treating my data by log transforming it, and the density plot shows a funny artifact on the left side of the distribution ( because I only have durations of integer numbers, not floats). I think this helps, you also should log transform the features that have similar distributions.
I finally thought that the solution should be stratified sampling or giving weights to features, but I don't know exactly how to implement that. My tries didn't produce any good results. Perhaps my data is too stochastic!

What kind of classifier is used in the following scenario?

If I am building a weather predictor that will predict if it is will snow tomorrow, it is very easy to just straight away answer by saying "NO".
Obviously, if you evaluate such a classifier on every day of the year, it would be correct with an accuracy at 95% (considering that I build it and test it in a region where it snows very rarely).
Of course, that is such a stupid classifier even if it has an accuracy of 95% because it is obviously more important to predict if it will snow during the winter months (Jan & Feb) as opposed to any other months.
So, if I have a lot of features that I collect about the previous day to predict if it will snow the next day or not, considering that there will be a feature that says which month/week of the year it is, how can I weigh this particular feature and design the classifier to solve this practical problem?
Of course, that is such a stupid classifier even if it has an accuracy of 95% because it is obviously more important to predict if it will snow during the winter months (Jan & Feb) as opposed to any other months.
Accuracy might not be the best measurement to use in your case. Consider using precision, recall and F1 score.
how can I weigh this particular feature and design the classifier to solve this practical problem?
I don't think you should weight any particular feature in any way. You should let your algorithm do that and use cross validation to decide on the best parameters for your model, in order to also avoid overfitting.
If you say jan and feb are the most important months, consider only applying your model for those two months. If that's not possible, look into giving different weights to your classes (going to rain / not going to rain), based on their number. This question discusses that issue - the concept should be understandable regardless of your language of choice.

Using a feature as Input vs. using it to build Several Machines on SVM

I am an undergraduate student and for my graduation thesis I am using SVM to predict the arrival time of a bus to a bus stop in its route. After doing a lot of research and reading some papers I still have a key doubt about how to model my system.
We've decided which features to use and we are in the process of gathering the data required to perform the regression, but what is confusing us are the implications or consequences of using some features as input for the SVM or building separated machines based on some of these features.
For instance, in this paper the authors built 4 SVMs for predicting bus arrival times: one for rush hour on sunny days, rush hour on rainy days, off-rush hour on sunny days and the last one for off-rush hours and rainy days.
But on a following paper on the same subejct they decided to use a single SVM with the weather condition and the rush/off-rush hour as input instead of breaking it in 4 SVMs as before.
I feel like this is the kind of thing that is more about experience so I would like to hear from you guys if anyone has any information about when to choose one of these approaches.
Thanks in advance.
There is no other way: you have to find out on your own. This is why you have to write this thesis. Nobody starts with a perfect solution. Everyone makes mistakes. Your problem is not easy and you cannot say what will work when you have never done anything similar. Try everything you found in the literature, compare the results, develop your own ideas, ...
Most important question: what is the data like?
Second question: what model do you expect to capture this?
So if you want to use SVMs for some reason, keep in mind their basic mechanism is linear, and can only capture non-linear phenomena if data is transformed by a suitable kernel.
For a particular problem at hand that means:
Do you have reason (plots, insights in the problem nature) to believe your problem is linear(ly separable)? Just use one linear svm.
Do you have reason your problem consist of several linear subproblems? Use a linear svm on each of the subproblems.
Does your data seem non-linearly grouped? Try an svm with something like rbf kernel.
Of course, you can just plug in and try, but checking the above may increase understanding of the problem.
In your particular problem I would go for single SVM.
With my not so extensive experience, I would consider breaking a problem in several SVMs for following reasons:
1)The classes are too different, or there are classes and subclasses in your problem.
E.g. in my case: there are several types of antibodies in a microscope image and they all may be positive or negative. So instead of defining A_Pos, A_Neg, B_Pos, B_Neg, ... I decide first if the image is positive or negative and determine the type in second SVM.
2)The feature extraction is too expensive. Provided you have groups of classes, which may be identified with fever features. Instead of extracting all features for a single machine, you may first extract only a small subset, and if required (result not with high enough probability) extract further features.
3)Decide whether the instance belongs to problem at all. Make a model containing one class and all instances of training set. If the instance to be classified is an outlier, stop. Otherwise classify with 2nd SVM containing all classes.
The key-word is "cascaded SVM"

Resources