If I am building a weather predictor that will predict if it is will snow tomorrow, it is very easy to just straight away answer by saying "NO".
Obviously, if you evaluate such a classifier on every day of the year, it would be correct with an accuracy at 95% (considering that I build it and test it in a region where it snows very rarely).
Of course, that is such a stupid classifier even if it has an accuracy of 95% because it is obviously more important to predict if it will snow during the winter months (Jan & Feb) as opposed to any other months.
So, if I have a lot of features that I collect about the previous day to predict if it will snow the next day or not, considering that there will be a feature that says which month/week of the year it is, how can I weigh this particular feature and design the classifier to solve this practical problem?
Of course, that is such a stupid classifier even if it has an accuracy of 95% because it is obviously more important to predict if it will snow during the winter months (Jan & Feb) as opposed to any other months.
Accuracy might not be the best measurement to use in your case. Consider using precision, recall and F1 score.
how can I weigh this particular feature and design the classifier to solve this practical problem?
I don't think you should weight any particular feature in any way. You should let your algorithm do that and use cross validation to decide on the best parameters for your model, in order to also avoid overfitting.
If you say jan and feb are the most important months, consider only applying your model for those two months. If that's not possible, look into giving different weights to your classes (going to rain / not going to rain), based on their number. This question discusses that issue - the concept should be understandable regardless of your language of choice.
Related
I'm working on a dataset and what to predict whether it will rain or not, so should I include the date column. I haven't built the model yet, but I think it will lead to overfitting.
I don't think datetime is a vital feature. Though useful feature could be the season but now-a-days it's changing rapidly due to climate change and so on.
Anyways as it's a time-series problem the results are much more dependent on the condition of prior days but of course there are subtle changes which makes it harder to predict.
There are some existing works you can find below:
https://pdfs.semanticscholar.org/2761/8afb77c5081d942640333528943149a66edd.pdf
(used 2 prior days info as features)
https://stackabuse.com/using-machine-learning-to-predict-the-weather-part-1/ (3 prior days info as features)
I think these are some good starting point.
I have a project where I have to predict a label (whether that's suitable for that situation or not) considering several variables.
e.g., Imagine a mango market where the price of mango goes high at 12 pm but gradually goes down after that or vice versa. Mangoes can be of different colors and sizes (attributes of the mangoes if those are properly ripened or not). Someone wants to buy mangoes from that market. What time and which color or what size will be the best choice for that person(considering different factors)?
I have done a small classification project with boosted regression tree and the result of classification is awesome. But, I can't seem to figure out how to make the algorithm decide which is the better choice (rather than only classification)
Any kind of idea will be highly appreciated. If my question is kinda dumb, I'm sorry for that :( and thank you for your time.
I am currently working on my final project in university and I have to do some Machine Learning. I have to say I am not experienced with ML. I have data with a timestamp, zone number (6 zones) and number of calls. I need to predict the number of calls and initially i decided to use Multilinear regression. However, while researching i found about time series analysis and I am wondering now, which one would be better for making predicitons in my case.
From what I understood time series analysis is good for forecast, but is it good for short term predictions. Like predicting number of calls tomorrow or next week? I want to make short term predictions. Maximum in the next month.
I just have read so much that I got confused!
I would very much appreciate, if you could advice me, what is better.
If I provided you with data sufficient to classify a bunch of objects as either apples, oranges or bananas, how long might it take you to build an SVM that could make that classification? I appreciate that it probably depends on the nature of the data, but are we more likely talking hours, days or weeks?
Ok. Now that you have that SVM, and you have an understanding of how the data behaves, how long would it likely take you to upgrade that SVM (or build a new one) to classify an extra class (tomatoes) as well? Seconds? Minutes? Hours?
The motivation for the question is trying to assess the practical suitability of SVMs to a situation in which not all data is available to be sampled at any time. Fruit are an obvious case - they change colour and availability with the season.
If you would expect SVMs to be too fiddly to be able to create inside 5 minutes on demand, despite experience with the problem domain, then suggestions of a more user-friendly form of classifier for such a situation would be appreciated.
Generally, adding a class to a 1 vs. many SVM classifier requires retraining all classes. In case of large data sets, this might turn out to be quite expensive. In the real world, when facing very large data sets, if performance and flexibility are more important than state-of-the-art accuracy, Naive Bayes is quite widely used (adding a class to a NB classifier requires training of the new class only).
However, according to your comment, which states the data has tens of dimensions and up to 1000s of samples, the problem is relatively small, so practically, SVM retrain can be performed very fast (probably, in the order of seconds to tens of seconds).
You need to give us more details about your problem, since there are too many different scenarios where SVM can be trained fairly quickly (I could train it in real time in a third person shooting game and not have any latency) or it could last several minutes (I have a case for a face detector that training took an hour long)
As a thumb rule, the training time is proportional to the number of samples and the dimension of each vector.
I am an undergraduate student and for my graduation thesis I am using SVM to predict the arrival time of a bus to a bus stop in its route. After doing a lot of research and reading some papers I still have a key doubt about how to model my system.
We've decided which features to use and we are in the process of gathering the data required to perform the regression, but what is confusing us are the implications or consequences of using some features as input for the SVM or building separated machines based on some of these features.
For instance, in this paper the authors built 4 SVMs for predicting bus arrival times: one for rush hour on sunny days, rush hour on rainy days, off-rush hour on sunny days and the last one for off-rush hours and rainy days.
But on a following paper on the same subejct they decided to use a single SVM with the weather condition and the rush/off-rush hour as input instead of breaking it in 4 SVMs as before.
I feel like this is the kind of thing that is more about experience so I would like to hear from you guys if anyone has any information about when to choose one of these approaches.
Thanks in advance.
There is no other way: you have to find out on your own. This is why you have to write this thesis. Nobody starts with a perfect solution. Everyone makes mistakes. Your problem is not easy and you cannot say what will work when you have never done anything similar. Try everything you found in the literature, compare the results, develop your own ideas, ...
Most important question: what is the data like?
Second question: what model do you expect to capture this?
So if you want to use SVMs for some reason, keep in mind their basic mechanism is linear, and can only capture non-linear phenomena if data is transformed by a suitable kernel.
For a particular problem at hand that means:
Do you have reason (plots, insights in the problem nature) to believe your problem is linear(ly separable)? Just use one linear svm.
Do you have reason your problem consist of several linear subproblems? Use a linear svm on each of the subproblems.
Does your data seem non-linearly grouped? Try an svm with something like rbf kernel.
Of course, you can just plug in and try, but checking the above may increase understanding of the problem.
In your particular problem I would go for single SVM.
With my not so extensive experience, I would consider breaking a problem in several SVMs for following reasons:
1)The classes are too different, or there are classes and subclasses in your problem.
E.g. in my case: there are several types of antibodies in a microscope image and they all may be positive or negative. So instead of defining A_Pos, A_Neg, B_Pos, B_Neg, ... I decide first if the image is positive or negative and determine the type in second SVM.
2)The feature extraction is too expensive. Provided you have groups of classes, which may be identified with fever features. Instead of extracting all features for a single machine, you may first extract only a small subset, and if required (result not with high enough probability) extract further features.
3)Decide whether the instance belongs to problem at all. Make a model containing one class and all instances of training set. If the instance to be classified is an outlier, stop. Otherwise classify with 2nd SVM containing all classes.
The key-word is "cascaded SVM"