Which methods are best for managing and predicting and labeling data in dynamic environment? The system data distribution changes and it is not static. The system can have different normal settings and under different settings, we have different normal data distributions. Consider we have two classes. Normal and abnormal. What happens? We cannot say that we can rely on historical data and train a simple classification method to predict future observations since one day after training the model, data distribution can change and old observations will become irrelevant to new ones. Consider the following figure:
Blue distribution and red distribution are normal data but under different setting and in the training time we have just one setting. This data is for one sensor. So, suppose we train a model with blue one and also have some abnormal samples. Imagine abnormals samples as normal samples with a little bit noise or fault in measurements. Then, we want to test the model but setting changes and now we have red distribution as our test observations. So, the model misclassifies the samples.
What are the best methods for a situation like this? Please note that I have tried several clustering algorithms but they cannot manage and distinguish between normal and abnormal samples.
Any suggestion and help are highly welcomed. Thanks
There are plenty of books on time series data.
In particular, on change detection. Your example can supposedly be considered a change in mean. There are statistical models to detect this.
Basseville, Michèle, and Igor V. Nikiforov. Detection of abrupt changes: theory and application. Vol. 104. Englewood Cliffs: Prentice Hall, 1993.
Related
Suppose you're trying to use machine learning for a classification task like, let's say, looking at photographs of animals and distinguishing horses from zebras. This task would seem to be within the state of the art.
But if you take a bunch of labelled photographs and throw them at something like a neural network or support vector machine, what happens in practice is that zebras are so much rarer than horses that the system just ends up learning to say 'always a horse' because this is actually the way to minimize its error.
Minimal error that may be but it's also not a very useful result. What is the recommended way to tell the system 'I want the best guess at which photographs are zebras, even if this does create some false positives'? There doesn't seem to be a lot of discussion of this problem.
One of the things I usually do with imbalanced classes (or skewed data sets) is simply generate more data. I think this is the best approach. You could go out in the real world and gather more data of the imbalanced class (e.g. find more pictures of zebras). You could also generate more data by simply making copies or duplicating it with transformations (e.g. flip horizontally).
You could also pick a classifier that uses an alternate evaluation (performance) metric over the one usually used - accuracy. Look at precision/recall/F1 score.
Week 6 of Andrew Ng's ML course talks about this topic: link
Here is another good web page I found on handling imbalanced classes: link
With this type of unbalanced data problem, it is a good approach to learn patterns associated with each class as opposed to simply comparing classes - this can be done via unsupervised learning learning first (such as with autoencoders). A good article with this available at https://www.r-bloggers.com/autoencoders-and-anomaly-detection-with-machine-learning-in-fraud-analytics/amp/. Another suggestion - after running the classifier, the confusion matrix can be used to determine where additional data should be pursued (I.e. many zebra errors)
I have a school project to make a program that uses the Weka tools to make predictions on football (soccer) games.
Since the algorithms are already there (the J48 algorithm), I need just the data. I found a website that offers football game data for free and I tried it in Weka but the predictions were pretty bad so I assume my data is not structured properly.
I need to extract the data from my source and format it another way in order to make new attributes and classes for my model. Does anyone know of a course/tutorial/guide on how to properly create your attributes and classes for machine learning predictions? Is there a standard that describes the best way of choosing the attributes of a data set for training a machine learning algorithm? What's the approach on this?
here's an example of the data that I have at the moment: http://www.football-data.co.uk/mmz4281/1516/E0.csv
and here is what the columns mean: http://www.football-data.co.uk/notes.txt
The problem may be that the data set you have is too small. Suppose you have ten variables and each variable has a range of 10 values. There are 10^10 possible configurations of these variables. It is unlikely your data set will be this large let alone cover all of the possible configurations. The trick is to narrow down the variables to the most relevant to avoid this large potential search space.
A second problem is that certain combinations of variables may be more significant than others.
The J48 algorithm attempts to to find the most relevant variable using entropy at each level in the tree. each path through the tree can be thought of as an AND condition: V1==a & V2==b ...
This covers the significance due to joint interactions. But what if the outcome is a result of A&B&C OR W&X&Y? The J48 algorithm will find only one and it will be the one where the the first variable selected will have the most overall significance when considered alone.
So, to answer your question, you need to not only find a training set which will cover the most common variable configurations in the "general" population but find an algorithm which will faithfully represent these training cases. Faithful meaning it will generally apply to unseen cases.
It's not an easy task. Many people and much money are involved in sports betting. If it were as easy as selecting the proper training set, you can be sure it would have been found by now.
EDIT:
It was asked in the comments how to you find the proper algorithm. The answer is the same way you find a needle in a haystack. There is no set rule. You may be lucky and stumble across it but in a large search space you won't ever know if you have. This is the same problem as finding the optimum point in a very convoluted search space.
A short-term answer is to
Think about what the algorithm can really accomplish. The J48 (and similar) algorithms are best suited for classification where the influence of the variables on the result are well known and follow a hierarchy. Flower classification is one example where it will likely excel.
Check the model against the training set. If it does poorly with the training set then it will likely have poor performance with unseen data. In general, you should expect the model to performance against the training to exceed the performance against unseen data.
The algorithm needs to be tested with data it has never seen. Testing against the training set, while a quick elimination test, will likely lead to overconfidence.
Reserve some of your data for testing. Weka provides a way to do this. The best case scenario would be to build the model on all cases except one (Leave On Out Approach) then see how the model performs on the average with these.
But this assumes the data at hand are not in some way biased.
A second pitfall is to let the test results bias the way you build the model.For example, trying different models parameters until you get an acceptable test response. With J48 it's not easy to allow this bias to creep in but if it did then you have just used your test set as an auxiliary training set.
Continue collecting more data; testing as long as possible. Even after all of the above, you still won't know how useful the algorithm is unless you can observe its performance against future cases. When what appears to be a good model starts behaving poorly then it's time to go back to the drawing board.
Surprisingly, there are a large number of fields (mostly in the soft sciences) which fail to see the need to verify the model with future data. But this is a matter better discussed elsewhere.
This may not be the answer you are looking for but it is the way things are.
In summary,
The training data set should cover the 'significant' variable configurations
You should verify the model against unseen data
Identifying (1) and doing (2) are the tricky bits. There is no cut-and-dried recipe to follow.
I'm curious as to whether research been done into random forests that combine unsupervised with supervised learning in a way allowing a single algorithm to find patterns in, and work with, multiple different data sets. I have googled every possible way to find research on this, and have come up empty. Can anyone point me in the right direction?
Note: I have already asked this question in the Data Sciences forum, but it's basically a dead forum so I came here.
(also read the comments and will incorporate the content in my answer)
From what I read between the lines is that you want to use Deep networks in a transfer learning setting. However, this would not be based on decision trees.
http://jmlr.csail.mit.edu/proceedings/papers/v27/mesnil12a/mesnil12a.pdf
There are many elements in your question:
1.) Machine learning algorithms, in general, don't care about the source of your data set. So basically you can feed the learning algorithms 20 different data sets and it will use all of them. However, the data should have the same underlying concept (except in the transfer learning case see below). This means: if you combine cats/dogs data with bills data this will not work or make it much harder for the algorithms. At least all input features need to be identical (exceptions exists), e.g, it is hard to combine images with text.
2.) labeled/unlabeled: Two important terms: a data set is a set of data points with a fixed number of dimensions. Datapoint i might be described as {Xi1,....Xin} where each Xi might for example be a pixel. A label Yi is from another domain, e.g., cats and dogs
3.) unsupervised learning data without any labels. (I have the gut feeling that this is not what you want.
4.) semi-supervised learning: The idea is basically that you combine data where you have labels with data without labels. Basically you have a set of images labeled as cats and dogs {Xi1,..,Xin,Yi} and a second set which contains images with cats/dogs but no labels {Xj1,..,Xjn}. The algorithm can use this information to build better classifiers as the unlabeld data provide information on how images look in general.
3.) transfer learning (I think this come closest to what you want). The Idea is that you provide a data set of cats and dogs and learn a classifier. Afterwards you want to train the classifier with images of cats/dogs/hamster. The training does not need to start from scratch but can use the cats/dogs classifier to converge much faster
4.) feature generation / feature construction The idea is that the algoritm learns features like "eyes". This features are used in the next step to learn the classifier. I'm mainly aware of this in the context of deep learning. Where the algoritm learns in the first step concepts like edges and constructs increasingly complex features like faces cats intolerant it can describe things like "the man on the elephant. This combined with transfer learning is probably what you want. However deep learning is based on Neural networks besides a few exceptions.
5.) outlier detection you provide a data set of cats/dogs as known images. When you provide the cats/dogs/hamster classifier. The classifier tells you that it has never seen something like a hamster before.
6.) active learning The idea is that you don't provide labels for all examples (Data points) beforehand, but that the algorithms asks you to label certain data points. This way you need to label much less data.
Essentially I have a data set, that has a feature vector, and label indicating whether it is spam or non-spam.
To get the labels for this data, 2 distinct types of expert were used each using different approaches to evaluate the item, the type of expert used then also became a feature in the vector.
Training and then testing on a separate portion of the data has achieved a high degree accuracy using a Random Forest algorithm.
However, it is clear now that, the feature describing the expert who made the label will not be available in a live environment. So I have tried a number of approaches to reflect this:
Remove the feature from the set and retrain and test
Split the data into 2 distinct sets based on the feature, and then train and test 2 separate classifiers
For the test data, set the feature in question all to the same value
With all 3 approaches, the classifiers have dropped from being highly accurate, to being virtually useless.
So I am looking for any advice or intuitions as to why this has occurred and how I might approach resolving it so as to regain some of the accuracy I was previously seeing?
To be clear I have no background in machine learning or statistics and am simply using a third party c# code library as a black box to achieve these results.
Sounds like you've completely overfit to the "who labeled what" feature (and combinations of this feature with other features). You can find out for sure by inspecting the random forest's feature importances and checking whether the annotator feature ranks high. Another way to find out is to let the annotators check each other's annotations and compute an agreement score such as Cohen's kappa. A low value, say less than .5, indicates disagreement among the annotators, which makes machine learning very hard.
Since the feature will not be available at test time, there's no easy way to get the performance back.
I am an undergraduate student and for my graduation thesis I am using SVM to predict the arrival time of a bus to a bus stop in its route. After doing a lot of research and reading some papers I still have a key doubt about how to model my system.
We've decided which features to use and we are in the process of gathering the data required to perform the regression, but what is confusing us are the implications or consequences of using some features as input for the SVM or building separated machines based on some of these features.
For instance, in this paper the authors built 4 SVMs for predicting bus arrival times: one for rush hour on sunny days, rush hour on rainy days, off-rush hour on sunny days and the last one for off-rush hours and rainy days.
But on a following paper on the same subejct they decided to use a single SVM with the weather condition and the rush/off-rush hour as input instead of breaking it in 4 SVMs as before.
I feel like this is the kind of thing that is more about experience so I would like to hear from you guys if anyone has any information about when to choose one of these approaches.
Thanks in advance.
There is no other way: you have to find out on your own. This is why you have to write this thesis. Nobody starts with a perfect solution. Everyone makes mistakes. Your problem is not easy and you cannot say what will work when you have never done anything similar. Try everything you found in the literature, compare the results, develop your own ideas, ...
Most important question: what is the data like?
Second question: what model do you expect to capture this?
So if you want to use SVMs for some reason, keep in mind their basic mechanism is linear, and can only capture non-linear phenomena if data is transformed by a suitable kernel.
For a particular problem at hand that means:
Do you have reason (plots, insights in the problem nature) to believe your problem is linear(ly separable)? Just use one linear svm.
Do you have reason your problem consist of several linear subproblems? Use a linear svm on each of the subproblems.
Does your data seem non-linearly grouped? Try an svm with something like rbf kernel.
Of course, you can just plug in and try, but checking the above may increase understanding of the problem.
In your particular problem I would go for single SVM.
With my not so extensive experience, I would consider breaking a problem in several SVMs for following reasons:
1)The classes are too different, or there are classes and subclasses in your problem.
E.g. in my case: there are several types of antibodies in a microscope image and they all may be positive or negative. So instead of defining A_Pos, A_Neg, B_Pos, B_Neg, ... I decide first if the image is positive or negative and determine the type in second SVM.
2)The feature extraction is too expensive. Provided you have groups of classes, which may be identified with fever features. Instead of extracting all features for a single machine, you may first extract only a small subset, and if required (result not with high enough probability) extract further features.
3)Decide whether the instance belongs to problem at all. Make a model containing one class and all instances of training set. If the instance to be classified is an outlier, stop. Otherwise classify with 2nd SVM containing all classes.
The key-word is "cascaded SVM"