I'm new to machine learning.
I've got a huge database of sensor data from weather stations. Those sensors can be broken or have odd values. Broken sensors influences the calculations that are being done with that data.
The goal is to use machine-learning to detect if new sensor values are odd and mark them as broken if so. As said, I'm new to ML. Can somebody push me in the right direction or give feedback to my approach.
The data has a datetime and a value. The sensor values are being pushed every hour.
I appreciate any kind of help!
Since the question is pretty general in nature, I will provide some basic thoughts. Maybe you are already slightly familiar with them.
Set up a dataset that contains both broken sensors, as well as good sensors. That is the dependent variable. With that set you also have some variables that might predict the Y variable. Let's call them X.
You train a model to learn te relationship between X and Y.
You predict, based on X values where you do not know the outcome, what Y will be.
Some useful insight on the basics, is here:
https://www.youtube.com/watch?v=elojMnjn4kk&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A
Good Luck!
You could use Isolation Forest to detect abnormal readings.
Twitter has developed a algorithm called ESD (Extreme Studentized Deviate) also useful.
https://github.com/twitter/AnomalyDetection/
However a good EDA (Exploratory data analysis) is needed to define the types of abnormality found in the readings due to faulty sensors.
1) Step kind of trend, where suddenly the value increases and remains increased or decreased as well
2) Gradual increase in the value compared to other sensors and suddenly very high increase
3) Intermittent spike in the data
Related
I am working on a machine learning scenario where the target variable is Duration of power outages.
The distribution of the target variable is severely skewed right (You can imagine most power outages occur and are over with fairly quick, but then there are many, many outliers that can last much longer) A lot of these power outages become less and less 'explainable' by data as the durations get longer and longer. They become more or less, 'unique outages', where events are occurring on site that are not necessarily 'typical' of other outages nor is data recorded on the specifics of those events outside of what's already available for all other 'typical' outages.
This causes a problem when creating models. This unexplainable data mingles in with the explainable parts and skews the models ability to predict as well.
I analyzed some percentiles to decide on a point that I considered to encompass as many outages as possible while I still believed that the duration was going to be mostly explainable. This was somewhere around the 320 minute mark and contained about 90% of the outages.
This was completely subjective to my opinion though and I know there has to be some kind of procedure in order to determine a 'best' cut-off point for this target variable. Ideally, I would like this procedure to be robust enough to consider the trade-off of encompassing as much data as possible and not telling me to make my cut-off 2 hours and thus cutting out a significant amount of customers as the purpose of this is to provide an accurate Estimated Restoration Time to as many customers as possible.
FYI: The methods of modeling I am using that appear to be working the best right now are random forests and conditional random forests. Methods I have used in this scenario include multiple linear regression, decision trees, random forests, and conditional random forests. MLR was by far the least effective. :(
I have exactly the same problem! I hope someone more informed brings his knowledge. I wander to what point is a long duration something that we want to discard or that we want to predict!
Also, I tried treating my data by log transforming it, and the density plot shows a funny artifact on the left side of the distribution ( because I only have durations of integer numbers, not floats). I think this helps, you also should log transform the features that have similar distributions.
I finally thought that the solution should be stratified sampling or giving weights to features, but I don't know exactly how to implement that. My tries didn't produce any good results. Perhaps my data is too stochastic!
Essentially I have a data set, that has a feature vector, and label indicating whether it is spam or non-spam.
To get the labels for this data, 2 distinct types of expert were used each using different approaches to evaluate the item, the type of expert used then also became a feature in the vector.
Training and then testing on a separate portion of the data has achieved a high degree accuracy using a Random Forest algorithm.
However, it is clear now that, the feature describing the expert who made the label will not be available in a live environment. So I have tried a number of approaches to reflect this:
Remove the feature from the set and retrain and test
Split the data into 2 distinct sets based on the feature, and then train and test 2 separate classifiers
For the test data, set the feature in question all to the same value
With all 3 approaches, the classifiers have dropped from being highly accurate, to being virtually useless.
So I am looking for any advice or intuitions as to why this has occurred and how I might approach resolving it so as to regain some of the accuracy I was previously seeing?
To be clear I have no background in machine learning or statistics and am simply using a third party c# code library as a black box to achieve these results.
Sounds like you've completely overfit to the "who labeled what" feature (and combinations of this feature with other features). You can find out for sure by inspecting the random forest's feature importances and checking whether the annotator feature ranks high. Another way to find out is to let the annotators check each other's annotations and compute an agreement score such as Cohen's kappa. A low value, say less than .5, indicates disagreement among the annotators, which makes machine learning very hard.
Since the feature will not be available at test time, there's no easy way to get the performance back.
I've been tasked to carry out a benchmark of an existing classifier for my company. The biggest problem currently is differentiating between different type of transportation's, like recognizing if i'm currently in a train, driving a car or bicycling so this is the main focus.
I've been reading alot about LSTM, http://en.wikipedia.org/wiki/Long_short_term_memory and its recent success in handwriting and speech-recognition, where the time between significant events could be pretty long.
So, my first thought about the problem with train/bus is that there probably isn't such a clear and short cycle as there is when walking/running for instance so long-term memory is probably crucial.
Have anyone tried anything similar with decent results?
Or is there other techniques that could potentially solve this problem better?
I've worked on mode of transportation detection using smartphone accelerometers. The main result I've found is that almost any classifier will do; the key problem is then the set of features. (This is no different from many other machine learning problems.) My feature set ended up containing time-domain and frequency-domain values, both taken from time-series sliding-window segmentation.
Another problem is that the accelerometer can be placed anywhere. On the body, it can be anywhere and in any orientation. If the user is driving, is the phone in a pocket, in a bag, on a car seat, attached to a suction-cup window mount, etc.?
If you want to avoid these problems, use GPS instead of the accelerometer. You can make relatively accurate classifications with that sensor, but the cost is the battery usage.
I've been having a bit of a debate with my adviser about this issue, and I'd like to get your opinion on it.
I have a fairly large dataset that I've used to build a classifier. I have a separate, smaller testing dataset that was obtained independently from the training set (in fact, you could say that each sample in either set was obtained independently). Each sample has a class label, along with metadata such as collection date and location.
There is no sample in the testing set that has the same metadata as any sample in the training set (as each sample was collected at a different location or time). However, it is possible that the feature vector itself could be identical to some sample in the training set. For example, there could be two virus strains that were sampled in Africa and Canada, respectively, but which both have the same protein sequence (the feature vector).
My adviser thinks that I should remove such samples from the testing set. His reasoning is that these are like "freebies" when it comes to testing, and may artificially boost the reported accuracy.
However, I disagree and think they should be included, because it may actually happen in the real world that the classifier sees a sample that it has already seen before. To remove these samples would bring us even further from reality.
What do you think?
It would be nice to know if you're talking about a couple of repetitions in million samples or 10 repetitions in 15 samples.
In general I don't find what you're doing reasonable. I think your advisor has a very good point. Your evaluation needs to be as close as possible to using your classifier outside your control -- You can't just assume your going to be evaluated on a datapoint you've already seen. Even if each data point is independent, you're going to be evaluated on never-before-seen data.
My experience is in computer vision, and it would be very highly questionable to train and test with the same picture of a one subject. In fact I wouldn't be comfortable training and testing with frames of the same video (not even the same frame).
EDIT:
There are two questions:
The distribution permits that these repetitions naturally happen. I
believe you, you know your experiment, you know your data, you're
the expert.
The issue that you're getting a boost by doing this and that this
boost is possibly unfair. One possible way to address your advisor's
concerns is to evaluate how significant a leverage you're getting
from the repeated data points. Generate 20 test cases 10 in which
you train with 1000 and test on 33 making sure there are not
repetitions in the 33, and another 10 cases in which you train with
1000 and test on 33 with repetitions allowed as they occur
naturally. Report the mean and standard deviation of both
experiments.
It depends... Your adviser suggested the common practice. You usually test a classifier on samples which have not been used for training. If the samples of the test set matching the training set are very few, your results are not going to have statistical difference because of the reappearance of the same vectors. If you want to be formal and still keep your logic, you have to prove that the reappearance of the same vectors has no statistical significance on the testing process. If you proved this theoretically, I would accept your logic. See this ebook on statistics in general, and this chapter as a start point on statistical significance and null hypothesis testing.
Hope I helped!
In as much as the training and testing datasets are representative of the underlying data distribution, I think it's perfectly valid to leave in repetitions. The test data should be representative of the kind of data you would expect your method to perform on. If you genuinely can get exact replicates, that's fine. However, I would question what your domain is where it's possible to generate exactly the same sample multiple times. Are your data synthetic? Are you using a tiny feature set with few possible values for each of your features, such that different points in input space map to the same point in feature space?
The fact that you're able to encounter the same instance multiple times is suspicious to me. Also, if you have 1,033 instances, you should be using far more than 33 of them for testing. The variance in your test accuracy will be huge. See the answer here.
Having several duplicate or very similar samples seems somewhat analogous to the distribution of the population you're attempting to classify being non-uniform. That is, certain feature combinations are more common than others, and the high occurrence of them in your data is giving them more weight. Either that, or your samples are not representative.
Note: Of course, even if a population is uniformly distributed there is always some likelihood of drawing similar samples (perhaps even identical depending on the distribution).
You could probably make some argument that identical observations are a special case, but are they really? If your samples are representative it seems perfectly reasonable that some feature combinations would be more common than others (perhaps even identical depending on your problem domain).
Hey I am really new to the field of machine learning and recently started reading the book Machine Learning by Tom Mitchell and am stuck on a particular section in the first chapter where he talks about Estimating Training values and also adjusting the weights. An explanation of the concepts of estimating training values would be great but I understand that it is not easy to explain all this so I would be really obliged if someone would be able to point me towards a resource (lecture video, or simple lecture slides, or some text snippet) that talks about the concept of estimating training data and the like.
Again I am sorry I cannot provide more information in terms of the question I am asking. The book sections are 1.2.4.1 and 1.2.4.2 in "Machine Learning by Tom Mitchell" if anyone has read this book and has had the same problem in understanding the concepts described in these sections.
Thanks in advance.
Ah. Classic textbook. My copy is a bit out of date but it looks like my section 1.2.4 deals with the same topics as yours.
First off, this is an introductory chapter that tries to be general and non-intimidating, but as a result it is also very abstract and a bit vague. At this point I wouldn't worry too much that you didn't understand the concepts, it is more likely that you're overthinking it. Later chapters will flesh out the things that seem unclear now.
Value in this context should be understood as a measure of the quality or performance of a certain state or instance, not as "values" as in numbers in general. Using his checkers example, a state with a high value is a board situation that is good/advantageous for the computer player.
The main idea here is that if you can provide every possible state that can be encountered with a value, and there is a set of rules that defines which states can be reached from the current state by doing which actions, then you can make an informed decision about which action to take.
But assigning values to states is only a trivial task for the end states of the game. The value attained at an end state is often called the reward. The goal is of course to maximize the reward. Estimating training values refers to the process of assigning guessed values to intermediate states based on the results you obtained later on in a game.
So, while playing many many training games you keep a trace of which states you encounter, and if you find that some state X leads to state Y, you can change your estimated value of X a bit, based on the current estimate for X and the current estimate of Y. This is what 'estimating the training weights' is all about. By repeated training, the model gets experienced and the estimates should converge to reliable values. It will start to avoid moves that lead to defeat, and favor moves that lead to victory. There are many different ways of doing such updates, and many different ways to represent the game state, but that is what the rest of the book is about .
I hope this helps!