what is the bootstrapped data in data mining?

what is the bootstrapped data in data mining? - machine-learning

recently I came across this term,but really have no idea what it refers to.I've searched online,but with little gain.
Thanks.

Take a sample of the time of day that you wake up on Saturdays. Some Friday nights you have a few too many drinks, so you wake up early (but go back to bed). Other days you wake up at a normal time. Other days you sleep in.
Here are the results:
[3.1, 4.8, 6.3, 6.4, 6.6, 7.3, 7.5, 7.7, 7.9, 10.1]
What is the mean time that you wake up?
Well it's 6.8 (o'clock, or 6:48). A touch early for me.
How good a prediction is this of when you'll wake up next Saturday? Can you quantify how wrong you are likely to be?
It's a pretty small sample, and we're not sure of the distribution of the underlying process, so it might not be a good idea to use standard parametric statistical techniques†.
Why don't we take a random sample of our sample, and calculate the mean and repeat this? This will give us an estimate of how bad our estimate is.
I did this several times, and the mean was between 5.98 and 7.8
This is called the bootstrap, and it was first mentioned by Bradley Efron in 1979.
A variant is called the jackknife, where you sample all but one of your dataset, take the mean, and repeat. The jackknife mean is 6.8 (same as the arithmetic mean) and ranges from 6.4 to 7.2.
Another variant is called k-fold cross-validation, where you (at random) split your data set into k equally-sized sections, calculate the mean of all but one section, and repeat k times. The 5-fold cross-validation mean is 6.8 and ranges from 4 to 9.
† This distribution does happen to be Normal. The 95% confidence interval of the mean is 5.43 to 8.11, reasonably close but bigger than the bootstrap mean.

If you don't have enough data to train your algorithm you can increase the size of your training set by (uniformly) randomly selecting items and duplicating them (with replacement).

In machine learning bootstrapping is iterative training on a known set. http://en.wikipedia.org/wiki/Bootstrapping_(machine_learning)

Related

Detecting broken sensors with machine learning

I'm new to machine learning.
I've got a huge database of sensor data from weather stations. Those sensors can be broken or have odd values. Broken sensors influences the calculations that are being done with that data.
The goal is to use machine-learning to detect if new sensor values are odd and mark them as broken if so. As said, I'm new to ML. Can somebody push me in the right direction or give feedback to my approach.
The data has a datetime and a value. The sensor values are being pushed every hour.
I appreciate any kind of help!

Since the question is pretty general in nature, I will provide some basic thoughts. Maybe you are already slightly familiar with them.
Set up a dataset that contains both broken sensors, as well as good sensors. That is the dependent variable. With that set you also have some variables that might predict the Y variable. Let's call them X.
You train a model to learn te relationship between X and Y.
You predict, based on X values where you do not know the outcome, what Y will be.
Some useful insight on the basics, is here:
https://www.youtube.com/watch?v=elojMnjn4kk&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A
Good Luck!

You could use Isolation Forest to detect abnormal readings.
Twitter has developed a algorithm called ESD (Extreme Studentized Deviate) also useful.
https://github.com/twitter/AnomalyDetection/
However a good EDA (Exploratory data analysis) is needed to define the types of abnormality found in the readings due to faulty sensors.
1) Step kind of trend, where suddenly the value increases and remains increased or decreased as well
2) Gradual increase in the value compared to other sensors and suddenly very high increase
3) Intermittent spike in the data

Methods to Find 'Best' Cut-Off Point for a Continuous Target Variable

I am working on a machine learning scenario where the target variable is Duration of power outages.
The distribution of the target variable is severely skewed right (You can imagine most power outages occur and are over with fairly quick, but then there are many, many outliers that can last much longer) A lot of these power outages become less and less 'explainable' by data as the durations get longer and longer. They become more or less, 'unique outages', where events are occurring on site that are not necessarily 'typical' of other outages nor is data recorded on the specifics of those events outside of what's already available for all other 'typical' outages.
This causes a problem when creating models. This unexplainable data mingles in with the explainable parts and skews the models ability to predict as well.
I analyzed some percentiles to decide on a point that I considered to encompass as many outages as possible while I still believed that the duration was going to be mostly explainable. This was somewhere around the 320 minute mark and contained about 90% of the outages.
This was completely subjective to my opinion though and I know there has to be some kind of procedure in order to determine a 'best' cut-off point for this target variable. Ideally, I would like this procedure to be robust enough to consider the trade-off of encompassing as much data as possible and not telling me to make my cut-off 2 hours and thus cutting out a significant amount of customers as the purpose of this is to provide an accurate Estimated Restoration Time to as many customers as possible.
FYI: The methods of modeling I am using that appear to be working the best right now are random forests and conditional random forests. Methods I have used in this scenario include multiple linear regression, decision trees, random forests, and conditional random forests. MLR was by far the least effective. :(

I have exactly the same problem! I hope someone more informed brings his knowledge. I wander to what point is a long duration something that we want to discard or that we want to predict!
Also, I tried treating my data by log transforming it, and the density plot shows a funny artifact on the left side of the distribution ( because I only have durations of integer numbers, not floats). I think this helps, you also should log transform the features that have similar distributions.
I finally thought that the solution should be stratified sampling or giving weights to features, but I don't know exactly how to implement that. My tries didn't produce any good results. Perhaps my data is too stochastic!

Is it ok to only use one epoch?

I'm training a neural network in TensorFlow (using tflearn) on data that I generate. From what I can tell, each epoch we use all of the training data. Since I can control how many examples I have, it seems like it would be best to just generate more training data until one epoch is enough to train the network.
So my question is: Is there any downside to only using one epoch, assuming I have enough training data? Am I correct in assuming that 1 epoch of a million examples is better than 10 epochs of 100,000?

Following a discussion with #Prune:
Suppose you have the possibility to generate an infinite number of labeled examples, sampled from a fixed underlying probability distribution, i.e. from the same manifold.
The more examples the network see, the better it will learn, and especially the better it will generalize. Ideally, if you train it long enough, it could reach 100% accuracy on this specific task.
The conclusion is that only running 1 epoch is fine, as long as the examples are sampled from the same distribution.
The limitations to this strategy could be:
if you need to store the generated examples, you might run out of memory
to handle unbalanced classes (cf. #jorgemf answer), you just need to sample the same number of examples for each class.
e.g. if you have two classes, with 10% chance of sampling the first one, you should create batches of examples with a 50% / 50% distribution
it's possible that running multiple epochs might make it learn some uncommon cases better.
I disagree, using multiple times the same example is always worse than generating new unknown examples. However, you might want to generate harder and harder examples with time to make your network better on uncommon cases.

You need training examples in order to make the network learn. Usually you don't have so many examples in order to make the network converge, so you need to run more than one epoch.
It is ok to use only one epoch if you have so many examples and they are similar. If you have 100 classes but some of them only have very few examples you are not going to learn those classes only with one epoch. So you need balanced classes.
Moreover, it is a good idea to have a variable learning rate which decreases with the number of examples, so the network can fine tune itself. It starts with a high learning rate and then decreases it over time, if you only run for one epoch you need to bear in mind this to tweak the graph.
My suggestion is to run more than one epoch, mostly because the more examples you have the more memory you need to store them. But if memory is fine and learning rate is adjusted based on number of examples and not epochs, then it is fine run one epoch.
Edit: I am assuming you are using a learning algorithm which updates the weights of the network every batch or similar.

What kind of classifier is used in the following scenario?

If I am building a weather predictor that will predict if it is will snow tomorrow, it is very easy to just straight away answer by saying "NO".
Obviously, if you evaluate such a classifier on every day of the year, it would be correct with an accuracy at 95% (considering that I build it and test it in a region where it snows very rarely).
Of course, that is such a stupid classifier even if it has an accuracy of 95% because it is obviously more important to predict if it will snow during the winter months (Jan & Feb) as opposed to any other months.
So, if I have a lot of features that I collect about the previous day to predict if it will snow the next day or not, considering that there will be a feature that says which month/week of the year it is, how can I weigh this particular feature and design the classifier to solve this practical problem?

Of course, that is such a stupid classifier even if it has an accuracy of 95% because it is obviously more important to predict if it will snow during the winter months (Jan & Feb) as opposed to any other months.
Accuracy might not be the best measurement to use in your case. Consider using precision, recall and F1 score.
how can I weigh this particular feature and design the classifier to solve this practical problem?
I don't think you should weight any particular feature in any way. You should let your algorithm do that and use cross validation to decide on the best parameters for your model, in order to also avoid overfitting.
If you say jan and feb are the most important months, consider only applying your model for those two months. If that's not possible, look into giving different weights to your classes (going to rain / not going to rain), based on their number. This question discusses that issue - the concept should be understandable regardless of your language of choice.

Should I remove test samples that are identical to some training sample?

I've been having a bit of a debate with my adviser about this issue, and I'd like to get your opinion on it.
I have a fairly large dataset that I've used to build a classifier. I have a separate, smaller testing dataset that was obtained independently from the training set (in fact, you could say that each sample in either set was obtained independently). Each sample has a class label, along with metadata such as collection date and location.
There is no sample in the testing set that has the same metadata as any sample in the training set (as each sample was collected at a different location or time). However, it is possible that the feature vector itself could be identical to some sample in the training set. For example, there could be two virus strains that were sampled in Africa and Canada, respectively, but which both have the same protein sequence (the feature vector).
My adviser thinks that I should remove such samples from the testing set. His reasoning is that these are like "freebies" when it comes to testing, and may artificially boost the reported accuracy.
However, I disagree and think they should be included, because it may actually happen in the real world that the classifier sees a sample that it has already seen before. To remove these samples would bring us even further from reality.
What do you think?

It would be nice to know if you're talking about a couple of repetitions in million samples or 10 repetitions in 15 samples.
In general I don't find what you're doing reasonable. I think your advisor has a very good point. Your evaluation needs to be as close as possible to using your classifier outside your control -- You can't just assume your going to be evaluated on a datapoint you've already seen. Even if each data point is independent, you're going to be evaluated on never-before-seen data.
My experience is in computer vision, and it would be very highly questionable to train and test with the same picture of a one subject. In fact I wouldn't be comfortable training and testing with frames of the same video (not even the same frame).
EDIT:
There are two questions:
The distribution permits that these repetitions naturally happen. I
believe you, you know your experiment, you know your data, you're
the expert.
The issue that you're getting a boost by doing this and that this
boost is possibly unfair. One possible way to address your advisor's
concerns is to evaluate how significant a leverage you're getting
from the repeated data points. Generate 20 test cases 10 in which
you train with 1000 and test on 33 making sure there are not
repetitions in the 33, and another 10 cases in which you train with
1000 and test on 33 with repetitions allowed as they occur
naturally. Report the mean and standard deviation of both
experiments.

It depends... Your adviser suggested the common practice. You usually test a classifier on samples which have not been used for training. If the samples of the test set matching the training set are very few, your results are not going to have statistical difference because of the reappearance of the same vectors. If you want to be formal and still keep your logic, you have to prove that the reappearance of the same vectors has no statistical significance on the testing process. If you proved this theoretically, I would accept your logic. See this ebook on statistics in general, and this chapter as a start point on statistical significance and null hypothesis testing.
Hope I helped!

In as much as the training and testing datasets are representative of the underlying data distribution, I think it's perfectly valid to leave in repetitions. The test data should be representative of the kind of data you would expect your method to perform on. If you genuinely can get exact replicates, that's fine. However, I would question what your domain is where it's possible to generate exactly the same sample multiple times. Are your data synthetic? Are you using a tiny feature set with few possible values for each of your features, such that different points in input space map to the same point in feature space?
The fact that you're able to encounter the same instance multiple times is suspicious to me. Also, if you have 1,033 instances, you should be using far more than 33 of them for testing. The variance in your test accuracy will be huge. See the answer here.

Having several duplicate or very similar samples seems somewhat analogous to the distribution of the population you're attempting to classify being non-uniform. That is, certain feature combinations are more common than others, and the high occurrence of them in your data is giving them more weight. Either that, or your samples are not representative.
Note: Of course, even if a population is uniformly distributed there is always some likelihood of drawing similar samples (perhaps even identical depending on the distribution).
You could probably make some argument that identical observations are a special case, but are they really? If your samples are representative it seems perfectly reasonable that some feature combinations would be more common than others (perhaps even identical depending on your problem domain).

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart