Concepts about data distributions - machine-learning

I'm a little confused here
is there a connection between data distribution and detecting novelty, I mean can data distribution differ between novelty, noise, or outlier? In order to detect them!
Another point need to be answered as well:
"training data and test data are drawn from the same distribution or the same feature space "
so when exactly does the data distribution change? And when the data distribution changes, on which set I'm supposed to focus on? where/when can this happen?

I would suggest reading this from scikit-learn. I think it is a good overview from which you can understand the difference between outlier and novelty detection. Basically novelty is a cluster of "outliers" but which being so close to each other represent a new kind of data partition probably, not simply something strange. Definitely, for the first such point, there is no way to discriminate between the two possibilities, but if you process your new data in batches and you detect high-density regions in the outlier spaces, you might suspect some novelty in the data.
For your second point, this is basically what concept drift is about.

Related

Are there any ways to build an ML model using CBIR and SIFT for image comparison in my case?

I have this project I'm working on. A part of the project involves multiple test runs during which screenshots of an application window are taken. Now, we have to ensure that screenshots taken between consecutive runs match (barring some allowable changes). These changes could be things like filenames, dates, different logos, etc. within the application window that we're taking a screenshot of.
I had the bright idea to automate the process of doing this checking. Essentially my idea was this. If I could somehow mathematically quantify the difference between a screenshot from the N-1th run and the Nth run, I could create a binary labelled dataset that mapped feature vectors of some sort to a label (0 for pass or 1 for fail if the images do not adequately match up). The reason for all of this was so that my labelled data would help make the model understand what scale of changes are acceptable, because there are so many kinds that are acceptable.
Now lets say I have access to lots of data that I have meticulously labelled, in the thousands. So far I have tried using SIFT in opencv using keypoint matching to determine a similarity score between images. But this isn't an intelligent, learning process. Is there some way I could take some information from SIFT and use it as my x-value in my dataset?
Here are my questions:
what would that be the information I need as my x-value? It needs to be something that represents the difference between two images. So maybe the difference between feature vectors from SIFT? What do I do when those vectors are of slightly different dimensions?
Am I on the right track with thinking about using SIFT? Should I look elsewhere and if so where?
Thanks for your time!
The approach that is being suggested in the question goes like this -
Find SIFT features of two consecutive images.
Use those to somehow quantify the similarity between two images (sounds reasonable)
Use this metric to first classify the images into similar and non-similar.
Use this dataset to train a NN do to the same job.
I am not completely convinced if this is a good approach. Let's say that you created the initial classifier with SIFT features. You are then using this data to train a NN. But this data will definitely have a lot of wrong labels. Because if it didn't have a lot of wrong labels, what's stopping you from using your original SIFT based classifier as your final solution?
So if your SIFT based classification is good, why even train a NN? On the other hand, if it's bad, you are giving a lot of wrong labeled data to the NN for training. I think the latter is a probably a bad idea. I say probably because there is a possibility that maybe the wrong labels just encourage the NN to generalize better, but that would require a lot of data, I imagine.
Another way to look at this is, let's say that your initial classifier is 90% accurate. That's probably the upper limit of the performance for the NN that you are looking at when talking about training it with this data.
You said that the issue that you have with your first approach is that 'it's not a an intelligent, learning process'. I think it's the wrong approach to think that the former approach is always inferior to the latter. SIFT is a powerful tool that can solve a lot of problems without all the 'black-boxness' of an NN. If this problem can be solved with sufficient accuracy using SIFT, I think going after a learning based approach is not the way to go, because again, a learning based approach isn't necessarily superior.
However, if the SIFT approach isn't giving you good enough results, definitely start thinking of NN stuff, but at that point, using the "bad" method to label the data is probably a bad idea.
Also in relation, I think you could potentially be underestimating the amount of data that is needed for this. You mentioned data in the thousands, but that's honestly, not a lot. You would need a lot more, I think.
One way I would think about instead doing this -
Do SIFT keyponits detection for a sample reference image.
Manually filter out keypoints that does not belong to the things in the image that are invariant. That is, just take keypoints at the locations in the image that is guaranteed (or very likely) to be always present.
When you get a new image, compute the keypoints and do matching with the reference image.
Set some threshold of the ratio of good matches to the total number of matches.
Depending on your application, this might give you good enough results.
If not, and if you really want your solution to be NN based, I would say you need to manually label the dataset as opposed to using SIFT.

How to treat outliers if you have data set with ~2,000 features and can't look at each feature individually

I'm wondering how one goes about treating outliers at scale. Based on my experiences, I usually need to understand why there are outliers from the first place. What causes it, are there any patterns, or it just happens randomly. I know that, theoretically, we usually define outliers as data points outside of 3 standard deviation. But in the case where data is so big that you can't treat each feature one by one, and don't know if the 3 standard deviation rule is applicable anymore because of sparsity, how do we most effectively treat the outliers.
My intuition about high dimensional data is that data is sparse so the definition of "outliers" is harder to determine. Do you guys think we would be able to just get away with using ML algorithms that are more robust to outliers (tree based models, robust SVM, etc) instead of trying to treat outliers during preprocessing step? And if we really want to treat it, what is the best way to do it?
I would first propose a frame work for understanding the data. Imagine you are handed a dataset with no explanation of what it is. Analytics could actually be used to enable us to get understanding. Usually rows are observations and columns parameters of some sort regarding the observations. You first want to have a frame work for what you are trying to achieve. Now matter is going on, all data centers around the interest of people...that is why we decided to record it in some format. Given that, we are at most interested in:
1.) Object
2.) Attributes of object
3.) Behaviors of object
4.) Preferences of object
4.) Behaviors and preferences of object over time
5.) Relationships of object to other objects
6.) Affects of attributes, behaviors, preferences and other objects on object
So you are wanting to identify these items. So you open a data set and maybe you instantly recognize a time stamp. You then see some categorical variables and start doing relationship analysis for what is one to one, one to many, many to many. You then identify continuous variables. These all come together to give a foundation for identifying what is an outlier.
If we are evaluating objects of over time....is the rare event indicative of something that happens rarely, but we want to know about. Forest fire are outlier events...but they are events of great concern. If I am analyzing machine data and having rare events, but these rare events are tied to machine failure, then it matters. Basically.....does the rare event-parameter show evidence that it correlates to something that you care about?
Now if you have so many dimensions that the above approach is not feasible to your judgement, then you are seeking dimension reduction alternatives. I am currently employing Single Value Decomposition as at technique. I am already seeing situations where I am accomplishing the same level of predictive ability with 25% of the data. Which segways into my final thought; find a mark to decide whether the outliers matter or not.
Begin with leaving them in then begin your analysis, and run the work again with them removed. What were the affects. I believe that when you are in doubt, simply do both and see how different the results are. If there is little difference than maybe you are good to go. If there is significant difference of concern, then you are wanting to take an evidenced based approach of the outlier occurring. Simply because it is rare in your data does not mean it is rare. Think of certain type crimes that are under-reported (via arrest records). Lack of data showing politicians being arrested for insider trading does not mean that politicians are not doing insider trader en masse.

Machine Learning Algorithm for Dynamic Environments

Which methods are best for managing and predicting and labeling data in dynamic environment? The system data distribution changes and it is not static. The system can have different normal settings and under different settings, we have different normal data distributions. Consider we have two classes. Normal and abnormal. What happens? We cannot say that we can rely on historical data and train a simple classification method to predict future observations since one day after training the model, data distribution can change and old observations will become irrelevant to new ones. Consider the following figure:
Blue distribution and red distribution are normal data but under different setting and in the training time we have just one setting. This data is for one sensor. So, suppose we train a model with blue one and also have some abnormal samples. Imagine abnormals samples as normal samples with a little bit noise or fault in measurements. Then, we want to test the model but setting changes and now we have red distribution as our test observations. So, the model misclassifies the samples.
What are the best methods for a situation like this? Please note that I have tried several clustering algorithms but they cannot manage and distinguish between normal and abnormal samples.
Any suggestion and help are highly welcomed. Thanks
There are plenty of books on time series data.
In particular, on change detection. Your example can supposedly be considered a change in mean. There are statistical models to detect this.
Basseville, Michèle, and Igor V. Nikiforov. Detection of abrupt changes: theory and application. Vol. 104. Englewood Cliffs: Prentice Hall, 1993.

How to prove the reliability of a predictive model to executives?

I trained data from 500 devices to predict their performance. Then I applied my trained model to a test data set for another 500 devices and show pretty good prediction results. Now my executives want me to prove this model will work well on one million devices not only on 500. Obviously we don't have data for one million devices. And if the model is not reliable, they want me to discover the required amount of train data in order to make a reliable prediction on one million devices. How should I deal with these executives who don't have a background in statistical analysis and modeling? Any suggestions? Thanks
I have suggested to #cep to write up his comment as an answer - including providing the variance and bias calculations. In any case it could be added
"Do not be quick to assume Execs are essentially incapable in terms of
technical or mathematical concepts"
While there may be Dilbert managers out there .. somewhere I have seen few of them myself. More often managers get to their positions through hard work. They are likely to be rusty - but the abilities are likely still there.
In this case whether or not they have a "background in statistical analysis and modeling" they are applying common sense.
The first thing you might do is to provide the proper context and terminology. #cel has mentioned some of it: providing concrete values for :
assumptions
what assumptions are you making about the data.
What basis is there to consider extrapolation of the limited data
why should said extrapoated results be trusted to apply to the 99.5% of untested data
data distribution
basic descriptive statistics
your take on the apriori distribution of the data. Justify why you chose it
modeling
which models/approaches were considered and why
which model you actually chose and why
how did you arrive at the hyperparameters
how you trained the model
results
statistical measures of fit and error rate

Random forest algorithms able to switch data sets

I'm curious as to whether research been done into random forests that combine unsupervised with supervised learning in a way allowing a single algorithm to find patterns in, and work with, multiple different data sets. I have googled every possible way to find research on this, and have come up empty. Can anyone point me in the right direction?
Note: I have already asked this question in the Data Sciences forum, but it's basically a dead forum so I came here.
(also read the comments and will incorporate the content in my answer)
From what I read between the lines is that you want to use Deep networks in a transfer learning setting. However, this would not be based on decision trees.
http://jmlr.csail.mit.edu/proceedings/papers/v27/mesnil12a/mesnil12a.pdf
There are many elements in your question:
1.) Machine learning algorithms, in general, don't care about the source of your data set. So basically you can feed the learning algorithms 20 different data sets and it will use all of them. However, the data should have the same underlying concept (except in the transfer learning case see below). This means: if you combine cats/dogs data with bills data this will not work or make it much harder for the algorithms. At least all input features need to be identical (exceptions exists), e.g, it is hard to combine images with text.
2.) labeled/unlabeled: Two important terms: a data set is a set of data points with a fixed number of dimensions. Datapoint i might be described as {Xi1,....Xin} where each Xi might for example be a pixel. A label Yi is from another domain, e.g., cats and dogs
3.) unsupervised learning data without any labels. (I have the gut feeling that this is not what you want.
4.) semi-supervised learning: The idea is basically that you combine data where you have labels with data without labels. Basically you have a set of images labeled as cats and dogs {Xi1,..,Xin,Yi} and a second set which contains images with cats/dogs but no labels {Xj1,..,Xjn}. The algorithm can use this information to build better classifiers as the unlabeld data provide information on how images look in general.
3.) transfer learning (I think this come closest to what you want). The Idea is that you provide a data set of cats and dogs and learn a classifier. Afterwards you want to train the classifier with images of cats/dogs/hamster. The training does not need to start from scratch but can use the cats/dogs classifier to converge much faster
4.) feature generation / feature construction The idea is that the algoritm learns features like "eyes". This features are used in the next step to learn the classifier. I'm mainly aware of this in the context of deep learning. Where the algoritm learns in the first step concepts like edges and constructs increasingly complex features like faces cats intolerant it can describe things like "the man on the elephant. This combined with transfer learning is probably what you want. However deep learning is based on Neural networks besides a few exceptions.
5.) outlier detection you provide a data set of cats/dogs as known images. When you provide the cats/dogs/hamster classifier. The classifier tells you that it has never seen something like a hamster before.
6.) active learning The idea is that you don't provide labels for all examples (Data points) beforehand, but that the algorithms asks you to label certain data points. This way you need to label much less data.

Resources