I have been reading about probability distributions lately and got confused that what actually is the difference between probability distribution and data distribution or are they the same? Also what actually is the importance of probability distribution in Machine Learning?
Thanks
Data distribution is a function or a listing that shows all the possible values (or intervals) of the data. This can help you decide if the set of good that you have is good enough to apply any techniques over it. You want to avoid skewed data.
Probability distribution is a statistical function that describes all the possible values and likelihoods that a random variable can take within a given range. This helps you decide what type of statistical methods you can apply to your data. Example: If your data forms a Gaussian distribution then you already know how values would look like when they are 1-standard deviation away from normal and what is the probability of expecting more than 1-standard deviation.
NOTE: You may want to learn about how hypothesis testing is done for ML models.
Related
When we use a machine learning approach, we divide the data set into test and training data and, in effect, we always use a post hoc approach by using all the data and then calculating the y-value for a new query.
But is there such a thing as an ad hoc approach where we can go through feature by feature for a new query and see how our prediction changes?
The advantage of this would be that we know exactly which feature has changed the predictions and how.
I would be grateful for any advice, including literature references, as I don't really know how to google it. It is also possible that the term ad-hoc approach is not chosen correctly.
Very vague question. Also, why would you know how the prediction changes? You usually want to know which feature contributes most towards finding the 'best' prediction/correct classification. That is approached by looking at Feature Importance which comes in different flavors for different algorithms.
In case that is kind of what you were looking for take a look at Permutation Feature Importance, Boruta Algorithm, SHAP Feature Importance, Feature Importance for tree-based algorithms, ...
While going through one kernel on Kaggle regarding Regression in that it was mentioned that the data should look like a normal distribution. But I am not getting why?
I know this question might be very basic But please help me to understand this concept.
Thanks in Advance!!
Regression models make a number of assumptions, one of which is normality. When this assumption is violated then your p-values and confidence intervals around your coefficient estimate could be wrong, leading to incorrect conclusions about the statistical significance of your predictors
However, a common misconception is that the data (i.e. the variables/predictors) needs to be normally distributed, but this is not true. These models don't make any assumptions about the distribution of predictors.
For example, imagine a case where you have a binary predictor in regression (Male/Female; Slow/Fast etc.) - it would be impossible for this variable to be normally distributed and yet it is still a valid predictor to use in a regression model. The normality assumption actually refers to the distribution of the residuals, not the predictors themselves
In a classification problem, we care about the distribution of the labels in train and validation set. In sklearn, there is stratify option in train_test_split to ensure that the distribution of the labels in train and validation set are similar.
In a regression problem, let's say we want to predict the housing price based on a bunch of features. Do we need to care about the distribution of the housing price in train and validation set?
If yes, how to we achieve this in sklearn?
Forcing features to have similar distributions in your training and in your validation set assumes highly trusting the data you have to be representative of the data you will encounter in real life (ie. in a production environment), which is often not the case.
Also, doing so may virtually increase your validation score, compared to your test score.
Instead of adjusting feature distributions in train and validation sets, I would suggest you to perform cross-validation (in sklearn), which may be more representative of a testing situation.
This book ('A. Geron, Hands-On Machine Learning with Scikit-Learn & TensorFlow, O'Reilly, 2017) provides an excellent introductory discussion of this in chapter 2. To paraphrase:
Generally for large datasets you don't need to perform stratified sampling: You training set should be a fair representation of the range of observed instances (there are of course exceptions to this). For smaller datasets you could introduce sampling bias (I.e., disproportionately recording data from only a particular region of the expected range of target attributes) if you performed random sampling and stratified sampling is properly required.
Practically, you will need to create a new categorical feature by binning this continuous feature. You can then perform stratified sampling of this categorical feature. Make sure to remove this new categorical feature before training your data!
However, to do this you will need to have a good understanding of your features, I doubt there will be much point in performing stratified sampling of features of weak predictive power. I guess it could even do harm if you introduce some unintentional bias in the data by performing non-random sampling.
Take home message:
My instinct is that stratified sampling of a continuous feature should always be information and understanding lead. I.e, if you know a feature is a strong predictor of the target variable and you also know the sampling across its values is not uniform, you probably want to perform stratified sampling to make sure the range of values are properly represented in both the training and validation set.
The statement of my exercise says : distribution of feature_3 is a hint of how the data is generated. I try to understand what I should infer from that for the rest of my ETL or ML model..
I have plotted the Q-Q plot of this feature. The distribution seems fairly normal. What can I infer from this information for the rest of my ETL or ML model ?
Most of machine learning models assume an underlying data distribution for them to function well.
So, coming back to your question, there are some ML techniques that assume that the data fed into them are normally (or Gaussian) distributed. These are Gaussian naive Bayes, Least Squares based (regression) models, LDA, QDA. So the statement you are referring to implies that your data was generated using such an algorithm and are normally distributed. See, here for a brief visual explanation of this and here for an explanation on the importance of normal distribution in Machine Learning.
In addition, please note that there are other algorithms (e.g. SVMs, Random Forests used for regression/classification, Decision trees, Gradient Boosted Trees etc) that do not assume any type of underlying data distribution.
Which methods are best for managing and predicting and labeling data in dynamic environment? The system data distribution changes and it is not static. The system can have different normal settings and under different settings, we have different normal data distributions. Consider we have two classes. Normal and abnormal. What happens? We cannot say that we can rely on historical data and train a simple classification method to predict future observations since one day after training the model, data distribution can change and old observations will become irrelevant to new ones. Consider the following figure:
Blue distribution and red distribution are normal data but under different setting and in the training time we have just one setting. This data is for one sensor. So, suppose we train a model with blue one and also have some abnormal samples. Imagine abnormals samples as normal samples with a little bit noise or fault in measurements. Then, we want to test the model but setting changes and now we have red distribution as our test observations. So, the model misclassifies the samples.
What are the best methods for a situation like this? Please note that I have tried several clustering algorithms but they cannot manage and distinguish between normal and abnormal samples.
Any suggestion and help are highly welcomed. Thanks
There are plenty of books on time series data.
In particular, on change detection. Your example can supposedly be considered a change in mean. There are statistical models to detect this.
Basseville, Michèle, and Igor V. Nikiforov. Detection of abrupt changes: theory and application. Vol. 104. Englewood Cliffs: Prentice Hall, 1993.