which power of the feature should i train with? regression - machine-learning

I have a explanatory variable x and a response variable y. I am trying to find which power of the feature i should train with. You can ignore the colors for my question. the scatter data is from the sensor and the line plot is the theoretical curve from the lab, which you can also ignore for my question.

For this answer I understand you want to obtain some polynomial curve going through the croissant shaped zone where points are dense.
Also I assume that the independent variable is on the horizontal axis, while the dependent is on the vertical one. Otherwise as you can see from the blue line, there is no functional that could give you this.
Now to select the degree of polynomial you can use stepwise regression.
This is about running the regression with more or less features one at a time (i.e decrease or increase the degree of polynomial in this case), and calculating a score such as AIC, BIC, or even adjusted R2 to assess if it's worth it or not to add or remove this feature.

Related

DONUT- Anomaly detection Algorithm ignores the relationship between sliding windows?

I'm trying to understand the paper : https://netman.aiops.org/wp-content/uploads/2018/05/PID5338621.pdf about Robust and Rapid Clustering of KPIs for Large-Scale Anomaly Detection.
Clustering is done using ROCKA algorithm.
Steps:
1.) Preprocessing is conducted on the raw KPI data to remove amplitude differences and standardize data.
2.) In baseline extraction step, we reduce noises, remove the extreme values (which are likely anomalies), and extract underlying shapes, referred to as baselines, of KPIs. It's done by applying moving average with a small sliding window.
3.) Clustering is then conducted on the baselines of sampled KPIs, with robustness against phase shifts and noises.
4.) Finally, we calculate the centroid of each cluster, then assign the unlabeled KPIs by their distances to these centroids.
I understand ROCKA mechanism.
Now, i'm trying to understand DONUT algorithm which is applied for "Anomaly Detection".
How it works is :
DONUT applies sliding windows over the KPI to get short series x and tries to recognize what normal patterns x follows. The indicator is then calculated by the difference between reconstructed normal patterns and x to show the severity of anomalies. In practice, a threshold should be selected for each KPI. A data point with an indicator value larger than the threshold is regarded as an anomaly.
Now my question is :
IT seems like DONUT is not robust enough against time information related anomalies. Meaning that it works on a set of sliding windows and it ignores the relationship between windows. So the window becomes a very critical parameter here. So it might generate high false positives. What I'm understanding wrong here?
Please help and make me understand how DONUT will capture the relationship between sliding windows.

Right order of doing feature selection, PCA and normalization?

I know that feature selection helps me remove features that may have low contribution. I know that PCA helps reduce possibly correlated features into one, reducing the dimensions. I know that normalization transforms features to the same scale.
But is there a recommended order to do these three steps? Logically I would think that I should weed out bad features by feature selection first, followed by normalizing them, and finally use PCA to reduce dimensions and make the features as independent from each other as possible.
Is this logic correct?
Bonus question - are there any more things to do (preprocess or transform)
to the features before feeding them into the estimator?
If I were doing a classifier of some sort I would personally use this order
Normalization
PCA
Feature Selection
Normalization: You would do normalization first to get data into reasonable bounds. If you have data (x,y) and the range of x is from -1000 to +1000 and y is from -1 to +1 You can see any distance metric would automatically say a change in y is less significant than a change in X. we don't know that is the case yet. So we want to normalize our data.
PCA: Uses the eigenvalue decomposition of data to find an orthogonal basis set that describes the variance in data points. If you have 4 characteristics, PCA can show you that only 2 characteristics really differentiate data points which brings us to the last step
Feature Selection: once you have a coordinate space that better describes your data you can select which features are salient.Typically you'd use the largest eigenvalues(EVs) and their corresponding eigenvectors from PCA for your representation. Since larger EVs mean there is more variance in that data direction, you can get more granularity in isolating features. This is a good method to reduce number of dimensions of your problem.
of course this could change from problem to problem, but that is simply a generic guide.
Generally speaking, Normalization is needed before PCA.
The key to the problem is the order of feature selection, and it's depends on the method of feature selection.
A simple feature selection is to see whether the variance or standard deviation of the feature is small. If these values are relatively small, this feature may not help the classifier. But if you do normalization before you do this, the standard deviation and variance will become smaller (generally less than 1), which will result in very small differences in std or var between the different features.If you use zero-mean normalization, the mean of all the features will equal 0 and std equals 1.At this point, it might be bad to do normalization before feature selection
Feature selection is flexible, and there are many ways to select features. The order of feature selection should be chosen according to the actual situation
Good answers here. One point needs to be highlighted. PCA is a form of dimensionality reduction. It will find a lower dimensional linear subspace that approximates the data well. When the axes of this subspace align with the features that one started with, it will lead to interpretable feature selection as well. Otherwise, feature selection after PCA, will lead to features that are linear combinations of the original set of features and they are difficult to interpret based on the original set of features.

Pre-processing data: Normalizing data labels in regression?

Recently I was told that the labels of regression data should also be normalized for better result but I am pretty doubtful of that. I have never tried normalizing labels in both regression and classification that's why I don't know if that state is true or not. Can you please give me a clear explanation (mathematically or in experience) about this problem?
Thank you so much.
Any help would be appreciated.
When you say "normalize" labels, it is not clear what you mean (i.e. whether you mean this in a statistical sense or something else). Can you please provide an example?
On Making labels uniform in data analysis
If you are trying to neaten labels for use with the text() function, you could try the abbreviate() function to shorten them, or the format() function to align them better.
The pretty() function works well for rounding labels on plot axes. For instance, the base function hist() for drawing histograms calls on Sturges or other algorithms and then uses pretty() to choose nice bin sizes.
The scale() function will standardize values by subtracting their mean and dividing by the standard deviation, which in some circles is referred to as normalization.
On the reasons for scaling in regression (in response to comment by questor). Suppose you regress Y on covariates X1, X2, ... The reasons for scaling covariates Xk depend on the context. It can enable comparison of the coefficients (effect sizes) of each covariate. It can help ensure numerical accuracy (these days not usually an issue unless covariates on hugely different scales and/or data is big). For a readable intro see Psychosomatic medicine editors' guide. For a mathematically intense discussion see Sylvain Sardy's guide.
In particular, in Bayesian regression, rescaling is advisable to ensure convergence of MCMC estimation; e.g. see this discussion.
You mean features not labels.
It is not necessary to normalize your features for regression or classification, even though in some cases, it is a trick that can help converging faster. You might want to check this post.
To my experience, when using a simple model like a linear regression with only a few variables, keeping the features as they are (without normalization) is preferable since the model is more interpretable.
It may be that what you mean is that you should scale your labels. The reason is so convergence is faster, and you don't get numeric instability.
For example, if your labels are in the range (1000, 1000000) and the weights are initialized close to zero, a mse loss would be so large, you'd likely get NaN errors.
See https://datascience.stackexchange.com/q/22776/38707 for a similar discussion.
for a regression problem with algorithms including decision tree or logistic regression and linear regression I tested in two modes: 1- with label scaling using MinMaxScaler 2- without label scaling the result that i got was : r2 score is the same in 2 mode mse and mae scales
for diabetes dataset using linear regression the result before and after is
without scaling:
Mean Squared Error: 3424.3166
Mean Absolute Error: 46.1742
R2_score : 0.33
after scaling labels:
Mean Squared Error: 0.0332
Mean Absolute Error: 0.1438
R2_score : 0.33
also below link can be useful which says scaling can be helpful in fast convergence enter scale or not scale labels in deep leaning?

Linear Regression :: Normalization (Vs) Standardization

I am using Linear regression to predict data. But, I am getting totally contrasting results when I Normalize (Vs) Standardize variables.
Normalization = x -xmin/ xmax – xmin
 
Zero Score Standardization = x - xmean/ xstd
 
a) Also, when to Normalize (Vs) Standardize ?
b) How Normalization affects Linear Regression?
c) Is it okay if I don't normalize all the attributes/lables in the linear regression?
Thanks,
Santosh
Note that the results might not necessarily be so different. You might simply need different hyperparameters for the two options to give similar results.
The ideal thing is to test what works best for your problem. If you can't afford this for some reason, most algorithms will probably benefit from standardization more so than from normalization.
See here for some examples of when one should be preferred over the other:
For example, in clustering analyses, standardization may be especially crucial in order to compare similarities between features based on certain distance measures. Another prominent example is the Principal Component Analysis, where we usually prefer standardization over Min-Max scaling, since we are interested in the components that maximize the variance (depending on the question and if the PCA computes the components via the correlation matrix instead of the covariance matrix; but more about PCA in my previous article).
However, this doesn’t mean that Min-Max scaling is not useful at all! A popular application is image processing, where pixel intensities have to be normalized to fit within a certain range (i.e., 0 to 255 for the RGB color range). Also, typical neural network algorithm require data that on a 0-1 scale.
One disadvantage of normalization over standardization is that it loses some information in the data, especially about outliers.
Also on the linked page, there is this picture:
As you can see, scaling clusters all the data very close together, which may not be what you want. It might cause algorithms such as gradient descent to take longer to converge to the same solution they would on a standardized data set, or it might even make it impossible.
"Normalizing variables" doesn't really make sense. The correct terminology is "normalizing / scaling the features". If you're going to normalize or scale one feature, you should do the same for the rest.
That makes sense because normalization and standardization do different things.
Normalization transforms your data into a range between 0 and 1
Standardization transforms your data such that the resulting distribution has a mean of 0 and a standard deviation of 1
Normalization/standardization are designed to achieve a similar goal, which is to create features that have similar ranges to each other. We want that so we can be sure we are capturing the true information in a feature, and that we dont over weigh a particular feature just because its values are much larger than other features.
If all of your features are within a similar range of each other then theres no real need to standardize/normalize. If, however, some features naturally take on values that are much larger/smaller than others then normalization/standardization is called for
If you're going to be normalizing at least one variable/feature, I would do the same thing to all of the others as well
First question is why we need Normalisation/Standardisation?
=> We take a example of dataset where we have salary variable and age variable.
Age can take range from 0 to 90 where salary can be from 25thousand to 2.5lakh.
We compare difference for 2 person then age difference will be in range of below 100 where salary difference will in range of thousands.
So if we don't want one variable to dominate other then we use either Normalisation or Standardization. Now both age and salary will be in same scale
but when we use standardiztion or normalisation, we lose original values and it is transformed to some values. So loss of interpretation but extremely important when we want to draw inference from our data.
Normalization rescales the values into a range of [0,1]. also called min-max scaled.
Standardization rescales data to have a mean (μ) of 0 and standard deviation (σ) of 1.So it gives a normal graph.
Example below:
Another example:
In above image, you can see that our actual data(in green) is spread b/w 1 to 6, standardised data(in red) is spread around -1 to 3 whereas normalised data(in blue) is spread around 0 to 1.
Normally many algorithm required you to first standardise/normalise data before passing as parameter. Like in PCA, where we do dimension reduction by plotting our 3D data into 1D(say).Here we required standardisation.
But in Image processing, it is required to normalise pixels before processing.
But during normalisation, we lose outliers(extreme datapoints-either too low or too high) which is slight disadvantage.
So it depends on our preference what we chose but standardisation is most recommended as it gives a normal curve.
None of the mentioned transformations shall matter for linear regression as these are all affine transformations.
Found coefficients would change but explained variance will ultimately remain the same. So, from linear regression perspective, Outliers remain as outliers (leverage points).
And these transformations also will not change the distribution. Shape of the distribution remains the same.
lot of people use Normalisation and Standardisation interchangeably. The purpose remains the same is to bring features into the same scale. The approach is to subtract each value from min value or mean and divide by max value minus min value or SD respectively. The difference you can observe that when using min value u will get all value + ve and mean value u will get bot + ve and -ve values. This is also one of the factors to decide which approach to use.

Why do we maximize variance during Principal Component Analysis?

I'm trying to read through PCA and saw that the objective was to maximize the variance. I don't quite understand why. Any explanation of other related topics would be helpful
Variance is a measure of the "variability" of the data you have. Potentially the number of components is infinite (actually, after numerization it is at most equal to the rank of the matrix, as #jazibjamil pointed out), so you want to "squeeze" the most information in each component of the finite set you build.
If, to exaggerate, you were to select a single principal component, you would want it to account for the most variability possible: hence the search for maximum variance, so that the one component collects the most "uniqueness" from the data set.
Note that PCA does not actually increase the variance of your data. Rather, it rotates the data set in such a way as to align the directions in which it is spread out the most with the principal axes. This enables you to remove those dimensions along which the data is almost flat. This decreases the dimensionality of the data while keeping the variance (or spread) among the points as close to the original as possible.
Maximizing the component vector variances is the same as maximizing the 'uniqueness' of those vectors. Thus you're vectors are as distant from each other as possible. That way if you only use the first N component vectors you're going to capture more space with highly varying vectors than with like vectors. Think about what Principal Component actually means.
Take for example a situation where you have 2 lines that are orthogonal in a 3D space. You can capture the environment much more completely with those orthogonal lines than 2 lines that are parallel (or nearly parallel). When applied to very high dimensional states using very few vectors, this becomes a much more important relationship among the vectors to maintain. In a linear algebra sense you want independent rows to be produced by PCA, otherwise some of those rows will be redundant.
See this PDF from Princeton's CS Department for a basic explanation.
max variance is basically setting these axis that occupy the maximum spread of the datapoints, why? because the direction of this axis is what really matters as it kinda explains correlations and later on we will compress/project the points along those axis to get rid of some dimensions

Resources