I need to test whether two histograms are significantly different from each other in terms of mean and variance. Both histograms only consist of two bars. (When) should I use two sample Kolomgorov-Smirnov or (Pearson's) Chi Square? How big should the sample size be for each? Are there better alternatives?
For future reference: KS is for continuous data, while CS works for categorical data. Therefore for comparing two histograms CS works better. However, it also needs sufficient sample size of at least 5 observations per cell.
Related
I have 96 features and the labels are represented by 1 and -1 for inputting to a deep learning model.
1- PCA
Here the 3 axis represent the 3 first principal components. The blue cloud represents the labels 1 and the red cloud represents the labels -1.
Even if we can identify two different clouds visually, they are stick together. I think we can face a problem during the training phase because of that.
2- t-SNE
For the same features and labels with t-SNE, we can still distinguish two clouds, but again they are stick together.
Questions :
1- Does the fact that the two clouds of dots are stick together can affect the % accuracy during the training and testing phase?
2- When we remove the red and blue color, we have somehow only one big cloud. Is there a way to work around the problem the two clouds ''stuck'' together?
What you call sticking together, means that in this space, your data isn't linearly separable. It doesn't seem to be nonlinearly separable either. I would expect with this these components, that you get poor accuracy for sure.
The way to work around the problem is more or different data. You have some options.
1) What about including more principal components? Maybe, 4, 5, 10 components would solve your problem. That might not work depending on your dataset, but it's the most obvious thing to try first.
2) You could try alternative matrix decomposition techniques. PCA isn't the only one. There's NMF, kernel PCA, LSA, and many others. Which one works best for you will fundamentally be determined by the distribution of your data.
3) Use any other type of feature selection. Frankly, 96 isn't that many, to begin with. You intend on doing deep learning? Wouldn't you normally put all 96 features into a deep learning model? There any many other ways to do feature selection besides matrix decomposition if you need to.
Good luck.
I use example code to compare HSV histograms using EMD.
I want to find similar images in people's (mobile) picture library. It's quite common that people take several images of the same subject (in a row) with just slight changes: zooming in/out a bit, different angle, different exposure as a result of changing position, other pose, ....
I selected 4 sets of 4 similar images to test this algorithm. When comparing the images inside the sets, I get 22 EMD-L1 values between roughly 0.25 and 2.25 (average 1.47) and 2 outliers around 7.2.
When I cross-comparing between sets I get values between 2 and 15 with an average around 8.
Yes, there is a significant range difference between the two result sets. But I was disappointed that there was no (gap) between these ranges, and instead a small overlap [2.0, 2.25]. I'm hoping to improve the algorithm.
How can I optimise my comparison for my particular use-case? There are various histogram forms, various histogram comparison algorithms, and then each has various parameters.
Does OpenCV implement the fastest known EMD algorithm? I was surprised that the comparison of some histograms took up to a second; especially with the relatively small bin numbers.
Then, some cross-comparisons give good EMD results, but have totally different RGB histograms. Here are two images:
My current EMD-L1 says 1.95, but the RGB histograms are totally different.
Probably you've already refined your comparison method. But this might not be obvious, you could divide the image into overlapping subregions, and then compute the EMD for all 4 parts.
I am using Linear regression to predict data. But, I am getting totally contrasting results when I Normalize (Vs) Standardize variables.
Normalization = x -xmin/ xmax – xmin
Zero Score Standardization = x - xmean/ xstd
a) Also, when to Normalize (Vs) Standardize ?
b) How Normalization affects Linear Regression?
c) Is it okay if I don't normalize all the attributes/lables in the linear regression?
Thanks,
Santosh
Note that the results might not necessarily be so different. You might simply need different hyperparameters for the two options to give similar results.
The ideal thing is to test what works best for your problem. If you can't afford this for some reason, most algorithms will probably benefit from standardization more so than from normalization.
See here for some examples of when one should be preferred over the other:
For example, in clustering analyses, standardization may be especially crucial in order to compare similarities between features based on certain distance measures. Another prominent example is the Principal Component Analysis, where we usually prefer standardization over Min-Max scaling, since we are interested in the components that maximize the variance (depending on the question and if the PCA computes the components via the correlation matrix instead of the covariance matrix; but more about PCA in my previous article).
However, this doesn’t mean that Min-Max scaling is not useful at all! A popular application is image processing, where pixel intensities have to be normalized to fit within a certain range (i.e., 0 to 255 for the RGB color range). Also, typical neural network algorithm require data that on a 0-1 scale.
One disadvantage of normalization over standardization is that it loses some information in the data, especially about outliers.
Also on the linked page, there is this picture:
As you can see, scaling clusters all the data very close together, which may not be what you want. It might cause algorithms such as gradient descent to take longer to converge to the same solution they would on a standardized data set, or it might even make it impossible.
"Normalizing variables" doesn't really make sense. The correct terminology is "normalizing / scaling the features". If you're going to normalize or scale one feature, you should do the same for the rest.
That makes sense because normalization and standardization do different things.
Normalization transforms your data into a range between 0 and 1
Standardization transforms your data such that the resulting distribution has a mean of 0 and a standard deviation of 1
Normalization/standardization are designed to achieve a similar goal, which is to create features that have similar ranges to each other. We want that so we can be sure we are capturing the true information in a feature, and that we dont over weigh a particular feature just because its values are much larger than other features.
If all of your features are within a similar range of each other then theres no real need to standardize/normalize. If, however, some features naturally take on values that are much larger/smaller than others then normalization/standardization is called for
If you're going to be normalizing at least one variable/feature, I would do the same thing to all of the others as well
First question is why we need Normalisation/Standardisation?
=> We take a example of dataset where we have salary variable and age variable.
Age can take range from 0 to 90 where salary can be from 25thousand to 2.5lakh.
We compare difference for 2 person then age difference will be in range of below 100 where salary difference will in range of thousands.
So if we don't want one variable to dominate other then we use either Normalisation or Standardization. Now both age and salary will be in same scale
but when we use standardiztion or normalisation, we lose original values and it is transformed to some values. So loss of interpretation but extremely important when we want to draw inference from our data.
Normalization rescales the values into a range of [0,1]. also called min-max scaled.
Standardization rescales data to have a mean (μ) of 0 and standard deviation (σ) of 1.So it gives a normal graph.
Example below:
Another example:
In above image, you can see that our actual data(in green) is spread b/w 1 to 6, standardised data(in red) is spread around -1 to 3 whereas normalised data(in blue) is spread around 0 to 1.
Normally many algorithm required you to first standardise/normalise data before passing as parameter. Like in PCA, where we do dimension reduction by plotting our 3D data into 1D(say).Here we required standardisation.
But in Image processing, it is required to normalise pixels before processing.
But during normalisation, we lose outliers(extreme datapoints-either too low or too high) which is slight disadvantage.
So it depends on our preference what we chose but standardisation is most recommended as it gives a normal curve.
None of the mentioned transformations shall matter for linear regression as these are all affine transformations.
Found coefficients would change but explained variance will ultimately remain the same. So, from linear regression perspective, Outliers remain as outliers (leverage points).
And these transformations also will not change the distribution. Shape of the distribution remains the same.
lot of people use Normalisation and Standardisation interchangeably. The purpose remains the same is to bring features into the same scale. The approach is to subtract each value from min value or mean and divide by max value minus min value or SD respectively. The difference you can observe that when using min value u will get all value + ve and mean value u will get bot + ve and -ve values. This is also one of the factors to decide which approach to use.
Different papers/libraries seem to have a different way of computing the chi squared distance, for instance in OpenCV it's expressed in one way while in this paper it's expressed in a different manner.
My first question is, what's the difference between the two formulas, i.e. why in one formula we divide by the value of one bin while in the other we divide by the sum of the two bins?
Secondly, should the histograms be normalized, if so why? The chi-squared statistic doesn't require that but the general consensus is to normalize the histogram before using a chi-squared distance.
The documentation is wrong. The implementation is correct inside OpenCV. Take a look at this bug post.
Also, normalising a histogram does not really change its pattern or "shape". Only the scale is brought down. So as long as you're working independent of scale, which you probably are if you're looking at how much one histogram "resembles" another, normalising should only make calculations faster (hopefully).
I'm trying to read through PCA and saw that the objective was to maximize the variance. I don't quite understand why. Any explanation of other related topics would be helpful
Variance is a measure of the "variability" of the data you have. Potentially the number of components is infinite (actually, after numerization it is at most equal to the rank of the matrix, as #jazibjamil pointed out), so you want to "squeeze" the most information in each component of the finite set you build.
If, to exaggerate, you were to select a single principal component, you would want it to account for the most variability possible: hence the search for maximum variance, so that the one component collects the most "uniqueness" from the data set.
Note that PCA does not actually increase the variance of your data. Rather, it rotates the data set in such a way as to align the directions in which it is spread out the most with the principal axes. This enables you to remove those dimensions along which the data is almost flat. This decreases the dimensionality of the data while keeping the variance (or spread) among the points as close to the original as possible.
Maximizing the component vector variances is the same as maximizing the 'uniqueness' of those vectors. Thus you're vectors are as distant from each other as possible. That way if you only use the first N component vectors you're going to capture more space with highly varying vectors than with like vectors. Think about what Principal Component actually means.
Take for example a situation where you have 2 lines that are orthogonal in a 3D space. You can capture the environment much more completely with those orthogonal lines than 2 lines that are parallel (or nearly parallel). When applied to very high dimensional states using very few vectors, this becomes a much more important relationship among the vectors to maintain. In a linear algebra sense you want independent rows to be produced by PCA, otherwise some of those rows will be redundant.
See this PDF from Princeton's CS Department for a basic explanation.
max variance is basically setting these axis that occupy the maximum spread of the datapoints, why? because the direction of this axis is what really matters as it kinda explains correlations and later on we will compress/project the points along those axis to get rid of some dimensions