Cannot comprehend output of sklearn.decomposition.PCA - machine-learning

I am a little confused about PCA algorithm especially the one implemented in sklearn.
when I use pca in sklearn decomposition with a 4000X784 matrix
X.shape = (4000,784)
pca = PCA()
pca.fit(X)
pca.explained_variance_.shape
I get
(784,)
On the other hand when I use another dataset with shape (50,784)
(50,)
Am I doing something wrong?

Let's see:
explained_variance_ratio_ array, [n_components] Percentage of variance explained by each of the selected components. k is not set then all components are stored and the sum of explained variances is equal to 1.0
In the first case, your data has 4000 elements with 748 components, so the attribute gives you an array of 748 values. If this is correct, then you need to transpose the second dataset.

The maximal number of components you get with PCA is equal to the minimum dimension of your X matrix.
The explained_variance_ method shows you how much of the variance of the data is explained by each PCA component.
These array shapes are normal because you get 768 components when you have more data than features, but only 50 when you have 50 lines of data.

Related

Why do we use MaxPooling 2x2? Can we use any other size like 3x3 or 5x5? And how to select which pooling to choose in what scenrio?

Greating,
I've searched it everywhere on YouTube, Google and also read some articles and research papers but can't seem to find the exact answer to my questions
I've few questions regarding CONVOLUTIONAL NEURAL NETWORK, I'm confused with this question: why do we use MaxPooling size 2x2 why don't we use any other size like 3x3, 4x4 ... nxn(of course less than the size of input) and can we even use any other than 2x2? And my other question is that: why do we always use MaxPooling most of the times? Does it depend on the images? For example if we have some noisy images then would it be suitable to use MaxPooling or should we use any other type of pooling?
Thank you!
MaxPool2D downsamples its input along its spatial dimensions (height and width) by taking the maximum value over an input window (of size defined by pool_size) for each channel of the input. For example, if I apply 2x2 MaxPooling2D on this array:
array = np.array([
[[5],[8]],
[[7],[2]]
])
Then the result would be 8, which is the maximum value of an element in this array.
Another example, if I apply a 2x2 MaxPooling2D on this array:
array = tf.constant([[[1.], [2.], [3.]],
[[4.], [5.], [6.]],
[[7.], [8.], [9.]]])
Then the output would be this:
([
[[5.], [6.]],
[[8.], [9.]]
])
What MaxPooling2D did here is that it slided a 2x2 window and took the maximum value of it, resulting in halving the dimension of the input array along both its height and width. If you still have any problem how this works, check this from keras and this from SO
Now that it is clear that MaxPool2D downsamples the input, let's get back to your question-
Why is a 2x2 MaxPooling used everywhere and not 3x3 or 4x4?
Well, the reason is that it reduces the data, applying a 3x3 MaxPooling2D on a matrix of shape (3,3,1) would result in a (1,1,1) matrix, and applying a 2x2 MaxPooling2D on a matrix of shape (3,3,1) would result in a (2,2,1) matrix. Obviously (2,2,1) matrix can keep more data than a matrix of shape (1,1,1). Often times, applying a MaxPooling2D operation with a pooling size of more than 2x2 results in a great loss of data, and so 2x2 is a better option to choose. This is why, you see 2x2 MaxPooling2D 'everywhere', like in ResNet50, VGG16 etc.

How to calculate correlation of colours in a dataset?

In this Distill article (https://distill.pub/2017/feature-visualization/) in footnote 8 authors write:
The Fourier transforms decorrelates spatially, but a correlation will still exist
between colors. To address this, we explicitly measure the correlation between colors
in the training set and use a Cholesky decomposition to decorrelate them.
I have trouble understanding how to do that. I understand that for an arbitrary image I can calculate a correlation matrix by interpreting the image's shape as [channels, width*height] instead of [channels, height, width]. But how to take the whole dataset into account? It can be averaged over, but that doesn't have anything to do with Cholesky decomposition.
Inspecting the code confuses me even more (https://github.com/tensorflow/lucid/blob/master/lucid/optvis/param/color.py#L24). There's no code for calculating correlations, but there's a hard-coded version of the matrix (and the decorrelation happens by matrix multiplication with this matrix). The matrix is named color_correlation_svd_sqrt, which has svd inside of it, and SVD wasn't mentioned anywhere else. Also the matrix there is non-triangular, which means that it hasn't come from the Cholesky decomposition.
Clarifications on any points I've mentioned would be greatly appreciated.
I figured out the answer to your question here: How to calculate the 3x3 covariance matrix for RGB values across an image dataset?
In short, you calculate the RGB covariance matrix for the image dataset and then do the following calculations
U,S,V = torch.svd(dataset_rgb_cov_matrix)
epsilon = 1e-10
svd_sqrt = U # torch.diag(torch.sqrt(S + epsilon))

Bring any PyTorch cuda tensor in the range [0,1]

Suppose I have a PyTorch Cuda Float tensor x of the shape [b,c,h,w] taking on any arbitrary value allowed by Float Tensor range. I want to normalise it in the range [0,1].
I think of the following algorithm (but any other will also do).
Step1: Find minimum in each batch. Call it min and having shape [b,1,1,1].
Step2: Similarly find the maximum and call it max.
Step3: Use y = (x-min)/max. Alternatively use y = (x-min)/(max-min). I don't know which one will be better. y should have the same shape as that of x.
I am using PyTorch 1.3.1.
Specifically I am unable to get the desired min using torch.min(). Same goes for max.
I am going to use it for feeding it to pre-trained VGG for calculating perceptual loss (after the above normalisation i will additionally bring them to ImageNet mean and std). Due to some reason I cannot enforce [0,1] range during data loading part because the previous works in my area have a very specific normalisation algorithm which has to be used but some times does not ensures [0,1] bound but will be somewhere in its vicinity. That is why at the time computing perceptual loss I have to do this explicit normalisation as a precaution. All out of the box implementation of perceptual loss I am aware assume data is in [0,1] or [-1,1] range and so do not do this transformation.
Thankyou very much
Not the most elegant way, but you can do that using keepdim=True and specifying each of the dimensions:
channel_min = x.min(dim=1, keepdim=True)[0].min(dim=2,keepdim=True)[0].min(dim=3, keepdim=True)[0]
channel_max = x.max(dim=1, keepdim=True)[0].max(dim=2,keepdim=True)[0].max(dim=3, keepdim=True)[0]

Handling zero rows/columns in covariance matrix during em-algorithm

I tried to implement GMMs but I have a few problems during the em-algorithm.
Let's say I've got 3D Samples (stat1, stat2, stat3) which I use to train the GMMs.
One of my training sets for one of the GMMs has in nearly every sample a "0" for stat1. During training I get really small Numbers (like "1.4456539880060609E-124") in the first row and column of the covariance matrix which leads in the next iteration of the EM-Algorithm to 0.0 in the first row and column.
I get something like this:
0.0 0.0 0.0
0.0 5.0 6.0
0.0 2.0 1.0
I need the inverse covariance matrix to calculate the density but since one column is zero I can't do this.
I thought about falling back to the old covariance matrix (and mean) or to replace every 0 with a really small number.
Or is there a another simple solution to this problem?
Simply your data lies in degenerated subspace of your actual input space, and GMM is not well suited in most generic form for such setting. THe problem is that empirical covariance estimator that you use simply fail for such data (as you said - you cannot inverse it). What you usually do? You chenge covariance estimator to the constrained/regularized ones, which contain:
Constant-based shrinking, thus instead of using Sigma = Cov(X) you do Sigma = Cov(X) + eps * I, where eps is prefedefined small constant, and I is identity matrix. Consequently you never have a zero values on the diagonal, and it is easy to prove that for reasonable epsilon, this will be inversible
Nicely fitted shrinking, like Oracle Covariance Estimator or Ledoit-Wolf Covariance Estimator which find best epsilon based on the data itself.
Constrain your gaussians to for example spherical family, thus N(m, sigma I), where sigma = avg_i( cov( X[:, i] ) is the mean covariance per dimension. This limits you to spherical gaussians, and also solves the above issue
There are many more solutions possible, but all based on the same thing - chenge covariance estimator in such a way, that you have a guarantee of invertability.

K means clustering for multidimensional data

if the data set has 440 objects and 8 attributes (dataset been taken from UCI machine learning repository). Then how do we calculate centroids for such datasets. (wholesale customers data)
https://archive.ics.uci.edu/ml/datasets/Wholesale+customers
if i calculate the mean of values of each row, will that be the centroid?
and how do I plot resulting clusters in matlab.
OK, first of all, in the dataset, 1 row corresponds to a single example in the data, you have 440 rows, which means the dataset consists of 440 examples. Each column contains the values for that specific feature (or attribute as you call it), e.g. column 1 in your dataset contains the values for the feature Channel, column 2 the values for the feature Region and so on.
K-Means
Now for K-Means Clustering, you need to specify the number of clusters (the K in K-Means). Say you want K=3 clusters, then the simplest way to initialise K-Means is to randomly choose 3 examples from your dataset (that is 3 rows, randomly drawn from the 440 rows you have) as your centroids. Now these 3 examples are your centroids.
You can think of your centroids as 3 bins and you want to put every example from the dataset into the closest(usually measured by the Euclidean distance; check the function norm in Matlab) bin.
After the first round of putting all examples into the closest bin, you recalculate the centroids by calculating the mean of all examples in their respective bins. You repeat the process of putting all the examples into the closest bin until no example in your dataset moves to another bin.
Some Matlab starting points
You load the data by X = load('path/to/the/dataset', '-ascii');
In your case X will be a 440x8 matrix.
You can calculate the Euclidean distance from an example to a centroid by
distance = norm(example - centroid1);,
where both, example and centroid1 have dimensionality 1x8.
Recalculating the centroids would work as follows, suppose you have done 1 iteration of K-Means and have put all examples into their respective closest bin. Say Bin1 now contains all examples that are closest to centroid1 and therefore Bin1 has dimensionality 127x8, which means that 127 examples out of 440 are in this bin. To calculate the centroid position for the next iteration you can then do centroid1 = mean(Bin1);. You would do similar things to your other bins.
As for plotting, you have to note that your dataset contains 8 features, which means 8 dimensions and which is not visualisable. I'd suggest you create or look for a (dummy) dataset which only consists of 2 features and would therefore be visualisable by using Matlab's plot() function.

Resources