Using 4 components in Principle Component Analysis (PCA)

Using 4 components in Principle Component Analysis (PCA) - machine-learning

I don't know how to interpret the output of Principle Components Analysis (PCA) and the cumulative sum graph.
I have the following problem. I have 4 machines (A, B, C, and D), and each machine has 9 features where measurements were taken along the time. The goal is to classify the machine behavior.
I am using Principle Components Analysis to reduce the dimensionality from 9 features. By reducing the dimensionality of the original feature space by projecting it onto a smaller subspace, and then plotting the cumulative sum over all the variance, you can see that the first 4 eigenvalues correspond to approximately 82% of all variance 1. It means that there is 18% of information being lost with the dimensionality reduction.
I want to cluster machines based on the Principal Components, but the examples that I find always plot the data using 2 Principal components (PCs) 2. The 2 is the graph that I plot using just 2 PCs.
If the cumulative sum graph suggests that 4 PCs corresponds to 82% of all variance, and the examples of PCA only have 2 PCs, what should I do? Do I apply PCA a second time? Or I need to do something else that I don't know?
In [3] I show the code that I have to use PCA with 2 components and plot the graph. The A, B, C, and D are the machines.
[3]
sklearn_pca = sklearnPCA(n_components=2)
Y_sklearn = sklearn_pca.fit_transform(X_std)
data = []
for name, col in zip(machine_list, colors.values()):
trace = dict(
type='scatter',
x=Y_sklearn[y==name,0],
y=Y_sklearn[y==name,1],
mode='markers',
name=name,
marker=dict(
color=col,
size=12,
line=dict(
color='rgba(217, 217, 217, 0.14)',
width=0.5),
opacity=0.8)
)
data.append(trace)
layout = dict(
xaxis=dict(title='PC1', showline=False),
yaxis=dict(title='PC2', showline=False)
)
fig = dict(data=data, layout=layout)
iplot(fig)

Related

PCA vs averaging columns

I have a dataframe with 300 float type columns and 1 integer column which is the dependent variable. The 300 columns are of 3 kinds:
1.Kind A: columns 1 to 100
2.Kind B: columns 101 to 200
3.Kind C: columns 201 to 300
I want to reduce the number of dimensions. Should I average the values for each kind and aggregate into 3 columns(one for each type), or should I perform some dimensionality reduction techniques like PCA? What is the justification of the same?

Option 1:
Do not do dimensionality reduction if you have large number of training data (say more then 5*300 sample for training)
Option 2:
Since you know that there are 3 kinds of data, run a PCA of those three kinds separately and get say 2 features for each. i.e
f1, f2 = PCA(kind A columns)
f3, f4 = PCA(kind B columns)
f5, f6 = PCA(kind C columns)
train(f1, f2, f3, f4, f5, f6)
Option 3
Run PCA on all columns and only take number of columns which preserve 90+ variance
Do not average, averaging is bad. But if you really want to do averaging and if you know certainly that some features are important rather do weighted average. In general averaging of features for dimensionally reduction is a very bad idea.

PCA will only consider the rows which will have highest co-relation with the output / result. So not all rows will be considered as a part of process to determine the output.
So it will be better if u do averaging as it will consider all the rows and determine the output from them.
As u have a larger number of features it is better if all the features are used to determine output.

Correlation between time series

I have a dataset where a process is described as a time series made of ~2000 points and 1500 dimensions.
I would like to quantify how much each dimension is correlated with another time series measured by another method.
What is the appropriate way to do this (eventually done in python) ? I have heard that Pearson is not well suited for this task, at least without data preparation. What are your thoughts about that?
Many thanks!

A general good rule in data science is to first try the easy thing. Only when the easy thing fails should you move to something more complicated. With that in mind, here is how you would compute the Pearson correlation between each dimension and some other time series. The key function here being pearsonr:
import numpy as np
from scipy.stats import pearsonr
# Generate a random dataset using 2000 points and 1500 dimensions
n_times = 2000
n_dimensions = 1500
data = np.random.rand(n_times, n_dimensions)
# Generate another time series, also using 2000 points
other_time_series = np.random.rand(n_times)
# Compute correlation between each dimension and the other time series
correlations = np.zeros(n_dimensions)
for dimension in range(n_dimensions):
# The Pearson correlation function gives us both the correlation
# coefficient (r) and a p-value (p). Here, we only use the coefficient.
r, p = pearsonr(data[:, dimension], other_time_series)
correlations[dimension] = r
# Now we have, for each dimension, the Pearson correlation with the other time
# series!
len(correlations)
# Print the first 5 correlation coefficients
print(correlations[:5])
If Pearson correlation doesn't work well for you, you can try swapping out the pearsonr function with something else, like:
spearmanr Spearman rank-order correlation coefficient.
kendalltau Kendall’s tau, a correlation measure for ordinal data.

k-means for all data or for each feature?

I want use k-means to discretize a time series data in two values (0 or 1). My time series data is a matrix time per genes (line = time, column = gene). Ex:
t\x x1 x2 x3
1 0.122 0.324 0.723
2 0.543 0.573 0.329
3 0.901 0.445 0.343
4 0.612 0.353 0.435
5 0.192 0.233 0.023
My question: Should I use k clusters for all data of matrix or k clusters for each column (so I will have k cluster per column totalizing k.number_columns)? and my genes are independents

Either may work.
Discretising all attributes at once has the benefit of giving you only one symbol per time, i.e. a univariate series.
But on the other hand, if columns are independent, the quality could be better if you discretise them individually. Note thatfor one-dimensional data, if it is noisy, quantiles may be much better than k-means (which is sensitive to noise).

Batch Normalization in Convolutional Neural Network

I am newbie in convolutional neural networks and just have idea about feature maps and how convolution is done on images to extract features. I would be glad to know some details on applying batch normalisation in CNN.
I read this paper https://arxiv.org/pdf/1502.03167v3.pdf and could understand the BN algorithm applied on a data but in the end they mentioned that a slight modification is required when applied to CNN:
For convolutional layers, we additionally want the normalization to obey the convolutional property – so that different elements of the same feature map, at different locations, are normalized in the same way. To achieve this, we jointly normalize all the activations in a mini- batch, over all locations. In Alg. 1, we let B be the set of all values in a feature map across both the elements of a mini-batch and spatial locations – so for a mini-batch of size m and feature maps of size p × q, we use the effec- tive mini-batch of size m′ = |B| = m · pq. We learn a pair of parameters γ(k) and β(k) per feature map, rather than per activation. Alg. 2 is modified similarly, so that during inference the BN transform applies the same linear transformation to each activation in a given feature map.
I am total confused when they say
"so that different elements of the same feature map, at different locations, are normalized in the same way"
I know what feature maps mean and different elements are the weights in every feature map. But I could not understand what location or spatial location means.
I could not understand the below sentence at all
"In Alg. 1, we let B be the set of all values in a feature map across both the elements of a mini-batch and spatial locations"
I would be glad if someone cold elaborate and explain me in much simpler terms

Let's start with the terms. Remember that the output of the convolutional layer is a 4-rank tensor [B, H, W, C], where B is the batch size, (H, W) is the feature map size, C is the number of channels. An index (x, y) where 0 <= x < H and 0 <= y < W is a spatial location.
Usual batchnorm
Now, here's how the batchnorm is applied in a usual way (in pseudo-code):
# t is the incoming tensor of shape [B, H, W, C]
# mean and stddev are computed along 0 axis and have shape [H, W, C]
mean = mean(t, axis=0)
stddev = stddev(t, axis=0)
for i in 0..B-1:
out[i,:,:,:] = norm(t[i,:,:,:], mean, stddev)
Basically, it computes H*W*C means and H*W*C standard deviations across B elements. You may notice that different elements at different spatial locations have their own mean and variance and gather only B values.
Batchnorm in conv layer
This way is totally possible. But the convolutional layer has a special property: filter weights are shared across the input image (you can read it in detail in this post). That's why it's reasonable to normalize the output in the same way, so that each output value takes the mean and variance of B*H*W values, at different locations.
Here's how the code looks like in this case (again pseudo-code):
# t is still the incoming tensor of shape [B, H, W, C]
# but mean and stddev are computed along (0, 1, 2) axes and have just [C] shape
mean = mean(t, axis=(0, 1, 2))
stddev = stddev(t, axis=(0, 1, 2))
for i in 0..B-1, x in 0..H-1, y in 0..W-1:
out[i,x,y,:] = norm(t[i,x,y,:], mean, stddev)
In total, there are only C means and standard deviations and each one of them is computed over B*H*W values. That's what they mean when they say "effective mini-batch": the difference between the two is only in axis selection (or equivalently "mini-batch selection").

Some clarification on Maxim's answer.
I was puzzled by seeing in Keras that the axis you specify is the channels axis, as it doesn't make sense to normalize over the channels - as every channel in a conv-net is considered a different "feature". I.e. normalizing over all channels is equivalent to normalizing number of bedrooms with size in square feet (multivariate regression example from Andrew's ML course). This is usually not what you want - what you do is normalize every feature by itself. I.e. you normalize the number of bedrooms across all examples to be with mu=0 and std=1, and you normalize the the square feet across all examples to be with mu=0 and std=1.
This is why you want C means and stds, because you want a mean and std per channel/feature.
After checking and testing it myself I realized the issue: there's a bit of a confusion/misconception here. The axis you specify in Keras is actually the axis which is not in the calculations. i.e. you get average over every axis except the one specified by this argument. This is confusing, as it is exactly the opposite behavior of how NumPy works, where the specified axis is the one you do the operation on (e.g. np.mean, np.std, etc.).
I actually built a toy model with only BN, and then calculated the BN manually - took the mean, std across all the 3 first dimensions [m, n_W, n_H] and got n_C results, calculated (X-mu)/std (using broadcasting) and got identical results to the Keras results.
Hope this helps anyone who was confused as I was.

I'm only 70% sure of what I say, so if it does not make sense, please edit or mention it before downvoting.
About location or spatial location: they mean the position of pixels in an image or feature map. A feature map is comparable to a sparse modified version of image where concepts are represented.
About so that different elements of the same feature map, at different locations, are normalized in the same way:
some normalisation algorithms are local, so they are dependent of their close surrounding (location) and not the things far apart in the image. They probably mean that every pixel, regardless of their location, is treated just like the element of a set, independently of it's direct special surrounding.
About In Alg. 1, we let B be the set of all values in a feature map across both the elements of a mini-batch and spatial locations: They get a flat list of every values of every training example in the minibatch, and this list combines things whatever their location is on the feature map.

Firstly we need to make it clear that the depth of a kernel is determined by previous feature map's channel num, and the number of kernel in this layer determins the channel num of next feature map (the next layer).
then we should make it clear that each kernel(three dimentional usually) will generate just one channel of feature map in the next layer.
thirdly we should try to accept the idea of each points in the generated feature map (regardless of their position) are generated by the same kernel, by sliding on previous layer. So they could be seen as a distribution generated by this kernel, and they could be seen as samples of a stochastic variable. Then they should be averaged to obtain the mean and then the variance. (it not rigid, only helps to understand)
This is what they say "so that different elements of the same feature map, at different locations, are normalized in the same way"

how to handle large number of features machine learning

I developed a image processing program that identifies what a number is given an image of numbers. Each image was 27x27 pixels = 729 pixels. I take each R, G and B value which means I have 2187 variables from each image (+1 for the intercept = total of 2188).
I used the below gradient descent formula:
Repeat {
θj = θj−α/m∑(hθ(x)−y)xj
}
Where θj is the coefficient on variable j; α is the learning rate; hθ(x) is the hypothesis; y is real value and xj is the value of variable j. m is the number of training sets. hθ(x), y are for each training set (i.e. that's what the summation sign is for). Further the hypothesis is defined as:
hθ(x) = 1/(1+ e^-z)
z= θo + θ1X1+θ2X2 +θ3X3...θnXn
With this, and 3000 training images, I was able to train my program in just over an hour and when tested on a cross validation set, it was able to identify the correct image ~ 67% of the time.
I wanted to improve that so I decided to attempt a polynomial of degree 2.
However the number of variables jumps from 2188 to 2,394,766 per image! It takes me an hour just to do 1 step of gradient descent.
So my question is, how is this vast number of variables handled in machine learning? On the one hand, I don't have enough space to even hold that many variables for each training set. On the other hand, I am currently storing 2188 variables per training sample, but I have to perform O(n^2) just to get the values of each variable multiplied by another variable (i.e. the polynomial to degree 2 values).
So any suggestions / advice is greatly appreciated.

try to use some dimensionality reduction first (PCA, kernel PCA, or LDA if you are classifying the images)
vectorize your gradient descent - with most math libraries or in matlab etc. it will run much faster
parallelize the algorithm and then run in on multiple CPUs (but maybe your library for multiplying vectors already supports parallel computations)

Along with Jirka-x1's answer, I would first say that this is one of the key differences in working with image data than say text data for ML: high dimensionality.
Second... this is a duplicate, see How to approach machine learning problems with high dimensional input space?

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart