Fast ICA using scikit learn- reconstruction error analysis - machine-learning

I am trying to use fastICA procedure in scikitLearn. For validation purposes I tried to understand the difference between PCA and ICA based signal reconstruction.
The original number of observed signals are 6 and I tried to use 3 reconstruction independent components . The problem is that both ICA and PCA result in the same reconstruction errors no matter what norm I use. Can some one throw light into what is happening here.
The code is below:
pca = PCA(n_components=3)
icamodel = FastICA(n_components=3,whiten=True)
Data = TrainingDataDict[YearSpan][RiskFactorNames]
PCR_Dict[YearSpan] = pd.DataFrame(pca.fit_transform(Data),
columns=['PC1','PC2','PC3'],index=Data.index)
ICR_Dict[YearSpan] = pd.DataFrame(icamodel.fit_transform(Data),
columns=['IC1','IC2','IC3'],index=Data.index)
'------------------------Inverse Transform of the IC and PCs -----------'
PCA_New_Data_Df = pd.DataFrame(pca.inverse_transform(PCR_Dict[YearSpan]),
columns =['F1','F2','F3'],index = Data.index)
ICA_New_Data_Df = pd.DataFrame(icamodel.inverse_transform(ICR_Dict[YearSpan]),
columns =['F1','F2','F3'],index = Data.index)
Below is the way I measure the reconstruction error
'-----------reconstruction errors------------------'
print 'PCA reconstruction error L2 norm:',np.sqrt((PCA_New_Data_Df - Data).apply(np.square).mean())
print 'ICA reconstruction error L2 norm:',np.sqrt((ICA_New_Data_Df - Data).apply(np.square).mean())
print 'PCA reconstruction error L1 norm:',(PCA_New_Data_Df - Data).apply(np.absolute).mean()
print 'ICA reconstruction error L1 norm:',(ICA_New_Data_Df - Data).apply(np.absolute).mean()
Below are the description of the tails of the PC and ICs
PC Stats : ('2003', '2005')
Kurtosis Skewness
PCR_1 -0.001075 -0.101006
PCR_2 1.057140 0.316163
PCR_3 1.067471 0.047946
IC Stats : ('2003', '2005')
Kurtosis Skewness
ICR_1 -0.221336 -0.204362
ICR_2 1.499278 0.433495
ICR_3 3.654237 0.072480
Below are the results of the reconstruction
PCA reconstruction error L2 norm:
SPTR 0.000601
SPTRMDCP 0.001503
RU20INTR 0.000788
LBUSTRUU 0.002311
LF98TRUU 0.001811
NDDUEAFE 0.000135
dtype: float64
ICA reconstruction error L2 norm :
SPTR 0.000601
SPTRMDCP 0.001503
RU20INTR 0.000788
LBUSTRUU 0.002311
LF98TRUU 0.001811
NDDUEAFE 0.000135
Even the L1 norms are the same. I am a bit confused!

Sorry for the late reply, I hope this answer can still help you.
fastICA can be seen as whitening (which can be achieved by PCA) plus an orthogonal rotation (an orthogonal rotation such that the estimated sources are as non-gaussian as possible).
The orthogonal rotation does not affect the reconstruction error of the ICA solution and hence you have the same reconstruction error for PCA and ICA.
Rotating a PCA solution is often used within psychology (for example a Varimax rotation). The orthogonal rotation matrix in fastICA is however estimated with an iterative procedure (fixed-point iteration scheme from Aapo Hyvärinen)

Related

Using ICA over MIT BIH NST dataset

In this dataset, I wanted to use signals with same unit and different SNR as input signals in ICA, i.e.
ica_input = np.array([ record_118e_6(MLII),
record_118e00(MLII),
record_118e06(MLII),
record_118e12(MLII),
record_118e18(MLII)
])
Is this a correct input to ICA?
Can I here consider the signal with different SNR linear mix of noises and true signal?
According to this article (which states how the dataset is generated by nst), The above input channels are linear mix of noise and clean signal and from plotting one can see that this data is clearly non gaussian hence ICA can be used in this case. Please correct me if I am wrong.

Confused about sklearn’s implementation of OSVM

I have recently started experimenting with OneClassSVM ( using Sklearn ) for unsupervised learning and I followed
this example .
I apologize for the silly questions But I’m a bit confused about two things :
Should I train my svm on both regular example case as well as the outliers , or the training is on regular examples only ?
Which of labels predicted by the OSVM and represent outliers is it 1 or -1
Once again i apologize for those questions but for some reason i cannot find this documented anyware
As this example you reference is about novelty-detection, the docs say:
novelty detection:
The training data is not polluted by outliers, and we are interested in detecting anomalies in new observations.
Meaning: you should train on regular examples only.
The approach is based on:
Schölkopf, Bernhard, et al. "Estimating the support of a high-dimensional distribution." Neural computation 13.7 (2001): 1443-1471.
Extract:
Suppose you are given some data set drawn from an underlying probability distribution P and you want to estimate a “simple” subset S of input space such that the probability that a test point drawn from P lies outside of S equals some a priori specied value between 0 and 1.
We propose a method to approach this problem by trying to estimate a function f that is positive on S and negative on the complement.
The above docs also say:
Inliers are labeled 1, while outliers are labeled -1.
This can also be seen in your example code, extracted:
# Generate some regular novel observations
X = 0.3 * np.random.randn(20, 2)
X_test = np.r_[X + 2, X - 2]
...
# all regular = inliers (defined above)
y_pred_test = clf.predict(X_test)
...
# -1 = outlier <-> error as assumed to be inlier
n_error_test = y_pred_test[y_pred_test == -1].size

Why to add mean in the reconstruction in PCA?

Suppose that X is our dataset (still not centered) and X_cent is our centered dataset (X_cent = X - mean(X)).
If we are doing PCA projection in this way Z_cent = F*X_cent, where F is matrix of principal components, that is pretty obvious that we need to add mean(X) after reconstruction Z_cent.
But what if we are doing PCA projection in this way Z = F*X? In this case we don't need to add mean(X) after reconstruction, but it gives us another result.
I think that something wrong with this procedure (construction-reconstruction), when it applied to the non-centered data (X in our case). Can anyone explain how it works? Why can't we do construction/reconstruction phase without this subracting/adding mean?
Thank you in advance.
If you retain all Principal Components, then reconstruction of the centered and non-centered vectors as described in your question would be identical. The problem (as indicated in your comments) is that you are only retaining K principal components. When you drop PCs, you lose information so the reconstruction will contain errors. Since you don't have to reconstruct the mean in one of the reconstructions, you don't introduce errors w.r.t. the mean there so the reconstruction errors of the two versions will be different.
Reconstruction with fewer than all PCs isn't quite as simple as multiplying by the transpose of the eigenvectors (F') because you need to pad your transformed data with zeros but to keep things simple, I'll ignore that here. Your two reconstructions look like this:
R1 = F'*F*X
R2 = F'*F*X_cent + X_mean
= F'*F*(X - X_mean) + X_mean
= F'*F*X - F'*F*X_mean + X_mean
Since the reconstruction is lossy, in general, F'*F*Y != Y for matrix Y. If you retrained all PCs, you would have R1 - R2 = 0. But since you are only retaining a subset of the PCs, your two reconstructions will differ by
R2 - R1 = X_mean - F'*F*X_mean
Your follow-up question in the comments regarding why it's better to reconstruct X_cent instead of X is a bit more nuanced and really depends on why you are doing PCA in the first place. The most fundamental reason is that the PCs are with respect to the mean in the first place so by not centering the data prior to transforming/rotating, you aren't really decorrelating the features. Another reason is that the numeric values of the transformed data are often orders of magnitude smaller when centering the data first.

Are there any machine learning regression algorithms that can train on ordinal data?

I have a function f(x): R^n --> R (sorry, is there a way to do LaTeX here?), and I want to build a machine learning algorithm that estimates f(x) for any input point x, based on a bunch of sample xs in a training data set. If I know the value of f(x) for every x in the training data, this should be simple - just do a regression, or take the weighted average of nearby points, or whatever.
However, this isn't what my training data looks like. Rather, I have a bunch of pairs of points (x, y), and I know the value of f(x) - f(y) for each pair, but I don't know the absolute values of f(x) for any particular x. It seems like there ought to be a way to use this data to find an approximation to f(x), but I haven't found anything after some Googling; there are papers like this but they seem to assume that the training data comes in the form of a set of discrete labels for each entity, rather than having labels over pairs of entities.
This is just making something up, but could I try kernel density estimation over f'(x), and then do integration to get f(x)? Or is that crazy, or is there a known better technique?
You could assume that f is linear, which would simplify things - if f is linear we know that:
f(x-y) = f(x) - f(y)
For example, Suppose you assume f(x) = <w, x>, making w the parameter you want to learn. How would the squared loss per sample (x,y) and known difference d look like?
loss((x,y), d) = (f(x)-f(y) - d)^2
= (<w,x> - <w,y> - d)^2
= (<w, x-y> - d)^2
= (<w, z> - d)^2 // where z:=x-y
Which is simply the squared loss for z=x-y
Practically, you would need to construct z=x-y for each pair and then learn f using linear regression over inputs z and outputs d.
This model might be too weak for your needs, but its probably the first thing you should try. Otherwise, as soon as you step away from the linearity assumption, you'd likely arrive at a difficult non-convex optimization problem.
I don't see a way to get absolute results. Any constant in your function (f(x) = g(x) + c) will disappear, in the same way constants disappear in an integral.

Effect of Standardization in Linear Regression: Machine Learning

As part of my assignment, I am working on couple of datasets, and finding their training errors with linear Regression. I was wondering whether the standardization has any effect on the training error or not? My correlation, and RMSE is coming out to be equal for datasets before and after the standardization.
Thanks,
It is easy to show that for linear regression it does not matter if you just transform input data through scaling (by a; the same applies for translation, meaning that any transformation of the form X' = aX + b for real a != 0,b have the same property).
X' = aX
w = (X^TX)X^Ty
w' = (aX^TaX)^-1 aX^Ty
w' = 1/a w
Thus
X^Tw = 1/a aX^T w = aX^T 1/a w = X'^Tw'^T
Consequently the projection, where the error is computed is exactly the same before and after scaling, so any type of loss function (independent on x) yields the exact same results.
However, if you scale output variable, then errors will change. Furthermore, if you standarize your dataset in more complex way then by just multiplying by a number (for example - by whitening or by nearly any rotation) then your results will depend on the preprocessing. If you use regularized linear regression (ridge regression) then even scaling the input data by a constant matters (as it changes the "meaning" of regularization parameter).

Resources