I am performing PCA on a dataset of (28 features + 1 class label) and 11M rows (samples) using the following simple code:
from sklearn.decomposition import PCA
import pandas as pd
df = pd.read_csv('HIGGS.csv', sep=',', header=None)
df_labels = df[df.columns[0]]
df_features = df.drop(df.columns[0], axis=1)
pca = PCA()
pca.fit(df_features.values)
print(pca.explained_variance_ratio_)
print(pca.explained_variance_ratio_.shape)
transformed_data = pca.transform(df_features.values)
The pca.explained_variance_ratio_ (or eigenvalues) are the following:
[0.11581302 0.09659324 0.08451179 0.07000956 0.0641502 0.05651781
0.055588 0.05446682 0.05291956 0.04468113 0.04248516 0.04108151
0.03885671 0.03775394 0.0255504 0.02181292 0.01979832 0.0185323
0.0164828 0.01047363 0.00779365 0.00702242 0.00586635 0.00531234
0.00300572 0.00135565 0.00109707 0.00046801]
Based on the explained_variance_ratio_, I don't know if there is something wrong here. The highest component is 11%, as opposed to the fact that we should be getting values starting at 99% and so. Does it imply that the dataset needs some preprocessing such as ensuring the data are in a normal distribution?
Dude, 99% for the first component means that the axis associated with the largest eigenvalue encodes 99% of the variance in your dataset. It is quite uncommon for any dataset to have a situation like this. Otherwise, the problem shrinks to a 1-D classification/regression problem.
There is nothing wrong with this output. Retain the first axes that encode aound 80% of the variance and build your model.
note: The PCA transformation is usually used to decrease the dimensions of your problem space. Since you have only 28 variables, I recommend abondoning PCA altogether.
Related
I want to check if a clustering would be helpful or not on my coordinates.
I'm dealing with trajectories and want to check if all of them are starting on a same area (the trajectories are different). Thus the aim here is to characterise the most frequent departure points.
However, sometimes there is no need for clustering. I'm using K-means here. I had thought of using the Silhouette Score but I don't see if it is mathematically correct for the case where there is only one cluster. DBScan will not be a good clustering as density are not similar in the clusters I wanted to build.
Would you have an idea to create a kind of check between k=1 and k=3, which would be the best split for my data? I'm dealing here with data with coordinates (latitude/longitude) where the starting point is not 100% fixed but can vary within 2km around a kind of barycentre.
Simple extract with k=2 :
from pyspark.ml.feature import VectorAssembler
vecAssembler = VectorAssembler(inputCols=["lat", "lon"], outputCol="features")
df1= vecAssembler.transform(df)
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
# Loads data.
# Trains a k-means model.
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(df1.select('features'))
# Make predictions
transformed = model.transform(df1)
evaluator = ClusteringEvaluator(predictionCol='prediction', featuresCol='features', \
metricName='silhouette', distanceMeasure='squaredEuclidean')
evaluator.evaluate(transformed)
Is there a way to compute in pySpark a case with k=1 ? in order to derive Elbow or gap statistics ?
In the study of machine learning and pattern recognition, we know that if a sample i has two dimensional feature like (length, weight), both of length and weight belongs to Gaussian distribution, so we can use a multivariate Gaussian distribution to describe it
it's just a 3d plot looks like this :
where z axis is the possibility ,
but what if this sample i has three dimensional features, x1, x2 , x3 ....xn or even more, how do we correctly plot it using one plot???
You can use dimensionality reduction methods to visualize higher dimensional data.
https://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html#sphx-glr-auto-examples-manifold-plot-compare-methods-py
convert D dimensional data into 2 or 3 dimensional data
plot the transformed data points on 2 or 3 data points depending upon the dimension to which the data was reduced.
Lets consider an example. Take 10th dimensional Gaussian
import matplotlib.pyplot as plt
import numpy as np
DIMENSION = 10
mean = np.zeros((DIMENSION,))
cov = np.eye(DIMENSION)
X = np.random.multivariate_normal(mean, cov, 5000)
Then perform dimensionality reduction (I used PCA, you can choose any other method depending upon the prior knowledge of effectiveness of the algorithm for a particular type of data)
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.decomposition import PCA
X_2d = PCA(n_components=2).fit_transform(X)
X_3d = PCA(n_components=3).fit_transform(X)
Then Plot them
fig = plt.figure(figsize=(12,4))
ax = fig.add_subplot(121, projection='3d')
ax.scatter(X_3d[:,0],X_3d[:,1],X_3d[:,2])
plt.title('3D')
fig.add_subplot(122)
plt.scatter(X_2d[:,0], X_2d[:,1])
plt.title('2D')
You can play with other algos as well. Each offers different kind of advantage.
I hope this answers your question.
Note: In higher dimension, phenomenon like "curse of dimensionality" also comes into play. so accurate projection in lower dimensional may not be possible. Something like why Greenland appears to be of similar size to that of Africa on cartographic map.
I'm a beginner in data science and I need your help
I'm trying to test unsupervised machine learning with the K-means
but I found that the result is not spherical. I normalized, I removed the outliers etc.
I tried to find several way to correct it but it doesn't work
Here are pictures:
(I took a little sample of the dataset to show you, it's actually 8000 rows)
...
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(df)
principalDf = pd.DataFrame(data = principalComponents, columns = ['principal component 1', 'principal component 2'])
principalDf.head(5)
I used the PCA to reduce the 6 dimensions to 2 :
It separates the data perfectly
Output:
Your data have 6 dimensions. You can't visualize data above 2 dimension in a straight forward manner, you need to use PCA or TSNE to visualize them.
from sklearn.model_selection import GridSearchCV
from sklearn import svm
params_svm = {
'kernel' : ['linear','rbf','poly'],
'C' : [0.1,0.5,1,10,100],
'gamma' : [0.001,0.01,0.1,1,10]
}
svm_clf = svm.SVC()
estimator_svm = GridSearchCV(svm_clf,param_grid=params_svm,cv=4,verbose=1,scoring='accuracy')
estimator_svm.fit(data,labels)
print(estimator_svm.best_params_)
estimator_svm.best_score_
/*
data.shape is (891,9)
labels.shape is (891) both are numeric 2-D and 1-D arrays.
*/
when I am using GridSearchCV with rbf it's giving the best parameter combination in just 2.7seconds..!
but when I make a list of kernel including any 'poly' or 'linear' separately or with 'rbf' it's taking too long to produce output, i.e. not giving output even after 15-20 minutes, which means I am doing something wrong. I am new to Machine Learning(supervised). I am not able to find any bug in the coding...I am not getting what's going wrong behind the scenes!
Can anyone explain this to me ,what i am doing wrong
No you are not doing anything wrong as per your code. There are many factors that come into play here
SVC is a complex classfier which requires computation of a distance between each pair of points in the dataset.
The complexity also varies with different kernel. I am not sure but i think it is O((no_of_samples)^2 * n_features) for rbf kernel, while it is O(n_samples*n_features) for linear kernel. So, it is not the case that just because rbf kernel works in 15 mins, then linear kernel will also work in similar time.
Also the time taken depends drastically on the dataset and the data patterns present in it. For e.g. an rbf kernel may converge quickly with say C = 0.5 but may take drastically more time for polynomial kernel to converge for the same value of C.
Also, without using the cache the running time increase a lot. In this answer, the author mentions it might increase to O(n_samples^3 *n_features).
Here is the offical documentation from sklearn about SVM complexity. See this section about practical tips on using SVM as well.
You can set verbose to True to see the progress of your classfier and how it is trained.
References
GridSearchCV goes to endless execution using SVC
Computational complexity of SVM
Official Documentation of SVM for scikit-learn
I have a dataset X whose shape is (1741, 61). Using logistic regression with cross_validation I was getting around 62-65% for each split (cv =5).
I thought that if I made the data quadratic, the accuracy is supposed to increase. However, I'm getting the opposite effect (I'm getting each split of cross_validation to be in the 40's, percentage-wise) So,I'm presuming I'm doing something wrong when trying to make the data quadratic?
Here is the code I'm using,
from sklearn import preprocessing
X_scaled = preprocessing.scale(X)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(3)
poly_x =poly.fit_transform(X_scaled)
classifier = LogisticRegression(penalty ='l2', max_iter = 200)
from sklearn.cross_validation import cross_val_score
cross_val_score(classifier, poly_x, y, cv=5)
array([ 0.46418338, 0.4269341 , 0.49425287, 0.58908046, 0.60518732])
Which makes me suspect, I'm doing something wrong.
I tried transforming the raw data into quadratic, then using preprocessing.scale, to scale the data, but it was resulting in an error.
UserWarning: Numerical issues were encountered when centering the data and might not be solved. Dataset may contain too large values. You may need to prescale your features.
warnings.warn("Numerical issues were encountered "
So I didn't bother going this route.
The other thing that's bothering is the speed of the quadratic computations. cross_val_score is taking around a couple of hours to output the score when using polynomial features. Is there any way to speed this up? I have an intel i5-6500 CPU with 16 gigs of ram, Windows 7 OS.
Thank you.
Have you tried using the MinMaxScaler instead of the Scaler? Scaler will output values that are both above and below 0, so you will run into a situation where values with a scaled value of -0.1 and those with a value of 0.1 will have the same squared value, despite not really being similar at all. Intuitively this would seem to be something that would lower the score of a polynomial fit. That being said I haven't tested this, it's just my intuition. Furthermore, be careful with Polynomial fits. I suggest reading this answer to "Why use regularization in polynomial regression instead of lowering the degree?". It's a great explanation and will likely introduce you to some new techniques. As an aside #MatthewDrury is an excellent teacher and I recommend reading all of his answers and blog posts.
There is a statement that "the accuracy is supposed to increase" with polynomial features. That is true if the polynomial features brings the model closer to the original data generating process. Polynomial features, especially making every feature interact and polynomial, may move the model further from the data generating process; hence worse results may be appropriate.
By using a 3 degree polynomial in scikit, the X matrix went from (1741, 61) to (1741, 41664), which is significantly more columns than rows.
41k+ columns will take longer to solve. You should be looking at feature selection methods. As Grr says, investigate lowering the polynomial. Try L1, grouped lasso, RFE, Bayesian methods. Try SMEs (subject matter experts who may be able to identify specific features that may be polynomial). Plot the data to see which features may interact or be best in a polynomial.
I have not looked at it for a while but I recall discussions on hierarchically well-formulated models (can you remove x1 but keep the x1 * x2 interaction). That is probably worth investigating if your model behaves best with an ill-formulated hierarchical model.