clustering k-means are not spherical - machine-learning

I'm a beginner in data science and I need your help
I'm trying to test unsupervised machine learning with the K-means
but I found that the result is not spherical. I normalized, I removed the outliers etc.
I tried to find several way to correct it but it doesn't work
Here are pictures:
(I took a little sample of the dataset to show you, it's actually 8000 rows)
...

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(df)
principalDf = pd.DataFrame(data = principalComponents, columns = ['principal component 1', 'principal component 2'])
principalDf.head(5)
I used the PCA to reduce the 6 dimensions to 2 :
It separates the data perfectly
Output:

Your data have 6 dimensions. You can't visualize data above 2 dimension in a straight forward manner, you need to use PCA or TSNE to visualize them.

Related

Derive the right k in k-means clustering (including k = 1) in pyspark

I want to check if a clustering would be helpful or not on my coordinates.
I'm dealing with trajectories and want to check if all of them are starting on a same area (the trajectories are different). Thus the aim here is to characterise the most frequent departure points.
However, sometimes there is no need for clustering. I'm using K-means here. I had thought of using the Silhouette Score but I don't see if it is mathematically correct for the case where there is only one cluster. DBScan will not be a good clustering as density are not similar in the clusters I wanted to build.
Would you have an idea to create a kind of check between k=1 and k=3, which would be the best split for my data? I'm dealing here with data with coordinates (latitude/longitude) where the starting point is not 100% fixed but can vary within 2km around a kind of barycentre.
Simple extract with k=2 :
from pyspark.ml.feature import VectorAssembler
vecAssembler = VectorAssembler(inputCols=["lat", "lon"], outputCol="features")
df1= vecAssembler.transform(df)
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
# Loads data.
# Trains a k-means model.
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(df1.select('features'))
# Make predictions
transformed = model.transform(df1)
evaluator = ClusteringEvaluator(predictionCol='prediction', featuresCol='features', \
metricName='silhouette', distanceMeasure='squaredEuclidean')
evaluator.evaluate(transformed)
Is there a way to compute in pySpark a case with k=1 ? in order to derive Elbow or gap statistics ?

Can I visualize the output values of my linear regression model, If I have got 3 predictor variables and 1 target variable?

I am trying to understand whether I can Visualize a 4-dimensional graph by breaking it down into smaller dimensions.
For example when we have a 2-d plane as a prediction for a 3-d graph, We can just chose a 2-d graph that shows our prediction as a line. Can I do the same for a 4-d graph? If yes then how?
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model
data = pd.read_csv('housing.csv')
data = data[:50] #taking just 50 rows from the excel file
model = linear_model.LinearRegression() #loading the model from the library
model.fit(data[['median_income','total_rooms','households']],data.median_house_value)
# Pls add code here for visualizations
Actually you can do one funny thing - since your object is a function from R^3->R, you could, in principle, take your input space as a 3d cube (I am guessing your data is somewhat bounded), and then use colour to code your prediction. This way you will get a 3d coloured point cloud. You will probably need transparency to see through it + some interactive investigation to rotate/move around, but 4d is the highest "visualisable" dimension (as long as one dimension is "special" and thus can be coded as a colour).

how to plot three or even more dimensional multivariate gaussian distribution

In the study of machine learning and pattern recognition, we know that if a sample i has two dimensional feature like (length, weight), both of length and weight belongs to Gaussian distribution, so we can use a multivariate Gaussian distribution to describe it
it's just a 3d plot looks like this :
where z axis is the possibility ,
but what if this sample i has three dimensional features, x1, x2 , x3 ....xn or even more, how do we correctly plot it using one plot???
You can use dimensionality reduction methods to visualize higher dimensional data.
https://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html#sphx-glr-auto-examples-manifold-plot-compare-methods-py
convert D dimensional data into 2 or 3 dimensional data
plot the transformed data points on 2 or 3 data points depending upon the dimension to which the data was reduced.
Lets consider an example. Take 10th dimensional Gaussian
import matplotlib.pyplot as plt
import numpy as np
DIMENSION = 10
mean = np.zeros((DIMENSION,))
cov = np.eye(DIMENSION)
X = np.random.multivariate_normal(mean, cov, 5000)
Then perform dimensionality reduction (I used PCA, you can choose any other method depending upon the prior knowledge of effectiveness of the algorithm for a particular type of data)
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.decomposition import PCA
X_2d = PCA(n_components=2).fit_transform(X)
X_3d = PCA(n_components=3).fit_transform(X)
Then Plot them
fig = plt.figure(figsize=(12,4))
ax = fig.add_subplot(121, projection='3d')
ax.scatter(X_3d[:,0],X_3d[:,1],X_3d[:,2])
plt.title('3D')
fig.add_subplot(122)
plt.scatter(X_2d[:,0], X_2d[:,1])
plt.title('2D')
You can play with other algos as well. Each offers different kind of advantage.
I hope this answers your question.
Note: In higher dimension, phenomenon like "curse of dimensionality" also comes into play. so accurate projection in lower dimensional may not be possible. Something like why Greenland appears to be of similar size to that of Africa on cartographic map.

How to load unlabelled data for sentiment classification after training SVM model?

I am trying to do sentiment classification and I used sklearn SVM model. I used the labeled data to train the model and got 89% accuracy. Now I want to use the model to predict the sentiment of unlabeled data. How can I do that? and after classification of unlabeled data, how to see whether it is classified as positive or negative?
I used python 3.7. Below is the code.
import random
import pandas as pd
data = pd.read_csv("label data for testing .csv", header=0)
sentiment_data = list(zip(data['Articles'], data['Sentiment']))
random.shuffle(sentiment_data)
train_x, train_y = zip(*sentiment_data[:350])
test_x, test_y = zip(*sentiment_data[350:])
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn import metrics
clf = Pipeline([
('vectorizer', CountVectorizer(analyzer="word",
tokenizer=word_tokenize,
preprocessor=lambda text: text.replace("<br />", " "),
max_features=None)),
('classifier', LinearSVC())
])
clf.fit(train_x, train_y)
pred_y = clf.predict(test_x)
print("Accuracy : ", metrics.accuracy_score(test_y, pred_y))
print("Precision : ", metrics.precision_score(test_y, pred_y))
print("Recall : ", metrics.recall_score(test_y, pred_y))
When I run this code, I get the output:
ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning)
Accuracy : 0.8977272727272727
Precision : 0.8604651162790697
Recall : 0.925
What is the meaning of ConvergenceWarning?
Thanks in Advance!
What is the meaning of ConvergenceWarning?
As Pavel already mention, ConvergenceWArning means that the max_iteris hitted, you can supress the warning here: How to disable ConvergenceWarning using sklearn?
Now I want to use the model to predict the sentiment of unlabeled
data. How can I do that?
You will do it with the command: pred_y = clf.predict(test_x), the only thing you will adjust is :pred_y (this is your free choice), and test_x, this should be your new unseen data, it has to have the same number of features as your data test_x and train_x.
In your case as you are doing:
sentiment_data = list(zip(data['Articles'], data['Sentiment']))
You are forming a tuple: Check this out
then you are shuffling it and unzip the first 350 rows:
train_x, train_y = zip(*sentiment_data[:350])
Here you train_x is the column: data['Articles'], so all you have to do if you have new data:
new_ data = pd.read_csv("new_data.csv", header=0)
new_y = clf.predict(new_data['Articles'])
how to see whether it is classified as positive or negative?
You can run then: pred_yand there will be either a 1 or a 0 in your outcome. Normally 0 should be negativ, but it depends on your dataset-up
Check out this site about model's persistence. Then you just load it and call predict method. Model will return predicted label. If you used any encoder (LabelEncoder, OneHotEncoder), you need to dump and load it separately.
If I were you, I'd rather do full data-driven approach and use some pretrained embedder. It'll also work for dozens of languages out-of-the-box with is quite neat.
There's LASER from facebook. There's also pypi package, though unofficial. It works just fine.
Nowadays there's a lot of pretrained models, so it shouldn't be that hard to reach near-seminal scores.
Now I want to use the model to predict the sentiment of unlabeled data. How can I do that? and after classification of unlabeled data, how to see whether it is classified as positive or negative?
Basically, you aggregate unlabeled data in same way as train_x or test_x is generated. Probably, it's 2D matrix of shape n_samples x 1, which you would then use in clf.predict to obtain predictions. clf.predict outputs most probable class. In your case 0 is negative and 1 is positive, but it's hard to tell without the dataset.
What is the meaning of ConvergenceWarning?
LinearSVC model is optimized using iterative algorithm. There is an argument max_iter (1000 by default) that controls maximum amount of iterations. If stopping criteria wasn't met during this process, you will get ConvergenceWarning. It shouldn't bother you much, as long as you have acceptable performance in terms of accuracy, or other metrics.

Issue with the results of PCA component values

I am performing PCA on a dataset of (28 features + 1 class label) and 11M rows (samples) using the following simple code:
from sklearn.decomposition import PCA
import pandas as pd
df = pd.read_csv('HIGGS.csv', sep=',', header=None)
df_labels = df[df.columns[0]]
df_features = df.drop(df.columns[0], axis=1)
pca = PCA()
pca.fit(df_features.values)
print(pca.explained_variance_ratio_)
print(pca.explained_variance_ratio_.shape)
transformed_data = pca.transform(df_features.values)
The pca.explained_variance_ratio_ (or eigenvalues) are the following:
[0.11581302 0.09659324 0.08451179 0.07000956 0.0641502 0.05651781
0.055588 0.05446682 0.05291956 0.04468113 0.04248516 0.04108151
0.03885671 0.03775394 0.0255504 0.02181292 0.01979832 0.0185323
0.0164828 0.01047363 0.00779365 0.00702242 0.00586635 0.00531234
0.00300572 0.00135565 0.00109707 0.00046801]
Based on the explained_variance_ratio_, I don't know if there is something wrong here. The highest component is 11%, as opposed to the fact that we should be getting values starting at 99% and so. Does it imply that the dataset needs some preprocessing such as ensuring the data are in a normal distribution?
Dude, 99% for the first component means that the axis associated with the largest eigenvalue encodes 99% of the variance in your dataset. It is quite uncommon for any dataset to have a situation like this. Otherwise, the problem shrinks to a 1-D classification/regression problem.
There is nothing wrong with this output. Retain the first axes that encode aound 80% of the variance and build your model.
note: The PCA transformation is usually used to decrease the dimensions of your problem space. Since you have only 28 variables, I recommend abondoning PCA altogether.

Resources