Path clustering or segmentation - path

I want to show categories customer path in percentage:
data (x,y) came from camera with costumer id
for example:
(x,y,customerid)
(10,10,1)
(10,12,1)
(11,13,1)
...
In first step thought,best way is using a clustering methos such as kmeans.
but there is some problem
the proper number of clustered group is ambigius.
clustering must be base on start point and related path(not base on points).
regression (or any related method) line must cross the nth clustered point from start point to end point.

Related

Clustering suggestions: I have unlabelled dataset6 attributes(all numeric) and 100k datapoints. I want to do cluster similar datapoints

As part of preprocessing:
I have removed attributes that are high in correlation(>0.8).
standardized the data(Standard Scalar)
`#To reduce it to lower dimensions I used
umap =UMAP(n_neighbors=20,
min_dist=0,
spread=2,
n_components=3,
metric='euclidean')
df_umap = umap.fit_transform(df_scaled1)
#For Clustering I used HDBSCAN
clusterer = hdbscan.HDBSCAN(min_cluster_size=30, max_cluster_size=100, prediction_data=True)
clusterer.fit(df_umap)
#Assign clusters to the original dataset
df['cluster'] = clusterer.labels_`
Data--(130351,6)
Column a
Column b
Column c
Column d
Column e
Column f
6.000194
7.0
1059216
353069.000000
26.863543
15.891751
3.001162
3.5
1303727
396995.666667
32.508957
11.215764
6.000019
7.0
25887
3379.000000
18.004558
10.993119
6.000208
7.0
201138
59076.666667
41.140104
10.972880
6.000079
7.0
59600
4509.666667
37.469000
9.667119
df.describe():
df.describe()
Results:
1.While some of the clusters have very similar data points;
example: cluster: 1555, but a lot of them are having extreme data points associated with single cluster;
example: cluster: 5423.
Also cluster id '-1' have 36221 data points associated with it.
My questions:
Am I using the correct approach for the data I have and the result I am trying to achieve?
Is UMAP the correct choice for dimension reduction?
Is HDBSCAN the right choice for this clustering problem? (I chose HDBSCAN, as it doesnt need any user input for defining number of clusters, the maximum and minimum data points associated to a cluster can be set before hand)
How to tune the clustering model to achieve better cluster quality ?(I am assuming with better cluster quality the points associated with cluster '-1' will also get clustered)
Is there any method to assess cluster quality?

using CURE algorithm with pyspark

I'm trying to implement CURE algorithm on my dataset (.csv).
I've done silhoutte score and clustering. Now I have to pick k representative points in the clusters. I have no idea how to even go about doing this (The idea is to pick those representative points in each cluster and then move them a fraction closer to the centroid of that cluster - Once completed then process each data point again and move it to closest cluster but If I can figure out how to pick points in cluster I can do the rest)
This is my code so far (I did not paste starting code e.g data loading, mapping etc part in this)
from pyspark.ml.evaluation import ClusteringEvaluator
silhouette_scores=[]
my_cleaned_data= assemble.transform(passed_data_im_using)
evaluator = ClusteringEvaluator(featuresCol='my_features',metricName='silhouette')
for K in range(1,10):
bisecting_k_means_=BisectingKMeans(featuresCol='my_features', k=K, minDivisibleClusterSize =1)
bisecting_k_means_fit=bisecting_k_means_.fit(my_cleaned_data)
bisecting_k_means_transform=bisecting_k_means_fit.transform(my_cleaned_data)
evaluation_score=evaluator.evaluate(bisecting_k_means_transform)
silhouette_scores.append(evaluation_score)
# clustering
bisecting_k_means_=BisectingKMeans(featuresCol='my_features', k=3)
bisecting_k_means_Model=bisecting_k_means_.fit(my_cleaned_data)
# this gives the clusters
bisecting_k_means_Model.clusterCenters()
# Pick points in each cluster
# No Idea how to do this part

Determine location of optimal hydrogen bond donor/acceptor pair?

I have a PDB structure that I'm analyzing which has a putative binding pocket in it for some endogenous ligand.
I'd like to to determine for the relevant amino acids within, say, 3A of the pocket, where the optimal hydrogen bond donor/acceptor pair for the ligand would be within the pocket. I have done this already for determining locations of optimal pi-pi stacking (e.g. find aromatic residues, determine plane the face of the ring, go out N angstroms orthogonal to the face), but i'm struggling to consider this for hydrogen bonds.
Can this be done?
Weel, I'll try to write out how I would try to do it.
First of all its not clear to me if your pocket is described by a grid that represent the pocket surface, or by a grid that represent all the pocket space (lets call it pocket cloud).
With Biopython assuming you have a cloud described by your grid:
Loop over all the cloud-grid points:
for every point loop over all the PDB atoms that are H donor or acceptor:
if the distance is in the desidered target range (3A - distance for optimal
donor or acceptor pair):
select the corresponding AA/atom/point
add to your result list the point as donor/acceptor/or both togeher
with the atom/AA selected
else:
pass
with Biopyton and distances see here: Biopython PDB: calculate distance between an atom and a point
H bonds are generally 2.7 to 3.3 Å
I am not sure my logic is correct, the idea is to end up with a subset of your grids point where you have red grid points where you could pose a donor and blue ones where you could pose an acceptor.
We are talking only about distances here, if you introduce geometry factors of the bond I think you should need a ligand with its own geometry too
Of course with this approach you would waste a lot of time on not productive computation, if You find a way to select only the grid surface point you could select a subset of PDB atoms that are close to the surface (3A) and then use the same approach above.

SPSS - Using K-means clustering after factor analysis

I am a developer that has been tasked with working out how previous results using SPSS were gathered, so we can repeat the process with some new data. We can't ask the person who did the original analysis because he is sadly no longer with us, so it has fallen to me to unravel what he did.
I am not a statistician and do not need to understand the principles involved. I really just need to know what menu items to navigate to.
We had a survey done, which asked a lot of questions of 10,000 people. A subset of 15 of these questions is being used for the analysis.
I know that factor analysis was done to reduce the data to 4 sets. K-means clustering was then used to find the cluster centers. This is what I'm after now.
I have worked out how to do the factor analysis to get the component score coefficient matrix that matches the data I have in my database. This was done by going to Analyze > Dimension Reduction > Factor. I then chose a fixed number of factors (4) from the "Extract" section, "Varimax" rotation from the "Rotation" section and checked the "Display factor score coefficient matrix" in the "Scores" section.
This gave data like this:
Matrix Value 1 Value 2 Value 3 Value 4
Q1 -0.0756 0.2134 -0.0245 -0.1236
Q2 ... ... ... ...
Q3 ... ... ... ...
...
What I have no idea of is how to proceed with this to do the k-means clustering.
The results I have in the database look like this:
Cluster centers Value 1 Value 2 Value 3 Value 4 Value 5
FAC1_1 -0.8373 -0.5766 0.2100 1.3499 0.2940
FAC2_1 ... ... ... ... ...
FAC3_1 ... ... ... ... ...
FAC4_1 ... ... ... ... ...
Now, I know that k-means clustering can be done on the original data set by using Analyze > Classify > K-means Cluster, but I don't know how to reference the factor analysis I've done.
Could someone give me some insight into how to create these cluster centers using SPSS?
In the GUI for FACTOR analysis (Analyze > Dimension Reduction > Factor), you have a sub-dialog "Scores", make sure "Save as variables" is checked.
This will save the factor scores in your data i.e. the variables FAC1_1, FAC2_1, FAC3_1, FAC4_1.
It is these variable that you then need to add as input variables in the K-means GUI.
It is better to setup your work in a syntax so if ever anyone else ever wants to replicate your work they can do so (and ideally your predecessor should have left his bread crumbs in a syntax document too. I would make every attempt to find this document if there is a remote possibility of it existing, a file of .sps file extension).
Here's how you'd set this up in syntax and what his/her workings may have looked like:
/* Replicate the factor analysis (four factors) and save the factor score variables */.
FACTOR
/VARIABLES < INPUT THE 15 VARIABLES HERE >
/MISSING LISTWISE
/ANALYSIS < INPUT THE 15 VARIABLES HERE >
/PRINT EXTRACTION ROTATION FSCORE
/FORMAT SORT BLANK(.10)
/PLOT ROTATION
/CRITERIA FACTORS(4) ITERATE(25)
/EXTRACTION PC
/CRITERIA ITERATE(25)
/ROTATION VARIMAX
/SAVE REG(ALL)
/METHOD=CORRELATION.
/* Replicate the clustering using factor scores as inputs, generating 5 segments */.
QUICK CLUSTER FAC1_1 FAC2_1 FAC3_1 FAC4_1
/MISSING=LISTWISE
/CRITERIA=CLUSTER(5) MXITER(10) CONVERGE(0)
/METHOD=KMEANS(NOUPDATE)
/SAVE CLUSTER (Seg5)
/PRINT INITIAL.
/* Check centroids match*/.
MEANS FAC1_1 FAC2_1 FAC3_1 FAC4_1 BY Seg5 /CELLS MEAN.
If you can replicate the FACTOR score variables to match exactly, then that is a good start, if the centroids do not match then, given the factor scores do match, then it can only be/most likely to be because the segment assignments are now different. Despite using the same input/methodology if the case ordering is different to previously, K-Means QUICK CLUSTER, can and will most likely yield different segment assignments due to random starting points.
I don't know any way round this but in principle these are the likely steps he/she had taken.
I have done same kind of analysis for a project of mine. First carry out the factor analysis, once you have been able to extract good amount of variance from the factor analysis try to save the factor scores (In SPSS).
For saving the factor scores go to Analyse->Dimension Reduction->Factor->Score->Save as variables.
As you save the scores there would be new variables created in the Variable view based on the number of components.
After you have been able to save the scores of the factors go to Analyse->Classify->K-Means and select the new variables (Factors Scores) enter the number of initial clusters required then OK.
If you have access to the system where the original work was done, look for the journal file (typically named statistics.jnl and kept in the location specified under Edit > Options > Files).
If journaling was in effect with the append option, it will have all the commands the user ran.
I'm doing the same set of analyses for a project. Just for your information, two-step clustering process offered by SPSS is more robust that K-means (Punj & Stewart 1983). In K-means, how are you going to choose the K?! You can also use the clvalid package to get the optimal number of K if you insist on using K-means.
Punj, G., & Stewart, D. W. (1983). Cluster analysis in marketing research: review and suggestions for application. Journal of marketing research, 134-148.

hierarchical clustering using flann in opencv

I'm trying to use a method hierarchicalClustering from opencv 2.4.2.
It work without error, but the problem is, that I don't undertstand the parametrs it accepts eg. branching...
And i think it couses my problem that i get always just one cluster.
My input is a cv::Mat of LBPH features (for face detection) number of rows is 12 and number of cols is 6272.
No matter what is the value of branching factor I get always just one cluster and its centroid is mean of rows from input matrix grouppeed_one_ferson_features.
Could you advice ???
THANK a LOT!!!
heres the code:
cv::Mat groupped_one_person_features;
.... // fill grouppeed_one_ferson_features with data
int Nclusters=50;
cv::Mat centroids (Nclusters,Features.data[0][0].cols,CV_32FC1);
int count = cv::flann::hierarchicalClustering<cvflann::L1<float>>groupped_one_person_features,centroids,cvflann::KMeansIndexParams(2000,11,cvflann::FLANN_CENTERS_KMEANSPP));
First of all, you missed a parenthesis in your last line:
int count = cv::flann::hierarchicalClustering<cvflann::L1<float>>(groupped_one_person_features,centroids,cvflann::KMeansIndexParams(2000,11,cvflann::FLANN_CENTERS_KMEANSPP));
In the order, the parameters are (according to flann_base.hpp):
The points to be clustered
The computed cluster centers. Matrix should be preallocated and centers.rows is the number of clusters requested.
The clustering parameters
The distance to be used for clustering
Therefore, if you always get one cluster, it possibly means that your centroids matrix only has one row. Can you verify this?
The parameters of KMeansIndexParams are (according to kmeans_index.h):
branching factor: the number of children of a node in the tree
iterations: max iterations to perform in one kmeans clustering (kmeans tree)
centers_init: algorithm used for picking the initial cluster centers for kmeans tree
cb_index: cluster boundary index. Used when searching the kmeans tree

Resources