Clustering suggestions: I have unlabelled dataset6 attributes(all numeric) and 100k datapoints. I want to do cluster similar datapoints - machine-learning

As part of preprocessing:
I have removed attributes that are high in correlation(>0.8).
standardized the data(Standard Scalar)
`#To reduce it to lower dimensions I used
umap =UMAP(n_neighbors=20,
min_dist=0,
spread=2,
n_components=3,
metric='euclidean')
df_umap = umap.fit_transform(df_scaled1)
#For Clustering I used HDBSCAN
clusterer = hdbscan.HDBSCAN(min_cluster_size=30, max_cluster_size=100, prediction_data=True)
clusterer.fit(df_umap)
#Assign clusters to the original dataset
df['cluster'] = clusterer.labels_`
Data--(130351,6)
Column a
Column b
Column c
Column d
Column e
Column f
6.000194
7.0
1059216
353069.000000
26.863543
15.891751
3.001162
3.5
1303727
396995.666667
32.508957
11.215764
6.000019
7.0
25887
3379.000000
18.004558
10.993119
6.000208
7.0
201138
59076.666667
41.140104
10.972880
6.000079
7.0
59600
4509.666667
37.469000
9.667119
df.describe():
df.describe()
Results:
1.While some of the clusters have very similar data points;
example: cluster: 1555, but a lot of them are having extreme data points associated with single cluster;
example: cluster: 5423.
Also cluster id '-1' have 36221 data points associated with it.
My questions:
Am I using the correct approach for the data I have and the result I am trying to achieve?
Is UMAP the correct choice for dimension reduction?
Is HDBSCAN the right choice for this clustering problem? (I chose HDBSCAN, as it doesnt need any user input for defining number of clusters, the maximum and minimum data points associated to a cluster can be set before hand)
How to tune the clustering model to achieve better cluster quality ?(I am assuming with better cluster quality the points associated with cluster '-1' will also get clustered)
Is there any method to assess cluster quality?

Related

Time series clustering of activity of machines

I have a NxM matrix where N is the number of time intervals and M are the number of nodes in a graph.
Each cell indicates the nodes that were active in that time interval
Now I need to find group of nodes that always appear together across time series. Is there some approach I can use to cluster these nodes together based on their time series activity.
In R you could do this:
# hierarchical clustering
library(dendextend) # contains color_branches()
dist_ts <- dist(mydata) # calculate distances
hc_dist <- hclust(dist_ts)
dend_ts <- as.dendrogram(hc_dist)
# set some value for h (height within the dendrogram) here that makes sense for you
dend_100 <- color_branches(dend_ts, h = 100)
plot(dend_100)
This creates a dendrogram with colored branches.
You could do much better visualizations, but your post is pretty generic (somewhat unclear what you're asking) and you didn't indicate whether you like R at all.
As the sets may overlap most clustering methods will not produce optimum results.
Instead, treat each time point as a transaction, containing all active nodes as items. Then run frequent itemset mining to find frequently active sets of machines.

Time series distance metric

In order to clusterize a set of time series I'm looking for a smart distance metric.
I've tried some well known metric but no one fits to my case.
ex: Let's assume that my cluster algorithm extracts this three centroids [s1, s2, s3]:
I want to put this new example [sx] in the most similar cluster:
The most similar centroids is the second one, so I need to find a distance function d that gives me d(sx, s2) < d(sx, s1) and d(sx, s2) < d(sx, s3)
edit
Here the results with metrics [cosine, euclidean, minkowski, dynamic type warping]
]3
edit 2
User Pietro P suggested to apply the distances on the cumulated version of the time series
The solution works, here the plots and the metrics:
nice question! using any standard distance of R^n (euclidean, manhattan or generically minkowski) over those time series cannot achieve the result you want, since those metrics are independent of the permutations of the coordinate of R^n (while time is strictly ordered and it is the phenomenon you want to capture).
A simple trick, that can do what you ask is using the cumulated version of the time series (sum values over time as time increases) and then apply a standard metric. Using the Manhattan metric, you would get as a distance between two time series the area between their cumulated versions.
Another approach would be by utilizing DTW which is an algorithm to compute the similarity between two temporal sequences. Full disclosure; I coded a Python package for this purpose called trendypy, you can download via pip (pip install trendypy). Here is a demo on how to utilize the package. You're just just basically computing the total min distance for different combinations to set the cluster centers.
what about using standard Pearson correlation coefficient? then you can assign the new point to the cluster with the highest coefficient.
correlation = scipy.stats.pearsonr(<new time series>, <centroid>)
Pietro P's answer is just a special case of applying a convolution to your time series.
If I gave the kernel:
[1,1,...,1,1,1,0,0,0,0,...0,0]
I would get a cumulative series .
Adding a convolution works because you're giving each data point information about it's neighbours - it's now order dependent.
It might be interesting to try with a guassian convolution or other kernels.

How do I decide or count number of hidden/tunable parameters in my design?

For my deep learning assignment I need to design a image classification network. There this constraint in the assignment I can have 500,000 number of hidden/tunable parameters at most in this design.
How can I count or observe the number of these hidden parameters especially if I am using this tensor flow tutorial as initial code/design.
Thanks in advance
How can I count or observe the number of these hidden parameters especially if I am using this tensor flow tutorial as initial code/design.
Instead of me doing the work for you I'll show you how to count free parameters
Glancing quickly it looks like the code at cifar10 uses layers of max pooling, convolution, bias, fully connected weights. Let's review how many free parameters each of these layers adds to your architecture.
max pooling : FREE! That's right, there are no "free parameters" from max pooling.
conv : Convolutions are defined using parameters like [1,3,3,1] where the numbers correspond to your tensor like so [batch_size, CONV_SIZE, CONV_SIZE, FEATURE_DEPTH]. Multiply all the dimension sizes together to find the total size of your free parameters. In the case of [1,3,3,1], the total is 1x3x3x1 = 9.
bias : A Bias is similar to convolutions in that it is defined by a shape like [10] or [1,342,342,3]. Same thing, just multiply all dimension sizes together to get the total free parameters. Sometimes a bias is just a single number, which means a size of 1.
fully connected : A fully connected layer usually has a 2d shape like [1024,32]. This means that it is a 2d matrix, and you calculate the total free parameters just like the convolution. In this example [1024,32] has 1024x32 = 32,768 free parameters.
Finally you add up all the free parameters from all the layers and that is your total number of free parameters.
500 000 parmeters? You use an R, G and B value of each pixel? If yes there is some problems
1. too much data (long calculating time)
2. in image clasification companys always use some other image analysis technique(preprocesing) befor throwing data into NN. if you have to identical images. Second is moved by one piksel. For the network they can be very diffrend.
Imagine other neural network. Use two parameters maybe weight and height. If you swap this parametrs what will happend.
Yes during learning of your image network can decrease this effect but when I made experiments with 5x5 binary images that was very hard to network. I start using 4 layers but this help only a little.
The image used to lerning can be good clasified, after destoring also but mooving for one pixel and you have a problem.
If no make eksperiments or use genetic algoritm to find it.
After laerning you should use some algoritm to find dates with network recognize as "no important"(big differnce beetwen weight of this input and the rest, If this input weight are too close to 0 network "think" it is no important)

Image classification with Sift features and Knn?

Can you help me waith Image classification using SIFT feature?
I want to classify images based on SIFT features:
Given a training set of images, extract SIFT from them
Compute K-Means over the entire set of SIFTs extracted form the
training set. the "K" parameter (the number of clusters) depends on
the number of SIFTs that you have for training, but usually is around
500->8000 (the higher, the better).
Now you have obtained K cluster centers.
You can compute the descriptor of an image by assigning each SIFT of
the image to one of the K clusters. In this way you obtain a
histogram of length K.
I have 130 images in training set so my training set 130*K
dimensional
I want to classify my test images ı have 1 images so my sample is 1*k
dimensional. I wrote this code knnclassify(sample,training
set,group).
I want to classify to 7 group. So, knnclassify(sample(1*10),trainingset(130*10),group(7*1))
The error is: The length of GROUP must equal the number of rows in TRAINING. What can I do?
Straight from the docs:
CLASS = knnclassify(SAMPLE,TRAINING,GROUP) classifies each row of the
data in SAMPLE into one of the groups in TRAINING using the nearest-
neighbor method. SAMPLE and TRAINING must be matrices with the same
number of columns. GROUP is a grouping variable for TRAINING. Its
unique values define groups, and each element defines the group to
which the corresponding row of TRAINING belongs. GROUP can be a
numeric vector, a string array, or a cell array of strings. TRAINING
and GROUP must have the same number of rows.
What this means, is that group should be 130x1, and should indicate which group each of the training samples belong to. unique(group) should return 7 values in your case - the seven categories represented in your training set.
If you don't already have a group vector which specifies which categories which image falls into, you could use kmeans to split your training set into 7 groups:
group = kmeans(trainingset,7);
knnclassify(sample, trainingset, group);

Kohonen Self Organizing Maps: Determining the number of neurons and grid size

I have a large dataset I am trying to do cluster analysis on using SOM. The dataset is HUGE (~ billions of records) and I am not sure what should be the number of neurons and the SOM grid size to start with. Any pointers to some material that talks about estimating the number of neurons and grid size would be greatly appreciated.
Thanks!
Quoting from the som_make function documentation of the som toolbox
It uses a heuristic formula of 'munits = 5*dlen^0.54321'. The
'mapsize' argument influences the final number of map units: a 'big'
map has x4 the default number of map units and a 'small' map has
x0.25 the default number of map units.
dlen is the number of records in your dataset
You can also read about the classic WEBSOM which addresses the issue of large datasets
http://www.cs.indiana.edu/~bmarkine/oral/self-organization-of-a.pdf
http://websom.hut.fi/websom/doc/ps/Lagus04Infosci.pdf
Keep in mind that the map size is also a parameter which is also application specific. Namely it depends on what you want to do with the generated clusters. Large maps produce a large number of small but "compact" clusters (records assigned to each cluster are quite similar). Small maps produce less but more generilized clusters. A "right number of clusters" doesn't exists, especially in real world datasets. It all depends on the detail which you want to examine your dataset.
I have written a function that, with the data set as input, returns the grid size. I rewrote it from the som_topol_struct() function of Matlab's Self Organizing Maps Toolbox into a R function.
topology=function(data)
{
#Determina, para lattice hexagonal, el número de neuronas (munits) y su disposición (msize)
D=data
# munits: número de hexágonos
# dlen: número de sujetos
dlen=dim(data)[1]
dim=dim(data)[2]
munits=ceiling(5*dlen^0.5) # Formula Heurística matlab
#munits=100
#size=c(round(sqrt(munits)),round(munits/(round(sqrt(munits)))))
A=matrix(Inf,nrow=dim,ncol=dim)
for (i in 1:dim)
{
D[,i]=D[,i]-mean(D[is.finite(D[,i]),i])
}
for (i in 1:dim){
for (j in i:dim){
c=D[,i]*D[,j]
c=c[is.finite(c)];
A[i,j]=sum(c)/length(c)
A[j,i]=A[i,j]
}
}
VS=eigen(A)
eigval=sort(VS$values)
if (eigval[length(eigval)]==0 | eigval[length(eigval)-1]*munits<eigval[length(eigval)]){
ratio=1
}else{
ratio=sqrt(eigval[length(eigval)]/eigval[length(eigval)-1])}
size1=min(munits,round(sqrt(munits/ratio*sqrt(0.75))))
size2=round(munits/size1)
return(list(munits=munits,msize=sort(c(size1,size2),decreasing=TRUE)))
}
hope it helps...
Iván Vallés-Pérez
I don't have a reference for it, but I would suggest starting off by using approximately 10 SOM neurons per expected class in your dataset. For example, if you think your dataset consists of 8 separate components, go for a map with 9x9 neurons. This is completely just a ballpark heuristic though.
If you'd like the data to drive the topology of your SOM a bit more directly, try one of the SOM variants that change topology during training:
Growing SOM
Growing Neural Gas
Unfortunately these algorithms involve even more parameter tuning than plain SOM, but they might work for your application.
Kohenon has written on the issue of selecting parameters and map size for SOM in his book "MATLAB Implementations and Applications of the Self-Organizing Map". In some cases, he suggest the initial values can be arrived at after testing several sizes of the SOM to check that the cluster structures were shown with sufficient resolution and statistical accuracy.
my suggestion would be the following
SOM is distantly related to correspondence analysis. In statistics, they use 5*r^2 as a rule of thumb, where r is the number of rows/columns in a square setup
usually, one should use some criterion that is based on the data itself, meaning that you need some criterion for estimating the homogeneity. If a certain threshold would be violated, you would need more nodes. For checking the homogeneity you would need some records per node. Agai, from statistics you could learn that for simple tests (small number of variables) you would need around 20 records, for more advanced tests on some variables at least 8 records.
remember that the SOM represents a predictive model. So validation is the key, absolutely mandatory. Yet, validation of predictive models (see typeI / II error entry in Wiki) is a subject on its own. And the acceptable risk as well as the risk structure also depend fully on your purpose.
You may test the dynamics of the error rate of the model by reducing its size more and more. Then take the smallest one with acceptable error.
It is a strength of the SOM to allow for empty nodes. Yet, there should not be too much of them. Let me say, less than 5%.
Taken all together, from experience, I would recommend the following criterion a minimum of the absolute number of 8..10 records, but those should not be more than 5% of all clusters.
Those 5% rule is of of course a heuristics, which however can be justified by the general usage of the confidence level in statistical tests. You may choose any percentage from 1% to 5%.

Resources