k-means clustered data: how to label newly incoming data - machine-learning

I have a data set with labels that were produced by a k-means clustering algorithm. Now there is some data (with the same data structure) from another source and I wonder what is the most sensible way to label this new, yet unseen data? I was thinking about either
calculating the distance to the prior k-means centroids and label the data to the the nearest centroids accordingly
run a new algorithm (e.g. SVM) on the new data using the old data as the training set
Unfortunately, I couldn't find anything about this particular problem. There are only a few questions about the general use of k-means as a classification model:
Can k-means clustering do classification?
How to segment new data with existing K-means model?
Thanks in advance.
Uli

You dont need SVM thing.First way is more convenient.If you are using sklearn https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html there is an example here.predict function will do your job.

Related

Clustering model like DBSCAAN,OPTICS, KMEANS

I have a doubt whether after clustering using any algorithm is it possible to segment new data based on the learning from the previous data
The issue is that clustering algorithms are unsupervised learning algorithms. They don't need a dependent variable to predict classes. They are used to find structures/similarities in the data points. What you can do is, treat the clustered data as your supervised data.
The approach would be clustering and assigning labels in the train data. Treat it as a multi-class classification data, train a new multi-class classification model using your data and validate it on the test data.
Let train and test be the datasets.
clusters <- Clustering(train)
train[y] <- clusters
model <- Classification(train, train[y])
prediction <- model.predict(test)
However interestingly KMeans in sklearn provides fit and predict method. So using KMeans from sklearn you can predict in the new data. However, DBScan doesn't have predict which is quite obvious from it's working mechanism.
Clustering is an unsupervised mechanism where the number of clusters and the identity of the segments which need to be clustered are not known to the system.
Hence what you can do is to obtain the learning of a model which is trained for Clustering , classification,Identification or verification and apply that learning to your use case of clustering.
If the new data is from the same domain of the trained data most probably you will end up with better accuracy in clustering. (You need to properly choose the clustering methodology based on the type of data which you choose. eg for voice clustering Dominant sets and hierarchical clustering will be the most potential candidates).
If the New data is from a different domain then the selected model may fail as it learned the features in correspond to your domain of training data.

Text Classification Technique for this scenario

I am completely new to Machine Learning algorithms and I have a quick question with respect to Classification of a dataset.
Currently there is a training data that consists of two columns Message and Identifier.
Message - Typical message extracted from Log containing timestamp and some text
Identifier - Should classify the category based on the message content.
The training data was prepared by extracting a particular category from the tool and labelling it accordingly.
Now the test data contains just the message and I am trying to obtain the Category accordingly.
Which approach is most helpful in this scenario ? Is it the Supervised or Unsupervised Learning ?
I have a trained dataset and I am trying to predict the Category for the Test Data.
Thanks in advance,
Adam
If your labels are exact then you can classify using ANN, SVM etc. But labels are not exact you have to cluster data with respect to the features you have in data. K-means or nearest neighbour can be the starting point for clustering.
It is supervised learning, and a classification problem.
However, obviously you do not have the label column (the to-be-predicted value) for your testset. Thus, you cannot calculate error measures (such as False Positive Rate, Accuracy etc) for that test set.
You could, however, split the set of labeled training data that you do have into a smaller training set and a validation set. Split it 70%/30%, perhaps. Then build a prediction model from your smaller 70% training dataset. Then tune it on your 30% validation set. When accuracy is good enough, then apply it on your testset to obtain/predict the missing values.
Which techniques / algorithms to use is a different question. You do not give enough information to answer that. And even if you did you still need to tune the model yourself.
You have labels to predict, and training data.
So by definition it is a supervised problem.
Try any classifier for text, such as NB, kNN, SVM, ANN, RF, ...
It's hard to predict which will work best on your data. You willhave to try and evaluate several.

How does APPLY_KMEANS work in Vertica

I am testing the machine learning tools in Vertica. I understand how the KMEANS work since it just devides the data into clusters. However I do not understand how the APPLY_KMEANS works on new data.
It looks to me like it acts more like a classification method. Since it classifies new Data in the existing clusters. So what algorithm is used (K nearest neighbor)? Its not very clear from the documentation.
k-means is a clustering algorithm (not classification!) that iterates over 2 steps:
Assignement step: Assign each point a centroid
Update step: update centroids coordinates
When you build your k-means model, you first initialize centroids (different strategy, can be random initialization), then you iterate until your clustering is ok (your error is below a given threshold).
What defines your model is actually your centroids.
When using APPLY_KMEANS you will run an assignment step using data from your query and centroids from your model. Points will then be assigned to clusters depending on their distance with respect to centroids.
Hope it helps
pltrdy
Note about Clustering vs Classification:
We can be tempted to think that clustering is a kind of classification. Still, classification must only refer to supervised learning while clustering corresponds to unsupervised learning. Thus, don't do it :)

Applying PCA before sending data to SVM

Before applying SVM on my data I want to reduce its dimension by PCA. Should I separate the Train data and Test data then apply PCA on each of them separately or apply PCA on both sets combined then separate them?
Actually both provided answers are only partially right. The crucial part here is what is the exact problem you are trying to solve. There are two basic possible settings which can be considered, and both are valid under some assumptions.
Case 1
You have some data (which you splitted to train and test) and in the future you will get more data coming from the same distribution.
If this is the case, you should fit PCA on train data, then SVM on its projection, and for testing you just apply already fitted PCA followed by already fitted SVM, and you do exactly the same for new data that will come. This way your test error (under some "size assumptions" should approximate your expected error).
Case 2
You have some data (which you splitted train and test) and in the future you will obtain a big chunk of unlabeled data and you will be able to fit your model then.
In such a case, you fit PCA on whole data provided, learn SVM on labeled part (train set) and evaluate on test set. This way, once new data arrives you can fit PCA using both your data and new ones, and then - train SVM on your old data (as this is the only one having labels). Under the assumption that again - data comes from the same distributions, everything is correct here. You use more data to fit PCA only to have a better estimator (maybe your data is really high dimensional and PCA fails with small sample?).
You should do them separately. If you run pca on both sets combined then you are going to introduce a bias in your svn. The goal of the test set is to see how your algorithm will perform without prior knowledge of the data.
Learn the Projection Matrix of PCA on the train set and use this to reduce the dimensions of the test data.
One benifit is this way you don't have to rely on collecting sufficient data in the test set if you are applying your classifier for actual run time where test data comes one sample at a time.
Also I think separate train and test PCA will fail.Why?
Think of PCA as giving you features, and then you learn a classifier over these features. If over time your data shifts, then the test features you get using PCA would be different, and you don't have a classifier trained on these features. Even if the set of directions/features of the PCA remain same but their order varies your classifier still fails.

How to use Test Learners and Confusion Matrix through Orange (GUI)

I'm new to use Orange GUI. I test some data with old labels such as cluster ID. Then I use K-means clustering to generate new data with the new attribute produced by new labels of cluster ID. But the problem is I don't know how to operate on Orange GUI to evalute the clustering effect between old and new labels as follows:
(1) Confusion matrix(GUI) cannot connect to output data of k-means
clustering directly. And I guess I need to train my data. But I don't
know how to train it and take the training data to compare with
labeled data to get Confusion matrix.
(2) ROC(GUI) also cannot connect that. And I speculate that ROC may be
working if after Test Learners andConfusion matrix are working.
If you've used Orange(GUI), your help is my appreciated. I hope you can guide me how to handle these icons and connections for evaluting k-means clustering effect. Thank you!
If my description is poor, you can leave messages here and I'll check every day morning and evening. My nation adopts UTC +8 zone.
:-)
Confusion matrix and ROC analysis are widgets intended to analyze the results of the classification that come from a Test Learners widget. A typical schema for such evaluation is:
Widgets for clustering can add a column with cluster labels to the data set, but there is no widget to turn such column into a predictor. With the current set of widgets there is no way to use unsupervised methods as learners, and hence no way to use widgets to analyze their results in classification evaluation setup.

Resources