I am student working with SPSS (statistics) for the first time. I used 1,000 rows of test data to run k-means cluster tool and obtained the results. I now want to take those results and run against a test set (another 1,000) to see how my model did.
I am not sure how to do this; any help is greatly appreciated!
Thanks
For clustering model (or any unsupervised model), there really is no right or wrong result. As such, there is no target variable that you can compare the cluster model result (the cluster allocation) to and the idea of splitting the data set into a training and a testing partition does not apply to these types of models.
The best you can do is to review the output of the model and explore the cluster allocations and determine whether these appear to be useful for the intended purpose.
Related
I split my dataset into training and testing. At the end after finding the best hyper parameters for the training dataset, should I fit the model again using all the data? The point is to reach the highest possible score for new data.
Yes, that would help to generalize your model, as more data generally means better generalization.
I don't think so. If you do that, you will no longer have a valid test set. What happens when you come back to improve the model later? If you do this, then you will need a new test set each model improvement, which means more labeling. You won't be able to compare experiments across model versions, because the test set won't be identical.
If you consider this model finished forever, then ok.
I am trying to apply machine learning methods to predict/ analyze user's behavior. The data which I have is in the following format:
data type
I am new to the machine learning, so I am trying to understand what I am doing makes sense or not. Now in the activity column, either I have two possibilities which I am representing as 0 or 1. Now in time column, I have time in a cyclic manner mapped to the range (0-24). Now at a certain time (onehot encoded) user performs an activity. If I use activity column as a target column in machine learning, and try to predict if at a certain time user will perform one activity or another, does it make sense or not?
The reason I am trying to predict activity is that if my model provides me some result about activity prediction and in real time a user does something else (which he has not been doing over the last week or so), I want to consider it as a deviation from normal behavior.
Am I doing right or wrong? any suggestion will be appreciated. Thanks.
I think your idea is valid, but machine learning models are not 100 % accurate all the time. That is why "Accuracy" is defined for a model.
If you want to create high-performance predictive models then go for deep learning models because its performance improves over time with the increase in the size of training data sets.
I think this is a great use case for a Classification problem. Since you have only few columns (features) in your dataset, i would say start with a simple Boosted Decision Tree Classification algorithm.
Your thinking is correct, that's basically how fraud detection AI works in some cases, one option to pursue is to use the decision tree model, this may help to scale dynamically.
I was working on the same project but in a different direction, have a look maybe it can help :) https://github.com/dmi3coder/behaiv-java.
In normal case I had tried out naive bayes and linear SVM earlier to classify data related to certain specific type of comments related to some page where I had access to training data manually labelled and classified as spam or ham.
Now I am being told to check if there are any ways to classify comments as spam where we don't have a training data. Something like getting two clusters for data which will be marked as spam or ham given any data.
I need to know certain ways to approach this problem and what would be a good way to implement this.
I am still learning and experimenting . Any help will be appreciated
Are the new comments very different from the old comments in terms of vocabulary? Because words is almost everything the classifiers for this task look at.
You always can try using your old training data and apply the classifier to the new domain. You would have to label a few examples from your new domain in order to measure performance (or better, let others do the labeling in order to get more reliable results).
If this doesn't work well, you could try domain adaptation or look for some datasets more similar to your new domain, using Google or looking at this spam/ham corpora.
Finally, there may be some regularity or pattern in your new setting, e.g. downvotes for a comment, which may indicate spam/ham. In such cases, you could compile training data yourself. This would them be called distant supervision (you can search for papers using this keyword).
The best I could get to was this research work which mentions about active learning. So what I came up with is that I first performed Kmeans clustering and got the central clusters (assuming 5 clusters I took 3 clusters descending ordered by length) and took 1000 msgs from each. Then I would assign it to be labelled by the user. The next process would be training using logistic regression on the labelled data and getting the probabilities of unlabelled data and then if I have probability close to 0.5 or in range of 0.4 to 0.6 which means it is uncertain I would assign it to be labelled and then the process would continue.
I'm new to use Orange GUI. I test some data with old labels such as cluster ID. Then I use K-means clustering to generate new data with the new attribute produced by new labels of cluster ID. But the problem is I don't know how to operate on Orange GUI to evalute the clustering effect between old and new labels as follows:
(1) Confusion matrix(GUI) cannot connect to output data of k-means
clustering directly. And I guess I need to train my data. But I don't
know how to train it and take the training data to compare with
labeled data to get Confusion matrix.
(2) ROC(GUI) also cannot connect that. And I speculate that ROC may be
working if after Test Learners andConfusion matrix are working.
If you've used Orange(GUI), your help is my appreciated. I hope you can guide me how to handle these icons and connections for evaluting k-means clustering effect. Thank you!
If my description is poor, you can leave messages here and I'll check every day morning and evening. My nation adopts UTC +8 zone.
:-)
Confusion matrix and ROC analysis are widgets intended to analyze the results of the classification that come from a Test Learners widget. A typical schema for such evaluation is:
Widgets for clustering can add a column with cluster labels to the data set, but there is no widget to turn such column into a predictor. With the current set of widgets there is no way to use unsupervised methods as learners, and hence no way to use widgets to analyze their results in classification evaluation setup.
I've been working weka for couple of months now.
Currently, I'm working on my machine learning course here in Ostfold University College.
I need a better way to construct a decision tree based on separated training and test sets.
Anybody come up with good idea can be of very great relief.
Thanx in advance.
-Neo
You might be asking for something more specific, but in general:
You build the decision tree with the training set, and you evaluate the performance of that tree using the test set. In other words, on the test data, you call a function usually named something like c*lassify*, passing in the newly-built tree and a data point (within your test set) you wish to classify.
This function returns the leaf (terminal) node from your tree to which that data point belongs--and assuming that the contents of that leaf is homogeneous (populated with data from a single class, not a mixture) then you have in essence assigned a class label to that data point. When you compare that class label assigned by the tree to the data point's actual class label, and repeat for all instances in your test set, you have a metric to evaluate the performance of your tree.
A rule of thumb: shuffle your data, then assign 90% to the training set and the other 10% to a test set.
actually i was looking for something like this - http://weka.wikispaces.com/Saving+and+loading+models
to save a model, load it and use it in the training set.
This is exactly what i was searching for. Hope it might be useful for anyone who had similar problem as mine.
cheers
-Neo182