How to use Test Learners and Confusion Matrix through Orange (GUI) - machine-learning

I'm new to use Orange GUI. I test some data with old labels such as cluster ID. Then I use K-means clustering to generate new data with the new attribute produced by new labels of cluster ID. But the problem is I don't know how to operate on Orange GUI to evalute the clustering effect between old and new labels as follows:
(1) Confusion matrix(GUI) cannot connect to output data of k-means
clustering directly. And I guess I need to train my data. But I don't
know how to train it and take the training data to compare with
labeled data to get Confusion matrix.
(2) ROC(GUI) also cannot connect that. And I speculate that ROC may be
working if after Test Learners andConfusion matrix are working.
If you've used Orange(GUI), your help is my appreciated. I hope you can guide me how to handle these icons and connections for evaluting k-means clustering effect. Thank you!
If my description is poor, you can leave messages here and I'll check every day morning and evening. My nation adopts UTC +8 zone.
:-)

Confusion matrix and ROC analysis are widgets intended to analyze the results of the classification that come from a Test Learners widget. A typical schema for such evaluation is:
Widgets for clustering can add a column with cluster labels to the data set, but there is no widget to turn such column into a predictor. With the current set of widgets there is no way to use unsupervised methods as learners, and hence no way to use widgets to analyze their results in classification evaluation setup.

Related

new features in dataset

I'm now in the middle of the semester and trying to understand the background of the algorithms and features.
I would like to understand some theory.
If I have a dataset with N samples.
each sample has 5 features for example.
I have done 3 kinds of classifications algorithms for example : SVM, decision tree and kMeans.
In all 3, I got nice results
In a mystery way, a new feature added to the dataset. The value of the features for every sample selected randomly.
I restarted the algorithms on the dataset ( with the new feature)
Are the classification results gonna change from the first results without the new feature? If yes, why are they gonna change and by how much ?
In addition, if I do not have the dataset how can I know how to recognize that new feature?
The results of your classification algorithm are going to either change or stay the same depending on how much information the model gains from the feature. If the feature for instance is random noise then it will have little to no effect on your model, other than slowing it down. If it contains useful information it might be able to increase parameters such as recall and precision. Hope this might help.

Is it a bad idea to use the cluster ID from clustering text data using K-means as feature to your supervised learning model?

I am building a model that will predict the lead time of products flowing through a pipeline.
I have a lot of different features, one is a string containing a few words about the purpose of the product (often abbreviations, name of the application it will be a part of and so forth). I have previously not used this field at all when doing feature engineering.
I was thinking that it would be nice to do some type of clustering on this data, and then use the cluster ID as a feature for my model, perhaps the lead time is correlated with the type of info present in that field.
Here was my line of thinking)
1) Cleaning & tokenizing text.
2) TF-IDF
3) Clustering
But after thinking more about it, is it a bad idea? Because the clustering was based on the old data, if new words are introduced in the new data this will not be captured by the clustering algorithm, and the data should perhaps be clustered differently now. Does this mean that I would have to retrain the entire model (k-means model and then the supervised model) whenever I want to predict new data points? Are there any best practices for this?
Are there better ways of finding clusters for text data to use as features in a supervised model?
I understand the urge to use an unsupervised clustering algorithm first to see for yourself, which clusters were found. And of course you can try if such a way helps your task.
But as you have labeled data, you can pass the product description without an intermediate clustering. Your supervised algorithm shall then learn for itself if and how this feature helps in your task (of course preprocessing such as removal of stopwords, cleaining, tokenizing and feature extraction needs to be done).
Depending of your text descriptions, I could also imagine that some simple sequence embeddings could work as feature-extraction. An embedding is a vector of for example 300 dimensions, which describes the words in a manner that hp office printer and canon ink jet shall be close to each other but nice leatherbag shall be farer away from the other to phrases. For example fasText-Word-Embeddings are already trained in english. To get a single embedding for a sequence of hp office printerone can take the average-vector of the three vectors (there are more ways to get an embedding for a whole sequence, for example doc2vec).
But in the end you need to run tests to choose your features and methods!

Machine learning with my car dataset

I’m very new to machine learning.
I have a dataset with data given me by a f1 race. User is playing this game and is giving me this dataset.
With machine learning, I have to work with this data and when a user (I know they are 10) plays a game I have to recognize who’s playing.
The data consists of datagram packet occurred in 1/10 second freq, the packets contains the following Time, laptime, lapdistance, totaldistance, speed, car position, traction control, last lap time, fuel, gear,..
I’ve thought to use a kmeans used in a supervised way.
Which algorithm could be better?
The task must be a multiclass classification. The very first step in any machine learning activity is to define a score metric (https://machinelearningmastery.com/classification-accuracy-is-not-enough-more-performance-measures-you-can-use/). That allows you to compare models between themselves and decide which is better. Then build a base model with random forest or/and logistic regression as suggested in another answer - they perform well out-of-the-box. Then try to play with features and understand which of them are more informative. And don't forget about a visualizations - they give many hints for data wrangling, etc.
this is somewhat a broad question, so I'll try my best
kmeans is unsupervised algorithm meaning it will find the classes itself and it best used when you know there are multiple classes but you don't know what exactly they are... using it with labeled data just means you will compute the distance of new vector v to each vector in the dataset and pick the one (or ones using majority vote) which give the min distance , this is not considered as machine learning
in this case when you do have the labels, supervised approach will yield much better results
I suggest try random forest and logistic regression at first, those are the most basic and common algorithms and they give pretty good results
if you haven't achieve the desired accuracy you can use deep learning and build a neural network with input layer as big as your packet's values and output layer of the number of classes, in between you can use one or multiple hidden layers with various nodes, but this is advanced approach and you better pick up some experience in machine learning field before pursue it
Note: the data is a time series, meaning that every driver has it's own behaviour of driving a car, so data should be considered as bulks of points, with this you can apply pattern matching technics, also there are a several neural networks build exactly for this data (like RNN) but this is far far advanced and much more difficult to implement

What methods are best for clustering multidimensional data that has irregular shape?

I am new to machine learning and data analysis and I'm struggling to cluster my data. I'm working with about 40,000 observations with 6 features.
I have tried various clustering methods including K-Means, DBSCAN, and also attempted scipy hierarchical clustering with linkage. During preprocessing missing data is imputed and all of the data is normalized. Once I complete PCA to reduce the dimensions from 4 to 6 my data looks like a crescent moon shape that can be seen below as the blue dots.
I determined that using 10 clusters for K-means would be best based on silhouette coefficient analysis and this is the result:
The result does not change much when performing PCA after the data has been clustered.
DBSCAN itself decides on 4 clusters and gives 4 clusters but with most of the data excluded from these clusters and depicted as noise.
For the hierarchical method the data usage was too much when trying to perform linkage() and kept providing a memory error message.
Is there any way I can cluster my data? Is the shape of my data (a crescent moon) lend itself to other modelling methods?
Don't run clustering without thinking first
Clustering algorithms must not be used as black boxes. They need to be carefully used or you get out only garbage. And to use them right, you need to understand the objective of each algorithm. K-means is a least squares approach. if you use it on badly normalized data, it fails.
Judging from your plot, there is a bad record in your database, largely causing that "moon" shape: everything needs tp be as far away as possible from that bad record.
Apart from that: 1. did you scale the data correctly for your problem? 2. did you choose the appropriate distance measure?

Clustering or other mechanisms for implementing generic spam detection

In normal case I had tried out naive bayes and linear SVM earlier to classify data related to certain specific type of comments related to some page where I had access to training data manually labelled and classified as spam or ham.
Now I am being told to check if there are any ways to classify comments as spam where we don't have a training data. Something like getting two clusters for data which will be marked as spam or ham given any data.
I need to know certain ways to approach this problem and what would be a good way to implement this.
I am still learning and experimenting . Any help will be appreciated
Are the new comments very different from the old comments in terms of vocabulary? Because words is almost everything the classifiers for this task look at.
You always can try using your old training data and apply the classifier to the new domain. You would have to label a few examples from your new domain in order to measure performance (or better, let others do the labeling in order to get more reliable results).
If this doesn't work well, you could try domain adaptation or look for some datasets more similar to your new domain, using Google or looking at this spam/ham corpora.
Finally, there may be some regularity or pattern in your new setting, e.g. downvotes for a comment, which may indicate spam/ham. In such cases, you could compile training data yourself. This would them be called distant supervision (you can search for papers using this keyword).
The best I could get to was this research work which mentions about active learning. So what I came up with is that I first performed Kmeans clustering and got the central clusters (assuming 5 clusters I took 3 clusters descending ordered by length) and took 1000 msgs from each. Then I would assign it to be labelled by the user. The next process would be training using logistic regression on the labelled data and getting the probabilities of unlabelled data and then if I have probability close to 0.5 or in range of 0.4 to 0.6 which means it is uncertain I would assign it to be labelled and then the process would continue.

Resources