Applying Multi-label Transformation in Rapidminer? - machine-learning

I am working on text categorization in rapid miner and require to implement a problem transformation method to convert multi-label data set into single label i.e. Label Power set etc but couldn't find one in Rapid miner, i am sure i am missing something or may be Rapid miner has provided them with another name or something ?
1) I searched and found "Polynomial By Binomial" operator for Rapidminer which i think is using Binary Relevance internally for problem transformation but how can i apply others i.e. Label Power set or Classifier Chains ?
2) Secondly SVM (Learner) inside "Polynomial By Binomial" operator is applied K(Number of classes)times and combines 'K' Models into a single model but it would still classify a multi-label (multiple labels) example as a single label (one label) example, How can i get the multiple labels associate with an example ?
3) Do i have to store each model generated inside "Polynomial By Binomial" and then apply each on testing data to find out the multiple labels associate with an example ?
I am new to rapid miner so ignore my mistake
Thanks in Advance ...

Polynomial to Bionomial is not the way you want to go.
This operator performs something like XvsAll. This enables you to solve multiclass problems with a learner only capable doing binomial classification.
For your problem:
Would it to transform your table like this:
before:
ID Label
1 A|B|C
2 B|C
to
ID Label
1 A
2 B
3 C
4 B
5 C
The tricky thing for this is how to calculate the performance. But i think once this is clear a combination of recall/remember/remove duplicates and join will do it.

Related

How to use a material number as a feature for Machine Learning?

I have a problem. I would like to use a classification algorithm. For this I have a column materialNumber, like the name the column represents the material number.
How could I use that as a feature for my Machine Learning algorithm?
I can not use them e.g. as a One Hot Enconding matrix, because there is too much different material numbers (~4500 unique material numbers).
How can I use this column in a classification algorithm? Do I need to standardize/normalize it? I would like to use a RandomForest classifier.
customerId materialNumber
0 1 1234.0
1 1 4562.0
2 2 1234.0
3 2 4562.0
4 3 1547.0
5 3 1547.0
Here you can group material numbers by categorizing them. If you want to use a categorical variable in a machine learning algorithm, as you mentioned, you have to use the "one-hot encoding" method. But here, as the unique material number values ​​increase, the number of columns in your data will also increase.
For example, you have a material number like this:
material_num_list=[1,2,3,4,5,6,7,8,9,10]
Suppose the numbers are similar in themselves, for example:
[1,5,6,7], [2,3,8], [4,9,10]
We ourselves can assign values ​​to these numbers:
[1,5,6,7] --> A
[2,3,8] --> B
[4,9,10] --> C
As you can see, our tag count has decreased. And we can do "one-hot encoding" with fewer tags.
But here, the data set needs to be examined well and this grouping process needs to be done in a reasonable way. It might work if you can categorize the material numbers as I mentioned.

How to decide numClasses parameter to be passed to Random Forest algorithm in SPark MLlib with pySpark

I am working on Classification using Random Forest algorithm in Spark have a sample dataset that looks like this:
Level1,Male,New York,New York,352.888890
Level1,Male,San Fransisco,California,495.8001345
Level2,Male,New York,New York,-495.8001345
Level1,Male,Columbus,Ohio,165.22352099
Level3,Male,New York,New York,495.8
Level4,Male,Columbus,Ohio,652.8
Level5,Female,Stamford,Connecticut,495.8
Level1,Female,San Fransisco,California,495.8001345
Level3,Male,Stamford,Connecticut,-552.8234
Level6,Female,Columbus,Ohio,7000
Here the last value in each row will serve as a label and rest serve as features. But I want to treat label as a category and not a number. So 165.22352099 will denote a category and so will -552.8234. For this I have encoded my features as well as label into categorical data. Now what I am having difficulty in is deciding what should I pass for numClasses parameter in Random Forest algorithm in Spark MlLib? I mean should it be equal to number of unique values in my label? My label has like 10000 unique values so if I put 10000 as value of numClasses then wouldn't it decrease the performance dramatically?
Here is the typical signature of building a model for Random Forest in MlLib:
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
numTrees=3, featureSubsetStrategy="auto",
impurity='gini', maxDepth=4, maxBins=32)
The confusion comes from the fact that you are doing something that you should not do. You problem is clearly a regression/ranking, not a classification. Why would you think about it as a classification? Try to answer these two questions:
Do you have at least 100 samples per each value (100,000 * 100 = 1,000,000)?
Is there completely no structure in the classes, so for example - are objects with value "200" not more similar to those with value "100" or "300" than to those with value "-1000" or "+2300"?
If at least one answer is no, then you should not treat this as a classification problem.
If for some weird reason you answered twice yes, then the answer is: "yes, you should encode each distinct value as a different class" thus leading to 10000 unique classes, which leads to:
extremely imbalanced classification (RF, without balancing meta-learner will nearly always fail in such scenario)
extreme number of classes (there are no models able to solve it, for sure RF will not solve it)
extremely small dimension of the problem- looking at as small is your number of features I would be surprised if you could predict from that binary classifiaction. As you can see how irregular are these values, you have 3 points which only diverge in first value and you get completely different results:
Level1,Male,New York,New York,352.888890
Level2,Male,New York,New York,-495.8001345
Level3,Male,New York,New York,495.8
So to sum up, with nearly 100% certainty this is not a classification problem, you should either:
regress on last value (keyword: reggresion)
build a ranking (keyword: learn to rank)
bucket your values to at most 10 different values and then - classify (keywords: imbalanced classification, sparse binary representation)

Weka - semi supervised learning - how to label data and get back the result?

I have just started to use Weka and I would like to ask you. I have installed a collective classification package and i have a simple training data.
X Y Label
--------------------
1 2 Class 1
3 2 Class 1
3 3 Unknown class
4 2 Unknown class
11 12 Unknown class
15 20 Unknown class
Is it possible somehow get data back from Weka labeled? Maybe I don't understand the semi-supervised method because in my opinion it's used to label other data if I have labeled a little subset.
In my case, I would like to annotate several normal instances, get label other similar instances and in the end detect anomaly instances..
Thank you your advices
My understanding is that you would like to store the predicted labels of your model into your missing labels.
What you could do is right-click on the Model after training, then select 'Visualize Classifier Errors'. In this visualization screen, set Y as the predicted class and then save the new ARFF. This datafile should then contain the predicted and class labels.
From there, you could try to replace the Missing Values with the predicted labels.
I hope this assists in the problem that you are experiencing.

Conditional Random Field feature functions

I've been reading some papers on CRFs and am slightly confused about the feature functions. Unary (node) and binary (edge) features f are normally of the form
f(yc, xc) = 1{yc=y ̃c}fg(xc).
where {.} is the indicator function evaluating to 1 if the condition enclosed is true, and 0 otherwise. fg is a function of the data xc which extracts useful attributes (features) from the data.
Now it seems to me that to create CRF features the true labels (yc) must be known. This is true for training but for the testing phase the true class labels are unknown (since we are trying to determine their most likely value).
Am I missing something? How can this be correctly implemented?
The idea with the CRF is that it assigns a score to each setting of the labels. So what you do, notionally, is compute the scores for all possible label assignments and then whichever labeling gets the biggest score is what the CRF predicts/outputs. This is only going to make sense if the CRF gives different scores to different label assignments. When you think of it that way it's clear that the labels must be involved in the feature functions for this to work.
So lets say the log probability function for your CRF is F(x,y). So it assigns a number to each combination of a data sample x and a labeling y. So when you get a new data sample the predicted label during test time is just argmax_y F(new_x, y). That is, you find the value of y that makes F(new_x,y) the biggest and that's the predicted labeling.

How are binary classifiers generalised to classify data into arbitrarily large sets?

How can algorithms which partition a space in to halves, such as Suport Vector Machines, be generalised to label data with labels from sets such as the integers?
For example, a support vector machine operates by constructing a hyperplane and then things 'above' the hyperplane take one label, and things below it take the other label.
How does this get generalised so that the labels are, for example, integers, or some other arbitrarily large set?
One option is the 'one-vs-all' approach, in which you create one classifier for each set you want to partition into, and select the set with the highest probability.
For example, say you want to classify objects with a label from {1,2,3}. Then you can create three binary classifiers:
C1 = 1 or (not 1)
C2 = 2 or (not 2)
C3 = 3 or (not 3)
If you run these classifiers on a new piece of data X, then they might return:
C1(X) = 31.6% chance of being in 1
C2(X) = 63.3% chance of being in 2
C3(X) = 89.3% chance of being in 3
Based on these outputs, you could classify X as most likely being from class 3. (The probabilities don't add up to 1 - that's because the classifiers don't know about each other).
If your output labels are ordered (with some kind of meaningful, rather than arbitrary ordering). For example, in finance you want to classify stocks into {BUY, SELL, HOLD}. Although you can't legitimately perform a regression on these (the data is ordinal rather than ratio data) you can assign the values of -1, 0 and 1 to SELL, HOLD and BUY and then pretend that you have ratio data. Sometimes this can give good results even though it's not theoretically justified.
Another approach is the Cramer-Singer method ("On the algorithmic implementation of multiclass kernel-based vector machines").
Svmlight implements it here: http://svmlight.joachims.org/svm_multiclass.html.
Classification into an infinite set (such as the set of integers) is called ordinal regression. Usually this is done by mapping a range of continuous values onto an element of the set. (see http://mlg.eng.cam.ac.uk/zoubin/papers/chu05a.pdf, Figure 1a)

Resources