How to apply InformationGain in rapidminer with seperate test set ? - machine-learning

I am dealing with text classification in rapidminer. I have seperate test and training splits. I applied Information Gain to a dataset using n-fold cross validation but i am confused on how to apply it on seperate test set ? Below is attached image
In figure i have connected the word list output from first "Process Documents From Files" which is used for training to second "Processed Documents From Files" which is used for testing but i want to apply the reduced feature to the second "Process Documents From Files" which perhaps should be the one returned from "Select By Weight" (reduced dimensions) operator but it returns weights which i cannot provide to second "Process Documents From Files". I searched alot but did'nt managed to find anything which can satisfy my need ?
Is it really possible for Rapidminer to have seperate test/train splits and apply feature selection ?
Is there any way to convert these weights into word list ? Please don't say write in repository (i can't do this) ?
In such scenario when i have different test/train splits and needs to apply feature selection, how would i make sure that test/train splits have same dimension vectors ?
I am really trapped out at it, kindly help ...

Immediately after the lower Process Documents operator insert a new Select By Weight operator before the Apply Model. Use a Multiply operator to copy the weights from the Weight By Information Gain operator and connect this to the input of the new Select By Weight operator.

Related

Decision Tree for Choosing Most Likely Option?

I'm trying to find the right ML algorithm. Let's say I have three data columns. I have a binary outcome for each column (either the data column belongs to (Group A) classification or it does not), BUT in each set of three data columns that I feed in, exactly ONE and only one column belongs to Group A.
Which algorithm can I choose to select the ONE BEST result of the three each time? Can I do this with a decision tree?
Decision tree aka ID3, can be suitable for this simple problem... best way is to check it on the data and see it's output prediction
ID3 have a problem of over fitting though
basically every classifier can do a good job on this task, if it linearly separable even SVM can be a good choice, also I'm suggesting trying basic neural network with 1/2 nodes at the output layer for classification of 2 groups
all of them are implemented via various packages and are fairly easy to use (almost any coding language)

Can I prevent the J48 classifier from splitting on the same field more than x times?

Using a dataset, Weka and the J48 classifier I've got the following tree:
And it splits off a lot on 'NumTweets' on the right side. Can I prevent J48 from doing more than a specified amount of splits on one field? Because this is obviously overfitting my data on a specific field. Ideally I'd want it to only reuse the same field in a branch 3-4 times. Is there any way I can do this?
Thanks in advance!
To answer your first question: No, the WEKA explorer does not offer split limits on a specific attribute. This can only be done manually in code.
With that said, there are several things you can try here to limit the tree size/reduce overfitting.
You could try REPTree instead of J48. It uses the
same splitting criteria as J48 but uses reduced error pruning. It has an
option to limit the depth of the tree.
Decreasing the J48 pruning confidence (-C parameter) will result in more pruning and thus smaller tree size.
You can try to play around with the minNumObj (minimal number of instances reaching each leaf) parameter.
No. But you could set the J48 minNumObj config parameter higher. (The default value is 2.) This sets a constraint on the minimum number of data elements that each leaf node will have to contain.
This way (by trial and error) you can balance and/or simplify the decision tree to some extent.
Maybe you can drop or ignore the annoying attribute. Maybe discretizing the NumTweets into bins (e.g. <1 tweet/day, <10 tweets/day, more > 10 Tweets day) also helps? This could be done with a Discretizing Filter on the Preprocessing Tab.

How to classify text with Knime

I'm trying to classify some data using knime with knime-labs deep learning plugin.
I have about 16.000 products in my DB, but I have about 700 of then that I know its category.
I'm trying to classify as much as possible using some DM (data mining) technique. I've downloaded some plugins to knime, now I have some deep learning tools as some text tools.
Here is my workflow, I'll use it to explain what I'm doing:
I'm transforming the product name into vector, than applying into it.
After I train a DL4J learner with DeepMLP. (I'm not really understand it all, it was the one that I thought I got the best results). Than I try to apply the model in the same data set.
I thought I would get the result with the predicted classes. But I'm getting a column with output_activations that looks that gets a pair of doubles. when sorting this column I get some related date close to each other. But I was expecting to get the classes.
Here is a print of the result table, here you can see the output with the input.
In columns selection it's getting just the converted_document and selected des_categoria as Label Column (learning node config). And in Predictor node I checked the "Append SoftMax Predicted Label?"
The nom_produto is the text column that I'm trying to use to predict the des_categoria column that it the product category.
I'm really newbie about DM and DL. If you could get me some help to solve what I'm trying to do would be awesome. Also be free to suggest some learning material about what attempting to achieve
PS: I also tried to apply it into the unclassified data (17,000 products), but I got the same result.
I won't answer with a workflow on this one because it is not going to be a simple one. However, be sure to find the text mining example on the KNIME server, i.e. the one that makes use of the bag of words approach.
The task
Product mapping to categories should be a straight-forward data mining task because the information that explains the target variable is available in a quasi-exhaustive manner. Depending on the number of categories to train though, there is a risk that you might need more than 700 instances to learn from.
Some resources
Here are some resources, only the first one being truly specialised in text mining:
Introduction on Information Retrieval, in particular chapter 13;
Data Science for Business is an excellent introduction to data mining, including text mining (chapter 10), also do not forget the chapter about similarity (chapter 6);
Machine Learning with R has the advantage of being accessible enough (chapter 4 provides an example of text classification with R code).
Preprocessing
First, you will have to preprocess your product labels a bit. Use KNIME's text analytics preprocessing nodes for that purpose, that is after you've transformed the product labels with Strings to Document:
Case Convert, Punctuation Erasure and Snowball Stemmer;
you probably won't need Stop Word Filter, however, there may be quasi-stop words such as "product", which you may need to remove manually with Dictionary Filter;
Be careful not to use any of the following without testing testing their impact first: N Chars Filter (g may be a useful word), Number Filter (numbers may indicate quantities, which may be useful for classification).
Should you encounter any trouble with the relevant nodes (e.g. Punctuation Erasure can be tricky amazingly thanks to the tokenizer), you can always apply String Manipulation with regex before converting the Strings to Document.
Keep it short and simple: the lookup table
You could build a lookup table based on the 700 training instances. The book Data mining techniques as well as resource (2) present this approach in some detail. If any model performs any worse than the lookup table, you should abandon the model.
Nearest neighbors
Neural networks are probably overkill for this task.
Start with a K Nearest Neighbor node (applying a string distance such as Cosine, Levensthein or Jaro-Winkler). This approach requires the least amount of data wrangling. At the very least, it will provide an excellent baseline model, so it is most definitely worth a shot.
You'll need to tune the parameter k and to experiment with the distance types. The Parameter Optimization Loop pair will help you with optimizing k, you can include a Cross-Validation meta node inside of the said loop to obtain an estimate of the expected performance given k instead of only one point estimate per value of k. Use Cohen's Kappa as an optimization criterion, as proposed by the resource number (3) and available via the Scorer node.
After the parameter tuning, you'll have to evaluate the relevance of your model using yet another Cross-Validation meta node, then follow up with a Loop pair including Scorer to calculate the descriptives on performance metric(s) per iteration, finally use Statistics. Kappa is a convenient metric for this task because the target variable consists of many product categories.
Don't forget to test its performance against the lookup table.
What next ?
Should lookup table or k-nn work well for you, then there's nothing else to add.
Should any of those approaches fail, you might want to analyse the precise cases on which it fails. In addition, training set size may be too low, so you could manually classify another few hundred or thousand instances.
If after increasing the training set size, you are still dealing with a bad model, you can try the bag of words approach together with a Naive Bayes classifier (see chapter 13 of the Information Retrieval reference). There is no room here to elaborate on the bag of words approach and Naive Bayes but you'll find the resources here above useful for that purpose.
One last note. Personally, I find KNIME's Naive Bayes node to perform poorly, probably because it does not implement Laplace smoothening. However, KNIME's R Learner and R Predictor nodes will allow you to use R's e1071 package, as demonstrated by resource (3).

modeling feature set with text documents

Example:
I have m sets of ~1000 text documents, ~10 are predictive of a binary result, roughly 990 aren't.
I want to train a classifier to take a set of documents and predict the binary result.
Assume for discussion that the documents each map the text to 100 features.
How is this modeled in terms of training examples and features? Do I merge all the text together and map it to a fixed set of features? Do I have 100 features per document * ~1000 documents (100,000 features) and one training example per set of documents? Do I classify each document separately and analyze the resulting set of confidences as they relate to the final binary prediction?
The most common way to handle text documents is with a bag of words model. The class proportions are irrelevant. Each word gets mapped to a unique index. Make the value at that index equal to the number of times that token occurs (there are smarter things to do). The number of features/dimension is then the number of unique tokens/words in your corpus. There are manny issues with this, and some of them are discussed here. But it works well enough for many things.
I would want to approach it as a two stage problem.
Stage 1: predict the relevancy of a document from the set of 1000. For best combination with stage 2, use something probabilistic (logistic regression is a good start).
Stage 2: Define features on the output of stage 1 to determine the answer to the ultimate question. These could be things like the counts of words for the n most relevant docs from stage 1, the probability of the most probable document, the 99th percentile of those probabilities, variances in probabilities, etc. Whatever you think will get you the correct answer (experiment!)
The reason for this is as follows: concatenating documents together will drown you in irrelevant information. You'll spend ages trying to figure out which words/features allow actual separation between the classes.
On the other hand, if you concatenate feature vectors together, you'll run into an exchangeability problem. By that I mean, word 1 in document 1 will be in position 1, word 1 in document 2 will be in position 1001, in document 3 it will be in position 2001, etc. and there will be no way to know that the features are all related. Furthermore, an alternate presentation of the order of the documents would lead to the positions in the feature vector changing its order, and your learning algorithm won't be smart to this. Equally valid presentations of the document orders will lead to completely different results in an entirely non-deterministic and unsatisfying way (unless you spend a long time designing a custom classifier that's not afficted with this problem, which might ultimately be necessary but it's not the thing I'd start with).

Weka: Results of each fold in 10-fold CV

For Weka Explorer (GUI), when we do a 10-fold CV for any given ARFF file, then what Weka Explorer provides (as far as I can see) is the average result for all the 10 folds.
Q. Is there any way to get the results of each fold? For instance, I need the error rates (incorrectly identified instances) for each fold.
Help appreciated.
I think this is possible using Weka's GUI. You need to use the Experimenter though instead of the Explorer. Here are the steps:
Open the Experimenter from the GUI Chooser
Create a new experiment (New button # top-right)
[optional] Enter a filename and location in the Results Destination to save the results to
Set the Number of (cross-validation) folds to your liking (start experimenting with 2 folds for easy results)
Add your dataset (if your dataset needs preprocessing then you should do this in the Explorer first and then save the preprocessed dataset)
Set the Number of repetitions (I recommend 1 to start of with)
Add the algorithm(s) you want to test (again start easy, start with one algorithm)
Go to the Run tab and Start the experiment and wait till it finishes
Go to the Analyse tab and import the experiment results by clicking Experiment (top-right)
For Row select: Fold
For Column select: Percent_incorrect or Number_incorrect (or any other measure you want to see)
You now see the specified results for each fold
Weka Explorer does not have an option to give the results for individual folds when using the crossvalidation option, there are some workarounds. If you explicitly don't want to change any code, you need to do some manual fiddling, but I think this gives more or less what you want
Instead of Cross-validation, select Percentage split and set it to 90%
Start classifier
Click More options... and change the Random seed for XVal / % Split value to something you haven't used before.
Repeat ten times.
This is not exactly equivalent to 10-fold crossvalidation though, since the pseudo-folds you make this way might overlap.
An alternative that is equivalent to crossvalidation, but more cumbersome, would be to make 10 folds manually by using the unsupervised instance filter RemoveFolds or RemoveRange.
Generate and save 10 training sets and 10 test sets. Then for every fold, load the training set, select Supplied test set in the classify tab, and select the appropriate test fold.

Resources