Weka Random Forest Property defination - random-forest

I am using Weka for classifying a binary dataset. I would like to use RandomForest as a classifier. I found RandomForest has number of properties. I would like to know the meaning of those properties and how they work in the time of classification. Weka shares some details which can be found from the more tab of the property window of RandomForest.
I am attaching the screenshot herewith
There are some definition of those properties:
bagSizePercent -- Size of each bag, as a percentage of the training set size.
numIterations -- The number of trees in the random forest.
outputOutOfBagComplexityStatistics -- Whether to output complexity-based statistics when out-of-bag evaluation is performed
numFeatures -- Sets the number of randomly chosen attributes. If 0, int(log_2(#predictors) + 1) is used.
and so on.
So those 1 lines unable to clear my doubt. If someone help me by sharing the in depth view of the properties that would be very helpful.

Related

Multi-label classification using SVM for text data

I have data in an excel file that I need to use to perform multi-label classification using SVM. It has two columns as shown below. 'tweet' - A,B,C,D,E,F,G and 'category' = X,Y,Z
tweet category
A X
B Y
C Z
D X,Y
E Y,Z
F X,Y,Z
G X,Z
Given a tweet, I want to train my model to predict the category it belongs to. Both the tweets and categories are text. I am trying to use Weka's LibSVM classifier to do the classification as I read it does multi-label classification. I converted the csv file to arff file and loaded it in Weka. I then ran the "LibSVM" classifier. However, I am getting very poor results as shown below. Any idea what I am doing wrong ? Is multi-label text classification even possible with "LibSVM" ?
Correctly Classified Instances 82 25.9494 %
Incorrectly Classified Instances 234 74.0506 %
Kappa statistic 0
Mean absolute error 0.0423
Root mean squared error 0.2057
Relative absolute error 89.9823 %
Root relative squared error 134.3377 %
Total Number of Instances 316
SVM can definitely be used for multiclass classification.
I have not used Weka LibSV before, but you if you already haven't you would need to do some data cleaning before you input text for any sort of classification.
The type of cleaning also depends on your classification task, but you can look into the following techniques which are used in practice for text analysis:
1) Remove twitter handles from your text
2) Remove stop words or words that you know for sure do not impact your classifications. Maybe you can only preserve pronouns and remove any other words. You can use POS tagging to perform this task. More info here
3) Remove punctuations
4) Use n-grams to get contextual meaning out of your text. This site has some good explanation of how that works. Essentially, this would mean that you would treat a sequence of words as a feature rather than using a single word as a data point in your model. Mind you this might impact the amount of memory your model occupies up while training.
5) Remove words that either occur too frequently or do not occur too frequently in your data set.
6) Balance your classes or categories in your case. This means before training your model, make sure the training data has a similar number of X,Y and Z categories. It is possible that your data had a lot of tweets that classify to X and Y but in your test set you had tweets that mostly mapped to the Z category.

Does a random forest randomly sample the data for each tree?

I appreciate bagging randomly resamples the training set for each tree, and random forests randomly select a subset of features for each tree.
My question though is does a random forest also resample the training set as well as taking a random subset of features. Is it in effect double random?
The answer is yes, most of the times, if you want to.
Random forests bootstrap the data and randomly select features.
bootstrapping means that it samples a data-set with the same size as the original dataset, but with replacement. So if you have N data points, each tree will use N data points, but some my be duplicated (as it samples them one by one with replacement).
However, it really is up to you what you do. In the sklearn implementation, the default is to bootstrap but you can flag bootstarp=False, and then you only have the random features selection.
See the documentation here:
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Text Classification Technique for this scenario

I am completely new to Machine Learning algorithms and I have a quick question with respect to Classification of a dataset.
Currently there is a training data that consists of two columns Message and Identifier.
Message - Typical message extracted from Log containing timestamp and some text
Identifier - Should classify the category based on the message content.
The training data was prepared by extracting a particular category from the tool and labelling it accordingly.
Now the test data contains just the message and I am trying to obtain the Category accordingly.
Which approach is most helpful in this scenario ? Is it the Supervised or Unsupervised Learning ?
I have a trained dataset and I am trying to predict the Category for the Test Data.
Thanks in advance,
Adam
If your labels are exact then you can classify using ANN, SVM etc. But labels are not exact you have to cluster data with respect to the features you have in data. K-means or nearest neighbour can be the starting point for clustering.
It is supervised learning, and a classification problem.
However, obviously you do not have the label column (the to-be-predicted value) for your testset. Thus, you cannot calculate error measures (such as False Positive Rate, Accuracy etc) for that test set.
You could, however, split the set of labeled training data that you do have into a smaller training set and a validation set. Split it 70%/30%, perhaps. Then build a prediction model from your smaller 70% training dataset. Then tune it on your 30% validation set. When accuracy is good enough, then apply it on your testset to obtain/predict the missing values.
Which techniques / algorithms to use is a different question. You do not give enough information to answer that. And even if you did you still need to tune the model yourself.
You have labels to predict, and training data.
So by definition it is a supervised problem.
Try any classifier for text, such as NB, kNN, SVM, ANN, RF, ...
It's hard to predict which will work best on your data. You willhave to try and evaluate several.

labelling of dataset in machine learning

I have a question about some basic concepts of machine learning. The examples, I observed, were giving a brief overview .For training the system, feature vector is given as input. In case of supervised learning, the dataset is labelled. I have confusion about labelling. For example if I have to distinguish between two types of pictures, I will provide a feature vector and on output side for testing, I'll provide 1 for type A and 2 for type B. But if I want to extract a region of interest from a dataset of images. How will I label my data to extract ROI using SVM. I hope I am able to convey my confusion. Thanks in anticipation.
In supervised learning, such as SVMs, the dataset should be composed as follows:
<i-th feature vector><i-th label>
where i goes from 1 to the number of patterns (also examples or observations) in your training set so this represents a single record in your training set which can be used to train the SVM classifier.
So you basically have a set composed by such tuples and if you do have just 2 labels (binary classification problem) you can easily use a SVM. Indeed the SVM model will be trained thanks to the training set and the training labels and once the training phase has finished you can use another set (called Validation Set or Test Set), which is structured in the same way as the training set, to test the accuracy of your SVMs.
In other words the SVM workflow should be structured as follows:
train the SVM using the training set and the training labels
predict the labels for the validation set using the model trained in the previous step
if you know what the actual validation labels are, you can match the predicted labels with the actual labels and check how many labels have been correctly predicted. The ratio between the number of correctly predicted labels and the total number of labels in the validation set returns a scalar between [0;1] and it's called the accuracy of your SVM model.
if you're interested in the ROI, you might want to check the trained SVM parameters (mainly the weights and bias) to reconstruct the separation hyperplane
It is also important to know that the training set records should be correctly, a priori labelled: if the training labels are not correct, the SVM will never be able to correctly predict the output for previously unseen patterns. You do not have to label your data according to the ROI you want to extract, the data must be correctly labelled a priori: the SVM will have the entire set of type A pictures and the set of type B pictures and will learn the decision boundary to separate pictures of type A and pictures of type B. You do not have to trick the labels: if you do, you're not doing classification and/or machine learning and/or pattern recognition. You're basically tricking the results.

Using weka to classify sensor data

I am working on a classification problem, which has different sensors. Each sensor collect a sets of numeric values.
I think its a classification problem and want to use weka as a ML tool for this problem. But I am not sure how to use weka to deal with the input values? And which classifier will best fit for this problem( one instance of a feature is a sets of numeric value)?
For example, I have three sensors A ,B, C. Can I define 5 collected data from all sensors,as one instance? Such as, One instance of A is {1,2,3,4,5,6,7}, and one instance of B is{3,434,534,213,55,4,7). C{424,24,24,13,24,5,6}.
Thanks a lot for your time on reviewing my question.
Commonly the first classifier to try is Naive Bayes (you can find it under "Bayes" directory in Weka) because it's fast, parameter less and the classification accuracy is hard to beat whenever the training sample is small.
Random Forest (you can find it under "Tree" directory in Weka) is another pleasant classifier since it process almost any data. Just run it and see whether it gives better results. It can be just necessary to increase the number of trees from the default 10 to some higher value. Since you have 7 attributes 100 trees should be enough.
Then I would try k-NN (you can find it under "Lazy" directory in Weka and it's called "IBk") because it commonly ranks amount the best single classifiers for a wide range of datasets. The only issues with k-nn are that it scales badly for large datasets (> 1GB) and it needs to fine tune k, the number of neighbors. This value is by default set to 1 but with increasing number of training samples it's commonly better to set it up to some higher integer value in range from 2 to 60.
And finally for some datasets where both, Naive Bayes and k-nn performs poorly, it's best to use SVM (under "Functions", it's called "Lib SVM"). However, it can be hassle to set up all the parameters of the SVM to get competitive results. Hence I leave it to the end when I already know what classification accuracies to expect. This classifier may not be the most convenient if you have more than two classes to classify.

Resources