I'm working on a classification task in Weka and got the problem that my class to predict has one value that is very frequent (about 85%). This leads to a lot of learning algorithms just predicting this frequent value of this class for a new dataset.
How can I deal with this problem? Does it just mean that I didn't find features that work well enough in predicting something better? Or is there something specific I can do to solve this problem?
I guess this is a pretty common problem, but I was not able to find a solution to it here.
You need to "SMOTE" your data. First figure out how many more instances of the minority case you need. In my case I wanted to get around a 50/50 ratio so I needed to over sample by 1300 percent. This tutorial will help if you are using the GUI: http://www.youtube.com/watch?v=w14ha2Fmg6U If you are doing this from the command line using Weka, the following command will get you going:
#Weka 3.7.7
java weka.Run -no-scan weka.filters.supervised.instance.SMOTE \
-c last -K 25 -P 1300.0 -S 1 -i input.arff -o output.arff
The -K option is the number of neighbors to take into account when smoting the data. The default is 5, but 25 worked best for my dataset.
Related
We have extracted features from search engine query log data and the feature file (as per input format of Vowpal Wabbit) amounts to 90.5 GB. The reason for this huge size is necessary redundancy in our feature construction. Vowpal Wabbit claims to be able to handle TBs of data in a matter of few hours. In addition to that, VW uses a hash function which takes almost no RAM. But When we run logistic regression using VW on our data, within a few minutes, it uses up all of RAM and then stalls.
This is the command we use-
vw -d train_output --power_t 1 --cache_file train.cache -f data.model
--compressed --loss_function logistic --adaptive --invariant
--l2 0.8e-8 --invert_hash train.model
train_output is the input file we want to train VW on, and train.model is the expected model obtained after training
Any help is welcome!
I've found the --invert_hash option to be extremely costly; try running without that option. You can also try turning on the --l1 regularization option to reduce the number of coefficients in the model.
How many features do you have in your model? How many features per row are there?
Using Vowpal Wabbit and its python interactor for active learning I've got up to the point of being able to send messages back and forth to the server from the client but I am having problems with seeding.
When I seed the model with the current command:
python active_interactor.py --verbose -o labelled.txt --seed data.seed localhost 12345 unlabelled.txt
The interactor sends the examples to the server (and I know this because the server updates the models and the debug information is produced) but when it feeds the unlabelled examples and asks for a label as a response, the predictions are always 0.
My question is: is the model not being seeded? If not, why are the predictions always 0 even though there is a model?
It should be noted that the same data can be successfully used to create a passive model that gives non-0 predictions, so I do not think the problem is with the training data.
---UPDATE---
Upon looking at the tests, we went ahead and changed the vw server to match the test with two parameters in mind that were left as their defaults beforehand, namely initial_t and l.
vw -f final.model --active_learning --active_mellowness 0.000001 --daemon --port 12345 --initial_t 10 -l 10
Once doing this, predictions are produced. This also works when -l is it's default. We will now do a grid search to find out the best possible parameters. One question though, what is the reason why low values of initial_t led to no predictions?
I'm using Mahout (v 0.7) parallel FPG algorithm, CLI mode, to generate frequent patterns. The algorithm works fine and generates the frequent patterns correctly.
The problem I'm having is that the algorithm always uses one reducer in the second stage of generating the patterns. The algorithm creates one mapper for each input split, but in the second stage, all the mappers send their output to one reducer which significantly slows down the algorithm.
I even tried to set the -Dmapred.reduce.tasks parameter to override the default number of reducers, and it did not work.
I would like to split the work of the second stage to multiple reducers, if possible.
Mahout's FPG command I use:
mahout fpg \
-i /path/to/input \
-o /path/to/output \
-s 5 \
-k 100 \
-method mapreduce
You can change the number of mappers and reducers by adding this at the end of your command:
-Dmapred.map.tasks=1000 -Dmapred.reduce.tasks=1000
For me, I couldn't change the number of mappers with this parameter, but I was always able to control the number of reducers.
It's possible that your data currently fits into a single mapper split.
You can make the split size smaller with something like:
-Dmapred.max.split.size=1048576
(That would reduce the split size to 1024*1024==1MB, but I've gone even smaller with Mahout before, such as -Dmapred.max.split.size=131072 for 128KB splits on very CPU-intensive jobs.)
I am trying to solve a numeric classification problem with numeric attributes in WEKA using linear regression and then I want to test my model on the existing dataset with ""re-evaluate model on current test dataset.
As a result of the evaluation I am getting the summary:
Correlation coefficient 0.9924
Mean absolute error 1.1017
Root mean squared error 1.2445
Total Number of Instances 17
But I don't have results as it is shown here: http://weka.wikispaces.com/Making+predictions
How to bring WEKA to the result I need?
Thank you.
To answer my question - for trained and tested model, right click on the model and go to visualize classifier error. there use save option to save actual and predicted values.
Are you using command line interface (CLI) or GUI.
If CLI, the command given in the above link works pretty fine
java weka.classifiers.trees.J48 -T unclassified.arff -l j48.model -p 0
So when you train the model you save it as *.model (j48.model) and later use it to evaluate on test data (unclassified.arff)
I want to classify using libsvm. I have 9 training sets , each set has 144000 labelled instances , each instance having a variable number of features. It is taking about 12 hours to train one set ( ./svm-train with probability estimates ). As i dont have much time , I would like to run more than one set at a time. I'm not sure if i can do this.. Can i run all 9 processes simultaneously in different terminals ?
./svm-train -b 1 feat1.txt
./svm-train -b 1 feat2.txt
.
.
.
./svm-train -b 1 feat9.txt
( i'm using fedora core 5 )
You can tell libsvm to use openmp for parallelization. Look at this libsvm faq entry: http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#f432
As Adam said, it depends on how many cores and processors your system has available. If that's insufficient, why not spin up a few EC2 instances to run on?
The Infochimps MachetEC2 public AMI comes with most of the tools you'll need: http://blog.infochimps.org/2009/02/06/start-hacking-machetec2-released/
Yes. But unless you have a multi-core or multi-processor system it may not save you that much time.