Vowpal Wabbit -- Active Learning, predictions always 0 even with seed data - machine-learning

Using Vowpal Wabbit and its python interactor for active learning I've got up to the point of being able to send messages back and forth to the server from the client but I am having problems with seeding.
When I seed the model with the current command:
python active_interactor.py --verbose -o labelled.txt --seed data.seed localhost 12345 unlabelled.txt
The interactor sends the examples to the server (and I know this because the server updates the models and the debug information is produced) but when it feeds the unlabelled examples and asks for a label as a response, the predictions are always 0.
My question is: is the model not being seeded? If not, why are the predictions always 0 even though there is a model?
It should be noted that the same data can be successfully used to create a passive model that gives non-0 predictions, so I do not think the problem is with the training data.
---UPDATE---
Upon looking at the tests, we went ahead and changed the vw server to match the test with two parameters in mind that were left as their defaults beforehand, namely initial_t and l.
vw -f final.model --active_learning --active_mellowness 0.000001 --daemon --port 12345 --initial_t 10 -l 10
Once doing this, predictions are produced. This also works when -l is it's default. We will now do a grid search to find out the best possible parameters. One question though, what is the reason why low values of initial_t led to no predictions?

Related

ClearML multiple tasks in single script changes logged value names

I trained multiple models with different configuration for a custom hyperparameter search. I use pytorch_lightning and its logging (TensorboardLogger).
When running my training script after Task.init() ClearML auto-creates a Task and connects the logger output to the server.
I log for each straining stage train, val and test the following scalars at each epoch: loss, acc and iou
When I have multiple configuration, e.g. networkA and networkB the first training log its values to loss, acc and iou, but the second to networkB:loss, networkB:acc and networkB:iou. This makes values umcomparable.
My training loop with Task initalization looks like this:
names = ['networkA', networkB']
for name in names:
task = Task.init(project_name="NetworkProject", task_name=name)
pl_train(name)
task.close()
method pl_train is a wrapper for whole training with Pytorch Ligtning. No ClearML code is inside this method.
Do you have any hint, how to properly use the usage of a loop in a script using completly separated tasks?
Edit: ClearML version was 0.17.4. Issue is fixed in main branch.
Disclaimer I'm part of the ClearML (formerly Trains) team.
pytorch_lightning is creating a new Tensorboard for each experiment. When ClearML logs the TB scalars, and it captures the same scalar being re-sent again, it adds a prefix so if you are reporting the same metric it will not overwrite the previous one. A good example would be reporting loss scalar in the training phase vs validation phase (producing "loss" and "validation:loss"). It might be the task.close() call does not clear the previous logs, so it "thinks" this is the same experiment, hence adding the prefix networkB to the loss. As long as you are closing the Task after training is completed you should have all experiments log with the same metric/variant (title/series). I suggest opening a GitHub issue, this should probably be considered a bug.

Random Forest - Verbose and Speed

I am trying to build a randomforest on a data set with 120k rows and 518 columns.
I have two questions:
1. I want to see the progress and logs of building the forest. Is verbose option deprecated in randomForest function?
2. How to increase the speed? Right now it takes more than 6 hours to build a random forest with 1000 trees.
H2O cluster is initialized with below settings:
hadoop jar h2odriver.jar -Dmapreduce.job.queuename=devclinical
-output temp3p -nodes 20 -nthreads -1 -mapperXmx 32g
h2o.init(ip = h2o_ip, port = h2o_port, startH2O = FALSE,
nthreads=-1,max_mem_size = "64G", min_mem_size="4G" )
Depending on congestion of your network and the busyness level of your hadoop nodes, it may finish faster with fewer nodes. For example, if 1 of the 20 nodes you requested is totally slammed by some other jobs, then that node may lag, and the work from that node is not rebalanced to other nodes.
A good way to see what is going on is to connect to H2O Flow in a browser and run the WaterMeter. This will show you CPU activity in your cluster.
You can compare the activity before you start your RF and after you start your RF.
If even before you start your RF the nodes are extremely busy then you may be out of luck and just have to wait. If even after you start your RF the nodes are not busy at all, then the network communication may be too high and fewer nodes would be better.
You'll also want to to look at the H2O logs and see how the dataset got parsed, datatype-wise, and the speed at which individual trees are built. And if your response column is a categorical and you're doing multinomial, each tree is really N trees where N is the number of levels in the response column.
[ Unfortunately, the "it's too slow" complaint is way too generic to say much more. ]
That sounds like a long time to train a Random Forest on a dataset with only 120k x 518 columns. As Tom said above, it might have to do with the congestion on your Hadoop cluster and possibly that this cluster that is way too big for this task. You should be able to train a dataset that size on a single machine (no multi-node cluster necessary).
If possible, try training the model on your laptop for a comparison. If there is nothing you can do to improve the Hadoop environment, this may be a better option for training.
For your other question about a verbose option -- I don't remember there ever being this option in H2O's Random Forest. You can view the progress of models as they build in H2O Flow, the GUI. When you click on a model to view it, there is a "Refresh" button that will allow you to check on the progress of the model at it trains.

How to deal with frequent classes?

I'm working on a classification task in Weka and got the problem that my class to predict has one value that is very frequent (about 85%). This leads to a lot of learning algorithms just predicting this frequent value of this class for a new dataset.
How can I deal with this problem? Does it just mean that I didn't find features that work well enough in predicting something better? Or is there something specific I can do to solve this problem?
I guess this is a pretty common problem, but I was not able to find a solution to it here.
You need to "SMOTE" your data. First figure out how many more instances of the minority case you need. In my case I wanted to get around a 50/50 ratio so I needed to over sample by 1300 percent. This tutorial will help if you are using the GUI: http://www.youtube.com/watch?v=w14ha2Fmg6U If you are doing this from the command line using Weka, the following command will get you going:
#Weka 3.7.7
java weka.Run -no-scan weka.filters.supervised.instance.SMOTE \
-c last -K 25 -P 1300.0 -S 1 -i input.arff -o output.arff
The -K option is the number of neighbors to take into account when smoting the data. The default is 5, but 25 worked best for my dataset.

How to use WEKA Machine Learning for a Bayes Neural Network and J48 Decision Tree

I am trying to figure out WEKA and perform some experiments with data that I have.
Basically what I want to do is take Data Set 1, use it as a training set. Run a J48 Decision Tree on it. Then take Data Set 2 and run the trained tree on it, with the output of the original data set with a extra column for what the prediction was.
Then do the same thing again with the Bayes Neural Network.
Can someone point me to a link of detail instructions on how exactly I would accomplish this? I seem to be missing some steps and cannot get the output of the original data set with the extra column.
Here is one way to do it with the command-line. This information is found in Chapter 1 ("A command-line primer") of the Weka manual that comes with the software.
java weka.classifiers.trees.J48 -t training_data.arff -T test_data.arff -p 1-N
where:
-t <training_data.arff> specifies the training data in ARFF format
-T <test_data.arff> specifies the test data in ARFF format
-p 1-N specifies that you want to output the feature vector and the prediction,
where N is the number of features in your feature vector.
For example, here I am using soybean.arff for both training and testing. There are 35 features in the feature vector:
java weka.classifiers.trees.J48 -t soybean.arff -T soybean.arff -p 1-35
The first few lines of the output look like:
=== Predictions on test data ===
inst# actual predicted error prediction (date,plant-stand,precip,temp,hail,crop-hist,area-damaged,severity,seed-tmt,germination,plant-growth,leaves,leafspots-halo,leafspots-marg,leafspot-size,leaf-shread,leaf-malf,leaf-mild,stem,lodging,stem-cankers,canker-lesion,fruiting-bodies,external-decay,mycelium,int-discolor,sclerotia,fruit-pods,fruit-spots,seed,mold-growth,seed-discolor,seed-size,shriveling,roots)
1 1:diaporth 1:diaporth 0.952 (october,normal,gt-norm,norm,yes,same-lst-yr,low-areas,pot-severe,none,90-100,abnorm,abnorm,absent,dna,dna,absent,absent,absent,abnorm,no,above-sec-nde,brown,present,firm-and-dry,absent,none,absent,norm,dna,norm,absent,absent,norm,absent,norm)
2 1:diaporth 1:diaporth 0.952 (august,normal,gt-norm,norm,yes,same-lst-two-yrs,scattered,severe,fungicide,80-89,abnorm,abnorm,absent,dna,dna,absent,absent,absent,abnorm,yes,above-sec-nde,brown,present,firm-and-dry,absent,none,absent,norm,dna,norm,absent,absent,norm,absent,norm)
The columns are: (1) data instance number; (2) ground truth label; (3) predicted label; (4) error; (5) prediction confidence; and (6) feature vector.

How to reavaluate model in WEKA?

I am trying to solve a numeric classification problem with numeric attributes in WEKA using linear regression and then I want to test my model on the existing dataset with ""re-evaluate model on current test dataset.
As a result of the evaluation I am getting the summary:
Correlation coefficient 0.9924
Mean absolute error 1.1017
Root mean squared error 1.2445
Total Number of Instances 17
But I don't have results as it is shown here: http://weka.wikispaces.com/Making+predictions
How to bring WEKA to the result I need?
Thank you.
To answer my question - for trained and tested model, right click on the model and go to visualize classifier error. there use save option to save actual and predicted values.
Are you using command line interface (CLI) or GUI.
If CLI, the command given in the above link works pretty fine
java weka.classifiers.trees.J48 -T unclassified.arff -l j48.model -p 0
So when you train the model you save it as *.model (j48.model) and later use it to evaluate on test data (unclassified.arff)

Resources