F1 score for a Random Forest model - random-forest

I have built a Random Forest model (H2O library) and then checked its accuracy on some test data. I would like to use the F1 score as a measure of the success of the model. However, I cannot find in the documentation a way to retrieve it.
I know that it is possible as this appears here
performance = best_nn.model_performance(test_data = test)
F1 = performance.F1()
However, in my case, for some reason, performance does not have F1 as a method.
What is wrong, and how is it possible to retreive it?
Environment:
H2O cluster uptime: 7 mins 29 secs
H2O cluster timezone: Asia/Jerusalem
H2O data parsing timezone: UTC
H2O cluster version: 3.22.0.2
H2O cluster version age: 10 days
H2O cluster name: H2O_from_python_user_24aghd
H2O cluster total nodes: 1
H2O cluster free memory: 894 Mb
H2O cluster total cores: 4
H2O cluster allowed cores: 4
H2O cluster status: locked, healthy
H2O connection url: http://localhost:54321
H2O connection proxy: None
H2O internal security: False
H2O API Extensions: Algos, AutoML, Core V3, Core V4
Python version: 2.7.15 final

It seems that I have found the reason, and it is rather a simple one:
F1 is appropriate only for models which have two possible classes as the response variable. Mine had more.
So, H2O did not suggest the metric.

Related

Detectron2 - Same Code&Data // Different platforms // highly divergent results

I use different hardware to benchmark multiple possibilites. The Code runs in a jupyter Notebook.
When i evaluate the different losses i get highly divergent results.
I also checked the full .cfg with cfg.dump() - it is completely consistent.
Detectron2 Parameters:
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-Detection/retinanet_R_101_FPN_3x.yaml"))
cfg.DATASETS.TRAIN = ("dataset_train",)
cfg.DATASETS.TEST = ("dataset_test",)
cfg.DATALOADER.NUM_WORKERS = 2
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Detection/retinanet_R_101_FPN_3x.yaml") # Let training initialize from model zoo
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.SOLVER.BASE_LR = 0.00025 # 0.00125 pick a good LR
cfg.SOLVER.MAX_ITER = 1200 # 300 iterations seems good enough for this toy dataset; you will need to train longer for a practical dataset
cfg.SOLVER.STEPS = [] # do not decay learning rate
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 512 # faster, and good enough for this toy dataset (default: 512)
#cfg.MODEL.ROI_HEADS.NUM_CLASSES = 25 # only has one class (ballon). (see https://detectron2.readthedocs.io/tutorials/datasets.html#update-the-config-for-new-datasets)
cfg.MODEL.RETINANET.NUM_CLASSES = 3
# NOTE: this config means the number of classes, but a few popular unofficial tutorials incorrect uses num_classes+1 here.
cfg.OUTPUT_DIR = "/content/drive/MyDrive/Colab_Notebooks/testrun/output"
cfg.TEST.EVAL_PERIOD = 25
cfg.SEED=5
1. Environment: Azure
Microsoft Azure - Machine Learning
STANDARD_NC6
Torch: 1.9.0+cu111
Results:
Training Log: Log Azure
2. Environment: Colab
GoogleColab free
Torch: 1.9.0+cu111
Results:
Training Log: Log Colab
EDIT:
3. Environment: Ubuntu
Ubuntu 22.04
RTX 3080
Torch: 1.9.0+cu111
Results:
Training Log: https://pastebin.com/PwXMz4hY
New dataset
Issue is not reproducible with a larger dataset:

Colab unable to load cache

I am trying to train a YOLOv5 neural network for recognizing vehicles. However, when it is trained on Google Colab, it always stops at here:
train: Scanning 'MyDataset/train/labels.cache' for images and labels... 26559 found, 0 missing, 0 empty, 0 corrupted: 100% 26559/26559 [00:00<?, ?it/s]
train: Caching images (8.5GB): 62% 16425/26559 [00:46<00:30, 330.41it/s]C
CPU times: user 850 ms, sys: 162 ms, total: 1.01 s
Wall time: 1min 26s
I followed the tutorial from roboflow. When I switched to the smaller database provided by roboflow, the training was able to proceed. I'm a Colab Pro+ user, so it shouldn't be a matter of not having enough memory.
I switched to a smaller dataset and now it loads without any problems.
train: Caching images (4.6GB): 100% 8853/8853 [00:18<00:00, 483.20it/s]
Then it started training smoothly.
I think it is indeed a matter of too much data. However Colab is not giving me any indication of running out of memory.

Do Tensorflow Serving run inference with cache?

When I serve my TF model with tensorflow serveing, on version 2.1.0, through docker, I perform a stress testing with Jmeter. There is a problem. TPS will hit 4400 by testing with single data, while it only reach 1700 with multiple data in a txt file. The model is BiLSTM which I've trained without any cache setting. The experiments all perform in local server rather than through network.
Metrics:
In single data task, I set running HTTP request with identical data without interval by 30 request threads for 10 minutes.
TPS: 4491
CPU occupied: 2100%
99% Latancy Line(ms): 17
error rate: 0
In multiple data task, I set running HTTP request with reading a txt file, a dataset with 9740000 different examples, by 30 request threads.
TPS: 1711
CPU occupied: 2300%
99% Latancy Line(ms): 42
error rate: 0
Hardware:
CPU cores:12
processor: 24
Intel(R) Xeon(R) Silver 4214 CPU # 2.20GHz
Is there a cache in Tensorflow Serving?
Why is TPS with single data testing larger thrice than with various data testing in stress testing task?
I've solved the problem. Request threads reading the same file needs to wait for which cost CPU for running Jmeter.

Unbalanced model, confused as to what steps to take

This is my first data mining project. I am using SAS Enterprise miner to train and test a classifier.
I have 3 files at my disposal,
Training file : 85 input variables and 1 target variable, with 5800+ observations
Prediction file : 85 input variables with 4000 observations
Verification file : 1 variable containing the correct predictions for the second file. Since this is an academic project, this file is here to tell us if we are doing a good job or not.
My problem is that the dataset is unbalanced (95% of 0s and 5% of 1s for the target variable in the training file). So naturally, I tried to re-sample the model using the "sampling node" as described in the following link
Here are the 2 approaches I used, they give slightly different results. But here is the general unsatisfactory result I am getting:
Without resampling : The model predicts less than ten solicited individuals (target variable = 1) over 4000 observations
With the resampling : The model predicts about 1500 solicited individuals over 4000 observations.
I am looking for 100 to 200 solicited individuals to have a model that would be considered acceptable.
Why do you think our predictions are way off this way, and how can we remedy to this situation?
Here is a screen shot of both models
There are some Technics to deal with unbalanced data. One that I remember many years ago was this approach:
say you have 100 observation solicited(minority) that are 5% of all your observations
cluster other none solicited(maturity) class, to 20 groups(each of with have 100 observation of none solicited individuals) with clustering algorithms like KMEAN, MEANSHIF, DBSCAN and...
then for each group of maturity clustered observation, create a dataset with all 100 observation solicited(minority) class. It means that you have 20 group of dataset each of witch is balanced with 100 solicited and 100 none solicited observations
train each balanced group and create a model for each of them
at prediction, predict all 20 models. for example if 15 out of 20 models say it is solicited, it is solicited

Random Forest - Verbose and Speed

I am trying to build a randomforest on a data set with 120k rows and 518 columns.
I have two questions:
1. I want to see the progress and logs of building the forest. Is verbose option deprecated in randomForest function?
2. How to increase the speed? Right now it takes more than 6 hours to build a random forest with 1000 trees.
H2O cluster is initialized with below settings:
hadoop jar h2odriver.jar -Dmapreduce.job.queuename=devclinical
-output temp3p -nodes 20 -nthreads -1 -mapperXmx 32g
h2o.init(ip = h2o_ip, port = h2o_port, startH2O = FALSE,
nthreads=-1,max_mem_size = "64G", min_mem_size="4G" )
Depending on congestion of your network and the busyness level of your hadoop nodes, it may finish faster with fewer nodes. For example, if 1 of the 20 nodes you requested is totally slammed by some other jobs, then that node may lag, and the work from that node is not rebalanced to other nodes.
A good way to see what is going on is to connect to H2O Flow in a browser and run the WaterMeter. This will show you CPU activity in your cluster.
You can compare the activity before you start your RF and after you start your RF.
If even before you start your RF the nodes are extremely busy then you may be out of luck and just have to wait. If even after you start your RF the nodes are not busy at all, then the network communication may be too high and fewer nodes would be better.
You'll also want to to look at the H2O logs and see how the dataset got parsed, datatype-wise, and the speed at which individual trees are built. And if your response column is a categorical and you're doing multinomial, each tree is really N trees where N is the number of levels in the response column.
[ Unfortunately, the "it's too slow" complaint is way too generic to say much more. ]
That sounds like a long time to train a Random Forest on a dataset with only 120k x 518 columns. As Tom said above, it might have to do with the congestion on your Hadoop cluster and possibly that this cluster that is way too big for this task. You should be able to train a dataset that size on a single machine (no multi-node cluster necessary).
If possible, try training the model on your laptop for a comparison. If there is nothing you can do to improve the Hadoop environment, this may be a better option for training.
For your other question about a verbose option -- I don't remember there ever being this option in H2O's Random Forest. You can view the progress of models as they build in H2O Flow, the GUI. When you click on a model to view it, there is a "Refresh" button that will allow you to check on the progress of the model at it trains.

Resources