I have telegraf configuration set-up in a docker base image and am seeing an issue with one of the services where jolokia metrics are captured but not written to influxDB. Other services using the same base image have no issue writing to InfluxDB.
telegraf.log file for service with issue:
2017/09/25 21:25:00 Gathered metrics, (5s interval), from 1 inputs in
115.245395ms
2017/09/25 21:25:05 Gathered metrics, (5s interval), from 1 inputs in
83.221324ms
2017/09/25 21:25:10 Gathered metrics, (5s interval), from 1 inputs in
75.461556ms
2017/09/25 21:25:15 Gathered metrics, (5s interval), from 1 inputs in
99.841166ms
2017/09/25 21:25:20 Gathered metrics, (5s interval), from 1 inputs in
62.729338ms
telegraf.log file for service without issue:
2017/09/25 20:45:20 Gathered metrics, (5s interval), from 1 inputs in
480.84182ms
2017/09/25 20:45:25 Gathered metrics, (5s interval), from 1 inputs in
481.822055ms
2017/09/25 20:45:30 Wrote 2 metrics to output influxdb in 57.553898ms
2017/09/25 20:45:30 Gathered metrics, (5s interval), from 1 inputs in
481.855258ms
2017/09/25 20:45:35 Gathered metrics, (5s interval), from 1 inputs in
481.826305ms
2017/09/25 20:45:40 Wrote 2 metrics to output influxdb in 62.126203ms
2017/09/25 20:45:40 Gathered metrics, (5s interval), from 1 inputs in
481.883574ms
2017/09/25 20:45:45 Gathered metrics, (5s interval), from 1 inputs in
481.851454ms
2017/09/25 20:45:50 Wrote 2 metrics to output influxdb in 70.463902ms
Appreciate any pointers on the root cause for this issue. I can post additional info on telegraf.conf file if needed.
Thanks,
Maddy
Related
I have a dataset of around 1M rows with a high imbalance (743 / 1072780). I am training xgboost model in h2o with the following parameters and it looks like it is overfitting
H2OXGBoostEstimator(max_depth=10,
subsample=0.7,
ntrees=200,
learn_rate=0.5,
min_rows=3,
col_sample_rate_per_tree = .75,
reg_lambda=2.0,
reg_alpha=2.0,
sample_rate = .5,
booster='gbtree',
nfolds=10,
keep_cross_validation_predictions = True,
stopping_metric = 'AUCPR',
min_split_improvement= 1e-5,
categorical_encoding = 'OneHotExplicit',
weights_column = "Products"
)
The output is:
Training data AUCPR: 0.6878932664592388 Validation data AUCPR: 0.04033158660014747
Training data AUC: 0.9992170372214433 Validation data AUC: 0.7000804189162043
Training data MSE: 0.0005722912424124134 Validation data MSE: 0.0010002949568585474
Training data RMSE: 0.023922609439866994 Validation data RMSE: 0.03162743993526108
Training data Gini: 0.9984340744428866 Validation data Gini: 0.40016083783240863
Confusion Matrix for Training Data:
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.15900755567210062:
0 1 Error Rate
----- ------ --- ------- ----------------
0 709201 337 0.0005 (337.0/709538.0)
1 189 516 0.2681 (189.0/705.0)
Total 709390 853 0.0007 (526.0/710243.0)
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.047459165255228676:
0 1 Error Rate
----- ------ --- ------- ----------------
0 202084 365 0.0018 (365.0/202449.0)
1 140 52 0.7292 (140.0/192.0)
Total 202224 417 0.0025 (505.0/202641.0)
{'train': , 'valid': }
I am using h2o 3.32.0.1 version (since it's a requirement), xgboost h2o doesnt support balance_classes or scale_pos_weight hyperparameters.
What can cause this to have such performance? Also, What can be improved here for such an imbalanced dataset that might improve the performance?
Training with such severely imbalanced data set is pointless. I would try a combination of up sampling and down sampling to get a more balanced data set that does not get too small.
This may be the worst class imbalance I have ever seen in a problem.
If you can subset your majority class - not until the point that it is balanced - but until the balance is less sever while still being representative (i.e., 15/85% minority/majority), you'll have more luck with other conventional techniques, or a mixture (i.e., up sampling and
augmentation.) Can the data logically be subset to help with the imbalance? For example if data ranges back several years, you could use only the last year's worth of data. I'd also manually optimize the threshold against the minority class, like true positive rate.
When I serve my TF model with tensorflow serveing, on version 2.1.0, through docker, I perform a stress testing with Jmeter. There is a problem. TPS will hit 4400 by testing with single data, while it only reach 1700 with multiple data in a txt file. The model is BiLSTM which I've trained without any cache setting. The experiments all perform in local server rather than through network.
Metrics:
In single data task, I set running HTTP request with identical data without interval by 30 request threads for 10 minutes.
TPS: 4491
CPU occupied: 2100%
99% Latancy Line(ms): 17
error rate: 0
In multiple data task, I set running HTTP request with reading a txt file, a dataset with 9740000 different examples, by 30 request threads.
TPS: 1711
CPU occupied: 2300%
99% Latancy Line(ms): 42
error rate: 0
Hardware:
CPU cores:12
processor: 24
Intel(R) Xeon(R) Silver 4214 CPU # 2.20GHz
Is there a cache in Tensorflow Serving?
Why is TPS with single data testing larger thrice than with various data testing in stress testing task?
I've solved the problem. Request threads reading the same file needs to wait for which cost CPU for running Jmeter.
From here and other sources TotalIteration = TotalExamples/Batchsize * Epochs. I have a CNN with 54 layers in total for 36000 examples, Batchsize of 12 and 3 Epochs. From the formula my Iteration is supposed to be 9000 but from the training-ui I am getting 3x9000 = 27000 at the end.
Tried a smaller dataset with the formula again lo and behold it's 3 times whatever the calculation was for the total iteration.
Don't understand where the 3 is coming from or is it because my images have 3 channels(RGB) so it's seeing each channel as an example(wonder why if this is the case, since CNNs are supposed to consume data in volumes)
On my Macbook Pro 13" I have the Blackmagic eGPU (AMD Radeon Pro 580) connected via USB-C. This should theoretically speed up my model training with Turi Create enormously.
For a small model in my case 15 labeled images (4k x 3k) and 500 iterations are used, which which take about 2 hours including the eGPU. Only CPU takes 4h, so the GPU speeds up, but not extremely.
In the Guide to Turi Create there is said that an object detection model with ~700 images and 4000 iterations is processed in 1 hour. So way faster.
While using CreateML I observe an increase of performance of at least 5x for transfer learning during the feature detection phase when using the eGPU.
Is this a problem of the framework itself?
Can I optimize the data or training parameters for better usage of the eGPU?
Is the data too small or the resolution too big to have optimal GPU usage over USB-C?
Class : ObjectDetector
Schema
------
Model : darknet-yolo
Number of classes : 4
Non-maximum suppression threshold : 0.45
Input image shape : (3, 416, 416)
Training summary
----------------
Training time : 1h 29m 8s
Training epochs : 1066
Training iterations : 500
Number of examples (images) : 15
Number of bounding boxes (instances) : 49
Final loss (specific to model) : 1.808
It is the image size/resolution (4k x 3k) which creates the bottleneck to the GPU. Scaling the images down (and setting the labels accordingly) gets full speed of the eGPU (100x vs CPU).
This is my first data mining project. I am using SAS Enterprise miner to train and test a classifier.
I have 3 files at my disposal,
Training file : 85 input variables and 1 target variable, with 5800+ observations
Prediction file : 85 input variables with 4000 observations
Verification file : 1 variable containing the correct predictions for the second file. Since this is an academic project, this file is here to tell us if we are doing a good job or not.
My problem is that the dataset is unbalanced (95% of 0s and 5% of 1s for the target variable in the training file). So naturally, I tried to re-sample the model using the "sampling node" as described in the following link
Here are the 2 approaches I used, they give slightly different results. But here is the general unsatisfactory result I am getting:
Without resampling : The model predicts less than ten solicited individuals (target variable = 1) over 4000 observations
With the resampling : The model predicts about 1500 solicited individuals over 4000 observations.
I am looking for 100 to 200 solicited individuals to have a model that would be considered acceptable.
Why do you think our predictions are way off this way, and how can we remedy to this situation?
Here is a screen shot of both models
There are some Technics to deal with unbalanced data. One that I remember many years ago was this approach:
say you have 100 observation solicited(minority) that are 5% of all your observations
cluster other none solicited(maturity) class, to 20 groups(each of with have 100 observation of none solicited individuals) with clustering algorithms like KMEAN, MEANSHIF, DBSCAN and...
then for each group of maturity clustered observation, create a dataset with all 100 observation solicited(minority) class. It means that you have 20 group of dataset each of witch is balanced with 100 solicited and 100 none solicited observations
train each balanced group and create a model for each of them
at prediction, predict all 20 models. for example if 15 out of 20 models say it is solicited, it is solicited