Negative probability in GMM - machine-learning

I am so confused. I have tested a program for myself by following MATLAB code :
feature_train=[1 1 2 1.2 1 1 700 709 708 699 678];
No_of_Clusters = 2;
No_of_Iterations = 10;
[m,v,w]=gaussmix(feature_train,[],No_of_Iterations,No_of_Clusters);
feature_ubm=[1000 1001 1002 1002 1000 1060 70 79 78 99 78 23 32 33 23 22 30];
No_of_Clusters = 3;
No_of_Iterations = 10;
[mubm,vubm,wubm]=gaussmix(feature_ubm,[],No_of_Iterations,No_of_Clusters);
feature_test=[2 2 2.2 3 1 600 650 750 800 658];
[lp_train,rp,kh,kp]=gaussmixp(feature_test,m,v,w);
[lp_ubm,rp,kh,kp]=gaussmixp(feature_test,mubm,vubm,wubm);
However, the result is wondering me because the feature_test must be classified in feature_train not feature_ubm. As you see below the probability of feature_ubm is more than feature_train!?!
Can anyone explain for me what is the problem ?
Is the problem related to gaussmip and gaussmix MATLAB functions ?
sum(lp_ubm)
ans =
-3.4108e+06
sum(lp_train)
ans =
-1.8658e+05

As you see below the probability of feature_ubm is more than feature_train!?!
You see exactly the opposite, despite the absolute value of ubm is big, you are considering negative numbers and
sum(lp_train) > sum(lp_ubm)
hense
P(test|train) > P(test|ubm)
So your test chunk is correctly classified as train, not as ubm.

Related

Calculate the longest continuous running time of a device

I have a table created with the following script:
n=15
ts=now()+1..n * 1000 * 100
status=rand(0 1 ,n)
val=rand(100,n)
t=table(ts,status,val)
select * from t order by ts
where
ts is the time, status indicates the device status (0: down; 1: running), and val indicates the running time.
Suppose I have the following data:
ts status val
2023.01.03T18:17:17.386 1 58
2023.01.03T18:18:57.386 0 93
2023.01.03T18:20:37.386 0 24
2023.01.03T18:22:17.386 1 87
2023.01.03T18:23:57.386 0 85
2023.01.03T18:25:37.386 1 9
2023.01.03T18:27:17.386 1 46
2023.01.03T18:28:57.386 1 3
2023.01.03T18:30:37.386 0 65
2023.01.03T18:32:17.386 1 66
2023.01.03T18:33:57.386 0 56
2023.01.03T18:35:37.386 0 42
2023.01.03T18:37:17.386 1 82
2023.01.03T18:38:57.386 1 95
2023.01.03T18:40:37.386 0 19
So how do I calculate the longest continuous running time? For example, both the 7th and 8th records have the status 1, I want to sum their val values. Or the 14th-15th records, I want to sum up their val values.
You can use the built-in function segment to group the consecutive identical values. The full script is as follows:
select first(ts), sum(iif(status==1, val, 0)) as total_val
from t
group by segment(status)
having sum(iif(status==1, val, 0)) > 0
The result:
segment_status first_ts total_val
0 2023.01.03T18:17:17.386 58
3 2023.01.03T18:22:17.386 87
5 2023.01.03T18:25:37.386 58
9 2023.01.03T18:32:17.386 66
12 2023.01.03T18:37:17.386 177

Terrible performance using XGBoost H2O

Very different Model Performance using XGBoost on H2O
I am training a XGBoost model using 5-fold croos validation on a very imbalanced binary classification problem. The dataset has 1200 columns (multi-document word2vec document embeddings).
The only parameters specified to train the XGBoost model were:
min_split_improvement = 1e-5
seed=1
nfolds = 5
The reported performance on train data was extremely high (probably overfitting!!!):
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.2814398407936096:
A D Error Rate
----- ----- --- ------- -------------
A 16858 2 0.0001 (2.0/16860.0)
D 0 414 0 (0.0/414.0)
Total 16858 416 0.0001 (2.0/17274.0)
AUC: 0.9999991404060721
The performance on cross validation data was terrible:
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.016815993119962513:
A D Error Rate
----- ----- --- ------- ----------------
A 16003 857 0.0508 (857.0/16860.0)
D 357 57 0.8623 (357.0/414.0)
Total 16360 914 0.0703 (1214.0/17274.0)
AUC: 0.6015883863129724
I know H2O cross validation generates an extra model using the whole data available and different performances are expected.
But, could be the cause that generated too bad performance on the resulting model?
Ps: XGBoost on a multi node H2O cluster with OMP
Model Type: classifier
Performance do modelo < XGBoost_model_python_1575650180928_617 >:
ModelMetricsBinomial: xgboost
** Reported on train data. **
MSE: 0.0008688085383330077
RMSE: 0.029475558320971762
LogLoss: 0.00836528606162877
Mean Per-Class Error: 5.931198102016033e-05
AUC: 0.9999991404060721
pr_auc: 0.9975495622569983
Gini: 0.9999982808121441
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.2814398407936096:
A D Error Rate
----- ----- --- ------- -------------
A 16858 2 0.0001 (2.0/16860.0)
D 0 414 0 (0.0/414.0)
Total 16858 416 0.0001 (2.0/17274.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
--------------------------- ----------- -------- -----
max f1 0.28144 0.99759 195
max f2 0.28144 0.999035 195
max f0point5 0.553885 0.998053 191
max accuracy 0.28144 0.999884 195
max precision 0.990297 1 0
max recall 0.28144 1 195
max specificity 0.990297 1 0
max absolute_mcc 0.28144 0.997534 195
max min_per_class_accuracy 0.28144 0.999881 195
max mean_per_class_accuracy 0.28144 0.999941 195
max tns 0.990297 16860 0
max fns 0.990297 413 0
max fps 0.000111383 16860 399
max tps 0.28144 414 195
max tnr 0.990297 1 0
max fnr 0.990297 0.997585 0
max fpr 0.000111383 1 399
max tpr 0.28144 1 195
Gains/Lift Table: Avg response rate: 2.40 %, avg score: 2.42 %
group cumulative_data_fraction lower_threshold lift cumulative_lift response_rate score cumulative_response_rate cumulative_score capture_rate cumulative_capture_rate gain cumulative_gain
-- ------- -------------------------- ----------------- ------- ----------------- --------------- ----------- -------------------------- ------------------ -------------- ------------------------- ------- -----------------
1 0.0100151 0.873526 41.7246 41.7246 1 0.907782 1 0.907782 0.417874 0.417874 4072.46 4072.46
2 0.0200301 0.776618 41.7246 41.7246 1 0.834968 1 0.871375 0.417874 0.835749 4072.46 4072.46
3 0.0300452 0.0326301 16.4004 33.2832 0.393064 0.303206 0.797688 0.681985 0.164251 1 1540.04 3228.32
4 0.0400023 0.0224876 0 24.9986 0 0.0263919 0.599132 0.518799 0 1 -100 2399.86
5 0.0500174 0.0180858 0 19.9931 0 0.0201498 0.479167 0.418953 0 1 -100 1899.31
6 0.100035 0.0107386 0 9.99653 0 0.0136044 0.239583 0.216279 0 1 -100 899.653
7 0.149994 0.00798337 0 6.66692 0 0.00922284 0.159784 0.147313 0 1 -100 566.692
8 0.200012 0.00629476 0 4.99971 0 0.00709438 0.119826 0.112249 0 1 -100 399.971
9 0.299988 0.00436827 0 3.33346 0 0.00522157 0.0798919 0.0765798 0 1 -100 233.346
10 0.400023 0.00311204 0 2.49986 0 0.00370085 0.0599132 0.0583548 0 1 -100 149.986
11 0.5 0.00227535 0 2 0 0.00267196 0.0479333 0.0472208 0 1 -100 100
12 0.599977 0.00170271 0 1.66673 0 0.00197515 0.039946 0.0396813 0 1 -100 66.6731
13 0.700012 0.00121528 0 1.42855 0 0.00145049 0.0342375 0.034218 0 1 -100 42.8548
14 0.799988 0.000837358 0 1.25002 0 0.00102069 0.0299588 0.0300692 0 1 -100 25.0018
15 0.899965 0.000507632 0 1.11115 0 0.000670878 0.0266306 0.0268033 0 1 -100 11.1154
16 1 3.35288e-05 0 1 0 0.00033002 0.0239667 0.0241551 0 1 -100 0
Performance da validação cruzada (xval) do modelo < XGBoost_model_python_1575650180928_617 >:
ModelMetricsBinomial: xgboost
** Reported on cross-validation data. **
MSE: 0.023504756648164406
RMSE: 0.15331261085822134
LogLoss: 0.14134815775808462
Mean Per-Class Error: 0.4160864407653825
AUC: 0.6015883863129724
pr_auc: 0.04991836222189148
Gini: 0.2031767726259448
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.016815993119962513:
A D Error Rate
----- ----- --- ------- ----------------
A 16003 857 0.0508 (857.0/16860.0)
D 357 57 0.8623 (357.0/414.0)
Total 16360 914 0.0703 (1214.0/17274.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
--------------------------- ----------- --------- -----
max f1 0.016816 0.0858434 209
max f2 0.00409934 0.138433 318
max f0point5 0.0422254 0.0914205 127
max accuracy 0.905155 0.976323 3
max precision 0.99221 1 0
max recall 9.60076e-05 1 399
max specificity 0.99221 1 0
max absolute_mcc 0.825434 0.109684 5
max min_per_class_accuracy 0.00238436 0.572464 345
max mean_per_class_accuracy 0.00262155 0.583914 341
max tns 0.99221 16860 0
max fns 0.99221 412 0
max fps 9.60076e-05 16860 399
max tps 9.60076e-05 414 399
max tnr 0.99221 1 0
max fnr 0.99221 0.995169 0
max fpr 9.60076e-05 1 399
max tpr 9.60076e-05 1 399
Gains/Lift Table: Avg response rate: 2.40 %, avg score: 0.54 %
group cumulative_data_fraction lower_threshold lift cumulative_lift response_rate score cumulative_response_rate cumulative_score capture_rate cumulative_capture_rate gain cumulative_gain
-- ------- -------------------------- ----------------- -------- ----------------- --------------- ----------- -------------------------- ------------------ -------------- ------------------------- --------- -----------------
1 0.0100151 0.0540408 4.34129 4.34129 0.104046 0.146278 0.104046 0.146278 0.0434783 0.0434783 334.129 334.129
2 0.0200301 0.033963 2.41183 3.37656 0.0578035 0.0424722 0.0809249 0.094375 0.0241546 0.0676329 141.183 237.656
3 0.0300452 0.0251807 2.17065 2.97459 0.0520231 0.0292894 0.0712909 0.0726798 0.0217391 0.089372 117.065 197.459
4 0.0400023 0.02038 2.18327 2.77762 0.0523256 0.0225741 0.0665702 0.0602078 0.0217391 0.111111 118.327 177.762
5 0.0500174 0.0174157 1.92946 2.60779 0.0462428 0.0188102 0.0625 0.0519187 0.0193237 0.130435 92.9463 160.779
6 0.100035 0.0103201 1.59365 2.10072 0.0381944 0.0132217 0.0503472 0.0325702 0.0797101 0.210145 59.3649 110.072
7 0.149994 0.00742152 1.06366 1.7553 0.0254925 0.00867473 0.0420687 0.0246112 0.0531401 0.263285 6.3664 75.5301
8 0.200012 0.00560037 1.11073 1.59411 0.0266204 0.00642966 0.0382055 0.0200645 0.0555556 0.318841 11.0725 59.4111
9 0.299988 0.00366149 1.30465 1.49764 0.0312681 0.00452583 0.0358935 0.0148859 0.130435 0.449275 30.465 49.7642
10 0.400023 0.00259159 1.13487 1.40692 0.0271991 0.00306994 0.0337192 0.0119311 0.113527 0.562802 13.4872 40.6923
11 0.5 0.00189 0.579844 1.24155 0.0138969 0.00220612 0.0297557 0.00998654 0.057971 0.620773 -42.0156 24.1546
12 0.599977 0.00136983 0.990568 1.19972 0.0237406 0.00161888 0.0287534 0.0085922 0.0990338 0.719807 -0.943246 19.9724
13 0.700012 0.000980029 0.676094 1.1249 0.0162037 0.00116698 0.02696 0.0075311 0.0676329 0.78744 -32.3906 12.4895
14 0.799988 0.00067366 0.797286 1.08395 0.0191083 0.000820365 0.0259787 0.00669244 0.0797101 0.86715 -20.2714 8.39529
15 0.899965 0.000409521 0.797286 1.05211 0.0191083 0.000540092 0.0252155 0.00600898 0.0797101 0.94686 -20.2714 5.21072
16 1 2.55768e-05 0.531216 1 0.0127315 0.000264023 0.0239667 0.00543429 0.0531401 1 -46.8784 0
For the non cross-validation case, try splitting your data up front into training and validation frames.
I expect you will get a worse AUC for the validation case.
Although for highly imbalanced cases, sometimes you just need to go by the error rate for each class.
Since there are so many true negatives, that can dominate the AUC (vast majority of predictions are correctly predicting “not interesting”). Some people will upsample the minority class in this situation using row weights to make the model more sensitive to them.

How to calculate png size based on dimension and bit depth

I'm trying to generate white png (jpg, gif as well) files using imagemagick. I have to calculate image's dimension based on size(kb) and bit depth(1).
I'm using this command on my windows machine:
magick -size "width" x "height" canvas:black white.png
I'm getting the following results
1 x 1 = 258 bytes;
2 x 2 = 260;
9 x 9 = 262;
17 x 17 = 263;
33 x 33 = 264;
40 x 40 = 263;
41 x 41 = 265;
65 x 65 = 267;
66 x 66 = 268;
What I understood from the results above is that minimal size is 256 + 1 (width) + 1 (height). So 1 x 1 file's size would be 258, 2 x 2 = 260. Results that goes next to these two seems not logical for me, why 33x33 is bigger than 40x40?
I have read png specification but couldn't figure out the formula how to calculate png (or other formats) size?

Testing accuracy more than training accuracy

I am building a tuned random forest model for multiclass classification.
I'm getting the following results
Training accuracy(AUC) :0.9921996
Testing accuracy(AUC) :0.992237664
I saw a question related to this on this website and the common answer seems to be that the dataset must be small and your model got lucky
But in my case I have about 300k training data points and 100k testing data points
Also my classes are well balanced
> summary(train$Bucket)
0 1 TO 30 121 TO 150 151 TO 180 181 TO 365 31 TO 60 366 TO 540 541 TO 730 61 TO 90
166034 32922 4168 4070 15268 23092 8794 6927 22559
730 + 91 TO 120
20311 11222
> summary(test$Bucket)
0 1 TO 30 121 TO 150 151 TO 180 181 TO 365 31 TO 60 366 TO 540 541 TO 730 61 TO 90
55344 10974 1389 1356 5090 7698 2932 2309 7520
730 + 91 TO 120
6770 3741
Is it possible for a model to fit this well on a large testing data? Please answer if I can do something to cross verify that my model is indeed fitting really well.
My complete code
split = sample.split(Book2$Bucket,SplitRatio =0.75)
train = subset(Book2,split==T)
test = subset(Book2,split==F)
traintask <- makeClassifTask(data = train,target = "Bucket")
rf <- makeLearner("classif.randomForest")
params <- makeParamSet(makeIntegerParam("mtry",lower = 2,upper = 10),makeIntegerParam("nodesize",lower = 10,upper = 50))
#set validation strategy
rdesc <- makeResampleDesc("CV",iters=5L)
#set optimization technique
ctrl <- makeTuneControlRandom(maxit = 5L)
#start tuning
tune <- tuneParams(learner = rf ,task = traintask ,resampling = rdesc ,measures = list(acc) ,par.set = params ,control = ctrl ,show.info = T)
rf.tree <- setHyperPars(rf, par.vals = tune$x)
tune$y
r<- train(rf.tree, traintask)
getLearnerModel(r)
testtask <- makeClassifTask(data = test,target = "Bucket")
rfpred <- predict(r, testtask)
performance(rfpred, measures = list(mmce, acc))
The difference is of order 1e-4, nothing is wrong, it is a regular, statistical error (variance of the result). Nothing to worry about. This literally means that a difference is about 0.0001 * 100,000 = 10 samples ... 10 samples out of 100k.

How much value do the 8 bit variable holds?

Probably answer is 256 but I am not satisfied with it.
Suppose a variable has 8 bits , its mean its 8th bit can hold the value 256 . But it also has other seven bits . Wouldn't the total value be the sum of all bits?
To me final value that 8 bit variable holds would be the sum of all bits. But it doesn't. Why?
The max value 8 bits can hold is: 11111111 which is equal to 255. If you have a signed value, the max value it can hold is 127, the left-most bit is used for sign.
The binary 10000000 equals 128 (2 ^ 7), not 256. That's where your confusion lays I think.
00000001 = 2 ^ 0 = 1
00000010 = 2 ^ 1 = 2
00000100 = 2 ^ 2 = 4
00001000 = 2 ^ 3 = 8
00010000 = 2 ^ 4 = 16
00100000 = 2 ^ 5 = 32
01000000 = 2 ^ 6 = 64
10000000 = 2 ^ 7 = 128
The value is indeed the sum of all bits set to 1, but the place value of the eighth bit is 27 (128), not 256 as you suggest - the least significant bit is 20 (i.e. 1), so for eight bits the MSB is 27. You appear to have started from 21 (2) .
For an unsigned integer:
Bit 0 = 20 = 1
Bit 1 = 21 = 2
Bit 2 = 22 = 4
Bit 3 = 23 = 8
Bit 4 = 24 = 16
Bit 5 = 25 = 32
Bit 6 = 26 = 64
Bit 7 = 27 = 128
Sum of all ones = 255 - not 256 as you suggest: 0 to 255 = 28 (256) values.
For a two's complement signed 8 bit type:
Bit 7 = -27 = -128
Sum of all ones = -1,
while if Bit 8 = 0, sum = +127,
and all zeros except bit 8 = -128.
(-128 to +127 = 28 (256) values).
Either way an 8 bit integer signed or otherwise has 28 (256) possible bit patterns.

Resources