problem with missing value. Does not work for every missing value? - normalization

I want my missing values to be replaced by the mode of given data. But my code is replacing only one of the missing values. Why?
my real data is:
0 NaN
1 NaN
2 normal
3 normal
4 normal
...
395 normal
396 normal
397 normal
398 normal
399 normal
Name: rbc, Length: 400, dtype: object
my code is:
rbc = data_penyakit['rbc'].mode()
rbc = data_penyakit['rbc'].mask(pd.isna, rbc)
rbc
and the result is
0 normal
1 NaN
2 normal
3 normal
4 normal
...
395 normal
396 normal
397 normal
398 normal
399 normal
Name: rbc, Length: 400, dtype: object
Why is the second missing value not replaced?

mode is giving back nan as the second most frequent item. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mode.html
So how about
fill = data_penyakit['rbc'].mode().iloc[0]
rbc.fillna(value=fill, inplace=True)

Related

RLlib PPO continuous actions seem to become nan after total_loss = inf?

After some amount of training on a custom Multi-agent environment using RLlib's (1.4.0) PPO network, I found that my continuous actions turn into nan (explodes?) which is probably caused by a bad gradient update which in turn depends on the loss/objective function.
As I understand it, PPO's loss function relies on three terms:
The PPO Gradient objective [depends on outputs of old policy and new policy, the advantage, and the "clip" parameter=0.3, say]
The Value Function Loss
The Entropy Loss [mainly there to encourage exploration]
Total Loss = PPO Gradient objective (clipped) - vf_loss_coeff * VF Loss + entropy_coeff * entropy.
I have set entropy coeff to 0. So I am focusing on the other two functions contributing to the total loss. As seen below in the progress table, the relevant portion where the total loss becomes inf is the problem area. The only change I found is that the policy loss was all negative until row #445.
So my question is: Can anyone explain what policy loss is supposed to look like and if this is normal? How do I resolve this issue with continuous actions becoming nan after a while? Is it just a question of lowering the learning rate?
EDIT
Here's a link to the related question (if you need more context)
END OF EDIT
I would really appreciate any tips! Thank you!
Total loss
policy loss
VF loss
430
6.068537
-0.053691725999999995
6.102932
431
5.9919114
-0.046943977000000005
6.0161843
432
8.134636
-0.05247503
8.164852
433
4.222730599999999
-0.048518334
4.2523246
434
6.563492
-0.05237444
6.594456
435
8.171028999999999
-0.048245672
8.198222999999999
436
8.948264
-0.048484523
8.976327000000001
437
7.556602000000001
-0.054372005
7.5880575
438
6.124418
-0.05249534
6.155608999999999
439
4.267647
-0.052565258
4.2978816
440
4.912957700000001
-0.054498855
4.9448576
441
16.630292999999998
-0.043477765999999994
16.656229
442
6.3149705
-0.057527818
6.349851999999999
443
4.2269225
-0.05446908599999999
4.260793700000001
444
9.503102
-0.052135203
9.53277
445
inf
0.2436709
4.410831
446
nan
-0.00029848056
22.596403
447
nan
0.00013323531
0.00043436907999999994
448
nan
1.5656527000000002e-05
0.0002645221
449
nan
1.3344318000000001e-05
0.0003139485
450
nan
6.941916999999999e-05
0.00025863337
451
nan
0.00015686743
0.00013607396
452
nan
-5.0206604e-06
0.00027541115000000003
453
nan
-4.5543664e-05
0.0004247162
454
nan
8.841756999999999e-05
0.00020278389999999998
455
nan
-8.465959e-05
9.261127e-05
456
nan
3.8680790000000003e-05
0.00032097592999999995
457
nan
2.7373152999999996e-06
0.0005146417
458
nan
-6.271608e-06
0.0013273798000000001
459
nan
-0.00013192794
0.00030621013
460
nan
0.00038987884
0.00038019830000000004
461
nan
-3.2747877999999998e-06
0.00031471922
462
nan
-6.9349815e-05
0.00038836736000000006
463
nan
-4.666238e-05
0.0002851575
464
nan
-3.7067155e-05
0.00020161088
465
nan
3.0623291e-06
0.00019258813999999998
466
nan
-8.599938e-06
0.00036465342000000005
467
nan
-1.1529375e-05
0.00016500981
468
nan
-3.0851965e-07
0.00022042097
469
nan
-0.0001133984
0.00030230957999999997
470
nan
-1.0735256e-05
0.00034000343000000003
It appears that RLLIB's PPO configuration of grad_clip is way too big (grad_clip=40). I changed it to grad_clip=4 and it worked.
I met the same problem when running the rllib example. I also post my problem in this issue. I am also running PPO in a countious and bounded action space. The PPO output actions that are quite large and finally crash dued to Nan related error.
For me, it seems that when the log_std of the action normal distribution is too large, large actions(about 1e20) will appear. I copy the codes for calculate loss in RLlib(v1.10.0) ppo_torch_policy.py and paste them below.
logp_ratio = torch.exp(
curr_action_dist.logp(train_batch[SampleBatch.ACTIONS]) -
train_batch[SampleBatch.ACTION_LOGP])
action_kl = prev_action_dist.kl(curr_action_dist)
mean_kl_loss = reduce_mean_valid(action_kl)
curr_entropy = curr_action_dist.entropy()
mean_entropy = reduce_mean_valid(curr_entropy)
surrogate_loss = torch.min(
train_batch[Postprocessing.ADVANTAGES] * logp_ratio,
train_batch[Postprocessing.ADVANTAGES] * torch.clamp(
logp_ratio, 1 - self.config["clip_param"],
1 + self.config["clip_param"]))
For that large actions, the logp curr_action_dist.logp(train_batch[SampleBatch.ACTIONS])computed by <class 'torch.distributions.normal.Normal'> will be -inf. And then curr_action_dist.logp(train_batch[SampleBatch.ACTIONS]) -train_batch[SampleBatch.ACTION_LOGP]) return Nan. torch.min and torch.clamp will still keep the Nan output(refer to the doc).
So in conclusion, I guess that the Nan is caused by the -inf value of the log probability of very large actions, and the torch failed to clip it according to the the "clip" parameter.
The difference is that I do not set entropy_coeff to zero. In my case, the std variance is encouraged to be as large as possible since the entropy is computed for the total normal distribution instead of the distribution restricted to the action space. I am not sure whether you get large σ as I do. In addition, I am using Pytorch, things may be different for Tf.

GIMP palette file (.gpl) format / syntax?

I'm looking for the exact specifications of this file format. Anyone got a link? Or want to comment?
I have spent the better part of the day searching, yet I keep getting directed back to the GIMP online user-manual. It says "look at a .gpl file and you will see it is easy" to build manually with a text editor. I don't actually have GIMP, but I see examples online. Yep, easy. • EXCEPT:
• What meaning do the color names ultimately have? Are they purely semantic, or does a program rely on them? If the latter, then what if there are two (2) or more colors with the same name?
• What does the "Columns" line do?
I've seen examples that have no "Columns" line.
I've seen examples that have values of 0, 4, and 16; yet this does not in any way that I can see correspond to the color data. I see 3 columns of decimal-sRGB values, and an optional 4th column with the color-name; seems I remember the example with "Columns 4" had no color names, only the 3 RGB columns.
• Do columns of RGB values need to "line up"? Or will the following example from my output algorithm work? (from the Crayola palette):
159 129 112 Beaver
253 124 110 Bittersweet
0 0 0 Black
172 229 238 Blizzard Blue
31 117 254 Blue
162 162 208 Blue Bell
102 153 204 Blue Gray
13 152 186 Blue Green
• Does this format accept sRGBA colors? And if so, how is the "A" value defined (0-1, 0%-100%, 0-127, 0-255, etc.?) (seems I remember when creating .png files with PHP, the "A" value was 7-bit)?
• How exactly do you add comments / metadata?
Today I see an example that says lines that begin with # are comments, or anything after a # on a line is a comment. Yesterday I thought (maybe I'm confused) I saw an example that said that comment lines begin with ;
• Is any other data-format supported?
Originally I thought the text-line just before the color-data that I see in every example indicated the format: "#" signifying decimal-sRGB; until today when I see that is just a blank-line comment.
• What line ending character(s) can / must I use?
\n
\r
• What character-encodings can I use? ASCII only? ¿UTF-8 ☺ with extended ♪♫ charset (¡hopefully!)?
• Anything I'm missing? Any other options available?
Here is an example from http://gimpchat.com/viewtopic.php?f=8&t=3375#
GIMP Palette
Name: bugslife_final.png-10
Columns: 16
#
191 180 180 Index 0
163 158 157 Index 1
145 136 132 Index 2
130 125 112 Index 3
… … …
56 50 49 Index 29
41 38 38 Index 30
23 23 23 Index 31
242 245 213 Index 32
227 232 181 Index 33
210 217 147 Index 34
195 204 118 Index 35
… … …
0 0 0 Index 251
0 0 0 Index 252
0 0 0 Index 253
0 0 0 Index 254
0 0 0 Index 255
Aloha!
Looking at the source code:
Columns is just an indication for display in the palette editor
Comments must start with a #. In non-empty lines that don't, the first three tokens are parsed as numbers
There is no alpha support

GridSearchCV freezing with linear svm

I have problem with GridSearchCV freezing (CPU is active but program in not advancing) with linear svm (but with rbf svm it works fine).
Depending on the random_state that I use for splitting my data, I have this freezing in different splits points of cv for different PCA components?
The features of one sample looks like the following(it is about 39 features)
[1 117 137 2 80 16 2 39 228 88 5 6 0 10 13 6 22 23 1 227 246 7 1.656934307 0 5 0.434195726 0.010123735 0.55568054 5 275 119.48398 0.9359527 0.80484825 3.1272728 98 334 526 0.13454546 0.10181818]
Another sample's features:
[23149 4 31839 9 219 117 23 5 31897 12389 108 2 0 33 23 0 0 18 0 0 0 23149 0 0 74 0.996405221 0.003549844 4.49347E-05 74 5144 6.4480677 0.286384 0.9947901 3.833787 20 5135 14586 0.0060264384 0.011664075]
If I delete the last 10 feature I don't have this problem ( The 10 new features that I added before my code worked fine). I did not check other combinations of the 10 last new features to check if a specific feature is causing this problem.
Also I use StandardScaler to scale the features but still facing this issue. I have less of this problem if I use MinMaxScaler scaler (but read soewhere it is not good for svm).
I also put n_jobs to different numbers and it only could advance by little but freezes again.
What do you suggest?
I followed part of this code to write my code:
TypeError grid seach

NMF Sparse Matrix Analysis (using SKlearn)

Just looking for some brief advice to put me back on the right track. I have been working on a solution to a problem where I have a very sparse input matrix (~25% of information filled, rest is 0's) stored in a sparse.coo_matrix:
sparse_matrix = sparse.coo_matrix((value, (rater, blurb))).toarray()
After some work on building this array from my data set and messing around with some other options, I currently have my NMF model fitter function defined as follows:
def nmf_model(matrix):
model = NMF(init='nndsvd', random_state=0)
W = model.fit_transform(matrix);
H = model.components_;
result = np.dot(W,H)
return result
Now, the issue is my output doesn't seem to be accounting for the 0 values correctly. Any value that was a 0 gets bumped to some value less than 1 and my known values fluctuate from the actual quite a bit (All data are ratings between 1 and 10). Can anyone spot what I am doing wrong? From the documentation for scikit, I assumed using the nndsvd initialization would help account for the empty values correct. Sample output:
#Row / Column / New Value
35 18 6.50746917334 #Actual Value is 6
35 19 0.580996641675 #Here down are all "estimates" of my function
35 20 1.26498699492
35 21 0.00194119935464
35 22 0.559623469753
35 23 0.109736902936
35 24 0.181657421405
35 25 0.0137801897011
35 26 0.251979684515
35 27 0.613055371646
35 28 6.17494590041 #Actual values is 5.5
Appreciate any advice any more experienced ML coders can offer!

How correlation help in matching two images?

It is known that if we are finding the most matching window to the current window in the entire image, then wherever the correlation is maximised then that is the matching window.
[22 12 14] (window)
(image)
[22 12 34 54 ]
[112 34 54 111 ]
[12 22 12 34 ]
[11 22 12 14 ]
But correlation is product of corresponding values in two windows.
So, if some of the window have high intensity values then they will always provide better match. e.g. in above example we will have higher correlation value for 2nd row.
Probably you need Normalized Cross Correlation, the maximum will be in 4th row.

Resources