fill nan with "ND" if before the nan there is a value - fillna

I have a dataframe like this:
A B C
1 32 nan nan
2 32 nan nan
3 nan nan 14
4 nan nan nan
my desired output is the following:
A B C
1 32 nan nan
2 32 nan nan
3 ND nan 14
4 ND nan ND
Basically, I need:
To exclude all the columns with only nan values
To fill nan (with "ND") after a value and not before! (see column C)
Could you help me, please?
Thanks

I found the following solution:
df=df.where(df.ffill().isna(), df.fillna("ND"))
However, if you try to run this line you will get the following error:
TypeError: <U2 cannot be converted to an IntegerDtype
I solved it using replace:
df=df.where(df.ffill().isna(), df.fillna(123456789123456789))
df= df.replace(123456789123456789, "ND")

Related

RLlib PPO continuous actions seem to become nan after total_loss = inf?

After some amount of training on a custom Multi-agent environment using RLlib's (1.4.0) PPO network, I found that my continuous actions turn into nan (explodes?) which is probably caused by a bad gradient update which in turn depends on the loss/objective function.
As I understand it, PPO's loss function relies on three terms:
The PPO Gradient objective [depends on outputs of old policy and new policy, the advantage, and the "clip" parameter=0.3, say]
The Value Function Loss
The Entropy Loss [mainly there to encourage exploration]
Total Loss = PPO Gradient objective (clipped) - vf_loss_coeff * VF Loss + entropy_coeff * entropy.
I have set entropy coeff to 0. So I am focusing on the other two functions contributing to the total loss. As seen below in the progress table, the relevant portion where the total loss becomes inf is the problem area. The only change I found is that the policy loss was all negative until row #445.
So my question is: Can anyone explain what policy loss is supposed to look like and if this is normal? How do I resolve this issue with continuous actions becoming nan after a while? Is it just a question of lowering the learning rate?
EDIT
Here's a link to the related question (if you need more context)
END OF EDIT
I would really appreciate any tips! Thank you!
Total loss
policy loss
VF loss
430
6.068537
-0.053691725999999995
6.102932
431
5.9919114
-0.046943977000000005
6.0161843
432
8.134636
-0.05247503
8.164852
433
4.222730599999999
-0.048518334
4.2523246
434
6.563492
-0.05237444
6.594456
435
8.171028999999999
-0.048245672
8.198222999999999
436
8.948264
-0.048484523
8.976327000000001
437
7.556602000000001
-0.054372005
7.5880575
438
6.124418
-0.05249534
6.155608999999999
439
4.267647
-0.052565258
4.2978816
440
4.912957700000001
-0.054498855
4.9448576
441
16.630292999999998
-0.043477765999999994
16.656229
442
6.3149705
-0.057527818
6.349851999999999
443
4.2269225
-0.05446908599999999
4.260793700000001
444
9.503102
-0.052135203
9.53277
445
inf
0.2436709
4.410831
446
nan
-0.00029848056
22.596403
447
nan
0.00013323531
0.00043436907999999994
448
nan
1.5656527000000002e-05
0.0002645221
449
nan
1.3344318000000001e-05
0.0003139485
450
nan
6.941916999999999e-05
0.00025863337
451
nan
0.00015686743
0.00013607396
452
nan
-5.0206604e-06
0.00027541115000000003
453
nan
-4.5543664e-05
0.0004247162
454
nan
8.841756999999999e-05
0.00020278389999999998
455
nan
-8.465959e-05
9.261127e-05
456
nan
3.8680790000000003e-05
0.00032097592999999995
457
nan
2.7373152999999996e-06
0.0005146417
458
nan
-6.271608e-06
0.0013273798000000001
459
nan
-0.00013192794
0.00030621013
460
nan
0.00038987884
0.00038019830000000004
461
nan
-3.2747877999999998e-06
0.00031471922
462
nan
-6.9349815e-05
0.00038836736000000006
463
nan
-4.666238e-05
0.0002851575
464
nan
-3.7067155e-05
0.00020161088
465
nan
3.0623291e-06
0.00019258813999999998
466
nan
-8.599938e-06
0.00036465342000000005
467
nan
-1.1529375e-05
0.00016500981
468
nan
-3.0851965e-07
0.00022042097
469
nan
-0.0001133984
0.00030230957999999997
470
nan
-1.0735256e-05
0.00034000343000000003
It appears that RLLIB's PPO configuration of grad_clip is way too big (grad_clip=40). I changed it to grad_clip=4 and it worked.
I met the same problem when running the rllib example. I also post my problem in this issue. I am also running PPO in a countious and bounded action space. The PPO output actions that are quite large and finally crash dued to Nan related error.
For me, it seems that when the log_std of the action normal distribution is too large, large actions(about 1e20) will appear. I copy the codes for calculate loss in RLlib(v1.10.0) ppo_torch_policy.py and paste them below.
logp_ratio = torch.exp(
curr_action_dist.logp(train_batch[SampleBatch.ACTIONS]) -
train_batch[SampleBatch.ACTION_LOGP])
action_kl = prev_action_dist.kl(curr_action_dist)
mean_kl_loss = reduce_mean_valid(action_kl)
curr_entropy = curr_action_dist.entropy()
mean_entropy = reduce_mean_valid(curr_entropy)
surrogate_loss = torch.min(
train_batch[Postprocessing.ADVANTAGES] * logp_ratio,
train_batch[Postprocessing.ADVANTAGES] * torch.clamp(
logp_ratio, 1 - self.config["clip_param"],
1 + self.config["clip_param"]))
For that large actions, the logp curr_action_dist.logp(train_batch[SampleBatch.ACTIONS])computed by <class 'torch.distributions.normal.Normal'> will be -inf. And then curr_action_dist.logp(train_batch[SampleBatch.ACTIONS]) -train_batch[SampleBatch.ACTION_LOGP]) return Nan. torch.min and torch.clamp will still keep the Nan output(refer to the doc).
So in conclusion, I guess that the Nan is caused by the -inf value of the log probability of very large actions, and the torch failed to clip it according to the the "clip" parameter.
The difference is that I do not set entropy_coeff to zero. In my case, the std variance is encouraged to be as large as possible since the entropy is computed for the total normal distribution instead of the distribution restricted to the action space. I am not sure whether you get large σ as I do. In addition, I am using Pytorch, things may be different for Tf.

problem with missing value. Does not work for every missing value?

I want my missing values to be replaced by the mode of given data. But my code is replacing only one of the missing values. Why?
my real data is:
0 NaN
1 NaN
2 normal
3 normal
4 normal
...
395 normal
396 normal
397 normal
398 normal
399 normal
Name: rbc, Length: 400, dtype: object
my code is:
rbc = data_penyakit['rbc'].mode()
rbc = data_penyakit['rbc'].mask(pd.isna, rbc)
rbc
and the result is
0 normal
1 NaN
2 normal
3 normal
4 normal
...
395 normal
396 normal
397 normal
398 normal
399 normal
Name: rbc, Length: 400, dtype: object
Why is the second missing value not replaced?
mode is giving back nan as the second most frequent item. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mode.html
So how about
fill = data_penyakit['rbc'].mode().iloc[0]
rbc.fillna(value=fill, inplace=True)

How to create a dask time series data frame with multiple columns

I'm having trouble creating a dask time series dataframe that calculates the mean per hour over multiple columns.
This is an example of my input csv file:
name,date_time,num
dan,2019-01-02 00:00:00,3
ben,2019-01-02 00:00:00,7
dan,2019-01-02 02:00:00,13
dan,2019-01-02 10:00:00,9
dan,2019-01-02 10:01:00,3
ben,2019-01-02 14:22:00,66
ben,2019-01-02 14:37:00,37
I can produce the desired output using pandas
import pandas as pd
from matplotlib import pyplot
df = pd.read_csv('my_file.csv')
df['timestamp'] = pd.to_datetime(df.date_time)
df = df.set_index(df.timestamp) # set a datetime index
df = df.groupby('name').resample('H')['num'].mean().unstack('name')
df.fillna(0).plot()
Desired output
name ben dan
timestamp
2019-01-02 00:00:00 7.0 3.0
2019-01-02 01:00:00 NaN NaN
2019-01-02 02:00:00 NaN 13.0
2019-01-02 03:00:00 NaN NaN
2019-01-02 04:00:00 NaN NaN
2019-01-02 05:00:00 NaN NaN
2019-01-02 06:00:00 NaN NaN
2019-01-02 07:00:00 NaN NaN
2019-01-02 08:00:00 NaN NaN
2019-01-02 09:00:00 NaN NaN
2019-01-02 10:00:00 NaN 6.0
2019-01-02 11:00:00 NaN NaN
2019-01-02 12:00:00 NaN NaN
2019-01-02 13:00:00 NaN NaN
2019-01-02 14:00:00 51.5 NaN
My attempt to produce the same dataframe with dask
from dask import dataframe as dd
from matplotlib import pyplot
ddf = dd.read_csv('my_file.csv')
# setting an index
ddf['timestamp'] = dd.to_datetime(ddf.date_time)
ddf = ddf.set_index(ddf.timestamp)
ddf.repartition(freq='MS')
ddf.groupby('name').resample('H')['num'].mean()
When I run the code above I get this error:
AttributeError: 'Column not found: resample'
This has me really stumped and any help would be appreciated.
It looks like dask dataframe does not implement a groupby-resample operation. It sounds like you have a feature request. I recommend raising an issue at https://github.com/dask/dask/issues/new
See https://docs.dask.org/en/latest/support.html#asking-for-help for requests on where to ask for help.

Multi-class classification in sparse dataset

I have a dataset of factory workstations.
There are two types of error in same particular time.
User selects error and time interval (dependent variable-y)
Machines produces errors during production (independent variables-x)
User selected error types are 8 unique in total so I tried to predict those errors using machine-produced errors(total 188 types) and some other numerical features such as avg. machine speed, machine volume, etc.
Each row represents user-selected error in particular time;
For example in first line user selects time interval as:
2018-01-03 12:02:00 - 2018-01-03 12:05:37
and m_er_1(machine error 1) also occured in same time interval 12 times.
m_er_1_dur(machine error 1 duration) is total duration of machine error in seconds
So I matched those two tables and looks like below;
user_error m_er_1 m_er_2 m_er_3 ... m_er_188 avg_m_speed .. m_er_1_dur
A 12 0 0 0 150 217
B 0 0 2 0 10 0
A 3 0 0 6 34 37
A 0 0 0 0 5 0
D 0 0 0 0 3 0
E 0 0 0 0 1000 0
In the end, I have 1900 rows 390 rows( 376( 188 machine error counts + 188 machine error duration) + 14 numerical features) and due to machine errors it is a sparse dataset, lots of 0.
There a none outliers, none nan values, I normalized and tried several classification algorithms( SVM, Logistic Regression, MLPC, XGBoost, etc.)
I also tried PCA but didn't work well, for 165 components explained_variance_ratio is around 0.95
But accuracy metrics are very low, for logistic regression accuracy score is 0.55 and MCC score around 0.1, recall, f1, precision also very low.
Are there some steps that I miss? What would you suggest for multiclass classification for sparse dataset?
Thanks in advance

GLMM glmer and glmmADMB - comparison error

I am trying to compare if there are differences in the number of obtained seeds in five different populations with different applied treatments, and having maternal plant and paternal plant as random effects. First I tried to fit a glmer model.
dat <-dat [,c(12,7,6,13,8,11)]
dat$parents<-factor(paste(dat$mother,dat$father,sep="_"))
compareTreat <- function(d)
{
d$treatment <-factor(d$treatment)
print (tapply(d$pop,list(d$pop,d$treatment),length))
print(summary(fit<-glmer(seed_no~treatment+(1|pop/mother)+
(1|pop/father),data=d,family="poisson")))
}
Then, I compared two treatments in two populations (pop 64 and pop 121, in that case). The other populations do not have this particular treatments, so I get NA values for those.
compareTreat(subset(dat,treatment%in%c("IE 5x","IE 7x")&pop%in%c(64,121)))
This is the output:
IE 5x IE 7x
10 NA NA
45 NA NA
64 31 27
121 33 28
144 NA NA
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: poisson ( log )
Formula: seed_no ~ treatment + (1 | pop/mother) + (1 | pop/father)
Data: d
AIC BIC logLik deviance df.resid
592.5 609.2 -290.2 580.5 113
Scaled residuals:
Min 1Q Median 3Q Max
-1.8950 -0.8038 -0.2178 0.4440 1.7991
Random effects:
Groups Name Variance Std.Dev.
father.pop (Intercept) 3.566e-01 5.971e-01
mother.pop (Intercept) 9.456e-01 9.724e-01
pop (Intercept) 1.083e-10 1.041e-05
pop.1 (Intercept) 1.017e-10 1.008e-05
Number of obs: 119, groups: father:pop, 81; mother:pop, 24; pop, 2
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.74664 0.24916 2.997 0.00273 **
treatmentIE 7x -0.05789 0.17894 -0.324 0.74629
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr)
tretmntIE7x -0.364
It seems there are no differences between treatments. But as there are many zeros in the data, a zero-inflated model would be worthy to try. I tried with glmmabmd, and I wrote the script like this:
compareTreat<-function(d)
{
d$treatment<-factor(d$treatment)
print(tapply(d$pop,list(d$pop,d$treatment), length))
print(summary(fit_zip<-glmmadmb(seed_no~treatment + (1|pop/mother)+
(1|pop/father),data=d,family="poisson", zeroInflation=TRUE)))
}
Then I compared again the treatments. Here I have not changed the code.
compareTreat(subset(dat,treatment%in%c("IE 5x","IE 7x")&pop%in%c(64,121)))
But in that case, the output is
IE 5x IE 7x
10 NA NA
45 NA NA
64 31 27
121 33 28
144 NA NA
Error in pop:father : NA/NaN argument
In addition: Warning messages:
1: In pop:father :
numerical expression has 119 elements: only the first used
2: In pop:father :
numerical expression has 119 elements: only the first used
3: In eval(parse(text = x), data) : NAs introduced by coercion
Called from: eval(parse(text = x), data)
I tried to change everything I came up with, but I still don't know where the problem is.
If I remove the (1|pop/father) from the glmmadmb script, the model runs, but it feels not correct. I wonder if the mistake is in the loop prior to the glmmadmb but it worked OK in the glmer model, or if it is in the comparison itself after the model. I tried as well to remove NAs with na.omit in case that was an issue, but it did not make a difference. Why does the script stop and does not continue running?
I am a student beginner with RStudio, my version is 3.4.2, called Short Summer. If someone with experience could point me in the right direction I would be very grateful!
H.

Resources