GLMM glmer and glmmADMB - comparison error - comparison

I am trying to compare if there are differences in the number of obtained seeds in five different populations with different applied treatments, and having maternal plant and paternal plant as random effects. First I tried to fit a glmer model.
dat <-dat [,c(12,7,6,13,8,11)]
dat$parents<-factor(paste(dat$mother,dat$father,sep="_"))
compareTreat <- function(d)
{
d$treatment <-factor(d$treatment)
print (tapply(d$pop,list(d$pop,d$treatment),length))
print(summary(fit<-glmer(seed_no~treatment+(1|pop/mother)+
(1|pop/father),data=d,family="poisson")))
}
Then, I compared two treatments in two populations (pop 64 and pop 121, in that case). The other populations do not have this particular treatments, so I get NA values for those.
compareTreat(subset(dat,treatment%in%c("IE 5x","IE 7x")&pop%in%c(64,121)))
This is the output:
IE 5x IE 7x
10 NA NA
45 NA NA
64 31 27
121 33 28
144 NA NA
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: poisson ( log )
Formula: seed_no ~ treatment + (1 | pop/mother) + (1 | pop/father)
Data: d
AIC BIC logLik deviance df.resid
592.5 609.2 -290.2 580.5 113
Scaled residuals:
Min 1Q Median 3Q Max
-1.8950 -0.8038 -0.2178 0.4440 1.7991
Random effects:
Groups Name Variance Std.Dev.
father.pop (Intercept) 3.566e-01 5.971e-01
mother.pop (Intercept) 9.456e-01 9.724e-01
pop (Intercept) 1.083e-10 1.041e-05
pop.1 (Intercept) 1.017e-10 1.008e-05
Number of obs: 119, groups: father:pop, 81; mother:pop, 24; pop, 2
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.74664 0.24916 2.997 0.00273 **
treatmentIE 7x -0.05789 0.17894 -0.324 0.74629
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr)
tretmntIE7x -0.364
It seems there are no differences between treatments. But as there are many zeros in the data, a zero-inflated model would be worthy to try. I tried with glmmabmd, and I wrote the script like this:
compareTreat<-function(d)
{
d$treatment<-factor(d$treatment)
print(tapply(d$pop,list(d$pop,d$treatment), length))
print(summary(fit_zip<-glmmadmb(seed_no~treatment + (1|pop/mother)+
(1|pop/father),data=d,family="poisson", zeroInflation=TRUE)))
}
Then I compared again the treatments. Here I have not changed the code.
compareTreat(subset(dat,treatment%in%c("IE 5x","IE 7x")&pop%in%c(64,121)))
But in that case, the output is
IE 5x IE 7x
10 NA NA
45 NA NA
64 31 27
121 33 28
144 NA NA
Error in pop:father : NA/NaN argument
In addition: Warning messages:
1: In pop:father :
numerical expression has 119 elements: only the first used
2: In pop:father :
numerical expression has 119 elements: only the first used
3: In eval(parse(text = x), data) : NAs introduced by coercion
Called from: eval(parse(text = x), data)
I tried to change everything I came up with, but I still don't know where the problem is.
If I remove the (1|pop/father) from the glmmadmb script, the model runs, but it feels not correct. I wonder if the mistake is in the loop prior to the glmmadmb but it worked OK in the glmer model, or if it is in the comparison itself after the model. I tried as well to remove NAs with na.omit in case that was an issue, but it did not make a difference. Why does the script stop and does not continue running?
I am a student beginner with RStudio, my version is 3.4.2, called Short Summer. If someone with experience could point me in the right direction I would be very grateful!
H.

Related

Drift Detection in categorical variables of high cardinality (10000+)

I am trying to solve a drift detection problem where I have to find out the drift in high cardinality (10000+) categorical variables such as ip_address, zipcode, cities. I have data points in the order of millions. I have tried the following methods -
chi square test from evidently python package https://github.com/evidentlyai/evidently/blob/main/src/evidently/analyzers/stattests/chisquare_stattest.py
maximum mean discrepancy test with GaussianRBF Kernel from alibi-detecthttps://github.com/SeldonIO/alibi-detect/blob/master/alibi_detect/cd/mmd.py
I have faced the below problems while applying these methods on my data
In chi square test, there is a constraint that we should have same set of categories in both training and inference datasets. This is highly unlikely for the features like ip address and zipcode. There are some data points which are available in training data but not in inference data. For such data points, I don't get observed frequency. I can assume their frequency as 0 as a work around.
But there are data points which have been newly introduced in the inference dataset and don't have their presence in the training dataset. So I would not be able to find out their expected frequency from training dataset. For such data points, I would have 0 in the denominator of the chi square formula and test statistic will be NaN. As a workaround, I can assume their minimum expected frequency equal to 1. But I wonder whether this is the correct way to approach the drift detection.
Moreover, the larger problem is the following -
The nature of these categorical feature variables is such that they can take any possible value from a very very large set of values. I don't have any control over these features taking a value. The users of the system can login from any IP address and from any zipcode. This becomes very difficult to find out the real drift in the data. Methods like chi square test can always give the significant result for such features. Is their any method which can handle such features for drift detection which takes into consideration the high cardinality and the aforementioned nature of the data.
In MMD test with GaussianRBF kernel, we use to calculate pairwise distance between two vectors. X is my training dataset which is having 15 millions records and 10 features. and Y is my inference dataset which is having 10 millions records and 10 features. Now when I perform MMD test on these datasets. I get the following -
a) K_XX = within similarity of X
b) K_YY = within similarity of Y
c) K_XY = cross similarity between X and Y
K_XX will try to generate a matrix (15 million X 15 million). This gives me "ResourceExhaustedError".
ResourceExhaustedError Traceback (most recent call last)
<command-574623167633967> in <module>
1 from alibi_detect.cd import MMDDrift
---> 2 detector = MMDDrift(x_ref=X, backend='tensorflow')
3 res = detector.predict(x=Y)
/databricks/python/lib/python3.8/site-packages/alibi_detect/utils/warnings.py in wrapper(*args, **kwargs)
15 def wrapper(*args, **kwargs):
16 _rename_kwargs(f.__name__, kwargs, aliases)
---> 17 return f(*args, **kwargs)
18 return wrapper
19 return deco
/databricks/python/lib/python3.8/site-packages/alibi_detect/cd/mmd.py in __init__(self, x_ref, backend, p_val, x_ref_preprocessed, preprocess_at_init, update_x_ref, preprocess_fn, kernel, sigma, configure_kernel_from_x_ref, n_permutations, device, input_shape, data_type)
101 if backend == 'tensorflow' and has_tensorflow:
102 kwargs.pop('device', None)
--> 103 self._detector = MMDDriftTF(*args, **kwargs) # type: ignore
104 else:
105 self._detector = MMDDriftTorch(*args, **kwargs) # type: ignore
/databricks/python/lib/python3.8/site-packages/alibi_detect/utils/warnings.py in wrapper(*args, **kwargs)
15 def wrapper(*args, **kwargs):
16 _rename_kwargs(f.__name__, kwargs, aliases)
---> 17 return f(*args, **kwargs)
18 return wrapper
19 return deco
/databricks/python/lib/python3.8/site-packages/alibi_detect/cd/tensorflow/mmd.py in __init__(self, x_ref, p_val, x_ref_preprocessed, preprocess_at_init, update_x_ref, preprocess_fn, kernel, sigma, configure_kernel_from_x_ref, n_permutations, input_shape, data_type)
86 # compute kernel matrix for the reference data
87 if self.infer_sigma or isinstance(sigma, tf.Tensor):
---> 88 self.k_xx = self.kernel(self.x_ref, self.x_ref, infer_sigma=self.infer_sigma)
89 self.infer_sigma = False
90 else:
/databricks/python/lib/python3.8/site-packages/keras/utils/traceback_utils.py in error_handler(*args, **kwargs)
65 except Exception as e: # pylint: disable=broad-except
66 filtered_tb = _process_traceback_frames(e.__traceback__)
---> 67 raise e.with_traceback(filtered_tb) from None
68 finally:
69 del filtered_tb
/databricks/python/lib/python3.8/site-packages/alibi_detect/utils/tensorflow/kernels.py in call(self, x, y, infer_sigma)
75 y = tf.cast(y, x.dtype)
76 x, y = tf.reshape(x, (x.shape[0], -1)), tf.reshape(y, (y.shape[0], -1)) # flatten
---> 77 dist = distance.squared_pairwise_distance(x, y) # [Nx, Ny]
78
79 if infer_sigma or self.init_required:
/databricks/python/lib/python3.8/site-packages/alibi_detect/utils/tensorflow/distance.py in squared_pairwise_distance(x, y, a_min, a_max)
28 x2 = tf.reduce_sum(x ** 2, axis=-1, keepdims=True)
29 y2 = tf.reduce_sum(y ** 2, axis=-1, keepdims=True)
---> 30 dist = x2 + tf.transpose(y2, (1, 0)) - 2. * x # tf.transpose(y, (1, 0))
31 return tf.clip_by_value(dist, a_min, a_max)
32
ResourceExhaustedError: Exception encountered when calling layer "gaussian_rbf_20" (type GaussianRBF).
OOM when allocating tensor with shape[14335347,14335347] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu [Op:AddV2]
The error seems pretty obvious given the quadratic complexity of MMD. Is their any way to mitigate this issue given the constraints that number of records in millions and high cardinality of categorical features.

Minimum number of states in DFA

Minimum number states in the DFA accepting strings (base 3 i.e,, ternary form) congruent to 5 modulo 6?
I have tried but couldn't do it.
At first sight, It seems to have 6 states but then it can be minimised further.
Let's first see the state transition table:
Here, the states q0, q1, q2,...., q5 corresponds to the states with modulo 0,1,2,..., 5 respectively when divided by 6. q0 is our initial state and since we need modulo 5 therefore our final state will be q5
Few observations drawn from above state transition table:
states q0, q2 and q4 are exactly same
states q1, q3 and q5 are exactly same
The states which make transitions to the same states on the same inputs can be merged into a single state.
Note: Final and Non-final states can never be merged.
Therefore, we can merge q0, q2, q4 together and q1, q3 together leaving the state q5 aloof from collation.
The final Minimal DFA has 3 states as shown below:
Let's look at a few strings in the language:
12 = 1*3 + 2 = 5 ~ 5 (mod 6)
102 = 1*9 + 0*3 + 2 = 11 ~ 5 (mod 6)
122 = 1*9 + 2*3 + 2 = 17 ~ 5 (mod 6)
212 = 2*9 + 1*3 + 2 = 23 ~ 5 (mod 6)
1002 = 1*18 + 0*9 + 0*9 + 2 = 29 ~ 5 (mod 6)
We notice that all the strings end in 2. This makes sense since 6 is a multiple of 3 and the only way to get 5 from a multiple of 3 is to add 2. Based on this, we can try to solve the problem of strings congruent to 3 modulo 6:
10 = 3
100 = 9
120 = 15
210 = 21
1000 = 27
There's not a real pattern emerging, but consider this: every base-3 number ending in 0 is definitely divisible by 3. The ones that are even are also divisible by 6; so the odd numbers whose base-3 representation ends in 0 must be congruent to 3 mod 6. Because all the powers of 3 are odd, we know we have an odd number if the number of 1s in the string is odd.
So, our conditions are:
the string begins with a 1;
the string has an odd number of 1s;
the string ends with 2;
the string can contain any number of 2s and 0s.
To get the minimum number of states in such a DFA, we can use the Myhill-Nerode theorem beginning with the empty string:
the empty string can be followed by any string in the language. Call its equivalence class [e]
the string 0 cannot be followed by anything since valid base-3 representations don't have leading 0s. Call its equivalence class [0].
the string 1 must be followed with stuff that has an even number of 1s in it ending with a 2. Call its equivalence class [1].
the string 2 can be followed by anything in the language. Indeed, you can verify that putting a 2 at the front of any string in the language gives another string in the language. However, it can also be followed by strings beginning with 0. Therefore, its class is new: [2].
the string 00 can't be followed by anything to fix it; its class is the same as its prefix 0, [0]. same for the string 01.
the string 10 can be followed by any string with an even number of 1s that ends in a 2; it is therefore equivalent to the class [1].
the string 11 can be followed by any string in the language whatever; indeed, you can verify prepending 11 in front of any string in the language gives another solution. However, it can also be followed by strings beginning with 0. Therefore, its class is the same as [2].
12 can be followed by a string with an even number of 1s ending in 2, as well as by the empty string (since 12 is in fact in the language). This is a new class, [12].
21 is equivalent to 1; class [1]
22 is equivalent to 2; class [2]
20 is equivalent to 2; class [2]
120 is indistinguishable from 1; its class is [1].
121 is indistinguishable from [2].
122 is indistinguishable from [12].
We have seen no new equivalence classes on new strings of length 3; so, we know we have seen all the equivalence classes. They are the following:
[e]: any string in the language can follow this
[0]: no string can follow this
[1]: a string with an even number of 1s ending in 2 can follow this
[2]: same as [e] but also strings beginning with 0
[12]: same as [1] but also the empty string
This means that a minimal DFA for our language has five states. Here is the DFA:
[0]
^
|
0
|
----->[e]--2-->[2]<-\
| ^ |
| | |
1 __1__/ /
| / /
| | 1
V V |
[1]--2-->[12]
^ |
| |
\___0___/
(transitions not pictured are self-loops on the respective states).
Note: I expected this DFA to have 6 states, as Welbog pointed out in the other answer, so I might have missed an equivalence class. However, the DFA seems right after checking a few examples and thinking about what it's doing: you can only get to accepting state [12] by seeing a 2 as the last symbol (definitely necessary) and you can only get to state [12] from state [1] and you must have seen an odd number of 1s to get to [1]…
The minimum number of states for almost all modulus problems is the base of the modulus. The general strategy is one state for every modulus, as transitions between moduli are independent of what the previous numbers were. For example, if you're in state r4 (representing x = 4 (mod 6)), and you encounter a 1 as your next input, your new modulus is 4x6+1 = 25 = 1 (mod 6), so the transition from r4 on input 1 is to r1. You'll find that the start state and r0 can be merged, for a total of 6 states.

missing data in time series

As im so new to this field and im trying to explore the data for a time series, and find the missing values and count them and study a distribution of their length and fill in these gaps, the thing is i have, let's say 10 file.txt and for each file i have 2 columns as follows:
C1 C2
944 0
920 1
920 2
928 3
912 7
920 8
920 9
880 10
888 11
920 12
944 13
and so on... lets say till 100 and not necessarily the 10 files have the same number of observations.
so here for example the missing values and not necessarily appears in all files that i have, missing value are: 4,5 and 6 in C2 and the corresponding 1st column C1(measured in milliseconds, so the value of 928ms is not a time neighbor of 912ms). So i want to find those gaps(the total missing values in all 10 files) and show a histogram of their lengths.
i wrote a piece of code in R, but the problem is that i don't get the exact total number that i should have for the missing values.
path = "files path"
out.file<-data.frame(TS = 0, Index = 0, File = '')
file.names <- dir(path, pattern =".txt")
for(i in 1:length(file.names)){
file <- cbind(read.table(file.names[i],
header=F,
sep ="\t",
stringsAsFactors=FALSE),
file.names[i])
colnames(file) <- c('TS', 'Index', 'File')
out.file <- rbind(out.file, file)
}
d = dim(out.file)[1]
misDa = 0
for(i in 2:(d-1)){
if(abs(out.file$Index[i]-out.file$Index[i+1]) > 1)
misDa = misDa+1
}
Hard to give specific hints without having a more extensive example of your data that contains some of the actual NAs.
If you are using R (like it seems) the naniar and the imputeTS packages offer nice functions for missing data visualizations.
Some examples from the naniar package, which is especially good for multivariate data (more plot examples):
Some examples from the imputeTS package, which is especially good for time series data (additional plot examples):

RandomForestSRC error and vimp

I am trying to perform a randomforest survival analysis according to the RANDOMFORESTSRC vignette in R. I have a data frame containing 59 variables - where 14 of them are numeric and the rest are factors. 2 of the numeric ones are TIME (days till death) and DIED (0/1 dead or not). I'm running into 2 problems:
trainrfsrc<- rfsrc(Surv(TIME, DIED) ~ .,
data = train, nsplit = 10, na.action = "na.impute")
trainrfsrc gives: Error rate: 17.07%
works fine, however exploring the error rate such as:
plot(gg_error(trainrfsrc))+ coord_cartesian(y = c(.09,.31))
returns:
geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?
or:
a<-(gg_error(trainrfsrc))
a
error ntree 1 NA 1 2 NA 2 3 NA 3 4 NA 4 5 NA 5 6 NA 6 7 NA 7 8 NA 8 9 NA 9 10 NA 10
NA for all 1000 trees.how come there's no error rate for each number of trees tried?
the second problem is when trying to explore the most important variables using VIMP such as:
plot(gg_vimp(trainrfsrc)) + theme(legend.position = c(.8,.2))+ labs(fill = "VIMP > 0")
it returns:
In gg_vimp.rfsrc(trainrfsrc) : rfsrc object does not contain VIMP information. Calculating...
Any ideas? Thanks
Setting the err.block=1 (or some integer between 1 and ntree) should fix the problem of returning NA for error. You can check the help file under rfsrc to read more about err.block.

Classification Supervised Training Confusion

So I am new to supervised machine learning, but I've been reading books and articles about it and I'm stuck on a problem. (Not stuck, but I don't understand the logic behind classification algorithms). I am trying to classify records as being wrong or not based on historical data.
So this is the original data (training data):
Name Office Age isWrong
F1 1 32 0
F2 2 61 1
F3 1 35 0
F4 0 25 0
F5 1 36 0
F6 2 52 0
F7 2 48 0
F8 1 17 1
F9 2 51 0
F10 0 24 0
F11 4 34 1
F12 0 21 0
F13 2 51 0
F14 0 27 0
F15 3 37 1
(only showing top 15 results of 200 results)
A wrong record is any record which reports an age LOWER than 18 or HIGHER than 60, or an office location that is NOT {0, 1, 2}. I have more records that display a 1 when any of the mentioned conditions are met. I trained my model with this dataset and I created a test dataset to test the results. However, I end up getting 0 on the prediction column of every record. I used a Naïve Bayes approach because this approach assumes independence between the features variables which is my case (no relationship between the office number and age). I know there are other methods like Logistic Regression and SVC(SVM), but I assume that they require a degree of relationship between the features variables. Despite that, I still tried those two approaches and got the same results. Am I doing something wrong? Do I need to specify something before training my model?
Here is what I did (very simple):
NaiveBayes nb = new NaiveBayes().setLabelCol("isWrong");
NaiveBayesModel nbm = nb.fit(dataset);
nbm.transform(dataset2).show();
Here is dataset2 (top 15):
Name Office Age
F1 9 36 //wrong, office is 9
F2 2 20
F3 1 17
F4 2 43
F5 2 90 // wrong, age is >60
F6 1 36
F7 1 40
F8 2 52
F9 2 49
F10 1 38
F11 0 28
F12 0 18
F13 1 40
F14 1 31
F15 2 45
But like I said, the prediction column displays 0 every time. Any idea why?
I don't know why you are opting for transform(). It just tries to cast the result dtype to the same one as the original column has
To get the probability you should be using the function:
predict_proba(X): Return probability estimates for the test vector X.
The following code should work perfectly in your scenario
NaiveBayes nb = new NaiveBayes().setLabelCol("isWrong");
nb.fit(dataset)
nb.predict_proba(dataset2)

Resources