ML classification, how to handle a cell that has 2 information? - machine-learning

So i have a situation that i couldnt get out. Im pretty new to machine learning and its community.
Im trying to make a classification model but here is my problem:
So lets say i have 2 of X (variables; text or integers) columns and 1 Y (which im trying to predict) column.
One of these X columns originated from a dataset that has duplicate rows but some of the information in duplicates are different and important for my work.
Let me try to make an example;
Product No    Variable 1      Y
1            apple      result1
2           orange      result2
3           banana, apple   result1
4            bluebarry     result3
5            banana     result5
So as you can see in row 3 there are two information that has a value to me. How can i handle this situation in a classifaction model? Sorry if its obvious. Im new to ML :)
Edit Note: that variable 1 column has huge data and approximately thousand different information. I dont have 1 variable at my model ofc. the real model is really high dimensioned already.

Related

Specifying emmeans for negative binomial model

I've been trying out different models for my count data that is zero inflated and it looks like after looking at the Akaike information criterion (AIC) and the rootgram of different models, the generalized linear model negative binomial is the best fit. I'd like to do a post-hoc on my data but I keep running into some errors and I'm not sure the best way to frame the code in order to get what I'm looking for.
I used the following code for the model, where "flwrs" is my response variable (the count data), "tre" is a treatment group (which there are four of - F400, F800, GE400, GE800, and "wa" is for my weeks after transplant (ranging from 1-26).
negbin <- glm.nb(flwrs ~ tre*wa + (1 | replic), data = test)
summary(negbin)
Anova(negbin)
With the anova I see p < 2.2e-16 for tre, wa, and tre:wa. I'd like to explore more the significance between the tre and wa in particular.
Analysis of Deviance Table (Type II tests)
Response: flwrs
LR Chisq Df Pr(>Chisq)
tre 263.6 3 < 2.2e-16 ***
wa 4423.8 25 < 2.2e-16 ***
1 | replic 0
tre:wa 531.1 72 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
When I try to do a post-hoc with emmeans though I keep running into errors that I can't seem to work through.
This is what I tried based on what I read about emmeans in the help tab of R
emm <- emmeans(negbin, ~ tre | wa)
pairs(emm, adjust = "tukey")
The reponse is this :
> emm <- emmeans(negbin, ~ tre | wa)
Error in `[.data.frame`(tbl, , vars, drop = FALSE) :
undefined columns selected
Error in (function (object, at, cov.reduce = mean, cov.keep = get_emm_option("cov.keep"), :
Perhaps a 'data' or 'params' argument is needed
> pairs(emm, adjust = "tukey")
Error in if (nc < 2L) stop("only one column in the argument to 'pairs'") :
argument is of length zero
So I'm not very sure how to specify parameters or define the columns. Mainly I'm hoping to get the p values for the different tre and wa so that I can add significance values to a plot I created. I'm very new to statistical modelling in R so any suggestions to what I'm doing are welcome!

Accounting for time with repeated-measures in lmer when not interested in time

I am trying to conduct a repeated-measures mixed-effects test with lmer and lmerTest, but I am not sure if I am doing it appropriately.
I have 6 sites with 3 plots per site that have been sampled once per year for 24 consecutive years. I have several environmental and species variables, but for simplicity, let's say I have two environmental variables (depth and temperature) and two species (species 1 and species 2). I am not interested in the time variable, changes with time, or the interactions, as this system has strong wet/dry seasonality where the effects of the dry season outweigh carry over effects of species from the prior year. I do not necessarily have data for all variables and plots every year, with some plots not sampled at times.
The question is whether species2 (a predator) has any effect on populations of species1, relative to the environmental variables.
Is it appropriate to include year as its own random effect in the model, along with plot within site?
model1 <- lmer(species1 ~ depth + temperature + species2 + (1|year) + (1|site/plot), data=data)
For this particular analysis, there were 435 total observations (plot/year), but I worry that it is not appropriately conducting repeated-measures.
anova(model1)
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
depth 0.0221 0.0221 1 145.75 0.0908 0.7635
temperature 9.0213 9.0213 1 422.19 37.0429 2.596e-09 ***
species2 0.0597 0.0597 1 418.95 0.2450 0.6208
This does not seem right. Is the a better way to incorporate year, or should I include year at all?
If I exclude year, why does the DenDF for depth change so drastically?
model1 <- lmer(species1 ~ depth + temperature + species2 + (1|year) + (1|site/plot), data=data)
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
depth 2.599 2.599 1 431.77 7.1096 0.007955 **
temperature 58.788 58.788 1 432.10 160.7955 < 2.2e-16 ***
species2 0.853 0.853 1 429.62 2.3336 0.127343
summary(M1)
Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: species1 ~ depth + temperature + species2 + (1 | site/plot)
Data: data
AIC BIC logLik deviance df.resid
833.4 861.9 -409.7 819.4 428
Scaled residuals:
Min 1Q Median 3Q Max
-2.20675 -0.66119 -0.07051 0.52722 2.99942
Random effects:
Groups Name Variance Std.Dev.
plot:site (Intercept) 0.0003221 0.01795
site (Intercept) 0.2051143 0.45290
Residual 0.3656072 0.60465
Number of obs: 435, groups: plot:site, 24; site, 6
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) -0.538258 0.325072 50.071940 -1.656 0.10401
depth 0.006338 0.002377 431.768539 2.666 0.00796 **
temperature 0.391023 0.030837 432.101095 12.681 < 2e-16 ***
species2 -0.353264 0.231252 429.615226 -1.528 0.12734
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) depth temp
depth -0.316
temperature -0.467 -0.204
specie2 -0.544 0.040 0.007
I may have asked more questions than I answered, but I hope some of this is helpful.
"The question is whether species2 (a predator) has any effect on populations of species1, relative to the environmental variables."
I think when you word it this way, it is not entirely clear. Are you interested in the effect that species2 has on species1 - depending on what the environmental variables are (in other words the effect of species2 on species1 can change depending on depth or temperature? Or do you mean you would like to compare the effects of species2 on species1 to the effects of depth or temperature on species1? Or what do you mean, exactly, by "relative to the environmental variables"?
Yes, (1|year) + (1|site/plot) is a random intercept for both year and for plot within site. If you wanted a variable to be able to vary over each group (i.e. have a random slope) you would do something like (Temperature|year) + (1|site/plot) if you thought the effect of temperature on species1 might be different in different years.
Exactly how you specify the model is going to be based on your knowledge of the biological system and your knowledge of statistics. Based on the information in your question, this random effects formulation that you have suggested appears completely reasonable to me. Yes, this is allowing you to account for grouped data (grouped by each year and by each plot within site). It is possible that with only 435 observations you may have convergence issues with an overly complex model, which you may or may not have - just something to look out for.
I am not sure what you mean by "this does not seem right" - what are you expecting to see? What is missing?
I am seeing the same model twice (below), with different values as the output, is there a copy and pasting error here, or am I missing something? The values shouldn't be off with the same model structure.
model1 <- lmer(species1 ~ depth + temperature + species2 + (1|year) + (1|site/plot), data=data)
You haven't removed year in the above line, but have below this in the summary(M1) call.
My simple answer about the year question would be yes, I would include year. Every year is so different in any biological dataset I have seen that it is worth including as a random intercept at least - exactly as you have done. If the variance of the random effect mean is estimated to be zero, then this term is as if you didn't have it there in the first place. At that point you can choose to fit that random effect as a fixed effect instead if you still would like to account for the grouped nature of the data.
Also, there are lots of resources on this. Some examples:
Bolker, Benjamin M., Mollie E. Brooks, Connie J. Clark, Shane W. Geange, John R. Poulsen, M. Henry H. Stevens, and Jada-Simone S. White. "Generalized linear mixed models: a practical guide for ecology and evolution." Trends in ecology & evolution 24, no. 3 (2009): 127-135.
Harrison, Xavier A., Lynda Donaldson, Maria Eugenia Correa-Cano, Julian Evans, David N. Fisher, Cecily ED Goodwin, Beth S. Robinson, David J. Hodgson, and Richard Inger. "A brief introduction to mixed effects modelling and multi-model inference in ecology." PeerJ 6 (2018): e4794.
https://peerj.com/articles/4794/

Turing Machine: take a mod of two numbers?

Design a Turing machine that takes input two non-negative numbers and performs the mod operation on them, for example, mod(3,7)=3 and mod(7,3)=1. Clearly, specify any assumptions and formats about the input and output of the TM.
Input is two positive integers X and Y in unary separated by a separator symbol. Output is a single number Z in unary. TM is single sided single tape deterministic.
First, move right to find the separator. Then, bounce back and forth between the end of X and the beginning of Y, marking pairs of symbols. If you run out of X before running out of Y, then X < Y and X mod Y = X; erase the separator and everything after it, then change all tape symbols to your unary digit and halt-accept. If you run out of Y before X, then change marked symbols in X to erased/separator, restore marked symbols of Y to the unary digit, and repeat (X >= Y, so X mod Y = (X - Y) mod Y).
Here's how your 2 mod 3 gets processed:
#110111#
#1a0b11#
#aa0bb1#
#aa#####
#11#####
Here's how 3 mod 2 gets processed:
#111011#
#11a0b1#
#1aa0bb#
#100011#
#a000b1#
#a######
#1######
some basic information is given below
![enter image description here
complete figure is given below. here # is empty or null.Idea is same as given above.
Here's how your 2 mod 3 gets processed:
#110111#
#1a0b11#
#aa0bb1#
#aa#####
#11#####
Here's how 3 mod 2 gets processed:
#111011#
#11a0b1#
#1aa0bb#
#100011#
#a000b1#
#a######
#1######

Data Science: Scoring methodology

I am looking for any methodology to assign a risk score to an individual based on certain events. I am looking to have a 0-100 scale with an exponential assignment. For example, for one event a day the score may rise to 25, for 2 it may rise to 50-60 and for 3-4 events a day the score for the day would be 100.
I tried to Google it but since I am not aware of the right terminology, I am landing up on random topics. :(
Is there any mathematical terminology for this kind of scoring system? what are the most common methods you might know?
P.S.: Expert/experience data scientist advice highly appreciated ;)
I would start by writing some qualifications:
0 events trigger a score of 0.
Non edge event count observations are where the score – 100-threshold would live.
Any score after the threshold will be 100.
If so, here's a (very) simplified example:
Stage Data:
userid <- c("a1","a2","a3","a4","a11","a12","a13","a14","u2","wtf42","ub40","foo","bar","baz","blue","bop","bob","boop","beep","mee","r")
events <- c(0,0,0,0,0,0,0,0,0,0,0,0,1,2,3,2,3,6,122,13,1)
df1 <- data.frame(userid,events)
Optional: Normalize events to be in (1,2].
This might be helpful for logarithmic properties. (Otherwise, given the assumed function, score=events^exp, as in this example, 1 event will always yield a score of 1) This will allow you to control sensitivity, but it must be done right as we are dealing with exponents and logarithms. I am not using normalization in the example:
normevents <- (events-mean(events))/((max(events)-min(events))*2)+1.5
Set the quantile threshold for max score:
MaxScoreThreshold <- 0.25
Get the non edge quintiles of the events distribution:
qts <- quantile(events[events>min(events) & events<max(events)], c(seq(from=0, to=100,by=5)/100))
Find the Events quantity that give a score of 100 using the set threshold.
MaxScoreEvents <- quantile(qts,MaxScoreThreshold)
Find the exponent of your exponential function
Given that:
Score = events ^ exponent
events is a Natural number - integer >0: We took care of it by
omitting the edges)
exponent > 1
Exponent Calculation:
exponent <- log(100)/log(MaxScoreEvents)
Generate the scores:
df1$Score <- apply(as.matrix(events^exponent),1,FUN = function(x) {
if (x > 100) {
result <- 100
}
else if (x < 0) {
result <- 0
}
else {
result <- x
}
return(ceiling(result))
})
df1
Resulting Data Frame:
userid events Score
1 a1 0 0
2 a2 0 0
3 a3 0 0
4 a4 0 0
5 a11 0 0
6 a12 0 0
7 a13 0 0
8 a14 0 0
9 u2 0 0
10 wtf42 0 0
11 ub40 0 0
12 foo 0 0
13 bar 1 1
14 baz 2 100
15 blue 3 100
16 bop 2 100
17 bob 3 100
18 boop 6 100
19 beep 122 100
20 mee 13 100
21 r 1 1
Under the assumption that your data is larger and has more event categories, the score won't snap to 100 so quickly, it is also a function of the threshold.
I would rely more on the data to define the parameters, threshold in this case.
If you have prior data as to what users really did whatever it is your score assess you can perform supervised learning, set the threshold # wherever the ratio is over 50% for example. Or If the graph of events to probability of ‘success’ looks like the cumulative probability function of a normal distribution, I’d set threshold # wherever it hits 45 degrees (For the first time).
You could also use logistic regression if you have prior data but instead of a Logit function ingesting the output of regression, use the number as your score. You can normalize it to be within 0-100.
It’s not always easy to write a Data Science question. I made many assumptions as to what you are looking for, hope this is the general direction.

Handling features not correlated with output prediction?

I do regression analysis with multiple features. Number of features is 20-23. For now, I check each feature correlation with output variable. Some features show correlation coefficient close to 1 or -1 (highly correlated). Some features show correlation coefficient near 0. My question is: do I have to remove this feature if it has close to 0 correlation coefficient? Or I can keep it and the only problem is that this feature will no make some noticeable effect to regression model or will have faint affect on it. Or removing that kind of features is obligatory?
In short
High (absolute) correlation between a feature and output implies that this feature should be valuable as predictor
Lack of correlation between feature and output implies nothing
More details
Pair-wise correlation only shows you how one thing affects the other, it says completely nothing about how good is this feature connected with others. So if your model is not trivial then you should not drop variables because they are not correlated with output). I will give you the example which should show you why.
Consider following sample, we have 2 features (X, Y), and one output value (Z, say red is 1, black is 0)
X Y Z
1 1 1
1 2 0
1 3 0
2 1 0
2 2 1
2 3 0
3 1 0
3 2 0
3 3 1
Let us compute the correlations:
CORREL(X, Z) = 0
CORREL(Y, Z) = 0
So... we should drop all values? One of them? If we drop any variable - our prolem becomes completely impossible to model! "magic" lies in the fact that there is actually a "hidden" relation in the data.
|X-Y|
0
1
2
1
0
1
2
1
0
And
CORREL(|X-Y|, Z) = -0.8528028654
Now this is a good predictor!
You can actually get a perfect regressor (interpolator) through
Z = 1 - sign(|X-Y|)

Resources