I am anlyzing data from an experiment.
I have three groups ( GROUP, 1 between subject factor) to compare via a cognitive task.
Task is composed by a 3 way full factorial design (2x3x3); all subjects are presented two stimuli (factor1), for each stimulus there are three conditions (factor2), and for each condition three position on the screen (factor3). For each combination of factors, there are N trials that are averaged to give average accuracy (ACC) and average reaction time (RT).
I want to build a model in spss using linear mixed model.
I tried in SPSS 22 the following syntax:
MIXED ACC BY GROUP FACTOR1 FACTOR2 FACTOR3 GENDER WITH RT Age
/FIXED = GROUP FACTOR1 FACTOR2 FACTOR3 GROUP*FACTOR1 GROUP*FACTOR2 GROUP*FACTOR3 GENDER AGE RT | SSTYPE(3)
/RANDOM= INTERCEPT | SUBJECT(SUBID) COVTYPE(VC)
Considered I have averaged accuracy rates across trials for each combination, should I include a repeated statement as well? If this were the case, what is the difference between the following
/REPEATED= FACTOR1 FACTOR2 FACTOR3 | SUBJECT(SUBID) COVTYPE(UN)
and the following nomenclature?
/REPEATED= FACTOR1*FACTOR2*FACTOR3 | SUBJECT(SUBID) COVTYPE(UN)
In other words, what is the difference between including or less asterisks?
Thanks for your comments,
Alessandro
You have two questions here: (1) a statistical question about what type of analysis is appropriate, and (2) a code question.
(1) Very briefly, if you're going to use linear mixed models, I think you should use all the data, and not average across your N trials within each combination of factors. Those N trials are your repeated measurements.
(2) The IBM KnowledgeCenter page on the REPEATED subcommand states
Specify a list of variable names (of any type) connected by asterisks
(repeated measure) following the REPEATED subcommand.
which suggests that
/REPEATED= FACTOR1 FACTOR2 FACTOR3 | SUBJECT(SUBID) COVTYPE(UN)
should be a syntax error. It isn't, so I looked at the Model Information table in the output. For both REPEATED specifications, the Repeated Effects section of that table lists FACTOR1*FACTOR2*FACTOR3 as the effect.
Based on this, it's safe to say that the SPSS syntax parser interprets
/REPEATED= FACTOR1 FACTOR2 FACTOR3 | SUBJECT(SUBID) COVTYPE(UN)
to be equivalent to
/REPEATED= FACTOR1*FACTOR2*FACTOR3 | SUBJECT(SUBID) COVTYPE(UN)
Related
I have a pretrained word2vec model in pyspark and I would like to know how big is its vocabulary (and perhaps get a list of words in the vocabulary).
Is this possible? I would guess it has to be stored somewhere since it can predict for new data, but I couldn't find a clear answer in the documentation.
I tried w2v_model.getVectors().count() but the result (970) seem too small for my use case. In case it may be relevant, I'm using short-text data and my dataset has tens of millions of messages each having from 10 to 30/40 words. I am using min_count=50.
Not quite sure why you doubt the result of .getVectors().count(), which gives the desired result indeed, as shown in the documentation link you have provided yourself.
Here is the example posted there, with a vocabulary of just three (3) tokens - a, b, and c:
from pyspark.ml.feature import Word2Vec
sent = ("a b " * 100 + "a c " * 10).split(" ") # 3-token vocabulary
doc = spark.createDataFrame([(sent,), (sent,)], ["sentence"])
word2Vec = Word2Vec(vectorSize=5, seed=42, inputCol="sentence", outputCol="model")
model = word2Vec.fit(doc)
So, unsurprisingly, it is
model.getVectors().count()
# 3
and asking for the vectors themselves
model.getVectors().show()
gives
+----+--------------------+
|word| vector|
+----+--------------------+
| a|[0.09511678665876...|
| b|[-1.2028766870498...|
| c|[0.30153277516365...|
+----+--------------------+
In your case, with min_count=50, every word that appears less than 50 times in your corpus will not be represented; reducing this number will result in more vectors.
I am trying to conduct a repeated-measures mixed-effects test with lmer and lmerTest, but I am not sure if I am doing it appropriately.
I have 6 sites with 3 plots per site that have been sampled once per year for 24 consecutive years. I have several environmental and species variables, but for simplicity, let's say I have two environmental variables (depth and temperature) and two species (species 1 and species 2). I am not interested in the time variable, changes with time, or the interactions, as this system has strong wet/dry seasonality where the effects of the dry season outweigh carry over effects of species from the prior year. I do not necessarily have data for all variables and plots every year, with some plots not sampled at times.
The question is whether species2 (a predator) has any effect on populations of species1, relative to the environmental variables.
Is it appropriate to include year as its own random effect in the model, along with plot within site?
model1 <- lmer(species1 ~ depth + temperature + species2 + (1|year) + (1|site/plot), data=data)
For this particular analysis, there were 435 total observations (plot/year), but I worry that it is not appropriately conducting repeated-measures.
anova(model1)
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
depth 0.0221 0.0221 1 145.75 0.0908 0.7635
temperature 9.0213 9.0213 1 422.19 37.0429 2.596e-09 ***
species2 0.0597 0.0597 1 418.95 0.2450 0.6208
This does not seem right. Is the a better way to incorporate year, or should I include year at all?
If I exclude year, why does the DenDF for depth change so drastically?
model1 <- lmer(species1 ~ depth + temperature + species2 + (1|year) + (1|site/plot), data=data)
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
depth 2.599 2.599 1 431.77 7.1096 0.007955 **
temperature 58.788 58.788 1 432.10 160.7955 < 2.2e-16 ***
species2 0.853 0.853 1 429.62 2.3336 0.127343
summary(M1)
Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: species1 ~ depth + temperature + species2 + (1 | site/plot)
Data: data
AIC BIC logLik deviance df.resid
833.4 861.9 -409.7 819.4 428
Scaled residuals:
Min 1Q Median 3Q Max
-2.20675 -0.66119 -0.07051 0.52722 2.99942
Random effects:
Groups Name Variance Std.Dev.
plot:site (Intercept) 0.0003221 0.01795
site (Intercept) 0.2051143 0.45290
Residual 0.3656072 0.60465
Number of obs: 435, groups: plot:site, 24; site, 6
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) -0.538258 0.325072 50.071940 -1.656 0.10401
depth 0.006338 0.002377 431.768539 2.666 0.00796 **
temperature 0.391023 0.030837 432.101095 12.681 < 2e-16 ***
species2 -0.353264 0.231252 429.615226 -1.528 0.12734
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) depth temp
depth -0.316
temperature -0.467 -0.204
specie2 -0.544 0.040 0.007
I may have asked more questions than I answered, but I hope some of this is helpful.
"The question is whether species2 (a predator) has any effect on populations of species1, relative to the environmental variables."
I think when you word it this way, it is not entirely clear. Are you interested in the effect that species2 has on species1 - depending on what the environmental variables are (in other words the effect of species2 on species1 can change depending on depth or temperature? Or do you mean you would like to compare the effects of species2 on species1 to the effects of depth or temperature on species1? Or what do you mean, exactly, by "relative to the environmental variables"?
Yes, (1|year) + (1|site/plot) is a random intercept for both year and for plot within site. If you wanted a variable to be able to vary over each group (i.e. have a random slope) you would do something like (Temperature|year) + (1|site/plot) if you thought the effect of temperature on species1 might be different in different years.
Exactly how you specify the model is going to be based on your knowledge of the biological system and your knowledge of statistics. Based on the information in your question, this random effects formulation that you have suggested appears completely reasonable to me. Yes, this is allowing you to account for grouped data (grouped by each year and by each plot within site). It is possible that with only 435 observations you may have convergence issues with an overly complex model, which you may or may not have - just something to look out for.
I am not sure what you mean by "this does not seem right" - what are you expecting to see? What is missing?
I am seeing the same model twice (below), with different values as the output, is there a copy and pasting error here, or am I missing something? The values shouldn't be off with the same model structure.
model1 <- lmer(species1 ~ depth + temperature + species2 + (1|year) + (1|site/plot), data=data)
You haven't removed year in the above line, but have below this in the summary(M1) call.
My simple answer about the year question would be yes, I would include year. Every year is so different in any biological dataset I have seen that it is worth including as a random intercept at least - exactly as you have done. If the variance of the random effect mean is estimated to be zero, then this term is as if you didn't have it there in the first place. At that point you can choose to fit that random effect as a fixed effect instead if you still would like to account for the grouped nature of the data.
Also, there are lots of resources on this. Some examples:
Bolker, Benjamin M., Mollie E. Brooks, Connie J. Clark, Shane W. Geange, John R. Poulsen, M. Henry H. Stevens, and Jada-Simone S. White. "Generalized linear mixed models: a practical guide for ecology and evolution." Trends in ecology & evolution 24, no. 3 (2009): 127-135.
Harrison, Xavier A., Lynda Donaldson, Maria Eugenia Correa-Cano, Julian Evans, David N. Fisher, Cecily ED Goodwin, Beth S. Robinson, David J. Hodgson, and Richard Inger. "A brief introduction to mixed effects modelling and multi-model inference in ecology." PeerJ 6 (2018): e4794.
https://peerj.com/articles/4794/
I have the following bayes net with me.
I want to find P(+h|+e). So I have to find A = P(+h,+e) and B = P(+e) to find P(+h|+e). I wanted to follow variable elimination for find the probability. Taking different orders is giving me different probabilities. How should I choose my order of the variable elimination for accurate calculation of P(+h|+e)?
Will it be okay if I calculate P(+h,+u,+e) and eliminating +u instead of finding P(+i, +h, +t, +u, +e) and eliminating +i,+t and +u for finding P(+h,+e)?
How do I calculate P(+e)?
1.P(h|e) is the conditional probability of P(cause | effect ),we are using an effect to infer the cause (diagnostic direction).
P(c| e)P(e) /P(c) = P(h| e)P(e)/P(h) = P(h,e)P(e)/P(h)
So to calculate P(h,e) you would have to calculate joint distribution with all the variables and marginalise each one since they are relevant to the query and evidence variables.
P(+i, +h, +t, +u, +e) would be the correct choice
To calculate P(+e) we would need only its parents, i.e Good test taker and understands the material. So we need to calculate the underlying conditional distribution P(e| t,u) and marginalizing out the variables t, u.
P(+e)
= Sum_t( Sum_u( P(+e, t, u)))
= P( +e | +t,+u)P(+t)P(+u) + P( +e | +t,-u)P(+t)P(-u) + P( +e | -t,+u)P(-t)P(+u)+ P( +e | -t,-u)P(-t)P(-u)
I have 77 subjects, 1 continuous DV (activation), 2 continuous IVs (score1 and score2) and 1 categorical IV (condition) with 2 levels. Each subject undergoes both conditions.
I code the model as:
MIXED activation BY condition WITH score1 score2
/CRITERIA=CIN(95) MXITER(1000) MXSTEP(10) SCORING(1) SINGULAR(0.000000000001) HCONVERGE(0,
ABSOLUTE) LCONVERGE(0, ABSOLUTE) PCONVERGE(0.000001, ABSOLUTE)
/FIXED=condition score1 score2 condition*score1 condition*score2 | SSTYPE(3)
/METHOD=ML
/PRINT=DESCRIPTIVES G SOLUTION TESTCOV
/REPEATED=condition | SUBJECT(subject) COVTYPE(ID)
/EMMEANS=TABLES(condition) COMPARE ADJ(BONFERRONI)
Which commands should I use to investigate the interaction between condition(0, 1) and score1 (continuous)?
If you can get the regression coefficients for the fixed part of the model, the one for condition*score1 will equal the difference between the score1 slopes between the two conditions. That will provide a test of the null hypothesis that the slopes are equal, I.e., that the score1 effects are the same.
Use the analogous method for condition*score2.
I have no clue about data mining or data analysis or statistical analysis but I think what I need is finding "clusters in a matrix". I have a data set of ~20k records and each has ~40 characteristics all of which are either turned on or off.
+--------+------+------+------+------+------+------+
| record | hasA | hasB | hasC | hasD | hasE | hasF |
+--------+------+------+------+------+------+------+
| foo | 1 | 0 | 1 | 0 | 0 | 0 |
| bar | 1 | 1 | 0 | 0 | 1 | 1 |
| baz | 1 | 1 | 1 | 0 | 0 | 0 |
+--------+------+------+------+------+------+------+
I'm quite convinced most of those 20k records have characteristics that fall into one of several categories. There must be means to determine how similar record 'foo' is to record 'bar'.
So, what is it that I'm actually looking at? What algorithm am I looking for?
Transform each record r into a binary vector v(r) so that i-th component of v(r) is set to 1 if r has i-th characteristic, and 0 otherwise.
Now run hierarchical clustering algorithm on this set of vectors under the Hamming distance or Jaccard distance, whichever you think is more appropriate; also make sure there's a notion of distance between clusters defined in terms of the underlying distance (see linkage criteria).
Then decide where to cut the resulting dendrogram based on common sense. Where you cut the dendrogram will affect the number of clusters.
One downside of hierarchical clustering is that it's rather slow. It takes O(n^3) time in general, so it would take quite a while on a large data set. For single- and complete-linkages you can bring the time down to O(n^2).
Hierarchical clustering is very easy to implement in languages such as Python. You can also use the implementation from the scipy library.
Example: Hierarchical Clustering in Python
Here's a code snippet to get you started. I assume S is the set of records transformed into binary vectors (i.e. each list in S corresponds to a record from your data set).
import numpy as np
import scipy
import scipy.cluster.hierarchy as sch
import matplotlib.pylab as plt
# This is the set of binary vectors, each of which would
# correspond to a record in your case.
S = [
[0, 0, 0, 1, 1], # 0
[0, 0, 0, 0, 1], # 1
[0, 0, 0, 1, 0], # 2
[1, 1, 1, 0, 0], # 3
[1, 0, 1, 0, 0], # 4
[0, 1, 1, 0, 0]] # 5
# Use Hamming distance with complete linkage.
Z = sch.linkage(sch.distance.pdist(S, metric='hamming'), 'complete')
# Compute the dendrogram
P = sch.dendrogram(Z)
plt.show()
The result is as you'd expect: cut at 0.5 to get two clusters, one of the first three vectors (which have ones at beginning, zeros at the end) and the other of the last three vectors (which have ones at the end, zeros at the beginning). Here's the image:
Hierarchical clustering starts with each vector being its own cluster. In each successive steps it merges the closest clusters. It repeats this until there is a single cluster left.
The dendrogram essentially encodes the whole clustering process. At the beginning each vector is its own cluster. Then {3} and {5} merge into {3,5} and {0} and {2} merge into {0,2}. Next, {4} and {3,5} merge into {3,4,5}, and {1} and {0,2} merge into {0,1,2}. Finally, {0,1,2} and {3,4,5} merge into {0,1,2,3,4,5}.
From the dendrogram you can usually see at which point it makes the most sense to cut---this will define your clusters.
I encourage you to experiment with various distances (e.g. Hamming distance, Jaccard distance) and linkages (e.g. single linkage, complete linkage), and various representations (e.g. binary vectors).
Are you sure you want cluster analysis?
To find similar records you don't need cluster analysis. Simply find similar records with any distance measure such as Jaccard similarity or Hamming distance (both of which are for binary data). Or cosine distance, so that you can use e.g. Lucene to find similar records fast.
To find common patterns, the use of frequent itemset mining may yield much more meaningful results, because these can work on a subset of attributes only. For example, in a supermarket, the columns Noodles, Tomato, Basil, Cheese may constitute a frequent pattern.
Most clustering algorithms attempt to divide the data into k groups. While this at first appears a good idea (get k target groups) it rarely matches what real data contains. For example customers: why would every customer belong to exactly one audience? What if the audiences are e.g. car lovers, gun lovers, football lovers, soccer moms - are you sure you don't want to allow overlap of these groups?
Furhermore, a problem with cluster analysis is that it's incredibly easy to use badly. It does not "fail hard" - you always get a result, and you might not realize that it's a bad result...