I am running a path analysis in order to examine its invariance between two different cultures. When I examine the fit and the parameters of the path model seperately for each culture, I find that some paths are significant in both cultures, whereas some are not significant in one culture and some are not significant in the other culture. So should I first respecify the path models, by deleting the non significant paths from both samples and then continue to examine the configural or metric invariance or is it ok to continue the analysis with no change in the path models provided that the fit indices are satisfactory?
Related
I was wondering if anyone knows how SPSS builds the interaction terms/calculates the significance for predictors behind the scenes in a GLM? From my understanding it dummy codes variables and treats the one that comes alphabetically last as the reference group.
The reason I'm asking is I have a GLM model which has 3 continuous predictors and two categorical predictors (dummy coded). When I build all the 2-way and 3-way interactions with syntax ie:
Age_Centred Age_CentredDx Age_Centredgender Age_CentredDxgender BMI_Centred BMI_CentredDx BMI_Centredgender BMI_CentredDxgender BPS_Centred BPS_CentredDx BPS_Centredgender BPS_CentredDxgender Dx Dxgender DxICV_Centred DxICV_Centredgender gender ICV_Centred ICV_Centred*gender.
vs manually creating all the variables by hand ie:
Age_Centred Age_Centred_Dx Age_Centred_gender Age_Centred_gender_Dx BMI_Centred BMI_Centred_Dx BMI_Centred_gender BMI_Centred_gender_Dx BPS_Centred BPS_Centred_Dx BPS_Centred_gender BPS_Centred_gender_Dx Dx gender_Dx ICV_Dx ICV_Centred_Dx_gender gender ICV_Centred ICV_gender.
I end up with a model which has the same intercept, overall significance, and R squared however the individual significance of the predictors changes. Refer to output below. To troubleshoot I've tried to flip the references groups when manually creating the variables but it still does not replicate the results. I've had another statistician try the same thing and ended up reaching the same point as what I did. Does it have to do with some of the parameters being redundant?
Building the terms via syntax:
Physically creating the variables by multiplying them together
All the details one might reasonably want about how GLM (and UNIANOVA, which is the same underlying code) parameterizes models, estimates parameters, and conducts hypothesis tests are available in the IBM SPSS Statistics Algorithms manual, available for download as a pdf at ftp://public.dhe.ibm.com/software/analytics/spss/documentation/statistics/26.0/en/client/Manuals/IBM_SPSS_Statistics_Algorithms.pdf. (Note that this is a large file, about 78 MB; clicking on the link starts a download.) In addition to the information in the GLM chapter, appendices F (Indicator Method) and H (Sums of Squares) are relevant, respectively, for building the design matrix and specifying linear combinations of model parameters for computing sums of squares for testing hypotheses.
In building the design matrix, categorical predictors (factors) are indeed represented by sets of indicator (0-1) variables. For a factor with k levels, k indicator variables are created, one for each observed level of the factor. The procedure does not explicitly treat the last category (sorted in ascending order, alphabetical for strings) as a reference category, though in simpler models the effect of what's done is essentially the same. If there is an intercept in the model, then the kth indicator will be redundant (linearly dependent) on the intercept and the preceding k-1 indicators. The estimation algorithm used in GLM/UNIANOVA will set the row and column in the cross-product matrix representing the redundant column in the design matrix to 0s, alias the corresponding parameter estimate to 0, and the results are similar to a reparameterization approach treating the last category as a reference category, except that you have to remember that it's there if you want to specify a linear combination of the parameters to estimate.
If you suppress the intercept, then for the first factor entered into the model the kth indicator would not be redundant (unless the factor is preceded by an unusual covariate or set of covariates). Any subsequent factors included in the model would involve redundant parameters, as would any interactions among factors, whether or not an intercept is included. Interactions among factors are created by multiplying the 0s and 1s for each level of the factors by those for each level of the other factor. So for an interaction of two two-level factors, there are four columns generated, of which typically the last three are redundant.
Covariates are entered simply by copying the values of the variables into the design matrix. Interactions involving covariates and other covariates multiply values for the columns involved within each row, and interactions involving covariates and factors multiply covariates (or products of them) by the indicator variables for the factor(s). Usually covariate-by-covariate terms do not involve redundancies, but factor-by-covariate terms do.
To get to the specifics of what's going on with your data, I can't replicate your exact results without your data, but I am able to replicate the patterns shown if I assume you've used the binary Dx variable as a covariate and the binary gender variable as a factor in each analysis. (There seem to actually be four continuous predictors in your model rather than three, but that doesn't affect anything of importance for understanding what's going on.)
There are two aspects of the situation to be considered. One is the parameterization and how the two ways of entering the variables into the model treat the variables and whether or not they produce the same estimates of parameters. The second is how the model specification results in the Type III tests shown in the ANOVA tables.
If I'm understanding things correctly based on what you've posted here, you should find if you compare parameter estimates for the two analyses that the parameter estimates for the intercepts and the non-redundant estimates for gender ([gender=0]) are the same, and have the same standard errors. For the terms involving just covariates or products of covariates, I expect that you will find the parameter estimates to differ between the two analyses and produce different t statistics. For interactions involving gender and covariates (which is all the other variables or products created outside the procedure), I expect the estimates will be the same in magnitude and opposite in sign, with the same standard errors.
None of the estimates or tests here are wrong. The models fitted involve interaction effects. An interaction means that effect of one variable varies by the levels of the other variable(s) in the interaction, and in order to estimate the same simple effects you have to parameterize the model in the same way, at least as far as the non-redundant parameters are concerned. However, to get the Type III tests for all terms to be identical, it's not always enough to have the same parameter estimates and standard errors. Type III tests involve a concept called containment that must also be considered.
For two effects in a model, effect A is contained in effect B if:
A and B contain the same covariate terms, if any.
B contains all factor effects in A, and at least one more (with the intercept being contained in all factor-only effects).
In your original model, the intercept is included in the gender effect, gender is not included in any effects, and all the covariate main effects and two-way interactions among covariates are contained within the interactions between those terms and gender, while the three-way interactions (which include gender) are not contained within any other effects.
Type III sums of squares (not invented by SPSS, but by our friends at SAS) are based on linear combinations of parameters where a given effect is adjusted for any effects that do not contain it, and made orthogonal to any effects that contain it. The practical application of these rules is complicated (see Appendix H of the algorithms).
If you recode the gender variable to swap the 0 and 1 values, specify it as a covariate along with all the other variables, and fit the same model, you should be able to match all the non-redundant parameter estimates from the original model, along with their standard errors and t statistics. However, because the containment relationships in the original model are no longer there, the Type III tests for the terms not involving gender (which were previously contained in terms involving gender) will not match up.
The bottom line is that all results are translatable and all correct for what's being done, and that in order to make much sense out of individual terms you have to carefully focus on what's being estimated in a given parameterization, as well as the containment relationships. The difficult part gets simpler when you take seriously the fact that when variable X is involved in interaction terms, there is no single estimate of the effect of X. Any estimates are conditional one where you fix the value(s) of the terms with which X interacts.
I'm attempting to build a predictive model that can map text-based vendor-provided descriptions of a service to around 800 standardized service codes, based on a training set of about 13,000 correctly mapped services.
Each standardized service code also has a standardized description, which is usually similar to the vendor-provided description (i.e, some of the words used are the same), but not identical. Descriptions are typically 3-10 word in length
My main issue is that I'm not sure what type of estimator will be appropriate for this problem.
I have tried using simple fuzzy matching approaches, including:
Counting matching words/characters between the vendor-provided and standardized service descriptions and selecting the one with the most matches
Trying to find the standardized service description with the minimum Levenshtein distance
These have not worked particularly well due to the use of synonymous but different word choices within the vendor-provided and standardized descriptions.
I have also considered using a decision tree, but it seems infeasible given 800+ possible outcomes.
Which type of estimator can I use to solve this problem?
I am interested in using DBpedia Spotlight. However, we need to insert a value to the two parameters confidence and support. What do these two parameters really mean?
I want to identify the significant, prominent n-grams in the text. In that case, what is the usual recommendation for confidence and support parameters (rule of thumb)?
When you ask DBpedia Spotlight to annotate text (finding entities/topics), it searches for n-grams that have URIs on DBpedia (n-grams that are Wikipedia titles). Those n-grams are called DBpedia resources.
Support: this is the Resource Prominence parameter, it helps you to ignore unimportant or uninformative resources. When you set a value X to it, this means resources that have a number of Wikipedia in-links smaller than X will be ignored and not returned to you.
Confidence: this is the Disambiguation Confidence parameter, it is a threshold which takes a value between 0 and 1. When you set a high value to it, you get better and more trustworthy annotations but you risk losing some correct ones.
Choosing values of those (or any other) parameters depends on your use case.
Examples:
If you have some test set or gold standard for the type of n-grams you are interested in, you can tune your choice until you get good enough results satisfied by your gold standard.
If you care about retrieving the top-N n-grams only to infer the topic of text, you can tune your parameters choosing high values to get few (mostly) correct n-grams and sort them by Confidence.
If you want to get as many n-grams as possible and your task won't get affected or biased by mistakes, you can set low values.
I am reading about n-grams and I am wondering whether there is a case in practice when uni-grams would are preferred to be used over bi-grams (or higher N-grams). As I understand, the bigger N, the bigger complexity to calculate the probabilities and establish the vector space. But apart from that, are there other reasons (e.g. related to type of data)?
This boils down to data sparsity: As your n-gram length increases, the amount of times you will see any given n-gram will decrease: In the most extreme example, if you have a corpus where the maximum document length is n tokens and you are looking for an m-gram where m=n+1, you will, of course, have no data points at all because it's simply not possible to have a sequence of that length in your data set. The more sparse your data set, the worse you can model it. For this reason, despite that a higher-order n-gram model, in theory, contains more information about a word's context, it cannot easily generalize to other data sets (known as overfitting) because the number of events (i.e. n-grams) it has seen during training becomes progressively less as n increases. On the other hand, a lower-order model lacks contextual information and so may underfit your data.
For this reason, if you have a very relatively large amount of token types (i.e. the vocabulary of your text is very rich) but each of these types has a very low frequency, you may get better results with a lower-order n-gram model. Similarly, if your training data set is very small, you may do better with a lower-order n-gram model. However, assuming that you have enough data to avoid over-fitting, you then get better separability of your data with a higher-order model.
Usually, n-grams more than 1 is better as it carries more information about the context in general. However, sometimes unigrams are also calculated besides bigram and trigrams and used as fallback for them. This is usefull also, if you want high recall than precision to search unigrams, for instance, you are searching for all possible uses of verb "make".
Lets use Statistical Machine Translation as an Example:
Intuitively, the best scenario is that your model has seen the full sentence (lets say 6-grams) before and knows its translation as a whole. If this is not the case you try to divide it to smaller n-grams, keeping into consideration that the more information you know about the word surroundings, the better the translation. For example, if you want to translate "Tom Green" to German, if you have seen the bi-gram you will know it is a person name and should remain as it is but if your model never saw it, you would fall back to unigrams and translate "Tom" and "Green" separately. Thus "Green" will be translated as a color to "GrĂ¼n" and so on.
Also, in search knowing more about the surrounding context makes the results more accurate.
I am trying to develop a music-focused search engine for my final year project.I have been doing some research on Latent Semantic Analysis and how it works on the Internet. I am having trouble understanding where LSI sits exactly in the whole system of search engines.
Should it be used after a web crawler has finished looking up web pages?
I don't know much about music retrieval, but in text retrieval, LSA is only relevant if the search engine is making use of the vector space model of information retrieval. Most common search engines, such as Lucene, break each document up into words (tokens), remove stop words and put the rest of them into the index, each usually associated with a term weight indicating the importance of the term within the document.
Now the list of (token,weight) pairs can be viewed as a vector representing the document. If you combine all of these vectors into a huge matrix and apply the LSA algorithm to that (after crawling and tokenising, but before indexing), you can use the result of the LSA algorithm to transform the vectors of all documents before indexing them.
Note that in the original vectors, the tokens represented the dimensions of the vector space. LSA will give you a new set of dimensions, and you'll have to index those (e.g. in the form of auto-generated integers) instead of the tokens.
Furthermore, you will have to transform the query into a vector of (token,weight) pairs, too, and then apply the LSA-based transformation to that vector as well.
I am unsure if anybody actually does all of this in any real-life text retrieval engine. One problem is that performing the LSA algorithm on the matrix of all document vectors consumes a lot of time and memory. Another problem is handling updates, i.e. when a new document is added, or an existing one changes. Ideally, you'd recompute the matrix, re-run LSA, and then modify all existing document vectors and re-generate the entire index. Not exactly scalable.