SPSS version 23, MIXED module: maximum dummy variables? - spss

I am using the MIXED routine, repeated measures. I have 10 dummy variables (0/1) and 8 scaled variables for fixed effects. The results keep showing that one of the dummy variables is redundant. I played around moving the order in which the dummy and scaled variables are listed. Usually the last listed dummy variable gets flagged as being redundant. Is there a maximum number of dummy variables that should be included in the model? Eight of the dummy variables refer to 8 geographical regions of a country.

To understand why SPSS 'kicks out' one of the dummy variables, you should look at the origin of these dummies.
Let's say we have a dependent y belonging to a sample of objects. These objects come from 8 regions, x. In a flat regression model, we model the relation between y and x:
y = a + bx + e.
We want to know the value of b. But x is a nominal variable, so the categories or regions are not numbers, but names. Names don't fit in the above equation.
You have probably recoded x into dummies x1, x2 to x8. Now look at the records in your data and their scores for x and the dummy variables. Here's an example of one record:
x x1 x2 x3 x4 x5 x6 x7 x8
8 0 0 0 0 0 0 0 1
If you look at the dummy variables one by one, and you get to x7, you know that the first 7 are al zeroes. For this record, you therefore already know that x8 must be 1. This is what SPSS means when it 'kicks out' redundant variable. This phenomenon is called perfect collinearity. The information in the last dummy you add to the model is redundant, because it is already in there.
In conclusion: leave out one of the dummies. The dummy variable you leave out will serve as the reference category in your model. For each of the other dummies, you will calculate the coefficient that tells you how big the records or objects with a given value/category of x differ from the reference category that was left out.
There are different ways to code your dummy variables in such a way that you use the mean as reference, in stead of one of the categories. Take a look at dummy coding on Wikipedia.
I also like this article that explains how degrees of freedom work. Although I hadn't mentioned this term before, it does touch on the very same idea of how dummy coding works.

Related

gretl - dummy interactions

There does not seem to be an "easy" way (such as in R or python) to create interaction terms between dummy variables in gretl ?
Do we really need to code those manually which will be difficult for many levels? Here is a minimal example of manual coding:
open credscore.gdt
SelfemplOwnRent=OwnRent*Selfempl
# model 1
ols Acc 0 OwnRent Selfempl SelfemplOwnRent
Now my manual interaction term will not work for factors with many levels and in fact does not even do the job for binary variables.
Thanks,
ML
One way of doing this is to use lists. Use the dummify-command for generating dummies for each level and the ^-operator for creating the interactions. Example:
open griliches.gdt
discrete med
list X = dummify(med)
list D = dummify(mrt)
list INT = X^D
ols lw 0 X D INT
The command discrete turns your variable into a discrete variable and allows to use dummify (this step is not necessary if your variable is already discrete). Now all interactions terms are stored in the list INT and you can easily assess them in the following ols-command.
#Markus Loecher on your second question:
You can always use the rename command to rename a series. So you would have to loop over all elements in list INT to do so. However, I would rather suggest to rename both input series, in the above example mrt and med respectively, before computing the interaction terms if you want shorter series names.

how to set a pattern in a variable using Z3Py

I'm pretty new in Z3, but a thing that my problem could be resolved with it.
I have two variables A and B and two pattern like this:
pattern_1: 1010x11x
pattern_2: x0x01111
where 1 and 0 are the bits zero and one, and x (dont care) cold be the bit 0 or 1.
I would like to use Z3Py for check if A with the pattern_1 and B with the pattern_2 can be true at the same time.
In this case if A = 10101111 and B = 10101111 than A and B cold be true ate the same time.
Can anyone help me with this?? It is possible resolve this with Z3Py
revised answer after clarification
Here's one way you could represent those constraints. There is an operation called Extract that can be applied to bit-vector terms. It is defined as follows:
def Extract(high, low, a):
"""Create a Z3 bit-vector extraction expression."""
where high is the high bit to be extracted, low is the low bit to be extracted, and a is the bitvector. This function represents the bits of a between high and low, inclusive.
Using the Extract function you can constrain each bit of whatever term you want to check so that it matches the pattern. For example, if the seventh bit of D must be a 1, then you can write s.add(Extract(7, 7, D) == 1). Repeat this for each bit in a pattern that isn't an x.

Trying to understand the VITERBI algorithm a bit better

I'm currently trying to implement the viterbi algorithm in python, more specifically the version presented in an online course.
As it stands, the algorithm is presented that way:
given a sentence with K tokens, we have to generate K tags .
We assume that tag K-1 = tag K-2 = '*', then for k going from 0 to K,
we set the tag for the token as follows :
tag(WORD_k) = argmax(p(k-1, tag_k-2, tag_k-1) * e( word_k, tag_k) * q(tag_k, tag_k-1, tag_k-1))
From my understanding this is straightforward because the p parameters are already calculated on each step (we go from 1 forward, and we already know p0), and max for the e and q params can be calculated by one iteration through the tags (since we can't come up with 2 different tags, we basically have to find the tag T for which the q * e product is maximal, and return that). This saves a lot of time, since we are almost at linear time in terms in big O notation, instead of exponential complexity, which we would get if we iterated through all possible word/tag combinations.
Am I getting the core of the algorithm correctly or am I missing something out?
Thanks in advance
since we can't come up with 2 different tags, we basically have to
find the tag T for which the q * e product is maximal, and return that
Yeah, sounds about right. q is the trigram (transition) probability and e is named the emission probability. As you said is unchanged between different paths in each stage, so the max is only dependent on the other two.
Each tag sequence should start with two asterisks at positions -2 and -1. So the first assumption is correct:
If we assume to be the maximum probability that the last two tags at position k are u and v, based on what we just said about the beginning asterisks, the base case would be
.
You had two errors in the general case though. The emission probability is a conditional. Also in the trigram, is repeated two times and the formula given is incorrect:

How do I winsorize data in SPSS?

Does anyone know how to winsorize data in SPSS? I have outliers for some of my variables and want to winsorize them. Someone taught me how to do use the Transform -> compute variable command, but I forgot what to do. I believe they told me to just compute the square root of the subjects measurement that I want to winsorize. Could someone please elucidate this process for me?
There is a script online to do it already it appears. It could perhaps be simplified (the saving to separate files is totally unnecessary), but it should do the job. If you don't need a script and you know the values of the percentiles you need it would be as simple as;
Get the estimates for the percentiles for variable X (here I get the 5th and 95th percentile);
freq var X /format = notable /percentiles = 5 95.
Then lets say (just by looking at the output) the 5th percentile is equal to 100 and the 95th percentile is equal to 250. Then lets make a new variable named winsor_X replacing all values below the 5th and 95th percentile with the associated percentile.
compute winsor_X = X.
if X <= 100 winsor_X = 100.
if X >= 250 winsor_X = 250.
You could do the last part a dozen different ways, but hopefully that is clear enough to realize what is going on when you winsorize a variable.

How to display the results of multiple comparisons

If you compare two sets of data (such as two files), the differences between these sets can be displayed in two columns, or two panes, such as WinMerge does.
But are there any visual paradigms to display the differences between multiple data sets?
Update
The starting point of my question was the assumption that displaying differences between 2 files is relatively easy, as I mentioned WinMerge, whereas comparing 3 or more text files turns out to be more complicated, as there will be more and more differences between, say, different versions of a document that have been created over time.
How would you highlight parts of the file that are the same in 2 versions, but different from other versions?
The data sets I have in mind are objects (A, B, C, ...) which may or may not exist and have properties (a, b, c, ...) which may be set or not set.
Example:
Set 1: A(a, b, c), B(b, c), C(c)
Set 2: A(a, b, c), B(b), C(c)
Set 3: A(a, b), B(b)
If you compare 2 sets, e.g. 1 and 2, the difference would be in B(c). Comparing sets 2 and 3 results in the difference A(c) and C().
If you compare all 3 sets, you end up with 3 comparisons (n * (n-1) / 2)
I have a different view than some of those who provided Answers--i.e., that you need to further specify the problem. The abstraction level is about right. Further specification would make the problem easier, but the solution less useful.
A couple of years ago, i saw a graphic on ProgrammableWeb--it compared the results from a search on Yahoo with the results from the same search on Google. There's a lot of information to covey: some results are in both sets, some in just one, and the common results will have different positions in the respective engine's results, which somehow has to be shown.
I like the graphic and reimplemented it in Matplotlib (a Python scientific plotting library). Below is an example using some random points as well as python code i used to generate it:
from matplotlib import pyplot as PLT
xvals = NP.array([(2,3), (5,7), (8,6), (1.5,1.8), (3.0,3.8), (5.3,5.2),
(3.7,4.1), (2.9, 3.7), (8.4, 6.1), (7.1, 6.4)])
yvals = NP.tile( NP.array([5,3]), [10,1] )
fig = PLT.figure()
ax1 = fig.add_subplot(111)
ax1.plot(x, y, "-", lw=3, color='b')
ax1.plot(x, y2, "-", lw=3, color='b')
for a, b in zip(xvals, yvals) : ax1.plot(a,b,'-o',ms=8,mfc='orange', color='g')
PLT.axis("off")
PLT.show()
This model has some interesting features: (i) it actually deals with 'similarity' on a per-item basis (the vertically-oriented line connecting the dots) rather than aggregate similarity; (ii) the degree of similarity between two data points is proportional to the angle of the line connecting them--90 degrees if they are equal, with a decreasing angle as the difference increases; this is very intuitive; (iii) cases in which a point in one data set is not present in the second data set are easy to show--a point will appear on one of the two lines but without a line connecting it to a point on the other line.
This model works well for comparing search results because each search result has a 'score' (its index, or order in the Results List). For other types of data, you might have to assign a score to each data point--a similarity metric might i suppose (in a sense, that's actually what the search result order is, an distance from the top of the list)
Since there has been so much work into displaying a diff of two files, you might start by expressing your 'multiple data sets' in an appropriate text format, then using whatever you want to show a diff between those text formats.
But you should tell us more about your data sets!
I experimented a bit, and implemented two displays:
Matrix
Timeline
I agree with Peter, you should specify what type your data is and what you wish to bring out in the comparison.
Depending on the nature of the data/comparison you can consider different visualisations. Is your data ordered or unordered? How many things are you comparing, i.e. fine grain or gross comparison?
Examples:
Visualizing a comparison of unordered data could just be plotting the two histograms of your sets (i.e. distributions):
image source
On the other hand, comparing a huge ordered dataset like DNA can be done innovatively.
Also, check out visual complexity, it's a great resource for interesting visualization.

Resources