Variable for the number of cases SPSS

In my SPSS Syntax Script I compute a bunch of formulas for each cases.
Let' say this is my data:
id value
1 34
2 12
3 94
I now compute a new variable where I need the number of cases in the file (number of ids)
in this example NUMBER OF CASES would be 3.
Is there a command for this? thx

You can use the AGGREGATE command without a break variable to return the number of cases in the dataset. Example below:
COMPUTE NewVar = Value/NumberOfCases.


Selecting a cut-off score in SPSS

I have 5 variables for one questionnaire about social support. I want to define the group with low vs. high support. According to the authors low support is defined as a sum score <= 18 AND two items scoring <= 3.
It would be great to get a dummy variable which shows which people are low vs high in support.
How can I do this in the syntax?
Thanks ;)
Assuming your variables are named Var1, Var2 .... Var5, and that they are consecutive in the dataset, this should work:
recode Var1 to Var5 (1 2 3=1)(4 thr hi=0) into L1 to L5.
compute LowSupport = sum(Var1 to Var5) <= 18 and sum(L1 to L5)>=2.
New variable LowSupport will have value 1 for rows that have the parameters you defined and 0 for other rows.
Note: If your variables are not consecutive you'll have to list all of them instead of using Var1 to var5.

How to equalize the number of rows per unit in an SPSS file

I have a file with a different number of rows for every "unit", and I'd like all the units to have the same number of rows, by adding the right number of empty rows per unit in the data.
For example:
data list list/ unit serial someData.
begin data.
1 1 54
2 1 57
2 2 87
2 3 91
3 1 17
3 2 43
end data.
what i'd like to get to is this:
1 1 54
1 2 .
1 3 .
2 1 57
2 2 87
2 3 91
3 1 17
3 2 43
3 3 .
I've worked with simple workarounds, for example casestovars => varstocases (keeping nulls), or preparing a base file with all the lines with unit names and serials, and then matching it with the data file so I end up with all the lines and all the data.
Could anyone suggest a more direct (\elegant\efficient\simple) approach?
Cartesian product is what you require here.
Using your example data and downloading the Custom Extension Command, you can solve as below:
data list list/ unit serial someData.
begin data.
1 1 54
2 1 57
2 2 87
2 3 91
3 1 17
3 2 43
end data.
STATS CARTPROD VAR1=unit VAR2=serial /SAVE OUTFILE="C:\Temp\dsCart".
SORT CASES BY unit serial.
MATCH FILES FILE=* /BY unit serial /FIRST=Primary.
SELECT IF Primary.
MATCH FILES FILE=* /FILE=ds0 /BY unit serial /DROP=Primary.
I'm not sure how efficient this Custom Extension Command is so you may want to experiment with different flavours of using STATS CARTPROD. An alternative approach would be to create two datasets (left and right) with your unique unit and serial values and then process these through the STATS CARTPROD command.
You already mentioned it: creating a base file with all the lines with unit names and serials, and then matching it with the data file would be a simple approach. I'd like to outline this one here for other readers.
So for the questions example you would create the base data set like this:
LOOP #i = 1 to 3. /* 3 = maximum value of unit.
LOOP # = 1 to 3. /* 3 = maximum value of serial.
COMPUTE unit = #i.
COMPUTE serial = #j.
The data set will look like this.
unit serial
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3
The following match files command will bring the wanted result.
/FILE base
/FILE data1
/BY unit serial.
If you want the code be more flexible regarding the maximum value of "unit" and "serial" you can make use of the python extension:
import spss, spssdata
# list of variable names
variables = ["unit", "serial"]
#fetch variable data
data = spssdata.Spssdata(variables).fetchall()
# get maximum of 'unit' and 'serial'
maxunit = max([int(i[0]) for i in data])
maxserial = max([int(i[1]) for i in data])
# create base data set
LOOP #i = 1 to {maxu}.
LOOP #j = 1 to {maxs}.
COMPUTE unit = #i.
COMPUTE serial = #j.
'''.format(maxu=maxunit, maxs=maxserial))

SPSS macro for splitting single numeric variables to multiple variables

I have a variable named A in SPSS database.
I want to split this variable by 2 lengths and create multiple variables as follows.
A_1 A_2 A_3 A_4 A_5
--- --- --- --- ---
10 21 02
23 45 32 12
14 23 78
23 67 89 06 54
23 45
Can anyone write SPSS macro to compute this operation?
Using STRING manipulations (after converting the NUMERIC field to STRING, if necessary), specifically SUBSTR you can extract out pairs of digits as you wish.
/* Simulate data */.
data list list / x (f8.0).
begin data.
end data.
dataset name dsSim.
If you have a known maximum value, in your example a value of 10 digits long then you'll need 5 variables to store the pairs of digits, which the follow does:
set mxwarns 0 /* temporarily supress warning messages */ .
string #xstr (a10).
compute #xstr=ltrim(string(x,f18.0)).
compute A_1=number(substr(#xstr,1,2), f8.0).
compute A_2=number(substr(#xstr,3,2), f8.0).
compute A_3=number(substr(#xstr,5,2), f8.0).
compute A_4=number(substr(#xstr,7,2), f8.0).
compute A_5=number(substr(#xstr,9,2), f8.0).
However, you may prefer to code something like this more dynamically (using python) where the code itself would read the maximum value in the data and create as many variables as needed.
begin program.
import spssdata, math
spss.Submit("set mprint on.")
# get maximum value
dataset declare dsAgg.
aggregate outfile=dsAgg /MaxX=max(x).
dataset activate dsAgg.
maxvalue = spssdata.Spssdata().fetchone()[0]
dataset close dsAgg.
dataset activate dsSim.
set mxwarns 0.
string #xstr (a10).
compute #xstr=ltrim(string(x,f18.0)).
for i in range(1,int(math.ceil(ndigits/2))+1):
cmd+="\ncompute B_%(i)s=number(substr(#xstr,%(j)s,2), f8.0)." % locals()
spss.Submit("set mprint off.")
end program.
You would need to weigh up the pros on cons of each method to asses which suits you best, for how you anticipate your data to arrive and how you then go onto work with in later. I haven't attempted to wrap either of these up in a macro but that could just as easily be done.

Generating means of a variable using dummy variables & foreach in Stata

My dataset includes TWO main variables X and Y.
Variable X represents distinct codes (e.g. 001X01, 001X02, etc) for multiple computer items with different brands.
Variable Y represents the tax charged for each code of variable X (e.g. 15 = 15% for 001X01) at a store.
I've created categories for these computer items using dummy variables (e.g. HD dummy variable for Hard-Drives, takes value of 1 when variable X represents a HD, etc). I have a list of over 40 variables (two of them representing X and Y, and the rest is a bunch of dummy variables for the different categories I've created for computer items).
I would like to display the averages of all these categories using a loop in Stata, but I'm not sure how to do this.
For example the code:
mean Y if HD == 1
Mean estimation Number of obs = 5
| Mean Std. Err. [95% Conf. Interval]
Tax | 7.1 2.537716 1.154172 15.24583
gives me the mean Tax for the category representing Hard Drives. How can I use a loop in Stata to automatically display all the mean Taxes charged for each category? I would do it by hand without a problem, but I want to repeat this process for multiple years, so I would like to use a loop for each year in order to come up with this output.
My goal is to create a separate Excel file with each of the computer categories I've created (38 total) and the average tax for each category by year.
Why bother with the loop and creating the indicator variables? If I understand correctly, your initial dataset allows the use of a simple collapse:
clear all
set more off
input ///
code tax str10 categ
1 0.15 "hd"
2 0.25 "pend"
3 0.23 "mouse"
4 0.29 "pend"
5 0.16 "pend"
6 0.50 "hd"
7 0.54 "monitor"
8 0.22 "monitor"
9 0.21 "mouse"
10 0.76 "mouse"
collapse (mean) tax, by(categ)
To take to Excel you can try export excel or put excel.
Run help collapse and help export for details.
Because you insist, below is an example that gives the same result using loops.
I assume the same data input as before. Some testing using this example database
with expand 1000000, shows that speed is virtually the same. But almost surely,
you (including your future you) and your readers will prefer collapse.
It is much clearer, cleaner and concise. It is even prettier.
levelsof categ, local(parts)
gen mtax = .
quietly {
foreach part of local parts {
summarize tax if categ == "`part'", meanonly
replace mtax = r(mean) if categ == "`part'"
bysort categ: keep if _n == 1
keep categ mtax
Stata has features that make it quite different from other languages. Once you
start getting a hold of it, you will find that many things done with loops elsewhere,
can be made loop-less in Stata. In many cases, the latter style will be preferred.
See corresponding help files using help <command> and if you are not familiarized with saved results (e.g. r(mean)), type help return.
A supplement to Roberto's excellent answer: After collapse, you will need a loop to export the results to excel.
levelsof categ, local(levels)
foreach x of local levels {
export excel `x', replace
I prefer to use numerical codes for variables such as your category variable. I then assign them value labels. Here's a version of Roberto's code which does this and which, for closer correspondence to your problem, adds a "year" variable
input code tax categ year
1 0.15 1 1999
2 0.25 2 2000
3 0.23 3 2013
4 0.29 1 2010
5 0.16 2 2000
6 0.50 1 2011
7 0.54 4 2000
8 0.22 4 2003
9 0.21 3 2004
10 0.76 3 2005
#delim ;
label define catl
1 hd
2 pend
3 mouse
4 monitor
#delim cr
label values categ catl
collapse (mean) tax, by(categ year)
levelsof categ, local(levels)
foreach x of local levels {
export excel `:label (categ) `x'', replace
The #delim ; command makes it possible to easily list each code on a separate line. The"label" function in the export statement is an extended macro function to insert a value label into the file name.

Constrained Sequence to Index Mapping

I'm puzzling over how to map a set of sequences to consecutive integers.
All the sequences follow this rule:
A_0 = 1
A_n >= 1
A_n <= max(A_0 .. A_n-1) + 1
I'm looking for a solution that will be able to, given such a sequence, compute a integer for doing a lookup into a table and given an index into the table, generate the sequence.
Example: for length 3, there are 5 the valid sequences. A fast function for doing the following map (preferably in both direction) would be a good solution
1,1,1 0
1,1,2 1
1,2,1 2
1,2,2 3
1,2,3 4
The point of the exercise is to get a packed table with a 1-1 mapping between valid sequences and cells.
The size of the set in bounded only by the number of unique sequences possible.
I don't know now what the length of the sequence will be but it will be a small, <12, constant known in advance.
I'll get to this sooner or later, but though I'd throw it out for the community to have "fun" with in the meantime.
these are different valid sequences
these are not
Related to this
There is a natural sequence indexing, but no so easy to calculate.
Let look for A_n for n>0, since A_0 = 1.
Indexing is done in 2 steps.
Part 1:
Group sequences by places where A_n = max(A_0 .. A_n-1) + 1. Call these places steps.
On steps are consecutive numbers (2,3,4,5,...).
On non-step places we can put numbers from 1 to number of steps with index less than k.
Each group can be represent as binary string where 1 is step and 0 non-step. E.g. 001001010 means group with 112aa3b4c, a<=2, b<=3, c<=4. Because, groups are indexed with binary number there is natural indexing of groups. From 0 to 2^length - 1. Lets call value of group binary representation group order.
Part 2:
Index sequences inside a group. Since groups define step positions, only numbers on non-step positions are variable, and they are variable in defined ranges. With that it is easy to index sequence of given group inside that group, with lexicographical order of variable places.
It is easy to calculate number of sequences in one group. It is number of form 1^i_1 * 2^i_2 * 3^i_3 * ....
This gives a 2 part key: <Steps, Group> this then needs to be mapped to the integers. To do that we have to find how many sequences are in groups that have order less than some value. For that, lets first find how many sequences are in groups of given length. That can be computed passing through all groups and summing number of sequences or similar with recurrence. Let T(l, n) be number of sequences of length l (A_0 is omitted ) where maximal value of first element can be n+1. Than holds:
T(l,n) = n*T(l-1,n) + T(l-1,n+1)
T(1,n) = n
Because l + n <= sequence length + 1 there are ~sequence_length^2/2 T(l,n) values, which can be easily calculated.
Next is to calculate number of sequences in groups of order less or equal than given value. That can be done with summing of T(l,n) values. E.g. number of sequences in groups with order <= 1001010 binary, is equal to
T(7,1) + # for 1000000
2^2 * T(4,2) + # for 001000
2^2 * 3 * T(2,3) # for 010
This will give a mapping but the direct implementation for combining the key parts is >O(1) at best. On the other hand, the Steps portion of the key is small and by computing the range of Groups for each Steps value, a lookup table can reduce this to O(1).
I'm not 100% sure about upper formula, but it should be something like it.
With these remarks and recurrence it is possible to make functions sequence -> index and index -> sequence. But not so trivial :-)
I think hash with out sorting should be the thing.
As A0 always start with 0, may be I think we can think of the sequence as an number with base 12 and use its base 10 as the key for look up. ( Still not sure about this).
This is a python function which can do the job for you assuming you got these values stored in a file and you pass the lines to the function
def valid_lines(lines):
for line in lines:
line = line.split(",")
if line[0] == 1 and line[-1] and line[-1] <= max(line)+1:
yield line
lines = (line for line in open('/tmp/numbers.txt'))
for valid_line in valid_lines(lines):
print valid_line
Given the sequence, I would sort it, then use the hash of the sorted sequence as the index of the table.
