I have a variable named A in SPSS database.
A
--
102102
23453212
142378
2367890654
2345
45
I want to split this variable by 2 lengths and create multiple variables as follows.
A_1 A_2 A_3 A_4 A_5
--- --- --- --- ---
10 21 02
23 45 32 12
14 23 78
23 67 89 06 54
23 45
45
Can anyone write SPSS macro to compute this operation?
Using STRING manipulations (after converting the NUMERIC field to STRING, if necessary), specifically SUBSTR you can extract out pairs of digits as you wish.
/* Simulate data */.
data list list / x (f8.0).
begin data.
102102
23453212
142378
2367890654
2345
45
end data.
dataset name dsSim.
If you have a known maximum value, in your example a value of 10 digits long then you'll need 5 variables to store the pairs of digits, which the follow does:
preserve.
set mxwarns 0 /* temporarily supress warning messages */ .
string #xstr (a10).
compute #xstr=ltrim(string(x,f18.0)).
compute A_1=number(substr(#xstr,1,2), f8.0).
compute A_2=number(substr(#xstr,3,2), f8.0).
compute A_3=number(substr(#xstr,5,2), f8.0).
compute A_4=number(substr(#xstr,7,2), f8.0).
compute A_5=number(substr(#xstr,9,2), f8.0).
exe.
restore.
However, you may prefer to code something like this more dynamically (using python) where the code itself would read the maximum value in the data and create as many variables as needed.
begin program.
import spssdata, math
spss.Submit("set mprint on.")
# get maximum value
spss.Submit("""
dataset declare dsAgg.
aggregate outfile=dsAgg /MaxX=max(x).
dataset activate dsAgg.
""")
maxvalue = spssdata.Spssdata().fetchone()[0]
ndigits=math.floor(math.log(maxvalue,10))+1
cmd="""
dataset close dsAgg.
dataset activate dsSim.
preserve.
set mxwarns 0.
string #xstr (a10).
compute #xstr=ltrim(string(x,f18.0)).
"""
for i in range(1,int(math.ceil(ndigits/2))+1):
j=(i-1)*2+1
cmd+="\ncompute B_%(i)s=number(substr(#xstr,%(j)s,2), f8.0)." % locals()
cmd+="\nexe.\nrestore."
spss.Submit(cmd)
spss.Submit("set mprint off.")
end program.
You would need to weigh up the pros on cons of each method to asses which suits you best, for how you anticipate your data to arrive and how you then go onto work with in later. I haven't attempted to wrap either of these up in a macro but that could just as easily be done.
Related
I need to compute 61 new variables using a simple math equation based on 4 sets of 61 existing variables. I know I can write 61 compute statements. Is there a more elegant way of creating these variables? Here's how the 61 statements would look:
COMPUTE score_1 = factor_1 * (a_1 + b_1) + c_1.
...
COMPUTE score_61 = factor_61 * (a_61 + b_61) + c_61.
EXECUTE.
Thanks in advance.
recode accept to and numbers my new variables (recode raw1 to raw61 (1=0) (2=1) into a_1 to a_61.) Can I do the same here?
You can use a do repeat structure
DO REPEAT score=score_1 score_2 ... score_61
/factor = factor_1 factor_2 ... factor_61
/a=a_1 a_2 ... a_61
/b=b_1 b_2 ... b_61
/c=c_1 c_2 ... c_61.
COMPUTE score=factor*(a+b)+c.
END REPEAT.
EXECUTE.
In the fortunate event that your variables are in set order (i.e. - all factors are consecutive, all a are consecutive, etc. you may reference them usingto like this:
/factor = factor_1 TO factor_61
otherwise, you need to enumerate them one by one. Hope this helps
I have a file with a different number of rows for every "unit", and I'd like all the units to have the same number of rows, by adding the right number of empty rows per unit in the data.
For example:
data list list/ unit serial someData.
begin data.
1 1 54
2 1 57
2 2 87
2 3 91
3 1 17
3 2 43
end data.
what i'd like to get to is this:
1 1 54
1 2 .
1 3 .
2 1 57
2 2 87
2 3 91
3 1 17
3 2 43
3 3 .
I've worked with simple workarounds, for example casestovars => varstocases (keeping nulls), or preparing a base file with all the lines with unit names and serials, and then matching it with the data file so I end up with all the lines and all the data.
Could anyone suggest a more direct (\elegant\efficient\simple) approach?
Thanks!
Cartesian product is what you require here.
Using your example data and downloading the Custom Extension Command, you can solve as below:
data list list/ unit serial someData.
begin data.
1 1 54
2 1 57
2 2 87
2 3 91
3 1 17
3 2 43
end data.
DATASET NAME ds0.
DATASET ACTIVATE ds0.
STATS CARTPROD VAR1=unit VAR2=serial /SAVE OUTFILE="C:\Temp\dsCart".
SORT CASES BY unit serial.
MATCH FILES FILE=* /BY unit serial /FIRST=Primary.
SELECT IF Primary.
MATCH FILES FILE=* /FILE=ds0 /BY unit serial /DROP=Primary.
EXE.
I'm not sure how efficient this Custom Extension Command is so you may want to experiment with different flavours of using STATS CARTPROD. An alternative approach would be to create two datasets (left and right) with your unique unit and serial values and then process these through the STATS CARTPROD command.
You already mentioned it: creating a base file with all the lines with unit names and serials, and then matching it with the data file would be a simple approach. I'd like to outline this one here for other readers.
So for the questions example you would create the base data set like this:
INPUT PROGRAM.
LOOP #i = 1 to 3. /* 3 = maximum value of unit.
LOOP # = 1 to 3. /* 3 = maximum value of serial.
COMPUTE unit = #i.
COMPUTE serial = #j.
END CASE.
END LOOP.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME base.
EXECUTE.
The data set will look like this.
unit serial
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3
The following match files command will bring the wanted result.
MATCH FILES
/FILE base
/FILE data1
/BY unit serial.
If you want the code be more flexible regarding the maximum value of "unit" and "serial" you can make use of the python extension:
BEGIN PROGRAM.
import spss, spssdata
# list of variable names
variables = ["unit", "serial"]
#fetch variable data
data = spssdata.Spssdata(variables).fetchall()
# get maximum of 'unit' and 'serial'
maxunit = max([int(i[0]) for i in data])
maxserial = max([int(i[1]) for i in data])
# create base data set
spss.Submit('''
INPUT PROGRAM.
LOOP #i = 1 to {maxu}.
LOOP #j = 1 to {maxs}.
COMPUTE unit = #i.
COMPUTE serial = #j.
END CASE.
END LOOP.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME base.
EXECUTE.
'''.format(maxu=maxunit, maxs=maxserial))
END PROGRAM.
My dataset includes TWO main variables X and Y.
Variable X represents distinct codes (e.g. 001X01, 001X02, etc) for multiple computer items with different brands.
Variable Y represents the tax charged for each code of variable X (e.g. 15 = 15% for 001X01) at a store.
I've created categories for these computer items using dummy variables (e.g. HD dummy variable for Hard-Drives, takes value of 1 when variable X represents a HD, etc). I have a list of over 40 variables (two of them representing X and Y, and the rest is a bunch of dummy variables for the different categories I've created for computer items).
I would like to display the averages of all these categories using a loop in Stata, but I'm not sure how to do this.
For example the code:
mean Y if HD == 1
Mean estimation Number of obs = 5
--------------------------------------------------------------
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
Tax | 7.1 2.537716 1.154172 15.24583
gives me the mean Tax for the category representing Hard Drives. How can I use a loop in Stata to automatically display all the mean Taxes charged for each category? I would do it by hand without a problem, but I want to repeat this process for multiple years, so I would like to use a loop for each year in order to come up with this output.
My goal is to create a separate Excel file with each of the computer categories I've created (38 total) and the average tax for each category by year.
Why bother with the loop and creating the indicator variables? If I understand correctly, your initial dataset allows the use of a simple collapse:
clear all
set more off
input ///
code tax str10 categ
1 0.15 "hd"
2 0.25 "pend"
3 0.23 "mouse"
4 0.29 "pend"
5 0.16 "pend"
6 0.50 "hd"
7 0.54 "monitor"
8 0.22 "monitor"
9 0.21 "mouse"
10 0.76 "mouse"
end
list
collapse (mean) tax, by(categ)
list
To take to Excel you can try export excel or put excel.
Run help collapse and help export for details.
Edit
Because you insist, below is an example that gives the same result using loops.
I assume the same data input as before. Some testing using this example database
with expand 1000000, shows that speed is virtually the same. But almost surely,
you (including your future you) and your readers will prefer collapse.
It is much clearer, cleaner and concise. It is even prettier.
levelsof categ, local(parts)
gen mtax = .
quietly {
foreach part of local parts {
summarize tax if categ == "`part'", meanonly
replace mtax = r(mean) if categ == "`part'"
}
}
bysort categ: keep if _n == 1
keep categ mtax
Stata has features that make it quite different from other languages. Once you
start getting a hold of it, you will find that many things done with loops elsewhere,
can be made loop-less in Stata. In many cases, the latter style will be preferred.
See corresponding help files using help <command> and if you are not familiarized with saved results (e.g. r(mean)), type help return.
A supplement to Roberto's excellent answer: After collapse, you will need a loop to export the results to excel.
levelsof categ, local(levels)
foreach x of local levels {
export excel `x', replace
}
I prefer to use numerical codes for variables such as your category variable. I then assign them value labels. Here's a version of Roberto's code which does this and which, for closer correspondence to your problem, adds a "year" variable
input code tax categ year
1 0.15 1 1999
2 0.25 2 2000
3 0.23 3 2013
4 0.29 1 2010
5 0.16 2 2000
6 0.50 1 2011
7 0.54 4 2000
8 0.22 4 2003
9 0.21 3 2004
10 0.76 3 2005
end
#delim ;
label define catl
1 hd
2 pend
3 mouse
4 monitor
;
#delim cr
label values categ catl
collapse (mean) tax, by(categ year)
levelsof categ, local(levels)
foreach x of local levels {
export excel `:label (categ) `x'', replace
}
The #delim ; command makes it possible to easily list each code on a separate line. The"label" function in the export statement is an extended macro function to insert a value label into the file name.
In my SPSS Syntax Script I compute a bunch of formulas for each cases.
Let' say this is my data:
id value
1 34
2 12
3 94
I now compute a new variable where I need the number of cases in the file (number of ids)
So
COMPUTE newvar = value/ NUMBER OF CASES
in this example NUMBER OF CASES would be 3.
Is there a command for this? thx
You can use the AGGREGATE command without a break variable to return the number of cases in the dataset. Example below:
DATA LIST FREE / ID Value.
BEGIN DATA
1 34
2 12
3 94
END DATA.
AGGREGATE OUTFILE=* MODE=ADDVARIABLES
/BREAK
/NumberOfCases=N.
COMPUTE NewVar = Value/NumberOfCases.
I'm trying to solve some huffman coding problems, but I always get different values for the codewords (values not lengths).
for example, if the codeword of character 'c' was 100, in my solution it is 101.
Here is an example:
Character Frequency codeword my solution
A 22 00 10
B 12 100 010
C 24 01 11
D 6 1010 0110
E 27 11 00
F 9 1011 0111
Both solutions have the same length for codewords, and there is no codeword that is prefix of another codeword.
Does this make my solution valid ? or it has to be only 2 solutions, the optimal one and flipping the bits of the optimal one ?
There are 96 possible ways to assign the 0's and 1's to that set of lengths, and all would be perfectly valid, optimal, prefix codes. You have shown two of them.
There exist conventions to define "canonical" Huffman codes which resolve the ambiguity. The value of defining canonical codes is in the transmission of the code from the compressor to the decompressor. As long as both sides know and agree on how to unambiguously assign the 0's and 1's, then only the code length for each symbol needs to be transmitted -- not the codes themselves.
The deflate format starts with zero for the shortest code, and increments up. Within each code length, the codes are ordered by the symbol values, i.e. sorting by symbol. So for your code that canonical Huffman code would be:
A - 00
C - 01
E - 10
B - 110
D - 1110
F - 1111
So there the two bit codes are assigned in the symbol order A, C, E, and similarly, the four bit codes are assigned in the order D, F. Shorter codes are assigned before longer codes.
There is a different and interesting ambiguity that arises in finding the code lengths. Depending on the order of combination of equal frequency nodes, i.e. when you have a choice of more than two lowest frequency nodes, you can actually end up with different sets of code lengths that are exactly equally optimal. Even though the code lengths are different, when you multiply the lengths by the frequencies and add them up, you get exactly the same number of bits for the two different codes.
There again, the different codes are all optimal and equally valid. There are ways to resolve that ambiguity as well at the time the nodes to combine are chosen, where the benefit can be minimizing the depth of the tree. That can reduce the table size for table-driven Huffman decoding.
For example, consider the frequencies A: 2, B: 2, C: 1, D: 1. You first combine C and D to get 2. Then you have A, B, and C+D all with frequency 2. Now you can choose to combine either A and B, or C+D with A or B. This gives two different sets of bit lengths. If you combine A and B, you get lengths: A-2, B-2, C-2, and D-2. If you combine C+D with B, you get A-1, B-2, C-3, D-3. Both are optimal codes, since 2x2 + 2x2 + 1x2 + 1x2 = 2x1 + 2x2 + 1x3 + 1x3 = 12, so both codes use 12 bits to represent those symbols that many times.
The problem is, that there is no problem.
You huffman tree is valid, it also gives the exactly same results after encoding and decoding. Just think if you would build a huffman tree by hand, there are always more ways to combine items with equal (or least difference) value. E.g. if you have A B C (everyone frequency 1), you can at first combine A and B, and the result with C, or at first B and C, and the result with a.
You see, there are more correct ways.
Edit: Even with only one possible way to combine the items by frequency, you can get different results because you can assign 1 for the left or for the right branch, so you would get different (correct) results.