Generating means of a variable using dummy variables & foreach in Stata - foreach

My dataset includes TWO main variables X and Y.
Variable X represents distinct codes (e.g. 001X01, 001X02, etc) for multiple computer items with different brands.
Variable Y represents the tax charged for each code of variable X (e.g. 15 = 15% for 001X01) at a store.
I've created categories for these computer items using dummy variables (e.g. HD dummy variable for Hard-Drives, takes value of 1 when variable X represents a HD, etc). I have a list of over 40 variables (two of them representing X and Y, and the rest is a bunch of dummy variables for the different categories I've created for computer items).
I would like to display the averages of all these categories using a loop in Stata, but I'm not sure how to do this.
For example the code:
mean Y if HD == 1
Mean estimation Number of obs = 5
--------------------------------------------------------------
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
Tax | 7.1 2.537716 1.154172 15.24583
gives me the mean Tax for the category representing Hard Drives. How can I use a loop in Stata to automatically display all the mean Taxes charged for each category? I would do it by hand without a problem, but I want to repeat this process for multiple years, so I would like to use a loop for each year in order to come up with this output.
My goal is to create a separate Excel file with each of the computer categories I've created (38 total) and the average tax for each category by year.

Why bother with the loop and creating the indicator variables? If I understand correctly, your initial dataset allows the use of a simple collapse:
clear all
set more off
input ///
code tax str10 categ
1 0.15 "hd"
2 0.25 "pend"
3 0.23 "mouse"
4 0.29 "pend"
5 0.16 "pend"
6 0.50 "hd"
7 0.54 "monitor"
8 0.22 "monitor"
9 0.21 "mouse"
10 0.76 "mouse"
end
list
collapse (mean) tax, by(categ)
list
To take to Excel you can try export excel or put excel.
Run help collapse and help export for details.
Edit
Because you insist, below is an example that gives the same result using loops.
I assume the same data input as before. Some testing using this example database
with expand 1000000, shows that speed is virtually the same. But almost surely,
you (including your future you) and your readers will prefer collapse.
It is much clearer, cleaner and concise. It is even prettier.
levelsof categ, local(parts)
gen mtax = .
quietly {
foreach part of local parts {
summarize tax if categ == "`part'", meanonly
replace mtax = r(mean) if categ == "`part'"
}
}
bysort categ: keep if _n == 1
keep categ mtax
Stata has features that make it quite different from other languages. Once you
start getting a hold of it, you will find that many things done with loops elsewhere,
can be made loop-less in Stata. In many cases, the latter style will be preferred.
See corresponding help files using help <command> and if you are not familiarized with saved results (e.g. r(mean)), type help return.

A supplement to Roberto's excellent answer: After collapse, you will need a loop to export the results to excel.
levelsof categ, local(levels)
foreach x of local levels {
export excel `x', replace
}
I prefer to use numerical codes for variables such as your category variable. I then assign them value labels. Here's a version of Roberto's code which does this and which, for closer correspondence to your problem, adds a "year" variable
input code tax categ year
1 0.15 1 1999
2 0.25 2 2000
3 0.23 3 2013
4 0.29 1 2010
5 0.16 2 2000
6 0.50 1 2011
7 0.54 4 2000
8 0.22 4 2003
9 0.21 3 2004
10 0.76 3 2005
end
#delim ;
label define catl
1 hd
2 pend
3 mouse
4 monitor
;
#delim cr
label values categ catl
collapse (mean) tax, by(categ year)
levelsof categ, local(levels)
foreach x of local levels {
export excel `:label (categ) `x'', replace
}
The #delim ; command makes it possible to easily list each code on a separate line. The"label" function in the export statement is an extended macro function to insert a value label into the file name.

Related

Loop to select which variable to omit from analysis

I have datasets with a large number of variables and I need to run PCA over these datasets with one variable removed each time. Below are 20 variables for an example dataset. I would like to run PCA with one variable removed from each PCA solution. For example, the first PCA solution will include all variables excluding Var_1_GroupA, the second will include all variables excluding Var_2_GroupA, etc. I am familiar with using macros to write loops but unsure how to complete the following task using macros or code in python.
Var_1_GroupA
Var_2_GroupA
Var_1_GroupB
Var_2_GroupB
Var_3_GroupB
Var_1_GroupC
Var_2_GroupC
Var_3_GroupC
Var_4_GroupC
Var_5_GroupC
Var_1_GroupD
Var_1_GroupE
new_Var_1_GroupA
new_Var_1_GroupB
new_Var_1_GroupC
new_Var_2_GroupC
Var_1_GroupF
Var_1_GroupG
Var_1_GroupH
Var_2_GroupH
In the example below I create 10 variables, and then run a simple means command with a different set of variables each time - excluding one of the variables at a time. You can edit the code to match your variables and your analysis code.
data list list/var1 to var10 (10F1).
begin data
1 2 3 4 5 6 7 8 9 9
5 4 3 6 3 8 1 2 5 8
0 8 6 4 2 1 3 5 7 9
end data.
dataset name wrk.
define !loopit (!pos=!cmdend)
!do !a !in(!1)
means
!do !b !in(!1) !if (!b<>!a) !then !b !ifend !doend
.
!doend
!enddefine.
!loopit var1 var2 var3 var4 var5 var6 var7 var8 var9 var10 .
note you vave to list the variable names in the macro call, can't use var1 to var10.
If you run into trouble while adapting this to your exact needs, these are very helpful in debugging macros:
set mexpand=on.
set mprint=on.

Selecting a cut-off score in SPSS

I have 5 variables for one questionnaire about social support. I want to define the group with low vs. high support. According to the authors low support is defined as a sum score <= 18 AND two items scoring <= 3.
It would be great to get a dummy variable which shows which people are low vs high in support.
How can I do this in the syntax?
Thanks ;)
Assuming your variables are named Var1, Var2 .... Var5, and that they are consecutive in the dataset, this should work:
recode Var1 to Var5 (1 2 3=1)(4 thr hi=0) into L1 to L5.
compute LowSupport = sum(Var1 to Var5) <= 18 and sum(L1 to L5)>=2.
execute.
New variable LowSupport will have value 1 for rows that have the parameters you defined and 0 for other rows.
Note: If your variables are not consecutive you'll have to list all of them instead of using Var1 to var5.

How to reference a particular row for an existing variable in SPSS syntax?

I have 2 variables, one for raw p-values and another for adjusted p-values. I need to compute a new variable based on the values of these two variables. What I need to do isn't too complicated, but I have a hard time doing it in SPSS because I can't figure out how I can reference a particular row for an existing variable in SPSS syntax.
The first column lists raw p-values in ascending order. The next column lists adjusted p-values, but these adjusted p-values are still incomplete. I need to compare two adjacent p-values in the adjusted p-values column (e.g., row 1 and 2, row 2 and 3, row 3 and 4, and so forth), and take the p-values whichever is smaller in each of these comparisons and enter those p-values into the following column as values for a new variable.
However, that's not the end of the story. One more condition has to be met. That is, the new p-values have to be in the same order as the raw p-values. However, I cannot ensure this if I start the comparisons from the top row. You can see that (i') is greater than (h') and (g'), and (d') is greater than (c'), (b'), and (a') in the example below (picture).
In order to solve this issue, I would need to start the comparison of the adjusted p-values from the bottom. In addition, I would need to compare the adjusted p-values to the new p-values of one row below. One exception is that I can simply use the value of (a) as the value of (a') since the value of (a) should always be the greatest of all the p-values as a rule. Then, for (b') , I need to compare (b) and (a') and enter whichever is smaller as (b'). For (c'), I need to compare (c) and (b') and enter whichever is smaller as (c'), and so forth. By doing this way, (d') would be 0.911 and (i') would be 0.017.
Sorry for this long post, but I would really appreciate if I can get some help to do this task in SPSS.
Thank you in advance for your help.
Raw p-values | Adjusted p-values (Temporal)| New p-values (Final)
-------------|-----------------------------|---------------------
0.002 | 0.030 (i) | 0.025 (i')
0.003 | 0.025 (h) | 0.017 (h')
0.004 | 0.017 (g) | 0.017 (g')
0.005 | 0.028 (f) | 0.028 (f')
0.023 | 0.068 (e) | 0.068 (e')
0.450 | 1.061 (d) | 1.061 (d')
0.544 | 1.145 (c) | 0.911 (c')
0.850 | 0.911 (b) | 0.911 (b')
0.974 | 0.974 (a) | 0.974 (a')
Another tool that may be convenient is the SHIFT VALUES command. It can move one or more columns of data either forward or backward.
I wonder whether the purpose of this has to do with adjusting p values for multiple testing corrections as with Benjamin-Hochberg FDR or others similar. If that is the case, you might find the STATS PADJUST (Analyze > Descriptives > Calculate adjusted p values) extension command useful. It offers six adjustment methods. You can install it from the Utilities (pre-V24) or Extensions (V24+) menu.
To get you started, here are a few tools that can help you with this task:
The LAG function
you can compare values in this line and the previous one, for example, the following will compare the Pval in each line to the one in the previous one, and put the smaller of the two in the NewPval:
compute NewPVal=min(Pval, lag(Pval)).
If you want to do the same process only start from the bottom, you can easily sort your data in reverse order and do the same.
CREATE + LEAD
if you want to make comparisons to the next line instead of the previous line, you should first create a "lead" variable and then compare to it.
for example, the following syntax will create a new variable that for each line contains the value of Pval in the next line, and then chooses the smaller of the two for the NewPval:
create /LeadPval=LEAD(Pval 1).
compute NewPVal=min(Pval, LeadPval).
Using case numbers
You can use case numbers (line numbers) in calculations and in conditions. For example, the following syntax will let you make different calculations in the first line and the following ones:
if $casenum=1 NewPval=Pval.
if $casenum>1 NewPVal=min(Pval, lag(Pval)).

Possible to use less/greater than operators with IF ANY?

Is it possible to use <,> operators with the if any function? Something like this:
select if (any(>10,Q1) AND any(<2,Q2 to Q10))
You definitely need to create an auxiliary variable to do this.
#Jignesh Sutar's solution is one that works fine. However there are often multiple ways in SPSS to accomplish a certain task.
Here is another solution where the COUNT command comes in handy.
It is important to note that the following solution assumes that the values of the variables are integers. If you have float values (1.5 for instance) you'll get a wrong result.
* count occurrences where Q2 to Q10 is less then 2.
COUNT #QLT2 = Q2 TO Q10 (LOWEST THRU 1).
* select if Q1>10 and
* there is at least one occurrence where Q2 to Q10 is less then 2.
SELECT (Q1>10 AND #QLT2>0).
There is also a variant for this sort of solution that deals with float variables correctly. But I think it is less intuitive though.
* count occurrences where Q2 to Q10 is 2 or higher.
COUNT #QGE2 = Q2 TO Q10 (2 THRU HIGHEST).
* select if Q1>10 and
* not every occurences of (the 9 variables) Q2 to Q10 is two or higher.
SELECT IF (Q1>10 AND #QGE2<9).
Note: Variables beginning with # are temporary variables. They are not stored in the data set.
I don't think you can (would be nice if you could - you can do something similar in Excel with COUNTIF & SUMIF IIRC).
You've have to construct a new variable which tests the multiple ANY less than condition, as per below example:
input program.
loop #j = 1 to 1000.
compute ID=#j.
vector Q(10).
loop #i = 1 to 10.
compute Q(#i) = trunc(rv.uniform(-20,20)).
end loop.
end case.
end loop.
end file.
end input program.
execute.
vector Q=Q2 to Q10.
loop #i=1 to 9 if Q(#i)<2.
compute #QLT2=1.
end loop if Q(#i)<2.
select if (Q1>10 and #QLT2=1).
exe.

Moving Average across Variables in Stata

I have a panel data set for which I would like to calculate moving averages across years.
Each year is a variable for which there is an observation for each state, and I would like to create a new variable for the average of every three year period.
For example:
P1947=rmean(v1943 v1944 v1945), P1947=rmean(v1944 v1945 v1946)
I figured I should use a foreach loop with the egen command, but I'm not sure about how I should refer to the different variables within the loop.
I'd appreciate any guidance!
This data structure is quite unfit for purpose. Assuming an identifier id you need to reshape, e.g.
reshape long v, i(id) j(year)
tsset id year
Then a moving average is easy. Use tssmooth or just generate, e.g.
gen mave = (L.v + v + F.v)/3
or (better)
gen mave = 0.25 * L.v + 0.5 * v + 0.25 * F.v
More on why your data structure is quite unfit: Not only would calculation of a moving average need a loop (not necessarily involving egen), but you would be creating several new extra variables. Using those in any subsequent analysis would be somewhere between awkward and impossible.
EDIT I'll give a sample loop, while not moving from my stance that it is poor technique. I don't see a reason behind your naming convention whereby P1947 is a mean for 1943-1945; I assume that's just a typo. Let's suppose that we have data for 1913-2012. For means of 3 years, we lose one year at each end.
forval j = 1914/2011 {
local i = `j' - 1
local k = `j' + 1
gen P`j' = (v`i' + v`j' + v`k') / 3
}
That could be written more concisely, at the expense of a flurry of macros within macros. Using unequal weights is easy, as above. The only reason to use egen is that it doesn't give up if there are missings, which the above will do.
FURTHER EDIT
As a matter of completeness, note that it is easy to handle missings without resorting to egen.
The numerator
(v`i' + v`j' + v`k')
generalises to
(cond(missing(v`i'), 0, v`i') + cond(missing(v`j'), 0, v`j') + cond(missing(v`k'), 0, v`k')
and the denominator
3
generalises to
!missing(v`i') + !missing(v`j') + !missing(v`k')
If all values are missing, this reduces to 0/0, or missing. Otherwise, if any value is missing, we add 0 to the numerator and 0 to the denominator, which is the same as ignoring it. Naturally the code is tolerable as above for averages of 3 years, but either for that case or for averaging over more years, we would replace the lines above by a loop, which is what egen does.
There is a user written program that can do that very easily for you. It is called mvsumm and can be found through findit mvsumm
xtset id time
mvsumm observations, stat(mean) win(t) gen(new_variable) end

Resources