string comparison against factors in Stata - comparison

Suppose I have a factor variable with labels "a" "b" and "c" and want to see which observations have a label of "b". Stata refuses to parse
gen isb = myfactor == "b"
Sure, there is literally a "type mismatch", since my factor is encoded as an integer and so cannot be compared to the string "b". However, it wouldn't kill Stata to (i) perform the obvious parse or (ii) provide a translator function so I can write the comparison as label(myfactor) == "b". Using decode to (re)create a string variable defeats the purpose of encoding, which is to save space and make computations more efficient, right?
I hadn't really expected the comparison above to work, but I at least figured there would be a one- or two-line approach. Here is what I have found so far. There is a nice macro ("extended") function that maps the other way (from an integer to a label, seen below as local labi: label ...). Here's the solution using it:
// sample data
clear
input str5 mystr int mynum
a 5
b 5
b 6
c 4
end
encode mystr, gen(myfactor)
// first, how many groups are there?
by myfactor, sort: gen ng = _n == 1
replace ng = sum(ng)
scalar ng = ng[_N]
drop ng
// now, which code corresponds to "b"?
forvalues i = 1/`=ng'{
local labi: label myfactor `i'
if "b" == "`labi'" {
scalar bcode = `i'
break
}
}
di bcode
The second step is what irks me, but I'm sure there's a also faster, more idiomatic way of performing the first step. Can I grab the length of the label vector, for example?

An example:
clear all
set more off
sysuse auto
gen isdom = 1 if foreign == "Domestic":`:value label foreign'
list foreign isdom in 1/60
This creates a variable called isdom and it will equal 1 if foreigns's value label is equal to "Domestic". It uses an extended macro function.
From [U] 18.3.8 Macro expressions:
Also, typing
command that makes reference to `:extended macro function'
is equivalent to
local macroname : extended macro function
command that makes reference to `macroname'
This explains one of the two : in the offered syntax. The other can be explained by
... to specify value labels directly in an expression, rather than through
the underlying numeric value ... You specify the label in double quotes
(""), followed by a colon (:), followed by the name of the value
label.
The quote is from Stata tip 14: Using value labels in expressions, by Kenneth Higbee, The Stata Journal (2004). Freely available at http://www.stata-journal.com/sjpdf.html?articlenum=dm0009
Edit
On computing the number of distinct observations, another way is:
by myfactor, sort: gen ng = _n == 1
count if ng
scalar sc_ng = r(N)
display sc_ng
But yours is fine. In fact, it is documented here: http://www.stata.com/support/faqs/data-management/number-of-distinct-observations/, along with more methods and comments.

Related

Var in var = var in var + 1

I am still new to Lua and have one question about var in var.
How do I calculate this:?
A=1
X=A
X=X+1
As you can see:
This calculation would result in
A=A+1
But this does not work for me.
I guess I have to format the cars in some way.
I want to do this because I want to be able to change a var in another var when necessary.
The = operator does two things:
Evaluate the right-hand side
Assign the result to the variable on the left-hand side
To illustrate, consider this example:
A = 1 -- A is now 1
X = A + A + A -- X is now 3, and A hasn't changed
X = X + 1 -- X is now 4, and A hasn't changed
Now lets look at your original code, and write out the meaning in plain language.
A=1 -- Create a variable 'A' and assign it the value of one
X=A -- Create the variable 'X' and assign it the current value of 'A'
X=X+1 -- Change 'X' by assigning it the current value of 'X' plus one
Notice how these comments read like "instructions" to a computer, rather than math equations. Lua (and programming in general) should be interpreted as a set of instructions executed from top to bottom.
However, as Egor Skriptunoff alludes to in earlier comments, tables behave differently. See Programming in Lua - Chapter 2.5 for a more detailed explanation of how tables are different.

Other ways to call/eval dynamic strings in Lua?

I am working with a third party device which has some implementation of Lua, and communicates in BACnet. The documentation is pretty janky, not providing any sort of help for any more advanced programming ideas. It's simply, "This is how you set variables...". So, I am trying to just figure it out, and hoping you all can help.
I need to set a long list of variables to certain values. I have a userdata 'ME', with a bunch of variables named MVXX (e.g. - MV21, MV98, MV56, etc).
(This is all kind of background for BACnet.) Variables in BACnet all have 17 'priorities', i.e., every BACnet variable is actually a sort of list of 17 values, with priority 16 being the default. So, typically, if I were to say ME.MV12 = 23, that would set MV12's priority-16 to the desired value of 23.
However, I need to set priority 17. I can do this in the provided Lua implementation, by saying ME.MV12_PV[17] = 23. I can set any of the priorities I want by indexing that PV. (Corollaries - what is PV? What is the underscore? How do I get to these objects? Or are they just interpreted from Lua to some function in C on the backend?)
All this being said, I need to make that variable name dynamic, so that i can set whichever value I need to set, based on some other code. I have made several attempts.
This tells me the object(MV12_PV[17]) does not exist:
x = 12
ME["MV" .. x .. "_PV[17]"] = 23
But this works fine, setting priority 16 to 23:
x = 12
ME["MV" .. x] = 23
I was trying to attempt some sort of what I think is called an evaluation, or eval. But, this just prints out function followed by some random 8 digit number:
x = 12
test = assert(loadstring("MV" .. x .. "_PV[17] = 23"))
print(test)
Any help? Apologies if I am unclear - tbh, I am so far behind the 8-ball I am pretty much grabbing at straws.
Underscores can be part of Lua identifiers (variable and function names). They are just part of the variable name (like letters are) and aren't a special Lua operator like [ and ] are.
In the expression ME.MV12_PV[17] we have ME being an object with a bunch of fields, ME.MV12_PV being an array stored in the "MV12_PV" field of that object and ME.MV12_PV[17] is the 17th slot in that array.
If you want to access fields dynamically, the thing to know is that accessing a field with dot notation in Lua is equivalent to using bracket notation and passing in the field name as a string:
-- The following are all equivalent:
x.foo
x["foo"]
local fieldname = "foo"
x[fieldname]
So in your case you might want to try doing something like this:
local n = 12
ME["MV"..n.."_PV"][17] = 23
BACnet "Commmandable" Objects (e.g. Binary Output, Analog Output, and o[tionally Binary Value, Analog Value and a handful of others) actually have 16 priorities (1-16). The "17th" you are referring to may be the "Relinquish Default", a value that is used if all 16 priorities are set to NULL or "Relinquished".
Perhaps your system will allow you to write to a BACnet Property called "Relinquish Default".

Stata: perform a foreach loop to calculate kappa across a large data file

I have a data file in Stata with 50 variables
j-r-hp j-p-hp j-m-hp p-c-hp p-r-hp p-p-hp p-m-hp ... etc,
I want to perform a weighted kappa between pairs, so that the first might be
kap j-r-hp j-p-hp, wgt(w2)
and the next would be
kap j-r-hp j-m-hp, wgt(w2)
I am new to Stata. Is there a straightforward way to use a loop for this, like a foreach loop?
Your variable names are not legal names in Stata, so I've changed the hyphens to underscores in the example below. Also, I don't know what it means to 'perform a weighted kappa', so my answer uses random normal variables and the corr[elate] command. You can use the results that Stata leaves behind in r() (see return list) to gather the results for the separate analyses.
The idea is to gather the variables in a list using a local, then to loop over each element in that list (but skipping the repeated pairs using continue). If you have many variables with structured names, you could instead use ds, which leaves r(varlist) in r().Have a look at the help file for macros (help macro and help extended_fcn), especially the section on 'Macro extended functions for parsing'. Hope this helps.
clear
set obs 100
local vars j_r_hp j_p_hp j_m_hp p_c_hp p_r_hp p_p_hp p_m_hp
foreach var of local vars {
gen `var'=rnormal()
}
forval ii=1/`: word count `vars'' {
forval jj=1/`: word count `vars'' {
if `ii'<`jj' continue
corr `: word `ii' of `vars'' `: word `jj' of `vars''
}
}
You can take advantage of the user-written command tuples (run ssc install tuples):
clear
set more off
*----- example data -----
set obs 100
local vars j_r_hp j_p_hp j_m_hp p_c_hp p_r_hp p_p_hp p_m_hp
foreach var of local vars {
gen `var' = abs(round(rnormal()*100))
}
*----- what you want -----
tuples `vars', min(2) max(2)
forvalues i = 1/`ntuples' {
display _newline(3) "variables `tuple`i''"
kappa `tuple`i''
}
How you get the variables names together to feed them into tuples will depend on the dataset.
This is a variation on the helpful answer by #Matthijs, but it really won't fit well into a comment. The main extra twists are
The use of tokenize to avoid repeated use of word # of. After tokenize the separate words of the argument (here separate variable names) are held in macros 1 up. Thus tokenize a b c puts a in local macro 1, b in local macro 2 and c in local macro 3. Nested macro references are treated exactly like parenthesised expressions in elementary algebra; what is on the inside is evaluated first.
Focusing directly on part of the notional matrix of results on one side of the diagonal. The small trick is to ensure that one matrix subscript exceeds the other subscript.
Random normal input doesn't make sense for kap, but you will be using your own data any way.
clear
set obs 100
local vars j_r_hp j_p_hp j_m_hp p_c_hp p_r_hp p_p_hp p_m_hp
foreach var of local vars {
gen `var' = rnormal()
}
tokenize `vars'
local p : word count `vars'
local pm1 = `p' - 1
forval i = 1/`pm1' {
local ip1 = `i' + 1
forval j = `ip1'/`p' {
di "``i'' and ``j''"
kap ``i'' ``j''
di
}
}
I thought I might add my own answer in addition to highlight a few things.
The first thing to note is that for a new user, the most "straightforward" way to do it would likely involve hard-coding all variables into a local to use in a loop (as other answers suggest), or referencing them using a wildcard and writing more than one loop for each group. See the example below on how you might use a wildcard:
clear *
sysuse auto
/* Rename variables to match your .dta file and identify groups */
rename (price mpg rep78) (j_r_hp j_p_hp j_m_hp)
rename (headroom trunk weight) (p_c_hp p_r_hp p_m_hp)
rename (length turn displacement foreign) (z_r_hp z_m_hp z_p_hp z_c_hp)
/* Loop over all variables beginning with j and ending hp */
foreach x of varlist j*hp {
foreach i of varlist j*hp {
if "`x'" != "`i'" & "`i'" >= "`x'"{ // This section ensures you get only
// unique pairs of x & i
kap `x' `i'
}
}
}
/* Loop over all variables beginning with p and ending hp */
foreach x of varlist p*hp {
* something involving x
}
* etc.
Now, depending on how many groups you have or how many variables you have, this might not seem straightforward after all.
This brings up the second thing I would like to mention. In cases where hard-coding many variables or many repeated commands becomes cumbersome, I tend to favor a programmatic solution. This will often involve writing more code up front, but in many cases tends to be at least quasi-generalizable, and will allow you to easily evaluate hundreds of variables if you ever have the need without having to write them all out.
The code below uses the returned results from describe, along with some foreach loops and some extended macro functions to execute the kappa command over your variables without having to store them in a local manually.
clear *
sysuse auto
rename (price mpg rep78) (j_r_hp j_p_hp j_m_hp)
rename (headroom trunk weight) (p_c_hp p_r_hp p_m_hp)
rename (length turn displacement foreign) (z_r_hp z_m_hp z_p_hp z_c_hp)
/*
use gear_ratio as an arbitrary weight, order it first to easily extract
from the local containing varlist
*/
order gear_ratio, first
qui describe, varlist
local Varlist `r(varlist)' // store varlist in a local macro
preserve // preserve data so canges can be reverted back
foreach x of local Varlist {
capture confirm numeric variable `x'
if _rc {
drop `x' // Keep only numeric variables to use in kappa
}
}
qui describe, varlist // replace the local macro varlist with now numeric only variables
local Varlist `r(varlist)'
local vars : list Varlist - weight // remove weight from analysis varlist
foreach x of local vars {
foreach i of local vars {
if "`x'" != "`i'" & "`i'" >= "`x'" {
gettoken leftx : x, parse("_")
gettoken lefti : i, parse("_")
if "`leftx'" == "`lefti'" {
kap `x' `i'
}
}
}
}
restore
There of course will be a learning curve here for new users but I've found the use of macros, loops and returned results to be wonderfully effective in adding flexibility to my programs and do files - I would highly suggest anybody using Stata at least studies the basics of these three topics.

How to refactor string containing variable names into booleans?

I have an SPSS variable containing lines like:
|2|3|4|5|6|7|8|10|11|12|13|14|15|16|18|20|21|22|23|24|25|26|27|28|29|
Every line starts with pipe, and ends with one. I need to refactor it into boolean variables as the following:
var var1 var2 var3 var4 var5
|2|4|5| 0 1 0 1 1
I have tried to do it with a loop like:
loop # = 1 to 72.
compute var# = SUBSTR(var,2#,1).
end loop.
exe.
My code won't work with 2 or more digits long numbers and also it won't place the values into their respective variables, so I've tried nest the char.substr(var,char.rindex(var,'|') + 1) into another loop with no luck because it still won't allow me to recognize the variable number.
How can I do it?
This looks like a nice job for the DO REPEAT command. However the type conversion is somewhat tricky:
DO REPEAT var#i=var1 TO var72
/i=1 TO 72.
COMPUTE var#i = CHAR.INDEX(var,CONCAT("|",LTRIM(STRING(i,F2.0)),"|"))>0).
END REPEAT.
Explanation: Let's go from the inside to the outside:
STRING(value,F2.0) converts the numeric values into a string of two digits (with a leading white space where the number consist of just one digit), e.g. 2 -> " 2".
LTRIM() removes the leading whitespaces, e.g. " 2" -> "2".
CONCAT() concatenates strings. In the above code it adds the "|" before and after the number, e.g. "2" -> "|2|"
CHAR.INDEX(stringvar,searchstring) returns the position at which the searchstring was found. It returns 0 if the searchstring wasn't found.
CHAR.INDEX(stringvar,searchstring)>0 returns a boolean value indicating if the searchstring was found or not.
It's easier to do the manipulations in Python than native SPSS syntax.
You can use SPSSINC TRANS extension for this purpose.
/* Example data*/.
data list free / TextStr (a99).
begin data.
"|2|3|4|5|6|7|8|10|11|12|13|14|15|16|18|20|21|22|23|24|25|26|27|28|29|"
end data.
/* defining function to achieve task */.
begin program.
def runTask(x):
numbers=map(int,filter(None,[i.strip() for i in x.lstrip('|').split("|")]))
answer=[1 if i in numbers else 0 for i in xrange(1,max(numbers)+1)]
return answer
end program.
/* Run job*/.
spssinc trans result = V1 to V30 type=0 /formula "runTask(TextStr)".
exe.

Lua base converter

I need a base converter function for Lua. I need to convert from base 10 to base 2,3,4,5,6,7,8,9,10,11...36 how can i to this?
In the string to number direction, the function tonumber() takes an optional second argument that specifies the base to use, which may range from 2 to 36 with the obvious meaning for digits in bases greater than 10.
In the number to string direction, this can be done slightly more efficiently than Nikolaus's answer by something like this:
local floor,insert = math.floor, table.insert
function basen(n,b)
n = floor(n)
if not b or b == 10 then return tostring(n) end
local digits = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"
local t = {}
local sign = ""
if n < 0 then
sign = "-"
n = -n
end
repeat
local d = (n % b) + 1
n = floor(n / b)
insert(t, 1, digits:sub(d,d))
until n == 0
return sign .. table.concat(t,"")
end
This creates fewer garbage strings to collect by using table.concat() instead of repeated calls to the string concatenation operator ... Although it makes little practical difference for strings this small, this idiom should be learned because otherwise building a buffer in a loop with the concatenation operator will actually tend to O(n2) performance while table.concat() has been designed to do substantially better.
There is an unanswered question as to whether it is more efficient to push the digits on a stack in the table t with calls to table.insert(t,1,digit), or to append them to the end with t[#t+1]=digit, followed by a call to string.reverse() to put the digits in the right order. I'll leave the benchmarking to the student. Note that although the code I pasted here does run and appears to get correct answers, there may other opportunities to tune it further.
For example, the common case of base 10 is culled off and handled with the built in tostring() function. But similar culls can be done for bases 8 and 16 which have conversion specifiers for string.format() ("%o" and "%x", respectively).
Also, neither Nikolaus's solution nor mine handle non-integers particularly well. I emphasize that here by forcing the value n to an integer with math.floor() at the beginning.
Correctly converting a general floating point value to any base (even base 10) is fraught with subtleties, which I leave as an exercise to the reader.
you can use a loop to convert an integer into a string containting the required base. for bases below 10 use the following code, if you need a base larger than that you need to add a line that mapps the result of x % base to a character (usign an array for example)
x = 1234
r = ""
base = 8
while x > 0 do
r = "" .. (x % base ) .. r
x = math.floor(x / base)
end
print( r );

Resources