Modify values programmatically SPSS - spss

I have a file with more than 250 variables and more than 100 cases. Some of these variables have an error in decimal dot (20445.12 should be 2.044512).
I want to modify programatically these data, I found a possible way in a Visual Basic editor provided by SPSS (I show you a screen shot below), but I have an absolute lack of knowledge.
How can I select a range of cells in this language?
How can I store the cell once modified its data?
--- EDITED NEW DATA ----
Thank you for your fast reply.
The problem now its the number of digits that number has. For example, error data could have the following format:
Case A) 43998 (five digits) ---> 4.3998 as correct value.
Case B) 4399 (four digits) ---> 4.3990 as correct value, but parsed as 0.4399 because 0 has been removed when file was created.
Is there any way, like:
IF (NUM < 10000) THEN NUM = NUM / 1000 ELSE NUM = NUM / 10000
Or something like IF (Number_of_digits(NUM)) THEN ...
Thank you.

there's no need for VB script, go this way:
open a syntax window, paste the following code:
do repeat vr=var1 var2 var3 var4.
compute vr=vr/10000.
end repeat.
save outfile="filepath\My corrected data.sav".
exe.
Replace var1 var2 var3 var4 with the names of the actual variables you need to change. For variables that are contiguous in the file you may use var1 to var4.
Replace vr=vr/10000 with whatever mathematical calculation you would like to use to correct the data.
Replace "filepath\My corrected data.sav" with your path and file name.
WARNING: this syntax will change the data in your file. You should make sure to create a backup of your original in addition to saving the corrected data to a new file.

Related

Splitting a string variable delimited list into individual binary variables in SPSS

I have a string variable created from a checkbox questions (Which of the following assets do you own?)
I am trying to create individual binary variables for each type of asset based on whether that number is present in the string list.
The syntax I am using cannot differentiate between 1 and 11.
do repeat wrd="1," ",2," ",3," ",4," ",5," ",6," ",7,"/NewVar= W3_CG_asset_TV_1 W3_CG_asset_radio_2 W3_CG_asset_payTV_3 W3_CG_asset_tel_4 W3_CG_asset_cellphone_5
W3_CG_asset_fridge_6 W3_CG_asset_freezer_7.
compute NewVar=char.index(W3_CG_HouseExpen1, wrd)>0.
end repeat.
do repeat wrd= ",8," ",9," ",10," ",11," ",12," ",13," ",14," ",15," ",16," ",17," ",18," ",19," /NewVar= W3_CG_asset_electricstove_8 W3_CG_asset_primusstove_9
W3_CG_asset_gasstove_10 W3_CG_asset_electrickettle_11 W3_CG_asset_microwave_12 W3_CG_asset_computer_13 W3_CG_asset_electricity_14 W3_CG_asset_geyser_15
W3_CG_asset_washingmachine_16 W3_CG_asset_workingvehicle_17 W3_CG_asset_bicycle_18 W3_CG_asset_donkeyhorse_19.
compute NewVar=char.index(W3_CG_HouseExpen1, wrd)>0.
end repeat.
I have tested this on SPSS 28.
make sure the column W3_CG_HouseExpen1 is string and length of it long enough to hold the data.
Then I added execute
data list list/W3_CG_HouseExpen1 (a50).
begin data
"1,2,11,12,"
"2,12,"
"1,2,"
"1,11,12,"
end data.
do repeat
wrd="1," ",2," ",3," ",4," ",5," ",6," ",7," ",8," ",9," ",10," ",11," ",12," ",13," ",14," ",15," ",16," ",17," ",18," ",19,"
/NewVar = W3_CG_asset_TV_1 W3_CG_asset_radio_2 W3_CG_asset_payTV_3 W3_CG_asset_tel_4 W3_CG_asset_cellphone_5 W3_CG_asset_fridge_6 W3_CG_asset_freezer_7
W3_CG_asset_electricstove_8 W3_CG_asset_primusstove_9 W3_CG_asset_gasstove_10 W3_CG_asset_electrickettle_11 W3_CG_asset_microwave_12 W3_CG_asset_computer_13 W3_CG_asset_electricity_14 W3_CG_asset_geyser_15
W3_CG_asset_washingmachine_16 W3_CG_asset_workingvehicle_17 W3_CG_asset_bicycle_18 W3_CG_asset_donkeyhorse_19.
compute NewVar=char.index(W3_CG_HouseExpen1, wrd)>0.
end repeat.
EXECUTE.
My suggestion is to run through this in reverse, erasing the values you've already recognized. So if you've got "11" and erased it, when you later search for "1" you won't find it in an "11".
I recreated a tiny exaple dataset to demonstrate on (EDIT-improved example):
data list list/W3_CG_HouseExpen1 (a50) .
begin data
"1,2,11,12,"
"11,12,"
"2,11,"
end data.
Now I do the whole process on a copy of the original W3_CG_HouseExpen1 variable so I can eat it away without damage to the original data:
string #temp(a50).
compute #temp=W3_CG_HouseExpen1.
do repeat wrd="12," "11," "2," "1," /NewVar= W3_12 W3_11 W3_2 W3_1.
compute NewVar=char.index(#temp, wrd)>0.
compute #temp=replace(#temp, wrd, ""). /*deleting the search string from the full string.
end repeat.
exe.

Stata: using foreach to rename numeric variables

I have a large dataset where subsets of variables have been entered with the same prefix, followed by an underscore and some details. They are all binary YN and the variables are all doubles. For example, I have the variables onsite_healthclinic and onsite_CBO where values can only be 1 or 0.
I want to rename them all according to the question they are on the survey I'm working off of (so the above variables would become q0052_healthclinic and q0052_CBO), but if I use the code below using substr I (obviously) get type mismatch:
foreach var in onsite_healthclinic onsite_CBO {
local new = substr(`var', 8, .)
rename `new' q0052_`new'
}
My question is, is there another command other than substr that I can use so that I don't have to either a) convert all of the variables to strings first; or b) rename them all manually (there are ~20 in each subset, so while doable, it's a waste of time).
There is no need for a loop here at all. Although the essential answer is one line long I give here a complete, self-contained answer.
clear
set obs 1
foreach v in onsite_healthclinic onsite_CBO {
gen `v' = 1
}
rename onsite_* q0052_*
describe, fullnames
This answer implies that you've not studied the help under rename groups.
Will this work?
foreach var in onsite_healthclinic onsite_CBO {
local new = substr("`var'", 8, .)
rename onsite_`new' q0052_`new'
}
I added quotes around the call to the local var in the substr function and added onsite_ to the rename and that seemed to work.

Logic behind COBOL code

I am not able to understand what is the logic behind these lines:
COMPUTE temp = RESULT - 1.843E19.
IF temp IS LESS THAN 1.0E16 THEN
Data definition:
000330 01 VAR1 COMP-1 VALUE 3.4E38. // 3.4 x 10 ^ 38
Here are those lines in context (the sub-program returns a square root):
MOVE VAR1 TO PARM1.
CALL "SQUAREROOT_ROUTINE" USING
BY REFERENCE PARM1,
BY REFERENCE RESULT.
COMPUTE temp = RESULT - 1.843E19.
IF temp IS LESS THAN 1.0E16 THEN
DISPLAY "OK"
ELSE
DISPLAY "False"
END-IF.
These lines are just trying to test if the result returned by the SQUAREROOT_ROUTINE is correct. Since the program is using float-values and rather large numbers this might look a bit complicated. Let's just do the math:
You start with 3.4E38, the squareroot is 1.84390889...E19.
By subtracting 1.843E19 (i.e. the approximate result) and comparing the difference against 1.0E16 the program is testing whether the result is between 1.843E19 and 1.843E19+1.0E16 = 1.844E19.
Not that this test would not catch an error if the result from SQUAREROOT_ROUTINE was too low instead of too high. To catch both types of wrong results you should compare the absolute value of the difference against the tolerance.
You might ask "Why make things so complicated"? The thing is that float-values usually are not exact and depending on the used precision you will get sightly different results due to rounding-errors.
well the logic itself is very straight forward, you are subtracting 1.843*(10^19) from the result you get from the SQUAREROOT_ROUTINE and putting that value in the variable called temp and then If the value of temp is less than 1.0*(10^16) you are going to print a line out to the SYSOUT that says "OK", otherwise you are going to print out "False" (if the value was equal to or greater than).
If you mean the logic as to why this code exists, you will need to talk to the author of the code, but it looks like a debugging display that was left in the program.

How to refactor string containing variable names into booleans?

I have an SPSS variable containing lines like:
|2|3|4|5|6|7|8|10|11|12|13|14|15|16|18|20|21|22|23|24|25|26|27|28|29|
Every line starts with pipe, and ends with one. I need to refactor it into boolean variables as the following:
var var1 var2 var3 var4 var5
|2|4|5| 0 1 0 1 1
I have tried to do it with a loop like:
loop # = 1 to 72.
compute var# = SUBSTR(var,2#,1).
end loop.
exe.
My code won't work with 2 or more digits long numbers and also it won't place the values into their respective variables, so I've tried nest the char.substr(var,char.rindex(var,'|') + 1) into another loop with no luck because it still won't allow me to recognize the variable number.
How can I do it?
This looks like a nice job for the DO REPEAT command. However the type conversion is somewhat tricky:
DO REPEAT var#i=var1 TO var72
/i=1 TO 72.
COMPUTE var#i = CHAR.INDEX(var,CONCAT("|",LTRIM(STRING(i,F2.0)),"|"))>0).
END REPEAT.
Explanation: Let's go from the inside to the outside:
STRING(value,F2.0) converts the numeric values into a string of two digits (with a leading white space where the number consist of just one digit), e.g. 2 -> " 2".
LTRIM() removes the leading whitespaces, e.g. " 2" -> "2".
CONCAT() concatenates strings. In the above code it adds the "|" before and after the number, e.g. "2" -> "|2|"
CHAR.INDEX(stringvar,searchstring) returns the position at which the searchstring was found. It returns 0 if the searchstring wasn't found.
CHAR.INDEX(stringvar,searchstring)>0 returns a boolean value indicating if the searchstring was found or not.
It's easier to do the manipulations in Python than native SPSS syntax.
You can use SPSSINC TRANS extension for this purpose.
/* Example data*/.
data list free / TextStr (a99).
begin data.
"|2|3|4|5|6|7|8|10|11|12|13|14|15|16|18|20|21|22|23|24|25|26|27|28|29|"
end data.
/* defining function to achieve task */.
begin program.
def runTask(x):
numbers=map(int,filter(None,[i.strip() for i in x.lstrip('|').split("|")]))
answer=[1 if i in numbers else 0 for i in xrange(1,max(numbers)+1)]
return answer
end program.
/* Run job*/.
spssinc trans result = V1 to V30 type=0 /formula "runTask(TextStr)".
exe.

string comparison against factors in Stata

Suppose I have a factor variable with labels "a" "b" and "c" and want to see which observations have a label of "b". Stata refuses to parse
gen isb = myfactor == "b"
Sure, there is literally a "type mismatch", since my factor is encoded as an integer and so cannot be compared to the string "b". However, it wouldn't kill Stata to (i) perform the obvious parse or (ii) provide a translator function so I can write the comparison as label(myfactor) == "b". Using decode to (re)create a string variable defeats the purpose of encoding, which is to save space and make computations more efficient, right?
I hadn't really expected the comparison above to work, but I at least figured there would be a one- or two-line approach. Here is what I have found so far. There is a nice macro ("extended") function that maps the other way (from an integer to a label, seen below as local labi: label ...). Here's the solution using it:
// sample data
clear
input str5 mystr int mynum
a 5
b 5
b 6
c 4
end
encode mystr, gen(myfactor)
// first, how many groups are there?
by myfactor, sort: gen ng = _n == 1
replace ng = sum(ng)
scalar ng = ng[_N]
drop ng
// now, which code corresponds to "b"?
forvalues i = 1/`=ng'{
local labi: label myfactor `i'
if "b" == "`labi'" {
scalar bcode = `i'
break
}
}
di bcode
The second step is what irks me, but I'm sure there's a also faster, more idiomatic way of performing the first step. Can I grab the length of the label vector, for example?
An example:
clear all
set more off
sysuse auto
gen isdom = 1 if foreign == "Domestic":`:value label foreign'
list foreign isdom in 1/60
This creates a variable called isdom and it will equal 1 if foreigns's value label is equal to "Domestic". It uses an extended macro function.
From [U] 18.3.8 Macro expressions:
Also, typing
command that makes reference to `:extended macro function'
is equivalent to
local macroname : extended macro function
command that makes reference to `macroname'
This explains one of the two : in the offered syntax. The other can be explained by
... to specify value labels directly in an expression, rather than through
the underlying numeric value ... You specify the label in double quotes
(""), followed by a colon (:), followed by the name of the value
label.
The quote is from Stata tip 14: Using value labels in expressions, by Kenneth Higbee, The Stata Journal (2004). Freely available at http://www.stata-journal.com/sjpdf.html?articlenum=dm0009
Edit
On computing the number of distinct observations, another way is:
by myfactor, sort: gen ng = _n == 1
count if ng
scalar sc_ng = r(N)
display sc_ng
But yours is fine. In fact, it is documented here: http://www.stata.com/support/faqs/data-management/number-of-distinct-observations/, along with more methods and comments.

Resources