ORIGINAL QUESTION
While importing a large dataset in work, I noticed some unexpected behaviour where Stata appears to "forget" a local macro, when using the append command.
This seems all the more strange as it appears to be a phenomenon specific to this one command (I tested with save and the code worked as expected).
*****************
** SET UP
******************
local datasets "auto.dta auto2.dta"
global data "/Users/Seansmac/Desktop/stata_question"
save "$data/test_data.dta", replace emptyok
local datasets "auto.dta auto2.dta"
** Save Stata data in two Excel sheets. This replicates the status of my raw data at work.
foreach dataset in `datasets'{
di "`dataset' loaded"
sysuse `dataset', clear
gen data_name = "`dataset'"
tab data_name
export excel using "$data/auto_excel.xlsx", sheet("`dataset'") first(variables) sheetreplace
di "`dataset' saved in excel"
}
*****************
** Demo of Problem
******************
import excel "$data/auto_excel.xlsx", desc
local worksheets `r(N_worksheet)'
di `worksheets'
forvalues i = 1/`worksheets' {
di " Sheet number `i'"
local shtname`i' `r(worksheet_`i')'
di "loading database: `shtname`i''"
import excel "$data/auto_excel.xlsx", sheet("`shtname`i''") clear firstrow
di "database: `shtname`i'' loaded"
append using "$data/test_data.dta", force
di "database: `shtname`i'' appended"
}
***** Show only the same data was appended twice
use "$data/test_data.dta", clear
tab data_name
** I include this tab to demonstrate that only one of the two data sets is appended.
*****************
** END
*****************
Apologies if the example is a little cluttered but I often find it helpful to use display when working with locals. To run the code, all that is required is that you change the global data.
To maintain fidelity with my problem in work I include the import excel section; I don't understand the problem sufficiently to make the example any more minimal.
ORIGINAL QUESTION EDITED
Below are two code chunks. The first demonstrates that the append command appears to work as I would expect it to (I note that I had forgot to use save in my original question). This chunk demonstrates that although appending a dataset with an empty dataset may not be entirely intuitive, it still works fine. The advantage of this method is that it eliminates the need for a conditional statement when loading files.
In the second code chunk, I try to use the append command in the same basic way, but this time in a loop. This code chunk is copy and pasted from Pearly Spencer with three minor changes:
I save an empty dataset at the beginning
I comment-out the logical if statements
I append with the empty dataset instead (and then save it, so the second time round it shouldn't be empty).
The local macro which Stata "forgets" is shtname. If you examine the display statements, nothing is printed after the first loop. This is the location of my question. To further demonstrate this, the tab command at the end of the script shows that the variable data_name has 148 observations of auto.dta and none of auto2.dta. This shows the same (ie the first) data set was appended twice. This suggests (to me) that the append part of the script works fine, but there is a problem with the local marco, shtname.
* DEMONSTRATE APPEND APPEARS TO WORK *
clear all
cd "[**INSERT CD**]"
*Create empty data set to append later
save "test_data_noloop.dta", replace emptyok
* load first dataset
sysuse auto.dta, clear
* Gen a variable indicating what dataset it is
gen dataset = "auto_1"
* append data with empty dataset
append using "test_data_noloop.dta"
save "test_data_noloop.dta", replace
clear
*load second dataset
sysuse auto2.dta, clear
* gen dataset variable again
gen dataset = "auto_2"
* append with the previously saved dataset
append using "test_data_noloop.dta"
* Demonstrate both datasets have been appended
tab dataset
* FOLLOW PEARLY SPENCER, WITH SMALL ADJUSTMENTS *
clear all
cd "[**INSERT CD**]"
local datasets "auto.dta auto2.dta"
** Added the following line
save "test_data.dta", replace emptyok
foreach dataset in `datasets'{
display "`dataset' loaded"
sysuse `dataset', clear
generate data_name = "`dataset'"
tab data_name
export excel using "auto_excel.xlsx", sheet("`dataset'") first(variables) sheetreplace
display "`dataset' saved in excel"
}
import excel "auto_excel.xlsx", desc
local worksheets `r(N_worksheet)'
display `worksheets'
forvalues i = 1 / `worksheets' {
display " Sheet number `i'"
local shtname`i' `r(worksheet_`i')'
display "loading database: `shtname`i''"
import excel "auto_excel.xlsx", sheet("`shtname`i''") clear firstrow
*if `i' == 1 save "test_data.dta", replace
display "database: `shtname`i'' loaded"
*if `i' > 1 {
append using "test_data.dta", force
save "test_data.dta", replace
*}
display "database: `shtname`i'' appended"
}
use "test_data.dta", clear
tab data_name
To address some remarks in the comments, auto2.dta can be found when typing sysuse dir into the console. It is thus a dataset available to all. I have tried my best to keep my code replicable, and unless I am mistaken all that needs to be done is to set the cd for the above code to work.
Secondly, I have tried hard to ensure I haven't made a stupid logical error (as mentioned above, I realise I omitted saving my file in my original question, which would indeed mean I am appending an empty dataset each time). That said, it may be a case that I've looked at this problem for so long I can no longer see the wood from the trees; so please go easy if it's still a one liner type issue!
Finally, I never said Stata is forgetting the local macro, merely that it appears to do so. Hence I ask the question to understand what is going on (or, more likely, where I've made the mistake).
SCREEN SHOT OF MY OUTPUT
See red marks where locals are not being displayed.
***EDIT #3
This image appears to show the (as yet) unexplained behaviour stems from append and not import.
My understanding is that you want to append successive Excel sheets into one Stata dataset.
Below is a working version of your toy example.
Set up:
clear all
local datasets "auto.dta auto2.dta"
Save Stata data in two excel sheets:
foreach dataset in `datasets'{
display "`dataset' loaded"
sysuse `dataset', clear
generate data_name = "`dataset'"
tab data_name
export excel using "auto_excel.xlsx", sheet("`dataset'") first(variables) sheetreplace
display "`dataset' saved in excel"
}
Demonstrate the solution to the problem:
import excel "auto_excel.xlsx", desc
local worksheets `r(N_worksheet)'
display `worksheets'
forvalues i = 1 / `worksheets' {
display " Sheet number `i'"
local shtname`i' `r(worksheet_`i')'
display "loading database: `shtname`i''"
import excel "auto_excel.xlsx", sheet("`shtname`i''") clear firstrow
if `i' == 1 save "test_data.dta", replace
display "database: `shtname`i'' loaded"
if `i' > 1 {
append using "test_data.dta", force
save "test_data.dta", replace
}
display "database: `shtname`i'' appended"
}
Show the results:
use "test_data.dta", clear
tab data_name
(I broke the code in different snippets for better legibility.)
EDIT:
Stata can use only one dataset at a time. The way the append command works is by 'attaching' the data from the specified external dataset to those already loaded into memory (i.e used). The reason the version of your example is not giving you the desired outcome is because you are trying to append an empty dataset every time after you import an Excel sheet. This is an error in logic.
EDIT 2:
The output generated:
. clear all
. local datasets "auto.dta auto2.dta"
.
. foreach dataset in `datasets'{
2. display "`dataset' loaded"
3. sysuse `dataset', clear
4. generate data_name = "`dataset'"
5. tab data_name
6. export excel using "auto_excel.xlsx", sheet("`dataset'") first(variables) sheetreplace
7. display "`dataset' saved in excel"
8. }
auto.dta loaded
(1978 Automobile Data)
data_name | Freq. Percent Cum.
------------+-----------------------------------
auto.dta | 74 100.00 100.00
------------+-----------------------------------
Total | 74 100.00
file auto_excel.xlsx saved
auto.dta saved in excel
auto2.dta loaded
(1978 Automobile Data)
data_name | Freq. Percent Cum.
------------+-----------------------------------
auto2.dta | 74 100.00 100.00
------------+-----------------------------------
Total | 74 100.00
file auto_excel.xlsx saved
auto2.dta saved in excel
.
. import excel "auto_excel.xlsx", desc
Sheet | Range
----------+----------
auto.dta | A1:M75
auto2.dta | A1:M75
.
. local worksheets `r(N_worksheet)'
. display `worksheets'
2
.
. forvalues i = 1 / `worksheets' {
2. display " Sheet number `i'"
3. local shtname`i' `r(worksheet_`i')'
4. display "loading database: `shtname`i''"
5. import excel "auto_excel.xlsx", sheet("`shtname`i''") clear firstrow
6. if `i' == 1 save "test_data.dta", replace
7. display "database: `shtname`i'' loaded"
8. if `i' > 1 {
9. append using "test_data.dta", force
10. save "test_data.dta", replace
11. }
12. display "database: `shtname`i'' appended"
13. }
Sheet number 1
loading database: auto.dta
file test_data.dta saved
database: auto.dta loaded
database: auto.dta appended
Sheet number 2
loading database: auto2.dta
database: auto2.dta loaded
(note: variable rep78 was byte in the using data, but will be str9 now)
file test_data.dta saved
database: auto2.dta appended
.
. use "test_data.dta", clear
. tab data_name
data_name | Freq. Percent Cum.
------------+-----------------------------------
auto.dta | 74 50.00 50.00
auto2.dta | 74 50.00 100.00
------------+-----------------------------------
Total | 148 100.00
.
end of do-file
Related
I have 2 txt files
The 1) txt file is like this :
sequence_id description
Solyc01g005420.2.1 No description available
Solyc01g006950.3.1 "31.4 cell.vesicle transport Encodes a syntaxin localized at the plasma membrane (SYR1 Syntaxin Related Protein 1 also known as SYP121 PENETRATION1/PEN1). SYR1/PEN1 is a member of the SNARE superfamily proteins. SNARE proteins are involved in cell signaling vesicle traffic growth and development. SYR1/PEN1 functions in positioning anchoring of the KAT1 K+ channel protein at the plasma membrane. Transcription is upregulated by abscisic acid suggesting a role in ABA signaling. Also functions in non-host resistance against barley powdery mildew Blumeria graminis sp. hordei. SYR1/PEN1 is a nonessential component of the preinvasive resistance against Colletotrichum fungus. Required for mlo resistance. syntaxin of plants 121 (SYP121)"
Solyc01g007770.2.1 No description available
Solyc01g008560.3.1 No description available
Solyc01g068490.3.1 20.1 stress.biotic Encodes a protein containing a U-box and an ARM domain. senescence-associated E3 ubiquitin ligase 1 (SAUL1)
..
.
the 2nd txt file has the gene ids:
Solyc02g080050.2.1
Solyc09g083200.3.1
Solyc05g050380.3.1
Solyc09g011490.3.1
Solyc04g051490.3.1
Solyc08g006470.3.1
Solyc01g107810.3.1
Solyc03g095770.3.1
Solyc12g006370.2.1
Solyc03g033840.3.1
Solyc02g069250.3.1
Solyc02g077040.3.1
Solyc03g093890.3.1
..
.
.
Each txt has a lot more lines than the ones i show. I just wanted to know what grep command should i use that i only get the genes that are on the 2nd txt file, deduct from the 1st with the description next to it.
thanks
currently two avro files are getting generated for 10 kb file, If I follow the same thing with my actual file (30MB+) I will n number of files.
so need a solution to generate only one or two .avro files even if the source file of large.
Also is there any way to avoid manual declaration of column names.
current approach...
spark-shell --packages com.databricks:spark-csv_2.10:1.5.0,com.databricks:spark-avro_2.10:2.0.1
import org.apache.spark.sql.types.{StructType, StructField, StringType}
// Manual schema declaration of the 'co' and 'id' column names and types
val customSchema = StructType(Array(
StructField("ind", StringType, true),
StructField("co", StringType, true)))
val df = sqlContext.read.format("com.databricks.spark.csv").option("comment", "\"").option("quote", "|").schema(customSchema).load("/tmp/file.txt")
df.write.format("com.databricks.spark.avro").save("/tmp/avroout")
// Note: /tmp/file.txt is input file/dir, and /tmp/avroout is the output dir
Try specifying number of partitions of your dataframe while writing the data as avro or any format. To fix this use repartition or coalesce df function.
df.coalesce(1).write.format("com.databricks.spark.avro").save("/tmp/avroout")
So that it writes only one file in "/tmp/avroout"
Hope this helps!
I have a file with more than 250 variables and more than 100 cases. Some of these variables have an error in decimal dot (20445.12 should be 2.044512).
I want to modify programatically these data, I found a possible way in a Visual Basic editor provided by SPSS (I show you a screen shot below), but I have an absolute lack of knowledge.
How can I select a range of cells in this language?
How can I store the cell once modified its data?
--- EDITED NEW DATA ----
Thank you for your fast reply.
The problem now its the number of digits that number has. For example, error data could have the following format:
Case A) 43998 (five digits) ---> 4.3998 as correct value.
Case B) 4399 (four digits) ---> 4.3990 as correct value, but parsed as 0.4399 because 0 has been removed when file was created.
Is there any way, like:
IF (NUM < 10000) THEN NUM = NUM / 1000 ELSE NUM = NUM / 10000
Or something like IF (Number_of_digits(NUM)) THEN ...
Thank you.
there's no need for VB script, go this way:
open a syntax window, paste the following code:
do repeat vr=var1 var2 var3 var4.
compute vr=vr/10000.
end repeat.
save outfile="filepath\My corrected data.sav".
exe.
Replace var1 var2 var3 var4 with the names of the actual variables you need to change. For variables that are contiguous in the file you may use var1 to var4.
Replace vr=vr/10000 with whatever mathematical calculation you would like to use to correct the data.
Replace "filepath\My corrected data.sav" with your path and file name.
WARNING: this syntax will change the data in your file. You should make sure to create a backup of your original in addition to saving the corrected data to a new file.
I am using a shell script to extract the data from 'extr' table. The extr table is a very big table having 410 columns. The table has 61047 rows of data. The size of one record is around 5KB.
I the script is as follows:
#!/usr/bin/ksh
sqlplus -s \/ << rbb
set pages 0
set head on
set feed off
set num 20
set linesize 32767
set colsep |
set trimspool on
spool extr.csv
select * from extr;
/
spool off
rbb
#-------- END ---------
One fine day the extr.csv file was having 2 records with incorrect number of columns (i.e. one record with more number of columns and other with less). Upon investigation I came to know that the two duplicate records were repeated in the file. The primary key of the records should ideally be unique in file but in this case 2 records were repeated. Also, the shift in the columns was abrupt.
Small example of the output file:
5001|A1A|AAB|190.00|105|A
5002|A2A|ABB|180.00|200|F
5003|A3A|AAB|153.33|205|R
5004|A4A|ABB|261.50|269|F
5005|A5A|AAB|243.00|258|G
5006|A6A|ABB|147.89|154|H
5003|A7A|AAB|249.67|AAB|153.33|205|R
5004|A8A|269|F
5009|A9A|AAB|368.00|358|S
5010|AAA|ABB|245.71|215|F
Here the primary key records for 5003 and 5004 have reappeared in place of 5007 and 5008. Also the duplicate reciords have shifted the records of 5007 and 5008 by appending/cutting down their columns.
Need your help in analysing why this happened? Why the 2 rows were extracted multiple times? Why the other 2 rows were missing from the file? and Why the records were shifted?
Note: This script is working fine since last two years and has never failed except for one time (mentioned above). It ran successfully during next run. Recently we have added one more program which accesses the extr table with cursor (select only).
I reproduced a similar behaviour.
;-> cat input
5001|A1A|AAB|190.00|105|A
5002|A2A|ABB|180.00|200|F
5003|A3A|AAB|153.33|205|R
5004|A4A|ABB|261.50|269|F
5005|A5A|AAB|243.00|258|G
5006|A6A|ABB|147.89|154|H
5009|A9A|AAB|368.00|358|S
5010|AAA|ABB|245.71|215|F
See the input file as your database.
Now I write a script that accesses "the database" and show some random freezes.
;-> cat writeout.sh
# Start this script twice
while IFS=\| read a b c d e f; do
# I think you need \c for skipping \n, but I do it different one time
echo "$a|$b|$c|$d|" | tr -d "\n"
(( sleeptime = RANDOM % 5 ))
sleep ${sleeptime}
echo "$e|$f"
done < input >> output
EDIT: Removed cat input | in script above, replaced by < input
Start this script twice in the background
;-> ./writeout.sh &
;-> ./writeout.sh &
Wait until both jobs are finished and see the result
;-> cat output
5001|A1A|AAB|190.00|105|A
5002|A2A|ABB|180.00|200|F
5003|A3A|AAB|153.33|5001|A1A|AAB|190.00|105|A
5002|A2A|ABB|180.00|205|R
5004|A4A|ABB|261.50|269|F
5005|A5A|AAB|243.00|200|F
5003|A3A|AAB|153.33|258|G
5006|A6A|ABB|147.89|154|H
5009|A9A|AAB|368.00|358|S
5010|AAA|ABB|245.71|205|R
5004|A4A|ABB|261.50|269|F
5005|A5A|AAB|243.00|258|G
5006|A6A|ABB|147.89|215|F
154|H
5009|A9A|AAB|368.00|358|S
5010|AAA|ABB|245.71|215|F
When I edit the last line of writeout.sh into done > output I do not see the problem, but that might be due to buffering and the small amount of data.
I still don't know exactly what happened in your case, but it really seems like 2 progs writing simultaneously to the same script.
A job in TWS could have been restarted manually, 2 scripts in your masterscript might write to the same file or something else.
Preventing this in the future can be done using some locking / checks (when the output file exists, quit and return errorcode to TWS).
My dataset includes TWO main variables X and Y.
Variable X represents distinct codes (e.g. 001X01, 001X02, etc) for multiple computer items with different brands.
Variable Y represents the tax charged for each code of variable X (e.g. 15 = 15% for 001X01) at a store.
I've created categories for these computer items using dummy variables (e.g. HD dummy variable for Hard-Drives, takes value of 1 when variable X represents a HD, etc). I have a list of over 40 variables (two of them representing X and Y, and the rest is a bunch of dummy variables for the different categories I've created for computer items).
I would like to display the averages of all these categories using a loop in Stata, but I'm not sure how to do this.
For example the code:
mean Y if HD == 1
Mean estimation Number of obs = 5
--------------------------------------------------------------
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
Tax | 7.1 2.537716 1.154172 15.24583
gives me the mean Tax for the category representing Hard Drives. How can I use a loop in Stata to automatically display all the mean Taxes charged for each category? I would do it by hand without a problem, but I want to repeat this process for multiple years, so I would like to use a loop for each year in order to come up with this output.
My goal is to create a separate Excel file with each of the computer categories I've created (38 total) and the average tax for each category by year.
Why bother with the loop and creating the indicator variables? If I understand correctly, your initial dataset allows the use of a simple collapse:
clear all
set more off
input ///
code tax str10 categ
1 0.15 "hd"
2 0.25 "pend"
3 0.23 "mouse"
4 0.29 "pend"
5 0.16 "pend"
6 0.50 "hd"
7 0.54 "monitor"
8 0.22 "monitor"
9 0.21 "mouse"
10 0.76 "mouse"
end
list
collapse (mean) tax, by(categ)
list
To take to Excel you can try export excel or put excel.
Run help collapse and help export for details.
Edit
Because you insist, below is an example that gives the same result using loops.
I assume the same data input as before. Some testing using this example database
with expand 1000000, shows that speed is virtually the same. But almost surely,
you (including your future you) and your readers will prefer collapse.
It is much clearer, cleaner and concise. It is even prettier.
levelsof categ, local(parts)
gen mtax = .
quietly {
foreach part of local parts {
summarize tax if categ == "`part'", meanonly
replace mtax = r(mean) if categ == "`part'"
}
}
bysort categ: keep if _n == 1
keep categ mtax
Stata has features that make it quite different from other languages. Once you
start getting a hold of it, you will find that many things done with loops elsewhere,
can be made loop-less in Stata. In many cases, the latter style will be preferred.
See corresponding help files using help <command> and if you are not familiarized with saved results (e.g. r(mean)), type help return.
A supplement to Roberto's excellent answer: After collapse, you will need a loop to export the results to excel.
levelsof categ, local(levels)
foreach x of local levels {
export excel `x', replace
}
I prefer to use numerical codes for variables such as your category variable. I then assign them value labels. Here's a version of Roberto's code which does this and which, for closer correspondence to your problem, adds a "year" variable
input code tax categ year
1 0.15 1 1999
2 0.25 2 2000
3 0.23 3 2013
4 0.29 1 2010
5 0.16 2 2000
6 0.50 1 2011
7 0.54 4 2000
8 0.22 4 2003
9 0.21 3 2004
10 0.76 3 2005
end
#delim ;
label define catl
1 hd
2 pend
3 mouse
4 monitor
;
#delim cr
label values categ catl
collapse (mean) tax, by(categ year)
levelsof categ, local(levels)
foreach x of local levels {
export excel `:label (categ) `x'', replace
}
The #delim ; command makes it possible to easily list each code on a separate line. The"label" function in the export statement is an extended macro function to insert a value label into the file name.