Parsing and generalising a Stata program for renaming variables - foreach

I'm learning how to write a program in Stata for the first time and I'm having difficulty generalising my program so I can parse an arbitrary list of variables when renaming variables in datasets.
I'm working with two datasets. First one is panel dataset containing the life satisfaction of interviewees in a survey over a duration of 26 years (same dataset as in my previous question). The variables are originally named in this format: ap6801 bp9301 cp9601 and all the way to zp15701. ap6801 contains the respondents' life satisfaction for the year 1985, bp9301 contains it for 1986, and so on.
I wrote the following program to rename the variables so instead of ap6801 it would be lsat1985.
program myprogram
local mcode 1984
foreach stub in a b c d e f g h i j k l m n o p q r s t u v w x y z {
local mcode = `mcode' + 1
rename `stub'* lsat`mcode'
}
Now, I want to modify and generalise this program so that I can use it on my second dataset and with arbitrary numbers. The second dataset consists of variables abetto bbetto cbetto all the way until zbetto. These variables indicate whether a specific person has been interviewed in a specific year and, if not, why not. abetto corresponds to year 1985, bbetto corresponds to year 1986, and so on.
My goal is to write a generalised version of the program so that when I enter an arbitrary list of variables and other information (for example lsat, and a list of numbers (eg: 1985-2010)): myprogram ap6801-zp15701 , the variables will be renamed lsat1985 lsat1986... lsat2010
I'm guessing the program will have the following basic structure:
program myprogram
syntax varlist
foreach x of varlist {
}
Within the loop, there might be local letter = substr("`x'",1,1) to identify the first letter of the variables (a, b, c, d...). Next step would be to link the letters of the alphabet to numbers that will be specified by the user, and a rename command that renames the variable in the format: lsat/betto year. I'm having a difficult time putting all that together in code.
I'm new to Stata and programming, so any help is appreciated!

Your program only works in your extraordinary circumstances. (Trivially, there is no end statement.) It will fail if any of the wildcards a* b* and so on does not occur as corresponding variable names in the data. Other way round, each rename is from a wildcard, say a*, to one variable name, say lsat1985, and this will work if and only if there is precisely one variable in each case.
More generally, this is an example of premature programming. Prejudice alert: It's not good style to write programs for such highly specific tasks, wiring in a specific variable name prefix, and for such highly unusual circumstances. At most, this is territory for code in a do-file. But if you allow variable name abbreviation, this should work. Good style here implies explaining the circumstances.
* each wildcard a* b* ... is matched by a single variable
set varabbrev on
local mcode = 1985
foreach pre in `c(alpha)' {
rename `pre' lsat`mcode'
local ++mcode
}
Note that there is no need to type out all the lower case letters a to z. Stata holds all those in one place as c(alpha). You are not expected to know that. For completeness, and this may be irrelevant to your problem, note that Stata variable names can start with other characters, most notably underscore _. I am left wondering about data for years from 2011 on.
It's hard not to write this without running the risk of seeming obnoxiously patronising. And the commentary may seem entirely unfair, as your goal is precisely that of generalising the program! However, in Stata it is often a better idea to write do-files first and move to a program when, and only when, you have used a do-file in so many different circumstances that the need for a more general program becomes evident. So, I won't answer your question on a more general program. It's not obviously a good idea and even if it were it is hard for an outsider to know what that program should look like. You have given one example where the suffix looks unpredictable (6801 9301 9601 and so on) and one where it is predictable (betto) and we can have no idea what else is true (unless freakishly someone here recognises your dataset). A program written for a dataset that is just 26 variables asomething to zsomething is possible, but would you ever use it more than a few times?

Related

How can I Read the file with Lua doing search filter and store the entries as a variable

So I want to read a text file with the following format:
Bob, G92f22f, Fggggfdff32
Rob, f3h9123, fdsgfdsg3
Sally, f2g4g, g3g3hgdsd
I want a simple Lua program that can filter out say "bob" and then say throw the data into a variable to use in a program.
a = Bob
b = G92f22f
c = Fggggfdff32
I assume then I could do
print(a,b,c)
Still quite new to Lua having a heck of a time with anything read / variable though.
You'll want to look at the io and string modules; they handle things like reading from / writing to files and doing pattern matching on strings.
Luas pattern matching is a bit simple compared to the regular expressions most modern languages use, but from what I see in your example, you could probably match a "word" as [^, ]+, that is, one or more characters that aren't a comma or a space

Source code logic evaluation

I was given a fragment of code (a function called bubbleSort(), written in Java, for example). How can I, or rather my program, tell if a given source code implements a particular sorting algorithm the correct way (using bubble method, for instance)?
I can enforce a user to give a legitimate function by analyzing function signature: making sure the the argument and return value is an array of integers. But I have no idea how to determine that algorithm logic is being done the right way. The input code could sort values correctly, but not in an aforementioned bubble method. How can my program discern that? I do realize a lot of code parsing would be involved, but maybe there's something else that I should know.
I hope I was somewhat clear.
I'd appreciate if someone could point me in the right direction or give suggestions on how to tackle such a problem. Perhaps there are tested ways that ease the evaluation of program logic.
In general, you can't do this because of the Halting problem. You can't even decide if the function will halt ("return").
As a practical matter, there's a bit more hope. If you are looking for a bubble sort, you can decide that it has number of parts:
a to-be-sorted datatype S with a partial order,
a container data type C with single instance variable A ("the array")
that holds the to-be-sorted data
a key type K ("array index") used to access the container that has a partial order
such that container[K] is type S
a comparison of two members of container, using key A and key B
such that A < B according to the key partial order, that determines
if container[B]>container of A
a swap operation on container[A], container[B] and some variable T of type S, that is conditionaly dependent on the comparison
a loop wrapped around the container that enumerates keys in according the partial order on K
You can build bits of code that find each of these bits of evidence in your source code, and if you find them all, claim you have evidence of a bubble sort.
To do this concretely, you need standard program analysis machinery:
to parse the source code and build an abstract syntax tree
build symbol tables (ST) that know the type of each identifier where it is used
construct a control flow graph (CFG) so that you check that various recognized bits occur in appropriate ordering
construct a data flow graph (DFG), so that you can determine that values recognized in one part of the algorithm flow properly to another part
[That's a lot of machinery just to get started]
From here, you can write ad hoc code procedural code to climb over the AST, ST, CFG, DFG, to "recognize" each of the individual parts. This is likely to be pretty messy as each recognizer will be checking these structures for evidence of its bit. But, you can do it.
This is messy enough, and interesting enough, so there are tools which can do much of this.
Our DMS Software Reengineering Toolkit is one. DMS already contains all the machinery to do standard program analysis for several languages. DMS also has a Dataflow pattern matching language, inspired by Rich and Water's 1980's "Programmer's Apprentice" ideas.
With DMS, you can express this particular problem roughly like this (untested):
dataflow pattern domain C;
dataflow pattern swap(in out v1:S, in out v2:S, T:S):statements =
" \T = \v1;
\v1 = \v2;
\v2 = \T;";
dataflow pattern conditional_swap(in out v1:S, in out v2:S,T:S):statements=
" if (\v1 > \v2)
\swap(\v1,\v2,\T);"
dataflow pattern container_access(inout container C, in key: K):expression
= " \container.body[\K] ";
dataflow pattern size(in container:C, out: integer):expression
= " \container . size "
dataflow pattern bubble_sort(in out container:C, k1: K, k2: K):function
" \k1 = \smallestK\(\);
while (\k1<\size\(container\)) {
\k2 = \next\(k1);
while (\k2 <= \size\(container\) {
\conditionalswap\(\container_access\(\container\,\k1\),
\container_access\(\container\,\k2\) \)
}
}
";
Within each pattern, you can write what amounts to the concrete syntax of the chosen programming language ("pattern domain"), referencing dataflows named in the pattern signature line. A subpattern can be mentioned inside another; one has to pass the dataflows to and from the subpattern by naming them. Unlike "plain old C", you have to pass the container explicitly rather than by implicit reference; that's because we are interested in the actual values that flow from one place in the pattern to another. (Just because two places in the code use the same variable, doesn't mean they see the same value).
Given these definitions, and ask to "match bubble_sort", DMS will visit the DFG (tied to CFG/AST/ST) to try to match the pattern; where it matches, it will bind the pattern variables to the DFG entries. If it can't find a match for everything, the match fails.
To accomplish the match, each of patterns above is converted essentially into its own DFG, and then each pattern is matched against the DFG for the code using what is called a subgraph isomorphism test. Constructing the DFG for the patter takes a lot of machinery: parsing, name resolution, control and data flow analysis, applied to fragments of code in the original language, intermixed with various pattern meta-escapes. The subgraph isomorphism is "sort of easy" to code, but can be very expensive to run. What saves the DMS pattern matchers is that most patterns have many, many constraints [tech point: and they don't have knots] and each attempted match tends to fail pretty fast, or succeed completely.
Not shown, but by defining the various bits separately, one can provide alternative implementations, enabling the recognition of variations.
We have used this to implement quite complete factory control model extraction tools from real industrial plant controllers for Dow Chemical on their peculiar Dowtran language (meant building parsers, etc. as above for Dowtran). We have version of this prototyped for C; the data flow analysis is harder.

Can you "teach" computers to do algebra using variable expressions (eg aX+bX=(a+b)X)

Let's say in the example lower case is constant and upper case is variable.
I'd like to have programs that can "intelligently" do specified tasks like algebra, but teaching the program new methods should be easy using symbols understood by humans. For example if the program told these facts:
aX+bX=(a+b)X
if a=bX then X=a/b
Then it should be able to perform these operations:
2a+3a=5a
3x+3x=6x
3x=1 therefore x=1/3
4x+2x=1 -> 6x=1 therefore x= 1/6
I was trying to do similar things with Prolog as it can easily "understand" variables, but then I had too many complications, mainly because two describing a relationship both ways results in a crash. (not easy to sort out)
To summarise: I want to know if a program which can be taught algebra by using mathematic symbols only. I'd like to know if other people have tried this and how complicated it is expected to be. The purpose of this is to make programming easier (runtime is not so important)
It depends on what do you want machine to do and how intelligent it should be.
Your question is mostly about AI but not ML. AI deals with formalization of "human" tasks while ML (though being a subset of AI) is about building models from data.
Described program may be implemented like this:
Each fact form a pattern. Program given with an expression and some patterns can try to apply some of them to expression and see what happens. If you want your program to be able to, for example, solve quadratic equations given rule like ax² + bx + c = 0 → x = (-b ± sqrt(b²-4ac))/(2a) then it'd be designed as follows:
Somebody gives a set of rules. Rule consists of a pattern and an outcome (solution or equivalent form). Think about the pattern as kind of a regular expression.
Then the program is asked to show some intelligence and prove its knowledge via doing something with a given expression. Here comes the major part:
you build a graph of expressions by applying possible rules (if a pattern is applicable to an expression you add new vertex with the corresponding outcome).
Then you run some path-search algorithm (A*, for example) to find sequence of transformations leading to the form like x = ...
I think this is an interesting question, although it off topic in SO (tool recommendation)
But nevertheless, because it captured my imagination, I wrote couple of function using R that can solve stuff like that quite easily
First, you'll have to install R, after words you'll need to download package called stringr
So in R console run
install.packages("stringr")
library(stringr)
And then you can define the following functions that I wrote
FirstFunc <- function(temp){
paste0(eval(parse(text = gsub("[A-Z]", "", temp))), unique(str_extract_all(temp, "[A-Z]")[[1]]))
}
SecondFunc <- function(temp){
eval(parse(text = strsplit(temp, "=")[[1]][2])) / eval(parse(text = gsub("[[:alpha:]]", "", strsplit(temp, "=")[[1]][1])))
}
Now, the first function will solve equations like
aX+bX=(a+b)X
While the second will solve equations like
4x+2x=1
For example
FirstFunc("3X+6X-2X-3X")
will return
"4X"
Now this functions is pretty primitive (mostly for the propose of illustration) and will solve equation that contain only one variable type, something like FirstFunc("3X-2X-2Y") won't give the correct result (but the function could be easily modified)
The second function will solve stuff like
SecondFunc("4x-2x=1")
will return
0.5
or
SecondFunc("4x+2x*3x=1")
will return
0.1
Note that this function also works only for one unknown variable (x) but could be easily modified too

How to convert Latex formula to C/C++ code?

I need to convert a math formula written in the Latex style to the function of a C/C++ code.
For example:
y = sin(x)^2 would become something like
double y = sin(x) * sin(x);
or
double y = pow(sin(x), 2);
where x is a variable defined somewhere before.
I mean that it should convert the latex formula to the C/C++ syntax. So that if there is a function y = G(x, y)^F(x) it doesn't matter what is G(x,y) and F(x),
it is a problem of the programmer to define it. It will just generate
double y = pow(G(x, y), F(x));
When the formula is too complicated it will take some time to make include it in the C/C++ formula. Is there any way to do this conversion?
Emacs' built-in calculator calc-mode can do this (and much more). Your examples can be converted like this:
Put the formula in some emacs buffer
$ y = sin(x)^2 $
With the cursor in the formula, activate calc-embedded mode
M-x calc-embedded
Switch the display language to C:
M-x calc-c-language
There you are:
$ y == pow(sin(x), 2) $
Note that it interprets the '=' sign in latex as an equality, which results in '==' for C. The latex equivalent to Cs assignment operator '=' would be '\gets'.
More on this topic on Turong's blog
I know the question is too old, but I'll just add a reply anyway as a think it might help someone else later. The question popped up a lot for me in my searches.
I'm working on a tool that does something similar, in a public git repo
You'll have to put some artificial limitations on your latex input, that's out of question.
Currently the tool I wrote only supports mul, div, add, sub, sqrt, pow, frac and sum as those are the only set of operations I need to handle, and the imposed limitations can be a bit loose by providing a preprocessor (see preproc.l for an [maybe not-so-good] example) that would clean away the raw latex input.
A mathematical equation, such as the ones in LaTeX, and a C expression are not interchangeable. The former states a relation between two terms, the latter defines an entity that can be evaluated, unambiguously yielding one value. a = b in C means 'take the value in variable b and store it in variable a', wheres in Math, it means 'in the current context, a and b are equal'. The first describes a computation process, the second describes a static fact. Consequently, the Math equation can be reversed: a = b is equivalent to b = a, but doing the same to the C equation yields something quite different.
To make matters worse, LaTeX formulae only contain the information needed to render the equations; often, this is not enough to capture their meaning.
Of course some LaTeX formulae, like your example, can be converted into C computations, but many others cannot, so any automated way of doing so would only make limited sense.
I'm not sure there is a simple answer, because mathematical formulaes (in LaTeX documents) are actually ambiguous, so to automate their translation to some code requires automating their understanding.
And the MathML standard has, IIRC, two forms representing formulaes (one for displaying, another for computing) and there is some reason for that.

Issues with ANDs and ORs (COBOL)

I can't seem to get this one part right. I was given a input file with a bunch of names, some of which I need to skip, with extra information on each one. I was trying use ANDs and ORs to skip over the names I did not need and I came up with this.
IF DL-CLASS-STANDING = 'First Yr' OR 'Second Yr' AND
GRAD-STAT-IN = ' ' OR 'X'
It got rid of all but one person, but when I tried to add another set of ANDs and ORs the program started acting like the stipulations where not even there.
Did I make it too complex for the compiler? Is there an easier way to skip over things?
Try adding some parentheses to group things logically:
IF (DL-CLASS-STANDING = 'First Yr' OR 'Second Yr') AND
(GRAD-STAT-IN = ' ' OR 'X')
You may want to look into fully expanding that abbreviated expression since the expansion may not be what you think when there's a lot of clauses - it's often far better to be explicit.
However, what I would do is use the 88 level variables to make this more readable - these were special levels to allow conditions to be specified in the data division directly rather than using explicit conditions in the code.
In other words, put something like this in your data division:
03 DL-CLASS-STANDING PIC X(20).
88 FIRST-YEAR VALUE 'First Yr'.
88 SECOND-YEAR VALUE 'Second Yr'.
03 GRAD-STAT-IN PIC X.
88 GS-UNKNOWN VALUE ' '.
88 GS-NO VALUE 'X'.
Then you can use the 88 level variables in your expressions:
IF (FIRST-YEAR OR SECOND-YEAR) AND (GS-UNKNOWN OR GS-NO) ...
This is, in my opinion, more readable and the whole point of COBOL was to look like readable English, after all.
The first thing to note is that the code shown is the code which was working, and the amended code which did not give the desired result was never shown. As an addendum, why, if only one person were left, would more selection be necessary? To sum up that, the actual question is unclear beyond saying "I don't know how to use OR in COBOL. I don't know how to use AND in COBOL".
Beyond that, there were two actual questions:
Did I make it too complex for the compiler?
Is there an easier way to skip over things [is there a clearer way to write conditions]?
To the first, the answer is No. It is very far from difficult for the compiler. The compiler knows exactly how to handle any combinations of OR, AND (and NOT, which we will come to later). The problem is, can the human writer/reader code a condition successfully such that the compiler will know what they want, rather than just giving the result from the compiler following its rules (which don't account for multiple possible human interpretations of a line of code)?
The second question therefore becomes:
How do I write a complex condition which the compiler will understand in an identical way to my intention as author and in an identical way for any reader of the code with some experience of COBOL?
Firstly, a quick rearrangement of the (working) code in the question:
IF DL-CLASS-STANDING = 'First Yr' OR 'Second Yr'
AND GRAD-STAT-IN = ' ' OR 'X'
And of the suggested code in one of the answers:
IF (DL-CLASS-STANDING = 'First Yr' OR 'Second Yr')
AND (GRAD-STAT-IN = ' ' OR 'X')
The second version is clearer, but (or and) it is identical to the first. It did not make that code work, it allowed that code to continue to work.
The answer was addressing the resolution of the problem of a condition having its complexity increased: brackets/parenthesis (simply simplifying the complexity is another possibility, but without the non-working example it is difficult to make suggestions on).
The original code works, but when it needs to be more complex, the wheels start to fall off.
The suggested code works, but it does not (fully) resolve the problem of extending the complexity of the condition, because, in minor, it repeats the problem, within the parenthesis, of extending the complexity of the condition.
How is this so?
A simple condition:
IF A EQUAL TO "B"
A slightly more complex condition:
IF A EQUAL TO "B" OR "C"
A slight, but not complete, simplification of that:
IF (A EQUAL TO "B" OR "C")
If the condition has to become more complex, with an AND, it can be simple for the humans (the compiler does not care, it cannot be fooled):
IF (A EQUAL TO "B" OR "C")
AND (E EQUAL TO "F")
But what of this?
IF (A EQUAL TO "B" OR "C" AND E EQUAL TO "F")
Placing the AND inside the brackets has allowed the original problem for humans to be replicated. What does that mean, and how does it work?
One answer is this:
IF (A EQUAL TO ("B" OR "C") AND E EQUAL TO "F")
Perhaps clearer, but not to everyone, and again the original problem still exists, in the minor.
So:
IF A EQUAL TO "B"
OR A EQUAL TO "C"
Simplified, for the first part, but still that problem in the minor (just add AND ...), so:
IF (A EQUAL TO "B")
OR (A EQUAL TO "C")
Leading to:
IF ((A EQUAL TO "B")
OR (A EQUAL TO "C"))
And:
IF ((A EQUAL TO "B")
OR (A EQUAL TO C))
Now, if someone wants to augment with AND, it is easy and clear. If done at the same level as one of the condition parts, it solely attaches to that. If done at the outermost level, it attaches to both (all).
IF (((A EQUAL TO "B")
AND (E EQUAL TO "F"))
OR (A EQUAL TO "C"))
or
IF (((A EQUAL TO "B")
OR (A EQUAL TO "C"))
AND (E EQUAL TO "F"))
What if someone wants to insert the AND inside the brackets? Well, because inside the brackets it is simple, and people don't tend to do that. If what is inside the brackets is already complicated, it does tend to be added. It seems that something which is simple through being on its own tends not to be made complicated, whereas something which is already complicated (more than one thing, not on its own) tends to be made more complex without too much further thought.
COBOL is an old language. Many old programs written in COBOL are still running. Many COBOL programs have to be amended, or just read to understand something, and that many times over their lifetimes of many years.
When changing code, by adding something to a condition, it is best if the original parts of the condition do not need to be "disturbed". If complexity is left within brackets, it is more likely that code needs to be disturbed, which increases the amount of time in understanding (it is more complex) and changing (more care is needed, more testing necessary, because the code is disturbed).
Many old programs will be examples of bad practice. There is not much to do about that, except to be careful with them.
There isn't any excuse for writing new code which requires more maintenance and care in the future than is absolutely necessary.
Now, the above examples may be considered long-winded. It's COBOL, right? Lots of typing? But COBOL gives immense flexibility in data definitions. COBOL has, as part of that, the Level 88, the Condition Name.
Here are data definitions for part of the above:
01 A PIC X.
88 PARCEL-IS-OUTSIZED VALUE "B" "C".
01 F PIC X.
88 POSTAGE-IS-SUFFICIENT VALUE "F".
The condition becomes:
IF PARCEL-IS-OUTSIZED
AND POSTAGE-IS-SUFFICIENT
Instead of just literal values, all the relevant literal values now have a name, so that the coder can indicate what they actually mean, as well as the actual values which carry that meaning. If more categories should be added to PARCEL-IS-OUTSIZED, the VALUE clause on the 88-level is extended.
If another condition is to be combined, it is much more simple to do so.
Is this all true? Well, yes. Look at it this way.
COBOL operates on the results of a condition where coded.
If condition
Simple conditions can be compounded through the use of brackets, to make a condition:
If condition = If (condition) = If ((condition1) operator (condition2))...
And so on, to the limits of the compiler.
The human just has to deal with the condition they want for the purpose at hand. For general logic-flow, look at the If condition. For verification, look at the lowest detail. For a subset, look at the part of the condition relevant to the sub-set.
Use simple conditions. Make conditions simple through brackets/parentheses. Make complex conditions, where needed, by combining simple conditions. Use condition-names for comparisons to literal values.
OR and AND have been treated so far. NOT is often seen as something to treat warily:
IF NOT A EQUAL TO B
IF A NOT EQUAL TO B
IF (NOT (A EQUAL TO B)), remembering that this is just IF condition
So NOT is not scary, if it is made simple.
Throughout, I've been editing out spaces. Because the brackets are there, I like to make them in-your-face. I like to structure and indent conditions, to emphasize the meaning I have given them.
So:
IF ( ( ( condition1 )
OR ( condition2 ) )
AND
( ( condition3 )
OR ( condition4 ) ) )
(and more sculptured than that as well). By structuring, I hope that a) I mess up less and b) when/if I do mess up, someone has a better chance of noticing it.
If conditions are not simplified, then understanding the code is more difficult. Changing the code is more difficult. For people learning COBOL, keeping things simple is a long-term benefit to all.
As a rule, I avoid the use of AND if at all possible. Nested IF's work just as well, are easier to read, and with judicious use of 88-levels, do not have to go very deep. This seems so much easier to read, at least in my experience:
05 DL-CLASS-STANDING PIC X(20) VALUE SPACE.
88 DL-CLASS-STANDING-VALID VALUE 'First Yr' 'Second Yr'.
05 GRAD-STAT-IN PIC X VALUE SPACE.
88 GRAD-STAT-IN-VALID VALUE SPACE 'N'.
Then the code is as simple as this:
IF DL-CLASS-STANDING-VALID
IF GRAD-STAT-IN-VALID
ACTION ... .

Resources