I would like to know what LR-attributed parsers can do and how it is implemented.
yacc generated parsers allow inherited attributes when the source of the attribute is a sibling that is located to the left using the $0, $-1, etc. specification syntax. With S -> A B B would be able to inherit a synthesized attribute from A but would not be able to inherit something from S. I think this is done by looking down 1 element from B in the stack which would be A.
Now zyacc doc says that they allow for LR-attributes grammars which is I guess more or less the same as yacc allows. Only that with zyacc those attributes are specified with the nonterminal (like parameters) and not just accessed within the semantic action. Are there any other difference like LR-attributes are mightier than the yacc inherited attributes or like LR-attributes are implemented differently (not just looking down the stack).
The point of LR attributed grammars is to make information seen in the left context,
available to the right extension.
Imagine your grammar had
R -> X S Y;
S -> A B;
You've already agreed that S can see attributes synthesized from X. In fact, those attributes can be available on the completion of parsing of X. Done properly, those attributes should be available to A and B, as they are parsed, as inherited attributes from S.
YACC doesn't implement any of this to my knowledge, unless you want to count the existence of the parse tree for X as being a "synthesized" attribute of parsing X.
How you implement attributed grammars depends on what you want to do. My company's main product, DMS, uses attribute grammars heavily, with no direction constraints. We simply build the full tree and the propagate the attributes as needed.
What we do is pre-compute, for each node type, the set of attributes [and their types] it may inherit, and the set it may synthesize, and synthesize a struct for each. At attribute evaluation time, these structs are associated with tree nodes via a very fast access hash table. For each node type, we inspect the dataflows (which child consumes which inherited attribute, which children use synthesized attributes from which other children). From that we compute an order of execution to cause all the attributes to be computed in a safe (generated-before-consumed) order and generate a procedure to accomplish this for that node type, which calls children procedures. Attribute evaluation then consists of calling the generated procedure for the grammar root. (In fact, we actually generate a partial order for evaluating the children, and generate a partial-order parallel call using DMS's implementation parallel programming language's capabilities, ensuring fast evaluation using multiple cores on very big ASTs).
There isn't any reason you couldn't limit this process to LR attributes. (Someday we'll push LR-compatible attributes into the parsing phase to allow their use in semantic checks).
It shouldn't surprise you that the device that generates the attribute evaluation process, is itself an attribute-evaluator that operates on grammars. Bootstrapping that was a bit of fun.
Related
I am playing around with Tatsu to implement a parser for a language used in the semiconductor industry. This language requires that variables be defined before usage. So for example:
SignalGroup { A: In; B: Out};
Pattern {
V {A=1, B=1 }
V {A=1, B=0 }
};
In this case, the SignalGroup block must come before the Pattern block. How do I enforce/implement this "ordering" when writing the grammer in TatSu?
Although for some languages it is possible to write grammars that verify if the same symbol appears on different places, the grammars usually end up being too complicated to be useful.
Compilers (translators) are usually implemented with separate lexical, syntactical, and semantic analyzer components. There are several reasons for that:
Each component is so well focused that it is clearer and easier to write.
Each component is very efficient
The most common errors (which are exactly lexical, syntactical, and semantic) can be reported earlier
With those components in mind, checking if a symbol has ben previously defined belongs to the semantic (meaning) aspect of the program, and the way to check is to keep a symbol table that is filled when the definition parts of the input are being parsed, and queried on the use parts of the input are being parsed.
In TatSu in particular the different components are well separated, yet run in parallel. For your requirement you just need to use the simplest grammar that allows for the semantic actions that store and query the symbols. By raising FailedSemantics from within semantic actions, any semantic errors will be reported exactly as the lexical and syntactical ones so the user doesn't have to think about which component flagged each error.
If you use the Python parser generation in TatSu, the translator will generate the skeleton of a semantic actions class as part of the output.
I know what is a Parse Tree and what is an Abstract Tree but I after reading some about Annotated Parse Tree(as we draw detailed tree which is same as Parse Tree), I feel that they are same as Parse Tree.
Can anyone please explain differences among these three in detail ?
Thanks.
AN ANNOTATED PARSE TREE is a parse tree showing the values of the attributes at each node. The process of computing the attribute values at the nodes is called annotating or decorating the parse tree.
For example: Refer link below, it is annotated parse tree for 3*5+4n
https://i.stack.imgur.com/WAwdZ.png
A parse tree is a representation of how a source text (of a program) has been decomposed to demonstate it matches a grammar for a language. Interior nodes in the tree are language grammar nonterminals (BNF rule left hand side tokens), while leaves of the tree are grammar terminals (all the other tokens) in the order required by grammar rules.
An annotated parse tree is one in which various facts about the program have been attached to parse tree nodes. For example, one might compute the set of identifiers that each subtree mentions, and attach that set to the subtree. Compilers have to store information they have collected about the program somewhere; this is a convenient place to store information which is derivable form the tree.
An activation tree is conceptual snapshot of the result of a set of procedures calling one another at runtime. Node in such a tree represent procedures which have run; childen represent procedures called by their parent.
So a key difference between (annotated) parse trees and activation trees is what they are used to represent: compile time properties vs. runtime properties.
An annotated parse tree lets you intergrate the entire compilation into the parse tree structure. CM Modula-3 does that if im not mistaken.
To build an APT, simply declare an abstract base class of nodes, subclass each production on it and declare the child nodes as field variables.
I was given a fragment of code (a function called bubbleSort(), written in Java, for example). How can I, or rather my program, tell if a given source code implements a particular sorting algorithm the correct way (using bubble method, for instance)?
I can enforce a user to give a legitimate function by analyzing function signature: making sure the the argument and return value is an array of integers. But I have no idea how to determine that algorithm logic is being done the right way. The input code could sort values correctly, but not in an aforementioned bubble method. How can my program discern that? I do realize a lot of code parsing would be involved, but maybe there's something else that I should know.
I hope I was somewhat clear.
I'd appreciate if someone could point me in the right direction or give suggestions on how to tackle such a problem. Perhaps there are tested ways that ease the evaluation of program logic.
In general, you can't do this because of the Halting problem. You can't even decide if the function will halt ("return").
As a practical matter, there's a bit more hope. If you are looking for a bubble sort, you can decide that it has number of parts:
a to-be-sorted datatype S with a partial order,
a container data type C with single instance variable A ("the array")
that holds the to-be-sorted data
a key type K ("array index") used to access the container that has a partial order
such that container[K] is type S
a comparison of two members of container, using key A and key B
such that A < B according to the key partial order, that determines
if container[B]>container of A
a swap operation on container[A], container[B] and some variable T of type S, that is conditionaly dependent on the comparison
a loop wrapped around the container that enumerates keys in according the partial order on K
You can build bits of code that find each of these bits of evidence in your source code, and if you find them all, claim you have evidence of a bubble sort.
To do this concretely, you need standard program analysis machinery:
to parse the source code and build an abstract syntax tree
build symbol tables (ST) that know the type of each identifier where it is used
construct a control flow graph (CFG) so that you check that various recognized bits occur in appropriate ordering
construct a data flow graph (DFG), so that you can determine that values recognized in one part of the algorithm flow properly to another part
[That's a lot of machinery just to get started]
From here, you can write ad hoc code procedural code to climb over the AST, ST, CFG, DFG, to "recognize" each of the individual parts. This is likely to be pretty messy as each recognizer will be checking these structures for evidence of its bit. But, you can do it.
This is messy enough, and interesting enough, so there are tools which can do much of this.
Our DMS Software Reengineering Toolkit is one. DMS already contains all the machinery to do standard program analysis for several languages. DMS also has a Dataflow pattern matching language, inspired by Rich and Water's 1980's "Programmer's Apprentice" ideas.
With DMS, you can express this particular problem roughly like this (untested):
dataflow pattern domain C;
dataflow pattern swap(in out v1:S, in out v2:S, T:S):statements =
" \T = \v1;
\v1 = \v2;
\v2 = \T;";
dataflow pattern conditional_swap(in out v1:S, in out v2:S,T:S):statements=
" if (\v1 > \v2)
\swap(\v1,\v2,\T);"
dataflow pattern container_access(inout container C, in key: K):expression
= " \container.body[\K] ";
dataflow pattern size(in container:C, out: integer):expression
= " \container . size "
dataflow pattern bubble_sort(in out container:C, k1: K, k2: K):function
" \k1 = \smallestK\(\);
while (\k1<\size\(container\)) {
\k2 = \next\(k1);
while (\k2 <= \size\(container\) {
\conditionalswap\(\container_access\(\container\,\k1\),
\container_access\(\container\,\k2\) \)
}
}
";
Within each pattern, you can write what amounts to the concrete syntax of the chosen programming language ("pattern domain"), referencing dataflows named in the pattern signature line. A subpattern can be mentioned inside another; one has to pass the dataflows to and from the subpattern by naming them. Unlike "plain old C", you have to pass the container explicitly rather than by implicit reference; that's because we are interested in the actual values that flow from one place in the pattern to another. (Just because two places in the code use the same variable, doesn't mean they see the same value).
Given these definitions, and ask to "match bubble_sort", DMS will visit the DFG (tied to CFG/AST/ST) to try to match the pattern; where it matches, it will bind the pattern variables to the DFG entries. If it can't find a match for everything, the match fails.
To accomplish the match, each of patterns above is converted essentially into its own DFG, and then each pattern is matched against the DFG for the code using what is called a subgraph isomorphism test. Constructing the DFG for the patter takes a lot of machinery: parsing, name resolution, control and data flow analysis, applied to fragments of code in the original language, intermixed with various pattern meta-escapes. The subgraph isomorphism is "sort of easy" to code, but can be very expensive to run. What saves the DMS pattern matchers is that most patterns have many, many constraints [tech point: and they don't have knots] and each attempted match tends to fail pretty fast, or succeed completely.
Not shown, but by defining the various bits separately, one can provide alternative implementations, enabling the recognition of variations.
We have used this to implement quite complete factory control model extraction tools from real industrial plant controllers for Dow Chemical on their peculiar Dowtran language (meant building parsers, etc. as above for Dowtran). We have version of this prototyped for C; the data flow analysis is harder.
I have what I imagine will be a fairly involved technical challenge: I want to be able to reliably alpha-rename identifiers in multiple languages (as many as possible). This will require special consideration for each language, and I'm asking for advice for how to minimize the amount of work I need to do by sharing code. Something like a unified parsing or abstract syntax framework that already has support for many languages would be great.
For example, here is some python code:
def foo(x):
def bar(y):
return x+y
return bar
An alpha renaming of x to y changes the x to a y and preserves semantics. So it would become:
def foo(y):
def bar(y1):
return y+y1
return bar
See how we needed to rename y to y1 in order to keep from breaking the code? That is why this is a hard problem. It seems like the program would have to have a pretty good knowledge of what constitutes a scope, rather than just doing, say, a string search and replace.
I would also like to preserve as much of the formatting as possible: comments, spacing, indentation. But that is not 100% necessary, it would just be nice.
Any tips?
To do this safely, you need to be able to to determine
all the identifiers (and those things that are not, e.g., the middle of a comment) in your code
the scopes of validity for each identifer
the ability to substitute a new identifier for an old one in the text
the ability to determine if renaming an identifier causes another name to be shadowed
To determine identifiers accurately, you need a least a langauge-accurate lexer. Identifiers in PHP look different than the do in COBOL.
To determine scopes of validity, you have to be determine program structure in practice, since most "scopes" are defined by such structure. This means you need a langauge-accurate parser; scopes in PHP are different than scopes in COBOL.
To determine which names are valid in which scopes, you need to know the language scoping rules. Your language may insist that the identifier X will refer to different Xes depending on the context in which X is found (consider object constructors named X with different arguments). Now you need to be able to traverse the scope structures according to the naming rules. Single inheritance, multiple inheritance, overloading, default types all will pretty much require you to build a model of the scopes for the programs, insert the identifiers and corresponding types into each scope, and then climb from the point of encounter of an identifier in the program text through the various scopes according to the language semantics. You will need symbol tables, inheritance linkages, ASTs, and the ability to navigage all of these. These structures are different from PHP and COBOL, but they share lots of common ideas so you likely need a library with the common concept support.
To rename an identifier, you have to modify the text. In a million lines of code, you need to point carefully. Modifying an AST node is one way to point carefully. Actually, you need to modify all the identifiers that correspond to the one being renamed; you have to climb over the tree to find them all, or record in the AST where all the references exist so they can be found easily. After modifyingy the tree you have to regenerate the source text after modifying the AST. That's a lot of machinery; see my SO answer on how to prettyprint ASTs preseriving all of the stuff you reasonably suggest should be preserved.
(Your other choice is to keep track in the AST of where the text for the string is,
and the read/patch/write the file.)
Before you update the file, you need to check that you haven't shadowed something. Consider this code:
{ local x;
x=1;
{local y;
y=2;
{local z;
z=y
print(x);
}
}
}
We agree this code prints "1". Now we decide to rename y to x.
We've broken the scoping, and now the print statement which referred
conceptually to the outer x refers to an x captured by the renamed y. The code now prints "2", so our rename broke it. This means that one must check all the other identifiers in scopes in which the renamed variable might be found, to see if the new name "captures" some name we weren't expecting. (This would be legal if the print statement printed z).
This is a lot of machinery.
Yes, there is a framework that has almost all of this as well as a number of robust language front ends. See our DMS Software Reengineering Toolkit. It has parsers producing ASTs, prettyprinters to produce text back from ASTs, generic symbol table management machinery (including support for multiple inheritance), AST visiting/modification machinery. Ithas prettyprinting machinery to turn ASTs back into text. It has front ends for C, C++, COBOL and Java that implement name and type resolution (e.g. instanting symbol table scopes and identifier to symbol table entry mappings); it has front ends for many other langauges that don't have scoping implemented yet.
We've just finished an exercise in implementing "rename" for Java. (All the above issues of course appeared). We about about to start one for C++.
You could try to create Xtext based implementations for the involved languages. The Xtext framework provides reliable infrastructure for cross language rename refactoring. However, you'll have to provide a grammar a at least a "good enough" scope resolution for each language.
Languages mostly guarantee tokens will be unique, whatever the context. A naive first approach (and this will break many, many pieces of code) would be:
cp file file.orig
sed -i 's/\b(newTokenName)\b/TEMPTOKEN/g' file
sed -i 's/\b(oldTokenName)\b/newTokenName/g' file
With GNU sed, this will break on PHP. Rewriting \b to a general token match, like ([^a-zA-Z~$-_][^a-zA-Z0-9~$-_]) would work on most C, Java, PHP, and Python, but not Perl (need to add # and % to the token characters. Beyond that, it would require a plugin architecture that works for any language you wanted to add. At some point, there will be two languages whose variable and function naming rules will be incompatible, and at that point, you'll need to do more and more in the plugin.
I came across the original statement of the Liskov Substitution Principle on Ward's wiki tonight:
What is wanted here is something like the following substitution property: If for each object o1 of type S there is an object o2 of type T such that for all programs P defined in terms of T, the behavior of P is unchanged when o1 is substituted for o2 then S is a subtype of T." - Barbara Liskov, Data Abstraction and Hierarchy, SIGPLAN Notices, 23,5 (May, 1988).
I've always been crap at parsing predicate logic (I failed Calc IV the first time though), so while I kind of understand how the above translates to:
Functions that use pointers or references to base classes must be able to use objects of derived classes without knowing it.
what I don't understand is why the property Liskov describes implies that S is a subtype of T and not the other way around.
Maybe I just don't know enough yet about OOP, but why does Liskov's statement only allow for the possibility S -> T, and not T -> S?
the hypothetical set of programs P (in terms of T) is not defined in terms of S, and therefore it doesn't say much about S. On the other hand, we do say that S works as well as T in that set of programs P, and so we can draw conclusions about S and its relationship to T.
One way to think about this is that P demands certain properties of T. S happens to satisfy those properties. Perhaps you could even say that 'every o1 in S is also in T'. This conclusion is used to define the word subtype.
As pointed by sepp2k there are multiple views explaining this in the other post.
Here are my two cents about it.
I like to look at it this way
If for each object o1 of type TallPerson there is an object o2 of type Person such that for all programs P defined in terms of Person, the behavior of P is unchanged when o1 is substituted for o2 then TallPerson is a subtype of Person. (replacing S with TallPerson and T with Person)
We typically have a perspective of objects deriving some base class have more functionality, since its extended. However with more functionality we are specializing it and reducing the scope in which they can be used thus becoming subtypes to its base class(broader type).
A derived class inherits its base class's public interface and is expected to either use the inherited implementation or to provide an implementation that behaves similarly (e.g., a Count() method should return the number of elements regardless of how those elements are stored.)
A base class wouldn't necessarily have the interface of any (let alone all) of its derived classes so it wouldn't make sense to expect an arbitrary reference to a base class to be substitutable for a specified derived class. Even if it appears that only the subset of the interface that supported by the base class's interface is required, that might not be the case (e.g., it could be that a shadowing method in the specific derived class is referred to).