Generating parse tree from CYK algorithm - parsing

I use CYK algorithm (already implemented it in Java) to see if a string recognized according to a specific grammar. Now I need to generate a parse tree for the string, is the a way to generate the tree from the matrix which I use when using the CYK algorithm?

When implementing CYK as just a recognizer, then the boxes in the chart are generally just a set of bits (or other boolean values) that correspond to the productions that might apply at that point. That doesn't leave you enough information to reconstruct the parse tree.
If you instead store a set of objects, those objects include the non-terminal and keep track of the two productions that were combined. When you're done, you check if your final box contains an object which represents a start symbol production. If it does, you can follow the pointers back to reconstruct the parse tree.

Related

Store NFA into data structure

I am provided with a NFA and I need to use a data structure (I can not use recursive descent parser) for storing it. Once the NFA is stored in a data structure I am given a string to check if the string is valid according to the NFA given or not.
Can someone please suggest a data structure for storing a NFA? Also if there are any opensource c language examples that would help a lot.
An NFA is just a set of triples State x Input -> State. It's usually convenient to represent a State with a small integer in a consecutive range starting at 0 (or some other defined starting point). Input symbols can also be mapped onto small integers, either directly (ascii code if all the transitions are ascii characters) or by keeping an inventory while you read the machine. Making a list of triples is highly inefficient and making a hash table is overkill; a plausible intermediate is a two-dimensional array. Remember that the machine is Nondeterministic, so a given [state, input symbol] pair might map to a set of next states.
You can determinize the NFA into a DFA using the Subset Construction. That simplifies the data structure but it can also blow up exponentially in size.

LL top-down parser, from CST to AST

I am currently learning about syntax analysis, and more especially, top-down parsing.
I know the terminology and the difference with bottom-up LR parsers, and since top-down LL parsers are easier to implement by hand, I am looking forward to make my own.
I have seen two kinds of approach:
The recursive-descent one using a collection of recursive functions.
The stack-based and table-driven automaton as shown here on Wikipedia.
I am more interested by the latter, for its power and its elimination of call-stack recursion. However, I don't understand how to build the AST from the implicit parse tree.
This code example of a stack-based finite automaton show the parser analyzing the input buffer, only giving a yes/no answer if the syntax has been accepted.
I have heard of stack annotations in order to build the AST, but I can't figure out how to implement them. Can someone provide a practical implementation of such technique ?
"Top-down" and "bottom-up" are excellent descriptions of the two parsing strategies, because they describe precisely how the syntax tree would be constructed if it were constructed. (You can also think of it as the traversal order over the implicit parse tree but here we're actually interested in real parse trees.)
It seems clear that there is an advantage to bottom-up tree construction. When it is time to add a node to the tree, you already know what its children are. You can construct the node fully-formed in one (functional) action. All the child information is right there waiting for you, so you can add semantic information to the node based on the semantic information of its children, even using the children in an order other than left-to-right.
By contrast, the top-down parser constructs the node without any children, and then needs to add each child in turn to the already constructed node. That's certainly possible, but it's a bit ugly. Also, the incremental nature of the node constructor means that semantic information attached to the node also needs to be computed incrementally, or deferred until the node is fully constructed.
In many ways, this is similar to the difference between evaluating expressions written in Reverse Polish Notation (RPN) from expressions written in (Forward) Polish Notation [Note 1]. RPN was invented precisely to ease evaluation, which is possible with a simple value stack. Forward Polish expressions can be evaluated, obviously: the easiest way is to use a recursive evaluator but in environments where the call stack can not be relied upon it is possible to do it using an operator stack, which effectively turns the expression into RPN on the fly.
So that's probably the mechanism of choice for building syntax trees from top-down parsers as well. We add a "reduction" marker to the end of every right-hand side. Since the marker goes at the end of the right-hand side, so it is pushed first.
We also need a value stack, to record the AST nodes (or semantic values) being constructed.
In the basic algorithm, we now have one more case. We start by popping the top of the parser stack, and then examine this object:
The top of the parser stack was a terminal. If the current input symbol is the same terminal, we remove the input symbol from the input, and push it (or its semantic value) onto the value stack.
The top of the parser stack was a marker. The associated reduction action is triggered, which will create the new AST node by popping an appropriate number of values from the value stack and combining them into a new AST node which is then pushed onto the value stack. (As a special case, the marker action for the augmented start symbol's unique production S' -> S $ causes the parse to be accepted, returning the (only) value in the value stack as the AST.)
The top of the parser stack was a non-terminal. We then identify the appropriate right-hand side using the current input symbol, and push that right-hand side (right-to-left) onto the parser stack.
You need to understand the concept behind. You need to understand the concept of pushdown automaton. After you understand how to make computation on paper with pencil you will be able to understand multiple ways to implement its idea, via recursive descent or with stack. The ideas are the same, when you use recursive descent you implicitly have the stack that the program use for execution, where the execution data is combined with the parsing automaton data.
I suggest you to start with the course taught by Ullman (automata) or Dick Grune, this one is the best focused on parsing. (the book of Grune is this one), look for the 2nd edition.
For LR parsing the essential is to understand the ideas of Earley, from these ideas Don Knuth created the LR method.
For LL parsing, the book of Grune is excellent, and Ullman presents the computation on paper, the math background of parsing that is essential to know if you want to implement your own parsers.
Concerning the AST, this is the output of parsing. A parser will generate a parsing tree that is transformed in AST or can generate and output directly the AST.

simplify equations/expressions using Javacc/jjtree

I have created a grammar to read a file of equations then created AST nodes for each rule.My question is how can I do simplification or substitute vales on the equations that the parser is able to read correctly. in which stage? before creating AST nodes or after?
Please provide me with ideas or tutorials to follow.
Thank you.
I'm assuming you equations are something like simple polynomials over real-value variables, like X^2+3*Y^2
You ask for two different solutions to two different problems that start with having an AST for at least one equation:
How to "substitute values" into the equation and compute the resulting value, e.g, for X==3 and Y=2, substitute into the AST for the formula above and compute 3^2+3*2^2 --> 21
How to do simplification: I assume you mean algebraic simplification.
The first problem of substituting values is fairly easy if yuo already have the AST. (If not, parse the equation to produce the AST first!) Then all you have to do is walk the AST, replacing every leaf node containing a variable name with the corresponding value, and then doing arithmetic on any parent nodes whose children now happen to be numbers; you repeat this until no more nodes can be arithmetically evaluated. Basically you wire simple arithmetic into a tree evaluation scheme.
Sometimes your evaluation will reduce the tree to a single value as in the example, and you can print the numeric result My SO answer shows how do that in detail. You can easily implement this yourself in a small project, even using JavaCC/JJTree appropriately adapted.
Sometimes the formula will end up in a state where no further arithmetic on it is possible, e.g., 1+x+y with x==0 and nothing known about y; then the result of such a subsitution/arithmetic evaluation process will be 1+y. Unfortunately, you will only have this as an AST... now you need to print out the resulting AST in order for the user to see the result. This is harder; see my SO answer on how to prettyprint a tree. This is considerably more work; if you restrict your tree to just polynomials over expressions, you can still do this in small project. JavaCC will help you with parsing, but provides zero help with prettyprinting.
The second problem is much harder, because you must not only accomplish variable substitution and arithmetic evaluation as above, but you have to somehow encode knowledge of algebraic laws, and how to match those laws to complex trees. You might hardwire one or two algebraic laws (e.g., x+0 -> x; y-y -> 0) but hardwiring many laws this way will produce an impossible mess because of how they interact.
JavaCC might form part of such an answer, but only a small part; the rest of the solution is hard enough so you are better off looking for an alternative rather than trying to build it all on top of JavaCC.
You need a more organized approach for this: a Program Transformation System (PTS). A typical PTS will allow you specify
a grammar for an arbitrary language (in your case, simply polynomials),
automatically parses instance to ASTs and can regenerate valid text from the AST. A good PTS will let you write source-to-source transformation rules that the PTS will apply automatically the instance AST; in your case you'd write down the algebraic laws as source-to-source rules and then the PTS does all the work.
An example is too long to provide here. But here I describe how to define formulas suitable for early calculus classes, and how to define algebraic rules that simply such formulas including applying some class calculus derivative laws.
With sufficient/significant effort, you can build your own PTS on top of JavaCC/JJTree. This is likely to take a few man-years. Easier to get a PTS rather than repeat all that work.

What is difference between Parse Tree, Annotated Parse Tree and Activation Tree ?(compiler)

I know what is a Parse Tree and what is an Abstract Tree but I after reading some about Annotated Parse Tree(as we draw detailed tree which is same as Parse Tree), I feel that they are same as Parse Tree.
Can anyone please explain differences among these three in detail ?
Thanks.
AN ANNOTATED PARSE TREE is a parse tree showing the values of the attributes at each node. The process of computing the attribute values at the nodes is called annotating or decorating the parse tree.
For example: Refer link below, it is annotated parse tree for 3*5+4n
https://i.stack.imgur.com/WAwdZ.png
A parse tree is a representation of how a source text (of a program) has been decomposed to demonstate it matches a grammar for a language. Interior nodes in the tree are language grammar nonterminals (BNF rule left hand side tokens), while leaves of the tree are grammar terminals (all the other tokens) in the order required by grammar rules.
An annotated parse tree is one in which various facts about the program have been attached to parse tree nodes. For example, one might compute the set of identifiers that each subtree mentions, and attach that set to the subtree. Compilers have to store information they have collected about the program somewhere; this is a convenient place to store information which is derivable form the tree.
An activation tree is conceptual snapshot of the result of a set of procedures calling one another at runtime. Node in such a tree represent procedures which have run; childen represent procedures called by their parent.
So a key difference between (annotated) parse trees and activation trees is what they are used to represent: compile time properties vs. runtime properties.
An annotated parse tree lets you intergrate the entire compilation into the parse tree structure. CM Modula-3 does that if im not mistaken.
To build an APT, simply declare an abstract base class of nodes, subclass each production on it and declare the child nodes as field variables.

NLP - How would you parse highly noisy sentence (with Earley parser)

I need to parse a sentence. Now I have an implemented Earley parser and a grammar for it. And everything works just fine when a sentence has no misspellings. But the problem is a lot of sentences I have to deal with are highly noisy. I wonder if there's an algorithm which combines parsing with errors correction? Possible errors are:
typos 'cheker' instead of 'checker'
typos like 'spellchecker' instead of 'spell checker'
contractions like 'Ear par' instead 'Earley parser'
If you know an article which can answer my question I would appriciate a link to it.
I assume you are using a tagger (or lexer) stage that is applied before the Earley parser, i.e. an algorithm that splits the input string into tokens and looks each token up in a dictionary to determine its part-of-speech (POS) tag(s):
John --> PN
loves --> V
a --> DT
woman --> NN
named --> JJ,VPP
Mary --> PN
It should be possible to build some kind of approximate string lookup (aka fuzzy string lookup) into that stage, so when it is presented with a misspelled token, such as 'lobes' instead of 'loves', it will not only identify the tags found by exact string matching ('lobes' as a noun plural of 'lobe'), but also tokens that are similar in shape ('loves' as third-person singular of verb 'love').
This will imply that you generally get a larger number of candidate tags for each token, and therefore a larger number of possible parse results during parsing. Whether or not this will produce the desired result depends on how comprehensive the grammar is, and how good the parser is at identifying the correct analysis when presented with many possible parse trees. A probabilistic parser may be better for this, as it assigns every candidate parse tree a probability (or confidence score), which may be used to select the most likely (or best) analysis.
If this is the solution you'd like to try, there are several possible implementation strategies. Firstly, if the tokenization and tagging is performed as a simple dictionary lookup (i.e. in the style of a lexer), you may simply use a data structure for the dictionary that enables approximate string matching. General methods for approximate string comparison are described in Approximate string matching algorithms, while methods for approximate string lookup in larger dictionaries are discussed in Quickly compare a string against a Collection in Java.
If, however, you use an actual tagger, as opposed to a lexer, i.e. something that performs POS disambiguation in addition to mere dictionary lookup, you will have to build the approximate dictionary lookup into that tagger. There must be a dictionary lookup function, which is used to generate candidate tags before disambiguation is applied, somewhere in the tagger. That dictionary lookup will have to be replaced with one that enables approximate string lookup.

Resources