I have a grammar for three address code assignment operation.
And I have a sample input and output.
But, I don't understand how the sample output 2nd comes from this grammar. Would anybody give the parse three/derivation?
Related
I'm trying to figure out how I can best parse just a subset of a given language with ANTLR. For example, say I'm looking to parse U-SQL. Really, I'm only interested in parsing certain parts of the language, such as query statements. I couldn't be bothered with parsing the many other features of the language. My current approach has been to design my lexer / parser grammar as follows:
// ...
statement
: queryStatement
| undefinedStatement
;
// ...
undefinedStatement
: (.)+?
;
// ...
UndefinedToken
: (.)+?
;
The gist is, I add a fall-back parser rule and lexer rule for undefined structures and tokens. I imagine later, when I go to walk the parse tree, I can simply ignore the undefined statements in the tree, and focus on the statements I'm interested in.
This seems like it would work, but is this an optimal strategy? Are there more elegant options available? Thanks in advance!
Parsing a subpart of a grammar is super easy. Usually you have a top level rule which you call to parse the full input with the entire grammar.
For the subpart use the function that parses only a subrule like:
const expression = parser.statement();
I use this approach frequently when I want to parse stored procedures or data types only.
Keep in mind however, that subrules usually are not termined with the EOF token (as the top level rule should be). This will cause no syntax error if more than the subelement is in the token stream (the parser just stops when the subrule has matched completely). If that's a problem for you then add a copy of the subrule you wanna parse, give it a dedicated name and end it with EOF, like this:
dataTypeDefinition: // For external use only. Don't reference this in the normal grammar.
dataType EOF
;
dataType: // type in sql_yacc.yy
type = (
...
Check the MySQL grammar for more details.
This general idea -- to parse the interesting bits of an input and ignore the sea of surrounding tokens -- is usually called "island parsing". There's an example of an island parser in the ANTLR reference book, although I don't know if it is directly applicable.
The tricky part of island parsing is getting the island boundaries right. If you miss a boundary, or recognise as a boundary something which isn't, then your parse will fail disastrously. So you need to understand the input at least well enough to be able to detect where the islands are. In your example, that might mean recognising a SELECT statement, for example. However, you cannot blindly recognise the string of letters SELECT because that string might appear inside a string constant or a comment or some other context in which it was never intended to be recognised as a token at all.
I suspect that if you are going to parse queries, you'll basically need to be able to recognise any token. So it's not going to be sea of uninspected input characters. You can view it as a sea of recognised but unparsed tokens. In that case, it should be reasonably safe to parse a non-query statement as a keyword followed by arbitrary tokens other than ; and ending with a ;. (But you might need to recognise nested blocks; I don't really know what the possibilities are.)
I've been playing around with antlr to do a kind of excel formula validation. Antlr looks pretty nice, however, I have some doubts about the way it works.
Imagine I have a grammar that already knows about all kind of tokens needed to perform an excel formula validation (rules references, operations, etc). In this grammar, there is no valid token for currency symbols (€,£, etc), though I have an 'ERROR_CHAR' token that matches anything: ERROR_CHAR: .;
Here's what I want to know about an example input: =€€€+SUM(1,2)
The formula is not valid
All the tokens after €€€ are valid and there are rules for them -> +SUM(1,2)
My parser only knows that € is invalid, but don't know about a sequence of ERROR_CHAR, just like €€€, and so, all the input is wrong and all subsequent tokens are caught by the error listener. I assume that this is because, based on my parser rules, I am not saying that ERROR_CHAR could be present anywhere in the input.
I don't want to skip those tokens, because I'd like to highlight the position of the error and I am already skipping whitespaces.
Do you have any idea how could I handle this?
If you just want to highlight the position of an error, let ANTLR detect errors and their positions. It does it quite well. No grammar changes required.
Use ErrorListener to detect errors and handle them.
You can find more information here: Handling errors in ANTLR4
I have a file in which each line represents a concatenated String series as this:
302007030064201410241
30210704006426141
1021070400642614134
Each line starts with operation code and each operation has a known rules to parse remaining part of the line.
What will be the good strategy to parse these numbers? Any sample for start would be great.
IMO, Antlr wont be much usefull if all different informations to parse look like all token are identical.
Write manually a little state machine.
Read a digit in loop until that digit and predecessors result in a know "operation code" (it could be simpler if all codes have the same lenght: you could wrap that in a function)
then depending on that code (e.g. in a switch) you can call its specific decoding logic in a dedicated function.
Your resulting parser will look like a recursive descent parser.
I have modified the PLSQL parser given by [Porcelli] (https://github.com/porcelli/plsql-parser). I am using this parser to parse PlSql files. After successful parsing, I am printing the AST. Now, I want to edit the AST and print back the original plsql source with edited information. How can I achieve this? How can I get back source file from AST with comments, newline and whitespace. Also, formatting should also be remain as original file.
Any lead towards this would be helpful.
Each node in an AST comes with an index member which gives you the token position in the input stream (token stream actually). When you examine the indexes in your AST you will see that not all indexes appear there (there are holes in the occuring indexes). These are the positions that have been filtered out (usually the whitespaces and comments).
Your input stream however is able to give you a token at a given index and, important, to give you every found token, regardless of the channel it is in. So, your strategy could be to iterate over the tokens from your token stream and print them out as they come along. Additionally, you can inspect your AST for the current index and see if instead a different output must be generated or additional output must be appended.
The simple answer is "walk the tree, and spit out text that corresponds to the nodes".
ANTLR offers "StringTemplates" as a basic kind of help, but in fact there's a lot of
fine detail that needs to be addressed: indentation, literals and their formats, comments,...
See my SO answer on Compiling an AST back to source code for a lot more detail.
One thing not addressed there is the general need to reproduce the original character encoding of the file (if you can, sometimes you can't, e.g., you had an ASCII file but inserted a string containing a Unicode character).
This is a follow up to a previous question I asked How to encode FIRST & FOLLOW sets inside a compiler, but this one is more about the design of my program.
I am implementing the Syntax Analysis phase of my compiler by writing a recursive descent parser. I need to be able to take advantage of the FIRST and FOLLOW sets so I can handle errors in the syntax of the source program more efficiently. I have already calculated the FIRST and FOLLOW for all of my non-terminals, but am have trouble deciding where to logically place them in my program and what the best data-structure would be to do so.
Note: all code will be pseudo code
Option 1) Use a map, and map all non-terminals by their name to two Sets that contains their FIRST and FOLLOW sets:
class ParseConstants
Map firstAndFollowMap = #create a map .....
firstAndFollowMap.put("<program>", FIRST_SET, FOLLOW_SET)
end
This seems like a viable option, but inside of my parser I would then need sorta ugly code like this to retrieve the FIRST and FOLLOW and pass to error function:
#processes the <program> non-terminal
def program
List list = firstAndFollowMap.get("<program>")
Set FIRST = list.get(0)
Set FOLLOW = list.get(1)
error(current_symbol, FOLLOW)
end
Option 2) Create a class for each non-terminal and have a FIRST and FOLLOW property:
class Program
FIRST = .....
FOLLOW = ....
end
this leads to code that looks a little nicer:
#processes the <program> non-terminal
def program
error(current_symbol, Program.FOLLOW)
end
These are the two options I thought up, I would love to hear any other suggestions for ways to encode these two sets, and also any critiques and additions to the two ways I posted would be helpful.
Thanks
I have also posted this question here: http://www.coderanch.com/t/570697/java/java/Encode-FIRST-FOLLOW-sets-recursive
You don't really need the FIRST and FOLLOW sets. You need to compute those to get the parse table. That is a table of {<non-terminal, token> -> <action, rule>} if LL(k) (which means seeing a non-terminal in stack and token in input, which action to take and if applies, which rule to apply), or a table of {<state, token> -> <action, state>} if (C|LA|)LR(k) (which means given state in stack and token in input, which action to take and go to which state.
After you get this table, you don't need the FIRST and FOLLOWS any more.
If you are writing a semantic analyzer, you must assume the parser is working correctly. Phrase level error handling (which means handling parse errors), is totally orthogonal to semantic analysis.
This means that in case of parse error, the phrase level error handler (PLEH) would try to fix the error. If it couldn't, parsing stops. If it could, the semantic analyzer shouldn't know if there was an error which was fixed, or there wasn't any error at all!
You can take a look at my compiler library for examples.
About phrase level error handling, you again don't need FIRST and FOLLOW. Let's talk about LL(k) for now (simply because about LR(k) I haven't thought about much yet). After you build the grammar table, you have many entries, like I said like this:
<non-terminal, token> -> <action, rule>
Now, when you parse, you take whatever is on the stack, if it was a terminal, then you must match it with the input. If it didn't match, the phrase level error handler kicks in:
Role one: handle missing terminals - simply generate a fake terminal of the type you need in your lexer and have the parser retry. You can do other stuff as well (for example check ahead in the input, if you have the token you want, drop one token from lexer)
If what you get is a non-terminal (T) from the stack instead, you must look at your lexer, get the lookahead and look at your table. If the entry <T, lookahead> existed, then you're good to go. Follow the action and push to/pop from the stack. If, however, no such entry existed, again, the phrase level error handler kicks in:
Role two: handle unexpected terminals - you can do many things to get passed this. What you do depends on what T and lookahead are and your expert knowledge of your grammar.
Examples of the things you can do are:
Fail! You can do nothing
Ignore this terminal. This means that you push lookahead to the stack (after pushing T back again) and have the parser continue. The parser would first match lookahead, throw it away and continues. Example: if you have an expression like this: *1+2/0.5, you can drop the unexpected * this way.
Change lookahead to something acceptable, push T back and retry. For example, an expression like this: 5id = 10; could be illegal because you don't accept ids that start with numbers. You can replace it with _5id for example to continue