I'm trying to figure out how I can best parse just a subset of a given language with ANTLR. For example, say I'm looking to parse U-SQL. Really, I'm only interested in parsing certain parts of the language, such as query statements. I couldn't be bothered with parsing the many other features of the language. My current approach has been to design my lexer / parser grammar as follows:
// ...
statement
: queryStatement
| undefinedStatement
;
// ...
undefinedStatement
: (.)+?
;
// ...
UndefinedToken
: (.)+?
;
The gist is, I add a fall-back parser rule and lexer rule for undefined structures and tokens. I imagine later, when I go to walk the parse tree, I can simply ignore the undefined statements in the tree, and focus on the statements I'm interested in.
This seems like it would work, but is this an optimal strategy? Are there more elegant options available? Thanks in advance!
Parsing a subpart of a grammar is super easy. Usually you have a top level rule which you call to parse the full input with the entire grammar.
For the subpart use the function that parses only a subrule like:
const expression = parser.statement();
I use this approach frequently when I want to parse stored procedures or data types only.
Keep in mind however, that subrules usually are not termined with the EOF token (as the top level rule should be). This will cause no syntax error if more than the subelement is in the token stream (the parser just stops when the subrule has matched completely). If that's a problem for you then add a copy of the subrule you wanna parse, give it a dedicated name and end it with EOF, like this:
dataTypeDefinition: // For external use only. Don't reference this in the normal grammar.
dataType EOF
;
dataType: // type in sql_yacc.yy
type = (
...
Check the MySQL grammar for more details.
This general idea -- to parse the interesting bits of an input and ignore the sea of surrounding tokens -- is usually called "island parsing". There's an example of an island parser in the ANTLR reference book, although I don't know if it is directly applicable.
The tricky part of island parsing is getting the island boundaries right. If you miss a boundary, or recognise as a boundary something which isn't, then your parse will fail disastrously. So you need to understand the input at least well enough to be able to detect where the islands are. In your example, that might mean recognising a SELECT statement, for example. However, you cannot blindly recognise the string of letters SELECT because that string might appear inside a string constant or a comment or some other context in which it was never intended to be recognised as a token at all.
I suspect that if you are going to parse queries, you'll basically need to be able to recognise any token. So it's not going to be sea of uninspected input characters. You can view it as a sea of recognised but unparsed tokens. In that case, it should be reasonably safe to parse a non-query statement as a keyword followed by arbitrary tokens other than ; and ending with a ;. (But you might need to recognise nested blocks; I don't really know what the possibilities are.)
Related
Most interpreters let you type the following at their console:
>> a = 2
>> a+3
5
>>
My question is what mechanisms are usually used to handle this syntax? Somehow the parser is able to distinguish between an assignment and an expression even though they could both start with a digit or letter. It's only when we retrieve the second token that you know if you have an assignment or not. In the past, I've looked ahead two tokens and if the second token isn't an equals I push the tokens back into the lexical stream and assume it's an expression. I suppose one could treat the assignment as an expression which I think some languages do. I thought of using left-factoring but I can't see it working.
eg
assignment = variable A
A = '=' expression | empty
Update I found this question on StackOverflow which address the same question: How to modify parsing grammar to allow assignment and non-assignment statements?
From how you're describing your approach - doing a few tokens of lookahead to decide how to handle things - it sounds like you're trying to write some sort of top-down parser along the lines of an LL(1) or an LL(2) parser, and you're trying to immediately decide whether the expression you're parsing is a variable assignment or an arithmetical expression. There are several ways that you could parse expressions like these quite naturally, and they essentially involve weakening one of those two assumptions.
The first way we could do this would be to switch from using a top-down parser like an LL(1) or LL(2) parser to something else like an LR(0) or SLR(1) parser. Those parsers work bottom-up by reading larger prefixes of the input string before deciding what they're looking at. In your case, a bottom-up parser might work by seeing the variable and thinking "okay, I'm either going to be reading an expression to print or an assignment statement, but with what I've seen so far I can't commit to either," then scanning more tokens to see what comes next. If they see an equals sign, great! It's an assignment statement. If they see something else, great! It's not. The nice part about this is that if you're using a standard bottom-up parsing algorithm like LR(0), SLR(1), LALR(1), or LR(1), you should probably find that the parser generally handles these sorts of issues quite well and no special-casing logic is necessary.
The other option would be to parse the entire expression assuming that = is a legitimate binary operator like any other operation, and then check afterwards whether what you parsed is a legal assignment statement or not. For example, if you use Dijkstra's shunting-yard algorithm to do the parsing, you can recover a parse tree for the overall expression, regardless of whether it's an arithmetical expression or an assignment. You could then walk the parse tree to ask questions like
if the top-level operation is an assignment, is the left-hand side a single variable?
if the top-level operation isn't an assignment, are there nested assignment statements buried in here that we need to get rid of?
In other words, you'd parse a broader class of statements than just the ones that are legal, and then do a postprocessing step to toss out anything that isn't valid.
I'm currently in the process of creating a programming language. I've laid out my entire design and am in progress of creating the Lexer for it. I have created numerous lexers and lexer generators in the past, but have never come to adopt the "standard", if one exists.
Is there a specific way a lexer should be created to maximise capability to use it with as many parsers as possible?
Because the way I design mine, they look like the following:
Code:
int main() {
printf("Hello, World!");
}
Lexer:
[
KEYWORD:INT, IDENTIFIER:"main", LEFT_ROUND_BRACKET, RIGHT_ROUNDBRACKET, LEFT_CURLY_BRACKET,
IDENTIFIER:"printf", LEFT_ROUND_BRACKET, STRING:"Hello, World!", RIGHT_ROUND_BRACKET, COLON,
RIGHT_CURLY_BRACKET
]
Is this the way Lexer's should be made? Also as a side-note, what should my next step be after creating a Lexer? I don't really want to use something such as ANTLR or Lex+Yacc or Flex+Bison, etc. I'm doing it from scratch.
If you don't want to use a parser generator [Note 1], then it is absolutely up to you how your lexer provides information to your parser.
Even if you do use a parser generator, there are many details which are going to be project-dependent. Sometimes it is convenient for the lexer to call the parser with each token; other times is is easier if the parser calls the lexer; in some cases, you'll want to have a driver which interacts separately with each component. And clearly, the precise datatype(s) of your tokens will vary from project to project, which can have an impact on how you communicate as well.
Personally, I would avoid use of global variables (as in the original yacc/lex protocol), but that's a general style issue.
Most lexers work in streaming mode, rather than tokenizing the entire input and then handing the vector of tokens to some higher power. Tokenizing one token at a time has a number of advantages, particularly if the tokenization is context-dependent, and, let's face it, almost all languages have some impurity somewhere in their syntax. But, again, that's entirely up to you.
Good luck with your project.
Notes:
Do you also forgo the use of compilers and write all your code from scratch in assembler or even binary?
Is there a specific way a lexer should be created to maximise capability to use it with as many parsers as possible?
In the lexers I've looked at, the canonical API is pretty minimal. It's basically:
Token readNextToken();
The lexer maintains a reference to the source text and its internal pointers into where it is currently looking. Then, every time you call that, it scans and returns the next token.
The Token type usually has:
A "type" enum for which kind of token it is: string, operator, identifier, etc. There are usually special kinds for "EOF", meaning a special terminator token that is produced after the end of the input, and "ERROR" for the rare cases where a syntax error comes from the lexical grammar. This is mainly just unterminated string literals or totally unknown characters in the source.
The source text of the token.
Sometimes literals are converted to their proper value representation during lexing in which case you'll have that value too. So a number token would have "123" as text but also have the numeric value 123. Or you can do that during parsing/compilation.
Location within the source file of the token. This is for error reporting. Usually 1-based line and column, but can also just be start and end byte offsets. The latter is a little faster to produce and can be converted to line and column lazily if needed.
Depending on your grammar, you may need to be able to rewind the lexer too.
Suppose I already have a complete YACC grammar. Let that be C grammar for example. Now I want to create a separate parser for domain-specific language, with simple grammar, except that it still needs to parse complete C type declarations. I wouldn't like to duplicate long rules from the original grammar with associated handling code, but instead would like to call out to the original parser to handle exactly one rule (let's call it "declarator").
If it was a recursive descent parser, there would be a function for each rule, easy to call in. But what about YACC with its implicit stack automaton?
Basically, no. Composing LR grammars is not easy, and bison doesn't offer much help.
But all is not lost. Nothing stops you from including the entire grammar (except the %start declaration), and just using part of it, except for one little detail: bison will complain about useless productions.
If that's a show-stopper for you, then you can use a trick to make it possible to create a grammar with multiple start rules. In fact, you can create a grammar which lets you specify the start symbol every time you call the parser; it doesn't even have to be baked in. Then you can tuck that into a library and use whichever parser you want.
Of course, this also comes at a cost: the cost is that the parser is bigger than it would otherwise need to be. However, it shouldn't be any slower, or at least not much -- there might be some cache effects -- and the extra size is probably insignificant compared to the rest of your compiler.
The hack is described in the bison FAQ in quite a lot of detail, so I'll just do an outline here: for each start production you want to support, you create one extra production which starts with a pseudo-token (that is, a lexical code which will never be generated by the lexer). For example, you might do the following:
%start meta_start
%token START_C START_DSL
meta_start: START_C c_start | START_DSL dsl_start;
Now you just have to arrange for the lexer to produce the appropriate START token when it first starts up. There are various ways to do that; the FAQ suggests using a global variable, but if you use a re-entrant flex scanner, you can just put the desired start token in the scanner state (along with a flag which is set when the start token has been sent).
I'm writing a program where I need to parse a JavaScript source file, extract some facts, and insert/replace portions of the code. A simplified description of the sorts of things I'd need to do is, given this code:
foo(['a', 'b', 'c']);
Extract 'a', 'b', and 'c' and rewrite the code as:
foo('bar', [0, 1, 2]);
I am using ANTLR for my parsing needs, producing C# 3 code. Somebody else had already contributed a JavaScript grammar. The parsing of the source code is working.
The problem I'm encountering is figuring out how to actually properly analyze and modify the source file. Each approach that I try to take in actually solving the problem leads me to a dead end. I can't help but think that I'm not using the tool as it's intended or am just too much of a novice when it comes to dealing with ASTs.
My first approach was to parse using a TokenRewriteStream and implement the EnterRule_* partial methods for the rules I'm interested in. While this seems to make modifying the token stream pretty easy, there is not enough contextual information for my analysis. It seems that all I have access to is a flat stream of tokens, which doesn't tell me enough about the entire structure of code. For example, to detect whether the foo function is being called, simply looking at the first token wouldn't work because that would also falsely match:
a.b.foo();
To allow me to do more sophisticated code analysis, my second approach was to modify the grammar with rewrite rules to produce more of a tree. Now, the first sample code block produces this:
Program
CallExpression
Identifier('foo')
ArgumentList
ArrayLiteral
StringLiteral('a')
StringLiteral('b')
StringLiteral('c')
This is working great for analyzing the code. However, now I am unable to easily rewrite the code. Sure, I could modify the tree structure to represent the code I want, but I can't use this to output source code. I had hoped that the token associated with each node would at least give me enough information to know where in the original text I would need to make the modifications, but all I get are token indexes or line/column numbers. To use the line and column numbers, I would have to make an awkward second pass through the source code.
I suspect I'm missing something in understanding how to properly use ANTLR to do what I need. Is there a more proper way for me to solve this problem?
What you are trying to do is called program transformation, that is, the automated generation of one program from another. What you are doing "wrong" is assuming is parser is all you need, and discovering that it isn't and that you have to fill in the gap.
Tools that do that this well have parsers (to build ASTs), means to modify the ASTs (both procedural and pattern directed), and prettyprinters which convert the (modified) AST back into legal source code. You seem to be struggling with the the fact that ANTLR doesn't come with prettyprinters; that's not part of its philosophy; ANTLR is a (fine) parser-generator. Other answers have suggested using ANTLR's "string templates", which are not by themselves prettyprinters, but can be used to implement one, at the price of implementing one. This harder to do than it looks; see my SO answer on compiling an AST back to source code.
The real issue here is the widely made but false assumption that "if I have a parser, I'm well on my way to building complex program analysis and transformation tools." See my essay on Life After Parsing for a long discussion of this; basically, you need a lot more tooling that "just" a parser to do this, unless you want to rebuild a significant fraction of the infrastructure by yourself instead of getting on with your task. Other useful features of practical program transformation systems include typically source-to-source transformations, which considerably simplify the problem of finding and replacing complex patterns in trees.
For instance, if you had source-to-source transformation capabilities (of our tool, the DMS Software Reengineering Toolkit, you'd be able to write parts of your example code changes using these DMS transforms:
domain ECMAScript.
tag replace; -- says this is a special kind of temporary tree
rule barize(function_name:IDENTIFIER,list:expression_list,b:body):
expression->expression
= " \function_name ( '[' \list ']' ) "
-> "\function_name( \firstarg\(\function_name\), \replace\(\list\))";
rule replace_unit_list(s:character_literal):
expression_list -> expression_list
replace(s) -> compute_index_for(s);
rule replace_long_list(s:character_list, list:expression_list):
expression_list -> expression_list
"\replace\(\s\,\list)-> "compute_index_for\(\s\),\list";
with rule-external "meta" procedures "first_arg" (which knows how to compute "bar" given the identifier "foo" [I'm guessing you want to do this), and "compute_index_for" which given a string literals, knows what integer to replace it with.
Individual rewrite rules have parameter lists "(....)" in which slots representing subtrees are named, a left-hand side acting as a pattern to match, and an right hand side acting as replacement, both usually quoted in metaquotes " which seperates rewrite-rule language text from target-language (e.g. JavaScript) text. There's lots of meta-escapes ** found inside the metaquotes which indicate a special rewrite-rule-language item. Typically these are parameter names, and represent whatever type of name tree the parameter represents, or represent an external meta procedure call (such as first_arg; you'll note the its argument list ( , ) is metaquoted!), or finally, a "tag" such as "replace", which is a peculiar kind of tree that represent future intent to do more transformations.
This particular set of rules works by replacing a candidate function call by the barized version, with the additional intent "replace" to transform the list. The other two transformations realize the intent by transforming "replace" away by processing elements of the list one at a time, and pushing the replace further down the list until it finally falls off the end and the replacement is done. (This is the transformational equivalent of a loop).
Your specific example may vary somewhat since you really weren't precise about the details.
Having applied these rules to modify the parsed tree, DMS can then trivially prettyprint the result (the default behavior in some configurations is "parse to AST, apply rules until exhaustion, prettyprint AST" because this is handy).
You can see a complete process of "define language", "define rewrite rules", "apply rules and prettyprint" at (High School) Algebra as a DMS domain.
Other program transformation systems include TXL and Stratego. We imagine DMS as the industrial strength version of these, in which we have built all that infrastructure including many standard language parsers and prettyprinters.
So it's turning out that I can actually use a rewriting tree grammar and insert/replace tokens using a TokenRewriteStream. Plus, it's actually really easy to do. My code resembles the following:
var charStream = new ANTLRInputStream(stream);
var lexer = new JavaScriptLexer(charStream);
var tokenStream = new TokenRewriteStream(lexer);
var parser = new JavaScriptParser(tokenStream);
var program = parser.program().Tree as Program;
var dependencies = new List<IModule>();
var functionCall = (
from callExpression in program.Children.OfType<CallExpression>()
where callExpression.Children[0].Text == "foo"
select callExpression
).Single();
var argList = functionCall.Children[1] as ArgumentList;
var array = argList.Children[0] as ArrayLiteral;
tokenStream.InsertAfter(argList.Token.TokenIndex, "'bar', ");
for (var i = 0; i < array.Children.Count(); i++)
{
tokenStream.Replace(
(array.Children[i] as StringLiteral).Token.TokenIndex,
i.ToString());
}
var rewrittenCode = tokenStream.ToString();
Have you looked at the string template library. It is by the same person who wrote ANTLR and they are intended to work together. It sounds like it would suit do what your looking for ie. output matched grammar rules as formatted text.
Here is an article on translation via ANTLR
So I'm doing a Parser, where I favor flexibility over speed, and I want it to be easy to write grammars for, e.g. no tricky workaround rules (fake rules to solve conflicts etc, like you have to do in yacc/bison etc.)
There's a hand-coded Lexer with a fixed set of tokens (e.g. PLUS, DECIMAL, STRING_LIT, NAME, and so on) right now there are three types of rules:
TokenRule: matches a particular token
SequenceRule: matches an ordered list of rules
GroupRule: matches any rule from a list
For example, let's say we have the TokenRule 'varAccess', which matches token NAME (roughly /[A-Za-z][A-Za-z0-9_]*/), and the SequenceRule 'assignment', which matches [expression, TokenRule(PLUS), expression].
Expression is a GroupRule matching either 'assignment' or 'varAccess' (the actual ruleset I'm testing with is a bit more complete, but that'll do for the example)
But now let's say I want to parse
var1 = var2
And let's say the Parser begins with rule Expression (the order in which they are defined shouldn't matter - priorities will be solved later). And let's say the GroupRule expression will first try 'assignment'. Then since 'expression' is the first rule to be matched in 'assignment', it will try to parse an expression again, and so on until the stack is filled up and the computer - as expected - simply gives up in a sparkly segfault.
So what I did is - SequenceRules add themselves as 'leafs' to their first rule, and become non-roôt rules. Root rules are rules that the parser will first try. When one of those is applied and matches, it tries to subapply each of its leafs, one by one, until one matches. Then it tries the leafs of the matching leaf, and so on, until nothing matches anymore.
So that it can parse expressions like
var1 = var2 = var3 = var4
Just right =) Now the interesting stuff. This code:
var1 = (var2 + var3)
Won't parse. What happens is, var1 get parsed (varAccess), assign is sub-applied, it looks for an expression, tries 'parenthesis', begins, looks for an expression after the '(', finds var2, and then chokes on the '+' because it was expecting a ')'.
Why doesn't it match the 'var2 + var3' ? (and yes, there's an 'add' SequenceRule, before you ask). Because 'add' isn't a root rule (to avoid infinite recursion with the parse-expresssion-beginning-with-expression-etc.) and that leafs aren't tested in SequenceRules otherwise it would parse things like
reader readLine() println()
as
reader (readLine() println())
(e.g. '1 = 3' is the expression expected by add, the leaf of varAccess a)
whereas we'd like it to be left-associative, e.g. parsing as
(reader readLine()) println()
So anyway, now we've got this problem that we should be able to parse expression such as '1 + 2' within SequenceRules. What to do? Add a special case that when SequenceRules begin with a TokenRule, then the GroupRules it contains are tested for leafs? Would that even make sense outside that particular example? Or should one be able to specify in each element of a SequenceRule if it should be tested for leafs or not? Tell me what you think (other than throw away the whole system - that'll probably happen in a few months anyway)
P.S: Please, pretty please, don't answer something like "go read this 400pages book or you don't even deserve our time" If you feel the need to - just refrain yourself and go bash on reddit. Okay? Thanks in advance.
LL(k) parsers (top down recursive, whether automated or written by hand) require refactoring of your grammar to avoid left recursion, and often require special specifications of lookahead (e.g. ANTLR) to be able to handle k-token lookahead. Since grammars are complex, you get to discover k by experimenting, which is exactly the thing you wish to avoid.
YACC/LALR(1) grammars aviod the problem of left recursion, which is a big step forward. The bad news is that there are no real programming langauges (other than Wirth's original PASCAL) that are LALR(1). Therefore you get to hack your grammar to change it from LR(k) to LALR(1), again forcing you to suffer the experiments that expose the strange cases, and hacking the grammar reduction logic to try to handle K-lookaheads when the parser generators (YACC, BISON, ... you name it) produce 1-lookahead parsers.
GLR parsers (http://en.wikipedia.org/wiki/GLR_parser) allow you to avoid almost all of this nonsense. If you can write a context free parser, under most practical circumstances, a GLR parser will parse it without further effort. That's an enormous relief when you try to write arbitrary grammars. And a really good GLR parser will directly produce a tree.
BISON has been enhanced to do GLR parsing, sort of. You still have to write complicated logic to produce your desired AST, and you have to worry about how to handle failed parsers and cleaning up/deleting their corresponding (failed) trees. The DMS Software Reengineering Tookit provides standard GLR parsers for any context free grammar, and automatically builds ASTs without any additional effort on your part; ambiguous trees are automatically constructed and can be cleaned up by post-parsing semantic analyis. We've used this to do define 30+ language grammars including C, including C++ (which is widely thought to be hard to parse [and it is almost impossible to parse with YACC] but is straightforward with real GLR); see C+++ front end parser and AST builder based on DMS.
Bottom line: if you want to write grammar rules in a straightforward way, and get a parser to process them, use GLR parsing technology. Bison almost works. DMs really works.
My favourite parsing technique is to create recursive-descent (RD) parser from a PEG grammar specification. They are usually very fast, simple, and flexible. One nice advantage is you don't have to worry about separate tokenization passes, and worrying about squeezing the grammar into some LALR form is non-existent. Some PEG libraries are listed [here][1].
Sorry, I know this falls into throw away the system, but you are barely out of the gate with your problem and switching to a PEG RD parser, would just eliminate your headaches now.