Can ANTLR 4 parse a non prefix-free language? - parsing

I've recently begun using ANTLR 4. After quite a lot of testing, I wonder if ANTLR 4 can parse a non prefix-free context-free language. That's because its parsers seem to only make the parse tree out of the longest-from-the-start valid substring of input string.
For example, in the parsing stage of this grammar
grammar Hello;
r : 'hello' ID ;
ID : [a-z]+ ;
WS : [ \t\r\n]+ -> skip ;
and this input string
hello sir hello sir
, the parser seems to only parse one hello sir. So, the input string seems to be valid in the grammar, but actually it's not.
Being aware of the fact that I might know about ANTLR enough, I've googled quite a lot about the problem but got no results. Although I might haven't searched it in the right way, now I think it's better to just ask the pros.
Thanks in advance.

Related

Adding semantics using prolog

to compile a syntax portion of a language we need to parse syntactically and lexiqually
then a step that is not trivial: adding semantic
what I do is to use DCG with some predicates in prolog for the first two step, but now I want to know how it done for the semantic ?
how I consider this task?
separate to the parser or mix it with them?...
EDIT :
Usually, a semantic of simple expression with DCG be written as;
expr (Z) -> term(X) "+", expr(Y), {Z is X + Y}.
but if the language is based on logical formulas such as B Method , problem become more complicated,
and what I need is some tips to overcome this problem
sorry about the wrong expressions
I am very appreciated to your opinions

Operator precedence parsing

I have a grammar which has the following productions:
S-> if e then S else | while e do S| begin L end
|s
L-> S; L|S
I am supposed to construct the operator precedence parsing table for the above. But I'm little confused about how to decide the precedence of various terminals here. Till now, we used to work on normal operators (like, +,I,(,id etc). But how to decide in this? I googled to find how to parse if-else grammar using operator precedence parser, but couldn't find any link explaining the same. I actually need to design the error correcting routines for parsing this grammar using operator precedence and SLR parser. Any help will be appreciated (a question from the book Compiler Design, Aho Ullman)!
Thanks in advance!!
Answering my own question for people who want to learn, read this pdf. It presents a method to do the parsing as per operator precedence parsing for all general operators.

Building a parser (Part I)

I'm making my own javascript-based programming language (yeah, it is crazy, but it's for learn only... maybe?). Well, I'm reading about parsers and the first pass is to convert the code source to tokens, like:
if(x > 5)
return true;
Tokenizer to:
T_IF "if"
T_LPAREN "("
T_IDENTIFIER "x"
T_GT ">"
T_NUMBER "5"
T_RPAREN ")"
T_IDENTIFIER "return"
T_TRUE "true"
T_TERMINATOR ";"
I don't know if my logic is correct for that for while. On my parser it is even better (or not?) and translate to it (yeah, multidimensional array):
T_IF "if"
T_EXPRESSION ...
T_IDENTIFIER "x"
T_GT ">"
T_NUMBER "5"
T_CLOSURE ...
T_IDENTIFIER "return"
T_TRUE "true"
I have some doubts:
Is my way better or worse that the original way? Note that my code will be read and compiled (translated to another language, like PHP), instead of interpreted all the time.
After I tokenizer, what I need do exactly? I'm really lost on this pass!
There are some good tutorial to learn how I can do it?
Well, is that. Bye!
Generally, you want to separate the functions of the tokeniser (also called a lexer) from other stages of your compiler or interpreter. The reason for this is basic modularity: each pass consumes one kind of thing (e.g., characters) and produces another one (e.g., tokens).
So you’ve converted your characters to tokens. Now you want to convert your flat list of tokens to meaningful nested expressions, and this is what is conventionally called parsing. For a JavaScript-like language, you should look into recursive descent parsing. For parsing expressions with infix operators of different precedence levels, Pratt parsing is very useful, and you can fall back on ordinary recursive descent parsing for special cases.
Just to give you a more concrete example based on your case, I’ll assume you can write two functions: accept(token) and expect(token), which test the next token in the stream you’ve created. You’ll make a function for each type of statement or expression in the grammar of your language. Here’s Pythonish pseudocode for a statement() function, for instance:
def statement():
if accept("if"):
x = expression()
y = statement()
return IfStatement(x, y)
elif accept("return"):
x = expression()
return ReturnStatement(x)
elif accept("{")
xs = []
while True:
xs.append(statement())
if not accept(";"):
break
expect("}")
return Block(xs)
else:
error("Invalid statement!")
This gives you what’s called an abstract syntax tree (AST) of your program, which you can then manipulate (optimisation and analysis), output (compilation), or run (interpretation).
Most toolkits split the complete process into two separate parts
lexer (aka. tokenizer)
parser (aka. grammar)
The tokenizer will split the input data into tokens. The parser will only operate on the token "stream" and build the structure.
Your question seems to be focused on the tokenizer. But your second solution mixes the grammar parser and the tokenizer into one step. Theoretically this is also possible but for a beginner it is much easier to do it the same way as most other tools/framework: keep the steps separate.
To your first solution: I would tokenize your example like this:
T_KEYWORD_IF "if"
T_LPAREN "("
T_IDENTIFIER "x"
T_GT ">"
T_LITARAL "5"
T_RPAREN ")"
T_KEYWORD_RET "return"
T_KEYWORD_TRUE "true"
T_TERMINATOR ";"
In most languages keywords cannot be used as method names, variable names and so on. This is reflected already on the tokenizer level (T_KEYWORD_IF, T_KEYWORD_RET, T_KEYWORD_TRUE).
The next level would take this stream and - by applying a formal grammar - would build some datastructure (often called AST - Abstract Syntax Tree) which might look like this:
IfStatement:
Expression:
BinaryOperator:
Operator: T_GT
LeftOperand:
IdentifierExpression:
"x"
RightOperand:
LiteralExpression
5
IfBlock
ReturnStatement
ReturnExpression
LiteralExpression
"true"
ElseBlock (empty)
Implementing the parser by hand is usually done by some frameworks. Implementing something like that by hand and efficiently is usually done at a university in the better part of a semester. So you really should use some kind of framework.
The input for a grammar parser framework is usually a formal grammar in some kind of BNF. Your "if" part migh look like this:
IfStatement: T_KEYWORD_IF T_LPAREN Expression T_RPAREN Statement ;
Expression: LiteralExpression | BinaryExpression | IdentifierExpression | ... ;
BinaryExpression: LeftOperand BinaryOperator RightOperand;
....
That's only to get the idea. Parsing a realworld-language like Javascript correctly is not an easy task. But funny.
Is my way better or worse that the original way? Note that my code will be read and compiled (translated to another language, like PHP), instead of interpreted all the time.
What's the original way ? There are many different ways to implement languages. I think yours is fine actually, I once tried to build a language myself that translated to C#, the hack programming language. Many language compilers translate to an intermediate language, it's quite common.
After I tokenizer, what I need do exactly? I'm really lost on this pass!
After tokenizing, you need to parse it. Use some good lexer / parser framework, such as the Boost.Spirit, or Coco, or whatever. There are hundreds of them. Or you can implement your own lexer, but that takes time and resources. There are many ways to parse code, I generally rely on recursive descent parsing.
Next you need to do Code Generation. That's the most difficult part in my opinion. There are tools for that too, but you can do it manually if you want to, I tried to do it in my project, but it was pretty basic and buggy, there's some helpful code here and here.
There are some good tutorial to learn how I can do it?
As I suggested earlier, use tools to do it. There are a lot of pretty good well-documented parser frameworks. For further information, you can try asking some people who know about this stuff. #DeadMG , over at the Lounge C++ is building a programming language called "Wide". You may try consulting him.
Let's say I have this statement in a programming language:
if (0 < 1) then
print("Hello")
The lexer will translate it into:
keyword: if
num: 0
op: <
num: 1
keyword: then
keyword: print
string: "Hello"
The parser will then take the information (aka "Token Stream") and make this:
if:
expression:
<:
0, 1
then:
print:
"Hello"
I don't know if this will help or not, but I hope it does.

Does the recognition of numbers belong in the scanner or in the parser?

When you look at the EBNF description of a language, you often see a definition for integers and real numbers:
integer ::= digit digit* // Accepts numbers with a 0 prefix
real ::= integer "." integer (('e'|'E') integer)?
(Definitions were made on the fly, I have probably made a mistake in them).
Although they appear in the context-free grammar, numbers are often recognized in the lexical analysis phase. Are they included in the language definition to make it more complete and it is up to the implementer to realize that they should actually be in the scanner?
Many common parser generator tools -- such as ANTLR, Lex/YACC -- separate parsing into two phases: first, the input string is tokenized. Second, the tokens are combined into productions to create a concrete syntax tree.
However, there are alternative techniques that do not require tokenization: check out backtracking recursive-descent parsers. For such a parser, tokens are defined in a similar way to non-tokens. pyparsing is a parser generator for such parsers.
The advantage of the two-step technique is that it usually produces more efficient parsers -- with tokens, there's a lot less string manipulation, string searching, and backtracking.
According to "The Definitive ANTLR Reference" (Terence Parr),
The only difference between [lexers and parsers] is that the parser recognizes grammatical structure in a stream of tokens while the lexer recognizes structure in a stream of characters.
The grammar syntax needs to be complete to be precise, so of course it includes details as to the precise format of identifiers and the spelling of operators.
Yes, the compiler engineer decides but generally it is pretty obvious. You want the lexer to handle all the character-level detail efficiently.
There's a longer answer at Is it a Lexer's Job to Parse Numbers and Strings?

Can CSP(Communicating sequential processes) parser be generated using ANTLR?

Can I write a parser for Communicating sequential processes(CSP) in ANTLR? I think it uses left recursion like in statement
VMS = (coin → (choc → VMS))
complete language specification can be found at CSPM : A Reference Manua
so it is not a LL grammer. Am I right?
In general, even if you have a grammar with left recursion, you can refactor the grammar to remove it. So ANTLR is reasonably likely to be able to process your grammar. There's no a priori reason you can't write a CSP grammar for ANTLR.
Whether the one you have is suitable is another question.
If your quoted phrase is a grammar rule, it doesn't have left recursion. (If it is, I don't understand the syntax of your grammar rules, especially why the parentheses [terminals?] would be unbalanced; that's pretty untradional.)
So ANTLR should be able to process it, modulo converting it to ANTRL grammar rule syntax.
You didn't show the rest of the grammar, so one can't have an opinion about the rest of it.
In the case above does not have left recursion. It would looks something like. Note this is a simplified version, CSP is much more complicated. I'm just showing it is possible.
assignment : PROCNAME EQ process
;
process : LPAREN EVENT ARROW process RPAREN
| PROCNAME
;
Besides, you can factor out left recursion with ANTLRWorks 'Remove Left Recursion' function.
CSP's are definitely possible in ANTLR, check http://www.carnap.info/ for an example.

Resources