difference between earley and lalr parser in lark? - parsing

I have a simple grammar, which parse key-value pairs section by section.
k1:1
k2:x
k3:3
k4:4
The grammar I have for it is:
start: section (_sep section)*
_sep: _NEWLINE _NEWLINE+
section: item (_NEWLINE item)*
item: NAME ":" VALUE
_NEWLINE: /\r?\n[\t ]*/
VALUE: /\w+/
NAME: /\w+/
However, the grammar works when using the earley parser, but not using lalr parser.
with the following code:
from lark import Lark
import logging
from pathlib import Path
logging.basicConfig(level=logging.DEBUG)
my_grammar = Path("my_grammar.lark").read_text()
print(my_grammar)
early = Lark(my_grammar, debug=True)
print(my_grammar)
lalr = Lark(my_grammar, parser='lalr', debug=True)
text = """
k1:1
k2:x
k3:3
k4:4
"""
print(text.strip())
print(early.parse(text.strip()).pretty())
print(lalr.parse(text.strip()).pretty())
the earley parser give me the valid result.
start
section
item
k1
1
item
k2
x
section
item
k3
3
item
k4
4
but lalr parser did not
lark.exceptions.UnexpectedCharacters: No terminal defined for '
' at line 3 col 1
^
Expecting: {'NAME'}
PS: the problem is with the _NEWLINE.
Lark-parser grammar config the lexer and parser in on grammar file. In my grammar above, a line will be tokenized as _NEWLINE. Multiple new line will be tokenized as _NEWLINE.. _NEWLINE. It confuse the parser.
change _sep to /\r?\n[\t ]*(\r?\n[\t ]*)/. multiple line will be tokenized as one token. and lalr(1) parser can work on it smoothly.
while I get it working. still curious about how early parser got it right.

Related

Problem with parsing file ending on a newline

It seems a bit like a trivial question, but I am stuck on parsing the end of file EOF using my own island grammar. I am using the new VScode extension btw.
I've mostly been using the examples from the basic recipes and have a simple grammar with the following layout rules:
layout Whitespace = [\t-\n\r\ ]*;
lexical IntegerLiteral = [0-9]+ !>> [0-9];
lexical Comment = "%%" ![\n]* $;
Using this, and some rules it parses some simple files, but will give a parse error anytime a file ends in a newline. (newlines in between lines are no problem).
Am is missing something obvious?
Thanks!
It sounds a bit like your grammar is missing a start nonterminal. All grammar rules get whitespace in between their constituent symbols but not at the start or the end.
A start nonterminal is the exception:
start syntax Islands = Island+;
Islands parseIslands(loc input)
= parse(#start[Islands], input).top;
Passing the start nonterminal to parse will allow the file to start and end with whitespace, and using the .top field you can ignore that whitespace from the parse tree again by projecting out the middle Islands tree.
Island grammars tend to be a complex beast, so without sharing the full grammar and input string, it might be a bit hard to answer this question. But I'll share some generic feedback.
he layout production might be ambiguous, if any other part of your language has optional parts. Rascal's parsing is non-greedy. So if you have:
lexical A = "a";
lexical B = "b";
lexical C = "c";
syntax A = A? B? C;
After fusing in the layouts, this becomes:
A` = A? Whitespace? B? Whitespace? C;
Now since whitespace is not eating all characters, the grammar is ambigous, as the parser can "bind" a whitespace between the A and B, or between the B and C. So in most cases, you want to make sure it's a greedy match by adding a follow restriction:
layout Whitespace = [\t-\n \r \ ]* !>> [\t-\n \r \ ];
Also, I fixed a bug, the layout definition didn't include a space as valid whitespace. Rascal allows for spaces in the character class (for readability), so in case we need to add a space, you have to say \ .
For the rest, it looks okay, but like I started with, island grammars are a bit harder to debug without both the full syntax, and what you want to have as water and what as island.

How to prefix numbers with optional letter?

I am building on an initial Xtext project build using gradle.
ext.xtextVersion = '2.20.0'
I have following xtext grammar:
grammar com.exampe.Rule with org.eclipse.xtext.common.Terminals hidden(WS, ML_COMMENT, SL_COMMENT)
import "http://www.eclipse.org/emf/2002/Ecore" as ecore
generate rule "http://www.example.com/Rule"
Rule:
{Number} (other?='o')? number=INT
;
This does NOT parse o19.
Then, the Rule is changed to following:
Rule:
{Number} (other?='*')? number=INT
;
This DOES parse *19.
I did not find any special treatment in letters versus symbols.
What is going wrong here? How can I make o19 getting parsed.
o19 is parsed by the rule ID which you imported by inheriting from org.eclipse.xtext.common.Terminals. In Xtext, the Lexer runs independent from the parser (context insensitive) and tokenizes the text into keywords and terminal rule calls.
You have to add a terminal rule for such cases.
terminal PREFIXED_INT:
'o' INT;
But I don't know whether it's a good idea in terms of readability if you keep the ID rule as well. Readers of your code might be mislead.

Alphabetic characters not recognized in tatsu parse

I have defined a very simple grammar, but tatsu does not behave as expected.
I have added a "start" rule and terminated it with a "$" character, but I still see the same behavior.
If I define the "fingering" rule with a regular expression (digit = /[1-5x]/) instead of the individual terminal symbols, the problem disappears. But shouldn't the old-school BNF-like syntax below work?
from pprint import pprint
from tatsu import parse
GRAMMAR = """
##grammar :: test
##nameguard :: False
start = sequence $ ;
sequence = {digit}+ ;
digit = 'x' | '1' | '2' | '3' | '4' | '5' ;"""
test = "23"
ast = parse(GRAMMAR, test)
pprint(ast) # Prints ['2', '3']
test = "xx"
ast = parse(GRAMMAR, test)
pprint(ast) # Throws tatsu.exceptions.FailedParse: (1:1) no available options :
The "xx" test should produce "['x', 'x']" and not throw an exception.
What am I missing?
You probably need to check interactions with ##nameguard, which is turned on by default.
For the first version of the grammar, use:
##nameguard :: False
You can also consider the definitions of ##whitespace and ##namechars that best suite the language and grammar.
Okay, I think there is a problem with ##nameguard. See https://github.com/neogeny/TatSu/issues/95. The easy workaround for the time being is to use a pattern expression in lieu of individual alphabetic terminals. Also, when ##nameguard is fixed, the documentation should clarify that it only relates to alphanumerics that begin with an alphabetic. Clearly, we did not need ##nameguard for the numeric terminals here.

Running Antlr4 parser with lexer grammar gets token recognition errors

I'm trying to create a grammar to parse Solr queries (only mildly relevant and you don't need to know anything about solr to answer the question -- just know more than I do about antlr 4.7). I'm basing it on the QueryParser.jj file from solr 6. I looked for an existing one, but there doesn't seem to be one that isn't old and out-of-date.
I'm stuck because when I try to run the parser I get "token recognition error"s.
The lexer I created uses lexer modes which, as I understand it means I need to have a separate lexer grammar file. So, I have a parser and a lexer file.
I whittled it down to a simple example to show I'm seeing. Maybe someone can tell me what I'm doing wrong. Here's the parser (Junk.g4):
grammar Junk;
options {
language = Java;
tokenVocab=JLexer;
}
term : TERM '\r\n';
I can't use an import because of the lexer modes in the lexer file I'm trying to create (the tokens in the modes become "undefined" if I use an import). That's why I reference the lexer file with the tokenVocab parameter (as shown in the XML example in github).
Here's the lexer (JLexer.g4):
lexer grammar JLexer;
TERM : TERM_START_CHAR TERM_CHAR* ;
TERM_START_CHAR : [abc] ;
TERM_CHAR : [efg] ;
WS : [ \t\n\r\u3000]+ -> skip;
If I copy the lexer code into the parser, then things work as expected (e.g., "aeee" is a term). Also, if I run the lexer file with grun (specifying tokens as the target), then the string parses as a TERM (as expected).
If I run the parser ("grun Junk term -tokens"), then I get:
line 1:0 token recognition error at: 'a'
line 1:1 token recognition error at: 'e'
line 1:2 token recognition error at: 'e'
line 1:3 token recognition error at: 'e'
[#0,4:5='\r\n',<'
'>,1:4]
I "compile" the lexer first, then "compile" the parser and then javac the resulting java files. I do this in a batch file, so I'm pretty confident that I'm doing this every time.
I don't understand what I'm doing wrong. Is it the way I'm running grun? Any suggestions would be appreciated.
Always trust your intuition! There is some convention internal to grun :-) See here TestRig.java c. lines 125, 150. Would have been lot nicer if some additional CLI args were also added.
When lexer and grammar are compiled separately, the grammar name - in your case - would be (insofar as TestRig goes) "Junk" and the two files must be named "JunkLexer.g4" and "JunkParser.g4". Accordingly the headers in parser file JunkParser.g4 should be modified too
parser grammar JunkParser;
options { tokenVocab=JunkLexer; }
... stuff
Now you can run your tests
> antlr4 JunkLexer
> antlr4 JunkParser
> javac Junk*.java
> grun Junk term -tokens
aeee
^Z
[#0,0:3='aeee',<TERM>,1:0]
[#1,6:5='<EOF>',<EOF>,2:0]
>

Antlr4 lexer predicate needed?

I try to parse this piece of text
:20: test :254:
aapje
:21: rest
...
:20: and :21: are special tags, because they start the line. :254: should be 'normal' text, as it does not start on a newline.
I would like the result to be
(20, 'test :254: \naapje')
(21, 'rest')
Lines are terminated using either \r\n or '\n'
I started out trying to ignore the whitespace, but then I match the ':254:' tag as well. So I have to create something that uses the whitespace information.
What I would like to be able to do is something like this:
lexer grammar MT9740_lexer;
InTagNewLine : '\r\n' ~':';
ReadNewLine :'\r\n' ;
But the first would consume the : How can I still generate these tokens? Or is there a smarted approach?
What I understand is that you're looking for some lexer rules that match the start of a line. This lexer rule should tokenize your :20: or :21: appearing at the start of a line only
SOL : {getCharPositionInLine() == 0}? ':' [0-9]+ ':' ;
Your parser rules can then look for this SOL token before parsing the rest of the line.

Resources