Why does *any* not backtrack in this example? - dart

I'm trying to understand why in the following example I do not get a match on f2. Contrast that with f1 which does succeed as expected.
import 'package:petitparser/petitparser.dart';
import 'package:petitparser/debug.dart';
main() {
showIt(p, s, [tag = '']) {
var result = p.parse(s);
print('''($tag): $result ${result.message}, ${result.position} in:
$s
123456789123456789
''');
}
final id = letter() & word().star();
final f1 = id & char('(') & letter().star() & char(')');
final f2 = id & char('(') & any().star() & char(')');
showIt(f1, 'foo(a)', 'as expected');
showIt(f2, 'foo(a)', 'why any not matching a?');
final re1 = new RegExp(r'[a-zA-Z]\w*\(\w*\)');
final re2 = new RegExp(r'[a-zA-Z]\w*\(.*\)');
print('foo(a)'.contains(re1));
print('foo(a)'.contains(re2));
}
The output:
(as expected): Success[1:7]: [f, [o, o], (, [a], )] null, 6 in:
foo(a)
123456789123456789
(why any not matching a?): Failure[1:7]: ")" expected ")" expected, 6 in:
foo(a)
123456789123456789
true
true
I'm pretty sure the reason has to do with the fact that any matches the closing paren. But when it then looks for the closing paren and can't find it, shouldn't it:
backtrack the last character
assume the any().star() succeeded with just the 'a'
accept the final paren and succeed
Also in contrast I show the analagous regexps that do this.

As you analyzed correctly, the any parser in your example consumes the closing parenthesis. And the star parser wrapping the any parser is eagerly consuming as much input as possible.
Backtracking as you describe is not automatically done by PEGs (parsing expression grammars). Only the ordered choice backtracks automatically.
To fix your example there are multiple possibilities. Most strait forward one is to not make any match the closing parenthesis:
id & char('(') & char(')').neg().star() & char(')')
or
id & char('(') & pattern('^)').star() & char(')')
Alternatively you can use the starLazy operator. Its implementation is using the star and ordered choice operators. An explanation can be found here.
id & char('(') & any().starLazy(char(')')) & char(')')

Related

Using Alex in Haskell to make a lexer that parses Dice Rolls

I'm making a parser for a DSL in Haskell using Alex + Happy.
My DSL uses dice rolls as part of the possible expressions.
Sometimes I have an expression that I want to parse that looks like:
[some code...] 3D6 [... rest of the code]
Which should translate roughly to:
TokenInt {... value = 3}, TokenD, TokenInt {... value = 6}
My DSL also uses variables (basically, Strings), so I have a special token that handle variable names.
So, with this tokens:
"D" { \pos str -> TokenD pos }
$alpha [$alpha $digit \_ \']* { \pos str -> TokenName pos str}
$digit+ { \pos str -> TokenInt pos (read str) }
The result I'm getting when using my parse now is:
TokenInt {... value = 3}, TokenName { ... , name = "D6"}
Which means that my lexer "reads" an Integer and a Variable named "D6".
I have tried many things, for example, i changed the token D to:
$digit "D" $digit { \pos str -> TokenD pos }
But that just consumes the digits :(
Can I parse the dice roll with the numbers?
Or at least parse TokenInt-TokenD-TokenInt?
PS: I'm using PosN as a wrapper, not sure if relevant.
The way I'd go about it would be to extend the TokenD type to TokenD Int Int so using the basic wrapper for convenience I would do
$digit+ D $digit+ { dice }
...
dice :: String -> Token
dice s = TokenD (read $ head ls) (read $ last ls)
where ls = split 'D' s
split can be found here.
This is an extra step that'd usually be done in during syntactic analysis but doesn't hurt much here.
Also I can't make Alex parse $alpha for TokenD instead of TokenName. If we had Di instead of D that'd be no problem. From Alex's docs:
When the input stream matches more than one rule, the rule which matches the longest prefix of the input stream wins. If there are still several rules which match an equal number of characters, then the rule which appears earliest in the file wins.
But then your code should work. I don't know if this is an issue with Alex.
I decided that I could survive with variables starting with lowercase letters (like Haskell variables), so I changed my lexer to parse variables only if they start with a lowercase letter.
That also solved some possible problems with some other reserved words.
I'm still curious to know if there were other solutions, but the problem in itself was solved.
Thank you all!

Lexer/Parser design for data file

I am writing a small program which needs to preprocess some data files that are inputs to another program. Because of this I can't change the format of the input files and I have run into a problem.
I am working in a language that doesn't have libraries for this sort of thing and I wouldn't mind the exercise so I am planning on implementing the lexer and parser by hand. I would like to implement a Lexer based roughly on this which is a fairly simple design.
The input file I need to interpret has a section which contains chemical reactions. The different chemical species on each side of the reaction are separated by '+' signs, but the names of the species can also have + characters in them (symbolizing electric charge). For example:
N2+O2=>NO+NO
N2++O2-=>NO+NO
N2+ + O2 => NO + NO
are all valid and the tokens output by the lexer should be
'N2' '+' 'O2' '=>' 'NO' '+' 'NO'
'N2+' '+' 'O2-' '=>' 'NO' '+' 'NO'
'N2+' '+' 'O2-' '=>' 'NO' '+' 'NO'
(note that the last two are identical). I would like to avoid look ahead in the lexer for simplicity. The problem is that the lexer would start reading the any of the above inputs, but when it got to the 3rd character (the first '+'), it wouldn't have any way to know whether it was a part of the species name or if it was a separator between reactants.
To fix this I thought I would just split it off so the second and third examples above would output:
'N2' '+' '+' 'O2-' '=>' 'NO' '+' 'NO'
The parser then would simply use the context, realize that two '+' tokens in a row means the first is part of the previous species name, and would correctly handle all three of the above cases. The problem with this is that now imagine I try to lex/parse
N2 + + O2- => NO + NO
(note the space between 'N2' and the first '+'). This is invalid syntax, however the lexer I just described would output exactly the same token outputs as the second and third examples and my parser wouldn't be able to catch the error.
So possible solutions as I see it:
implement a lexer with atleast one character look ahead
include tokens for whitespace
include leading white space in the '+' token
create a "combined" token that includes both the species name and any trailing '+' without white space between, then letting the parser sort out whether the '+' is actually part of the name or not.
Since I am very new to this kind of programming I am hoping someone can comment on my proposed solutions (or suggest another). My main reservation about the first solution is I simply do not know how much more complicated implementing a lexer with look ahead is.
You don't mention your implementation language, but with an input syntax as relatively simple as the one you outline, I don't think having logic along the lines of the following pseudo-code would be unreasonable.
string GetToken()
{
string token = GetAlphaNumeric(); // assumed to ignore (eat) white-space
var ch = GetChar(); // assumed to ignore (eat) white-space
if (ch == '+')
{
var ch2 = GetChar();
if (ch2 == '+')
token += '+';
else
PutChar(ch2);
}
PutChar(ch);
return token;
}

Match newlines only under specific conditions

I am writing parser of language similar to javascript with its semicolon insertion ex:
var x = 1 + 2;
x;
and
var x = 1 + 2
x
and even
var x = 1 +
2
x
are the same.
For now my lexer matches newline (\n) only when it occurs after token different that semicolon. That plays nice with basic situations like 1 and 2 but how i can deal with third situation? i.e. new line happening in the middle of expression. I can't match new line every time because it would pollute my parser (inserting alternatives with newlines token everywhere) and I also cannot match them at all because it is statement terminator. Basically I would be the best to somehow check during parsing end of the statement if there was a new line character or semicolon there.
This has gone unanswered for a while. I cannot see why you cannot make a statement separator a newline **or* a semicolon. A bit like this:
whitespace [ \t]+
%%
{whitespace} /* Skip */
;[\n]* return(SEMICOLON);
[\n]+ return(SEMICOLON);
Then you're grammar is not messed up at all, as you only get the semicolon in the grammar.

Spirit: Allowing a character at the begining but not in the middle

I'm triying to write a parser for javascript identifiers so far this is what I have:
// All this rules have string as attribute.
identifier_ = identifier_start
>>
*(
identifier_part >> -(qi::char_(".") > identifier_part)
)
;
identifier_part = +(qi::alnum | qi::char_("_"));
identifier_start = qi::char_("a-zA-Z$_");
This parser work fine for the list of "good identifiers" in my tests:
"x__",
"__xyz",
"_",
"$",
"foo4_.bar_3",
"$foo.bar",
"$foo",
"_foo_bar.foo",
"_foo____bar.foo"
but I'm having trouble with one of the bad identifiers: foo$bar. This is supposed to fail, but it success!! And the sintetized attribute has the value "foo".
Here is the debug ouput for foo$bar:
<identifier_>
<try>foo$bar</try>
<identifier_start>
<try>foo$bar</try>
<success>oo$bar</success>
<attributes>[[f]]</attributes>
</identifier_start>
<identifier_part>
<try>oo$bar</try>
<success>$bar</success>
<attributes>[[f, o, o]]</attributes>
</identifier_part>
<identifier_part>
<try>$bar</try>
<fail/>
</identifier_part>
<success>$bar</success>
<attributes>[[f, o, o]]</attributes>
</identifier_>
What I want is to the parser fails when parsing foo$bar but not when parsing $foobar.
What I'm missing?
You don't require that the parser needs to consume all input.
When a rule stops matching before the $ sign, it returns with success, because nothing says it can't be followed by a $ sign. So, you would like to assert that it isn't followed by a character that could be part of an identifier:
identifier_ = identifier_start
>>
*(
identifier_part >> -(qi::char_(".") > identifier_part)
) >> !identifier_start
;
A related directive is distinct from the Qi repository: http://www.boost.org/doc/libs/1_55_0/libs/spirit/repository/doc/html/spirit_repository/qi_components/directives/distinct.html

How to retrieve value from optional parser in Parsec?

Sorry if it's a novice question - I want to parse something defined by
Exp ::= Mandatory_Part Optional_Part0 Optional_Part1
I thought I could do this:
proc::Parser String
proc = do {
;str<-parserMandatoryPart
;str0<-optional(parserOptionalPart0) --(1)
;str1<-optional(parserOptionalPart1) --(2)
;return str++str0++str1
}
I want to get str0/str1 if optional parts are present, otherwise, str0/str1 would be "".
But (1) and (2) won't work since optional() doesn't allow extracting result from its parameters, in this case, parserOptionalPart0/parserOptionalPart1.
Now What would be the proper way to do it?
Many thanks!
Billy R
The function you're looking for is optionMaybe. It returns Nothing if the parser failed, and returns the content in Just if it consumed input.
From the docs:
option x p tries to apply parser p. If p fails without consuming input, it returns the value x, otherwise the value returned by p.
So you could do:
proc :: Parser String
proc = do
str <- parserMandatoryPart
str0 <- option "" parserOptionalPart0
str1 <- option "" parserOptionalPart1
return (str++str0++str1)
Watch out for the "without consuming input" part. You may need to wrap either or both optional parsers with try.
I've also adjusted your code style to be more standard, and fixed an error on the last line. return isn't a keyword; it's an ordinary function. So return a ++ b is (return a) ++ b, i.e. almost never what you want.

Resources