How to recognise single new line tokens in a flex/bison based parser and ignore multiple new lines? - parsing

I want my bison based parser to recognise single new line tokens like '\n' but ignore multiple new lines so they dont have a role in the overall grammar except in situations i want just a single new line to be included after a pattern,for example leave a new line after a definition but then ignore other new lines.
So far in my lexer i just include the [\n] { } type of rule which ignores new lines,but want to recognise single new line tokens so i tried [\n{1}] {return '\n';} but it doesnt seem to work as intended.
Any help is appreciated.

The first problem is that [\n{1}] doesn't do what you think. That means: "recognize one character that can be a newline, an opening curly bracket, a one or a closing curly bracket".
To solve this it's better to understand the criteria for priority in Flex.
The pattern with the bigger match has priority.
If the pattern has the same length, the pattern above has priority.
Try the following:
[\n] {return '\n';}
[\n]+ {}
A single newline matches both, but uses the rule above (returns the token). More than one newline matches the second rule but not the first (it is ignored).

Related

How to replace some characters of input file, before it getting lexed in flex?

How to replace all occurrences of some character or char-sequence with some other character or char-sequence, before flex lexes it. For example I want B\65R to match identifier rule as it is equivalent to BAR in my grammar. So, essentially I want to turn a sequence of \dd into its equivalent ascii character and then lex it. (\65 -> A, \66 -> B, …).
I know, I can first search the entire file for a sequence of \dd and replace it with equivalent character and then feed it to flex. But I wonder if there exists a better way. Something like writing a rule that matches \dd and then replacing it with corresponding alternative in the input stream, so that, I don't have to parse entire file twice.
Several options...
Next, flex is going to read from a filter that
substitutes "\dd" by "chr(dd)" (untested).
You could run something along the lines of
YYIN = popen("perl -pe 's/\\(\d\d)/chr($1)/e' ", "r");
yylex()....

How would I create a parser which consumes a character that is also at the beginning and end

How would I create a parser that allows a character which also happens to be the same as the begin/end character. Using the following example:
'Isn't it hot'
The second single-quote should be accepted as part of the content that is between the beginning and ending single-quote. I created a parser like this:
char("'").seq((word()|char("'")|whitespace()).plus()).seq(char("'"))
but it fails as:
Failure[1:15]: "'" expected
If I use "any()|char("'") then it greedily consumes the ending single-quote causing an error as well.
Would I need to create an actual Grammar class? I have attempted to create one but can't figure out how to make a Parser that doesn't try to consume the end marker greedily.
The problem is that plus() is greedy and blind. This means the repetition consumes as much input as possible, but does not consider what comes afterwards. In your example, everything up to the end of the input is consumed, but then the last quote in the sequence cannot be matched anymore.
You can solve the problem by using the non-blind variation plusGreedy(Parser) instead:
char("'")
.seq((word() | char("'") | whitespace()).plusGreedy(char("'")))
.seq(char("'"));
This consumes the input as long as there is still a char("'") left that can be consumed afterwards.

How to match `\b` in regex in PetitParserDart?

\b is the "world boundary" in regular expression, how to match it in PetitParserDart?
I tried:
pattern("\b") & word().plus() & pattern("\b")
But it doesn't match anything. The patten above I want is \b\w+\b in regular expression.
My real problem is:
I want to treat the render as a token, only if it's a standalone word.
Following is true:
render
to render the page
render()
#render[it]
Following is not:
rerender
rendering
render123
I can't use string("render").trim() here since it will eat up the spaces around it. So I want the \b but it seems not be supported by PetitParserDart.
The parser returned by pattern only looks at a single character. Have a look at the tests for some examples.
A first approximation of the regular expression \b\w+\b would be:
word().neg() & word().plus() & word().not()
However, this requires a non-word character at the beginning of the parsed string. You can avoid this problem by removing word().neg() and making sure that the caller starts at a valid place.
The problem you describe is common when using parsing expression grammars. You can typically solve it by reordering the choices accordingly, or by using the logical predicates like and() and not(). For example the Smalltalk grammar defines the token true as follows:
def('trueToken', _token('true') & word().not());
This avoids that the token parser accidentally consumes part of a variable called trueblood.

Parsing optional semicolon at statement end

I was writing a parser to parse C-like grammars.
First, it could now parse code like:
a = 1;
b = 2;
Now I want to make the semicolon at the end of line optional.
The original YACC rule was:
stmt: expr ';' { ... }
Where the new line is processed by the lexer that written by myself(the code are simplified):
rule(/\r\n|\r|\n/) { increase_lineno(); return :PASS }
the instruction :PASS here is equivalent to return nothing in LEX, which drop current matched text and skip to the next rule, just like what is usually done with whitespaces.
Because of this, I can't just simply change my YACC rule into:
stmt: expr end_of_stmt { ... }
;
end_of_stmt: ';'
| '\n'
;
So I chose to change the lexer's state dynamically by the parser correspondingly.
Like this:
stmt: expr { state = :STATEMENT_END } ';' { ... }
And add a lexer rule that can match new line with the new state:
rule(/\r\n|\r|\n/, :STATEMENT_END) { increase_lineno(); state = nil; return ';' }
Which means when the lexer is under :STATEMENT_END state. it will first increase the line number as usual, and then set the state into initial one, and then pretend itself is a semicolon.
It's strange that it doesn't actually work with following code:
a = 1
b = 2
I debugged it and got it is not actually get a ';' as expect when scanned the newline after the number 1, and the state specified rule is not really executed.
And the code to set the new state is executed after it already scanned the new line and returned nothing, that means, these works is done as following order:
scan a, = and 1
scan newline and skip, so get the next value b
the inserted code({ state = :STATEMENT_END }) is executed
raising error -- unexpected b here
This is what I expect:
scan a, = and 1
found that it matches the rule expr, so reduce into stmt
execute the inserted code to set the new lexer state
scan the newline and return a ; according the new state matching rule
continue to scan & parse the following line
After introspection I found that might caused as YACC uses LALR(1), this parser will read forward for one token first. When it scans to there, the state is not set yet, so it cannot get a correct token.
My question is: how to make it work as expected? I have no idea on this.
Thanks.
The first thing to recognize is that having optional line terminators like this introduces ambiguity into your language, and so you first need to decide which way you want to resolve the ambiguity. In this case, the main ambiguity comes from operators that may be either infix or prefix. For example:
a = b
-c;
Do you want to treat the above as a single expr-statement, or as two separate statements with the first semicolon elided? A similar potential ambiguity occurs with function call syntax in a C-like language:
a = b
(c);
If you want these to resolve as two statements, you can use the approach you've tried; you just need to set the state one token earlier. This gets tricky as you DON'T want to set the state if you have unclosed parenthesis, so you end up needing an additional state var to record the paren nesting depth, and only set the insert-semi-before-newline state when that is 0.
If you want to resolve the above cases as one statement, things get tricky, as you actually need more lookahead to decide when a newline should end a statement -- at the very least you need to look at the token AFTER the newline (and any comments or other ignored stuff). In this case you can have the lexer do the extra lookahead. If you were using flex (which you're apparently not?), I would suggest either using the / operator (which does lookahead directly), or defer returning the semicolon until the lexer rule that matches the next token.
In general, when doing this kind of token state recording, I find it easiest to do it entirely within the lexer where possible, so you don't need to worry about the extra token of lookahead sometimes (but not always) done by the parser. In this specific case, an easy approach would be to have the lexer record the parenthesis seen (+1 for (, -1 for )), and the last token returned. Then, in the newline rule, if the paren level is 0 and the last token was something that could end an expression (ID or constant or ) or postfix-only operator), return the extra ;
An alternate approach is to have the lexer return NEWLINE as its own token. You would then change the parser to accept stmt: expr NEWLINE as well as optional newlines between most other tokens in the grammar. This exposes the ambiguity directly to the parser (its now not LALR(1)), so you need to resolve it either by using yacc's operator precedence rules (tricky and error prone), or using something like bison's %glr-parser option or btyacc's backtracking ability to deal with the ambiguity directly.
What you are attempting is certainly possible.
Ruby, in fact, does exactly this, and it has a yacc parser. Newlines soft-terminate statements, semicolons are optional, and statements are automatically continued on multiple lines "if they need it".
Communicating between the parser and lexical analyzer may be necessary, and yes, legacy yacc is LALR(1).
I don't know exactly how Ruby does it. My guess has always been that it doesn't actually communicate (much) but rather the lexer recognizes constructs that obviously aren't finished and silently just treats newlines as spaces until the parens and brackets balance. It must also notice when lines end with binary operators or commas and eat those newlines too.
Just a guess, but I believe this technique would work. And Ruby is open source... if you want to see exactly how Matz did it.

How to go backward to a certain position in Flex?

For example, my lexer recognizes a function call pattern:
//i.e. hello(...), foo(...), bar(...)
FUNCALL [a-zA-Z0-9]*[-_]*[a-zA-Z0-9]+[-_*][a-zA-Z0-9]*\(.\)
Now that flex recognizes the pattern, but it goes passed the last character in the pattern (i.e. after stored foo(...) inside yytext, the lexer will point to the next character after foo(...))
How can I reset the lexer pointer back to the beginning of the function pattern? i.e. after recognizing foo(..), I want to the lexer to point to the start of foo(..), so I can start tokenizing it.
I need to do this because for each regex pattern, only one token can be returned for each pattern. i.e. after matching foo(...), I can only return either foo or ( or ) with return statement but not all.
Flex has a trailing context pattern match (manual excerpt below) Read and understand the limitations before you use this.
`r/s'
an `r' but only if it is followed by an `s'. The text matched by
`s' is included when determining whether this rule is the longest
match, but is then returned to the input before the action is
executed. So the action only sees the text matched by `r'. This
type of pattern is called "trailing context". (There are some
combinations of `r/s' that flex cannot match correctly. *Note
Limitations::, regarding dangerous trailing context.)
Presumably something like this:
FUNCALL [a-zA-Z0-9]*[-_]*[a-zA-Z0-9]+[-_*][a-zA-Z0-9]*/\(.\)
You may find that it makes more sense to change your parser so you don't need to do this.

Resources