How would I create a parser which consumes a character that is also at the beginning and end - dart

How would I create a parser that allows a character which also happens to be the same as the begin/end character. Using the following example:
'Isn't it hot'
The second single-quote should be accepted as part of the content that is between the beginning and ending single-quote. I created a parser like this:
char("'").seq((word()|char("'")|whitespace()).plus()).seq(char("'"))
but it fails as:
Failure[1:15]: "'" expected
If I use "any()|char("'") then it greedily consumes the ending single-quote causing an error as well.
Would I need to create an actual Grammar class? I have attempted to create one but can't figure out how to make a Parser that doesn't try to consume the end marker greedily.

The problem is that plus() is greedy and blind. This means the repetition consumes as much input as possible, but does not consider what comes afterwards. In your example, everything up to the end of the input is consumed, but then the last quote in the sequence cannot be matched anymore.
You can solve the problem by using the non-blind variation plusGreedy(Parser) instead:
char("'")
.seq((word() | char("'") | whitespace()).plusGreedy(char("'")))
.seq(char("'"));
This consumes the input as long as there is still a char("'") left that can be consumed afterwards.

Related

How to recognise single new line tokens in a flex/bison based parser and ignore multiple new lines?

I want my bison based parser to recognise single new line tokens like '\n' but ignore multiple new lines so they dont have a role in the overall grammar except in situations i want just a single new line to be included after a pattern,for example leave a new line after a definition but then ignore other new lines.
So far in my lexer i just include the [\n] { } type of rule which ignores new lines,but want to recognise single new line tokens so i tried [\n{1}] {return '\n';} but it doesnt seem to work as intended.
Any help is appreciated.
The first problem is that [\n{1}] doesn't do what you think. That means: "recognize one character that can be a newline, an opening curly bracket, a one or a closing curly bracket".
To solve this it's better to understand the criteria for priority in Flex.
The pattern with the bigger match has priority.
If the pattern has the same length, the pattern above has priority.
Try the following:
[\n] {return '\n';}
[\n]+ {}
A single newline matches both, but uses the rule above (returns the token). More than one newline matches the second rule but not the first (it is ignored).

whitespace in flex patterns leads to "unrecognized rule"

The flex info manual provides allows whitespace in regular expressions using the "x" modifier in the (?r-s:pattern) form. It specifically offers a simple example (without whitespace)
(?:foo) same as (foo)
but the following program fails to compile with the error "unrecognized rule":
BAD (?:foo)
%%
{BAD} {}
I cannot find any form of (? that is acceptable as a rule pattern. Is the manual in error, or do I misunderstand?
The example in your question does not seem to reflect the question itself, since it shows neither the use of whitespace nor a x flag. So I'm going to assume that the pattern which is failing for you is something like
BAD (?x:two | lines |
of | words)
%%
{BAD} { }
And, indeed, that will not work. Although you can use extended format in a pattern, you can only use it in a definition if it doesn't contain a newline. The definition terminates at the last non-whitespace character on the definition line.
Anyway, definitions are overused. You could write the above as
%%
(?x:two | lines |
of | words ) { }
Which saves anyone reading your code from having to search for a definition.
I do understand that you might want to use a very long pattern in a rule, which is awkward, particularly if you want to use it twice. Regardless of the issue with newlines, this tends to run into problems with Flex's definition length limit (2047 characters). My approach has been to break the very long pattern into a series of definitions, and then define another symbol which concatenates the pieces.
Before v2.6, Flex did not chop whitespace off the end of the definition line, which also leads to mysterious "unrecognized rule" errors. The manual seems to still reflect the v2.5 behaviour:
The definition is taken to begin at the first non-whitespace character following the name and continuing to the end of the line.

Handling arbitrary text blocks in an Xtext grammar

In an effort to better understand Xtext, I'm working on writing a grammar and have hit a roadblock. I've boiled it down to the following scenario. I have some input such as this:
thing {abc}
{def}
There may be keywords (e.g.'thing') followed by other language elements (e.g. ID) in braces. Or, there can just be a block of content inside braces. This content should simply be passed along to the parser en masse.
If I try something like this:
Model: (things+=AThing | blocks+=ABlock)*;
AThing : 'thing' '{' name = ID '}';
ABlock : block=BLOCK;
terminal BLOCK:'{' -> '}';
and parse the sample text above, I get an error:
'mismatched input '{abc}' expecting '{'' on ABlock, offset 6, length 5
So, '{abc}' is being matched by the BLOCK terminal rule, which I understand. But how do I alter the grammar to properly handle the sample input? I've been wrestling with this problem for a while and have come up empty. So it's either something very simple that I've missed, or the problem is really complex and I don't realize it. Any enlightenment would be greatly appreciated.
Parsing happens in two stages: tokenizer and lexical. In the first one the text input is divided into tokens, in the second one the tokens are matched against lexical rules. Broadly something like (with some arbitrary language):
1st phase:
text: class X { this ; }
----- --- --- ---- --- ---
tokens: ID ID LB ID SC RB
2nd phase:
Is there a rule that starts with a 'class' string?
YES: Is the next expected token an ID?
YES: Is the next expected token a LB?
...
NO: Is there another rule that starts with 'class'?
...
NO: Is there a rule that starts with an ID token?
...
The lexer implementation is a bit more complex, but I hope you get the idea.
The issue with your grammar is that your termial BLOCK rule is used during the first phase, hence you get
thing {abc} {def}
----- ----- -----
ID BLOCK BLOCK
That is why the error message says if found '{abc}' and not a '{'. The lexer matched the thing and was expecting the next token to be a '{' but it got a BLOCK.
If you want arbitrary text inside the block, I don't think you can use '{' to identify the name of things.
This looks like what is mentioned here:
A quite common case requiring backtracking is when your language uses the same delimiter pair for two different semantics
So the simplest solution seems to use different delimiters. Otherwise you may have to look into enabling backtracking.

Rails strip all except numbers commas and decimal points

Hi I've been struggling with this for the last hour and am no closer. How exactly do I strip everything except numbers, commas and decimal points from a rails string? The closest I have so far is:-
rate = rate.gsub!(/[^0-9]/i, '')
This strips everything but the numbers. When I try add commas to the expression, everything is getting stripped. I got the aboves from somewhere else and as far as I can gather:
^ = not
Everything to the left of the comma gets replaced by what's in the '' on the right
No idea what the /i does
I'm very new to gsub. Does anyone know of a good tutorial on building expressions?
Thanks
Try:
rate = rate.gsub(/[^0-9,\.]/, '')
Basically, you know the ^ means not when inside the character class brackets [] which you are using, and then you can just add the comma to the list. The decimal needs to be escaped with a backslash because in regular expressions they are a special character that means "match anything".
Also, be aware of whether you are using gsub or gsub!
gsub! has the bang, so it edits the instance of the string you're passing in, rather than returning another one.
So if using gsub! it would be:
rate.gsub!(/[^0-9,\.]/, '')
And rate would be altered.
If you do not want to alter the original variable, then you can use the version without the bang (and assign it to a different var):
cleaned_rate = rate.gsub!(/[^0-9,\.]/, '')
I'd just google for tutorials. I haven't used one. Regexes are a LOT of time and trial and error (and table-flipping).
This is a cool tool to use with a mini cheat-sheet on it for ruby that allows you to quickly edit and test your expression:
http://rubular.com/
You can just add the comma and period in the square-bracketed expression:
rate.gsub(/[^0-9,.]/, '')
You don't need the i for case-insensitivity for numbers and symbols.
There's lots of info on regular expressions, regex, etc. Maybe search for those instead of gsub.
You can use this:
rate = rate.gsub!(/[^0-9\.\,]/g,'')
Also check this out to learn more about regular expressions:
http://www.regexr.com/

How to match `\b` in regex in PetitParserDart?

\b is the "world boundary" in regular expression, how to match it in PetitParserDart?
I tried:
pattern("\b") & word().plus() & pattern("\b")
But it doesn't match anything. The patten above I want is \b\w+\b in regular expression.
My real problem is:
I want to treat the render as a token, only if it's a standalone word.
Following is true:
render
to render the page
render()
#render[it]
Following is not:
rerender
rendering
render123
I can't use string("render").trim() here since it will eat up the spaces around it. So I want the \b but it seems not be supported by PetitParserDart.
The parser returned by pattern only looks at a single character. Have a look at the tests for some examples.
A first approximation of the regular expression \b\w+\b would be:
word().neg() & word().plus() & word().not()
However, this requires a non-word character at the beginning of the parsed string. You can avoid this problem by removing word().neg() and making sure that the caller starts at a valid place.
The problem you describe is common when using parsing expression grammars. You can typically solve it by reordering the choices accordingly, or by using the logical predicates like and() and not(). For example the Smalltalk grammar defines the token true as follows:
def('trueToken', _token('true') & word().not());
This avoids that the token parser accidentally consumes part of a variable called trueblood.

Resources