Spirit: Allowing a character at the begining but not in the middle - parsing

I'm triying to write a parser for javascript identifiers so far this is what I have:
// All this rules have string as attribute.
identifier_ = identifier_start
>>
*(
identifier_part >> -(qi::char_(".") > identifier_part)
)
;
identifier_part = +(qi::alnum | qi::char_("_"));
identifier_start = qi::char_("a-zA-Z$_");
This parser work fine for the list of "good identifiers" in my tests:
"x__",
"__xyz",
"_",
"$",
"foo4_.bar_3",
"$foo.bar",
"$foo",
"_foo_bar.foo",
"_foo____bar.foo"
but I'm having trouble with one of the bad identifiers: foo$bar. This is supposed to fail, but it success!! And the sintetized attribute has the value "foo".
Here is the debug ouput for foo$bar:
<identifier_>
<try>foo$bar</try>
<identifier_start>
<try>foo$bar</try>
<success>oo$bar</success>
<attributes>[[f]]</attributes>
</identifier_start>
<identifier_part>
<try>oo$bar</try>
<success>$bar</success>
<attributes>[[f, o, o]]</attributes>
</identifier_part>
<identifier_part>
<try>$bar</try>
<fail/>
</identifier_part>
<success>$bar</success>
<attributes>[[f, o, o]]</attributes>
</identifier_>
What I want is to the parser fails when parsing foo$bar but not when parsing $foobar.
What I'm missing?

You don't require that the parser needs to consume all input.
When a rule stops matching before the $ sign, it returns with success, because nothing says it can't be followed by a $ sign. So, you would like to assert that it isn't followed by a character that could be part of an identifier:
identifier_ = identifier_start
>>
*(
identifier_part >> -(qi::char_(".") > identifier_part)
) >> !identifier_start
;
A related directive is distinct from the Qi repository: http://www.boost.org/doc/libs/1_55_0/libs/spirit/repository/doc/html/spirit_repository/qi_components/directives/distinct.html

Related

Alphabetic characters not recognized in tatsu parse

I have defined a very simple grammar, but tatsu does not behave as expected.
I have added a "start" rule and terminated it with a "$" character, but I still see the same behavior.
If I define the "fingering" rule with a regular expression (digit = /[1-5x]/) instead of the individual terminal symbols, the problem disappears. But shouldn't the old-school BNF-like syntax below work?
from pprint import pprint
from tatsu import parse
GRAMMAR = """
##grammar :: test
##nameguard :: False
start = sequence $ ;
sequence = {digit}+ ;
digit = 'x' | '1' | '2' | '3' | '4' | '5' ;"""
test = "23"
ast = parse(GRAMMAR, test)
pprint(ast) # Prints ['2', '3']
test = "xx"
ast = parse(GRAMMAR, test)
pprint(ast) # Throws tatsu.exceptions.FailedParse: (1:1) no available options :
The "xx" test should produce "['x', 'x']" and not throw an exception.
What am I missing?
You probably need to check interactions with ##nameguard, which is turned on by default.
For the first version of the grammar, use:
##nameguard :: False
You can also consider the definitions of ##whitespace and ##namechars that best suite the language and grammar.
Okay, I think there is a problem with ##nameguard. See https://github.com/neogeny/TatSu/issues/95. The easy workaround for the time being is to use a pattern expression in lieu of individual alphabetic terminals. Also, when ##nameguard is fixed, the documentation should clarify that it only relates to alphanumerics that begin with an alphabetic. Clearly, we did not need ##nameguard for the numeric terminals here.

How to replace double quotes in Erlang

This is probably a rather trivial question for the Erlang experts - I'm trying to have my ejabberd server store offline messages (in a Riak db) which inherently do contain double quotes (") around various values, etc. I get a format error when I try to create a Riak database object from them, and testing of replacing the double quotes with an escape character (\") corrects the issue. The question is how can I do this replacement manually?
I tried the following code but somehow doesn't work.
(ejabberd#xxx-xx-xx-xxx)4> re:replace(""hello"", """, "\"", [{return, list}, global]).
* 1: syntax error before: hello
So essentially I'm trying to replace the embedded " around the hello word with \".
I don't know Erlang, but you probably need something like this:
"\"hello\"", "\"", "\\\""
You must escape both " and \ in replacement string.
The Erlang literal syntax for strings uses the "\" (backslash)
character as an escape code. You need to escape backslashes in literal
strings, both in your code and in the shell, with an additional
backslash, i.e.: "\".
Example:
Let's make an example. I use $ Erlang symbol which will be substituted with ascii integer of a character to show what is happening behind each string which basically is a list of integer.
Subject = [$"] ++ "hello" ++ [$"] = "\"hello\"".
Target = [$"] = "\"".
Replacement = [$\\, $\\, $"] = "\\\\\"".
Result = re:replace(Subject, Target, Replacement, [{return, list}, global]).
Now with getting the length of Subject and Result we can find the difference:
7 = length(Subject). %% => 7 characters: " h e l l o "
9 = length(Result). %% => 9 characters: \ " h e l l o \ "

Why does *any* not backtrack in this example?

I'm trying to understand why in the following example I do not get a match on f2. Contrast that with f1 which does succeed as expected.
import 'package:petitparser/petitparser.dart';
import 'package:petitparser/debug.dart';
main() {
showIt(p, s, [tag = '']) {
var result = p.parse(s);
print('''($tag): $result ${result.message}, ${result.position} in:
$s
123456789123456789
''');
}
final id = letter() & word().star();
final f1 = id & char('(') & letter().star() & char(')');
final f2 = id & char('(') & any().star() & char(')');
showIt(f1, 'foo(a)', 'as expected');
showIt(f2, 'foo(a)', 'why any not matching a?');
final re1 = new RegExp(r'[a-zA-Z]\w*\(\w*\)');
final re2 = new RegExp(r'[a-zA-Z]\w*\(.*\)');
print('foo(a)'.contains(re1));
print('foo(a)'.contains(re2));
}
The output:
(as expected): Success[1:7]: [f, [o, o], (, [a], )] null, 6 in:
foo(a)
123456789123456789
(why any not matching a?): Failure[1:7]: ")" expected ")" expected, 6 in:
foo(a)
123456789123456789
true
true
I'm pretty sure the reason has to do with the fact that any matches the closing paren. But when it then looks for the closing paren and can't find it, shouldn't it:
backtrack the last character
assume the any().star() succeeded with just the 'a'
accept the final paren and succeed
Also in contrast I show the analagous regexps that do this.
As you analyzed correctly, the any parser in your example consumes the closing parenthesis. And the star parser wrapping the any parser is eagerly consuming as much input as possible.
Backtracking as you describe is not automatically done by PEGs (parsing expression grammars). Only the ordered choice backtracks automatically.
To fix your example there are multiple possibilities. Most strait forward one is to not make any match the closing parenthesis:
id & char('(') & char(')').neg().star() & char(')')
or
id & char('(') & pattern('^)').star() & char(')')
Alternatively you can use the starLazy operator. Its implementation is using the star and ordered choice operators. An explanation can be found here.
id & char('(') & any().starLazy(char(')')) & char(')')

Does anyone have an efficient R3 function that mimics the behaviour of find/any in R2?

Rebol2 has an /ANY refinement on the FIND function that can do wildcard searches:
>> find/any "here is a string" "s?r"
== "string"
I use this extensively in tight loops that need to perform well. But the refinement was removed in Rebol3.
What's the most efficient way of doing this in Rebol3? (I'm guessing a parse solution of some sort.)
Here's a stab at handling the "*" case:
like: funct [
series [series!]
search [series!]
][
rule: copy []
remove-each s b: parse/all search "*" [empty? s]
foreach s b [
append rule reduce ['to s]
]
append rule [to end]
all [
parse series rule
find series first b
]
]
used as follows:
>> like "abcde" "b*d"
== "bcde"
I had edited your question for "clarity" and changed it to say 'was removed'. That made it sound like it was a deliberate decision. Yet it actually turns out it may just not have been implemented.
BUT if anyone asks me, I don't think it should be in the box...and not just because it's a lousy use of the word "ALL". Here's why:
You're looking for patterns in strings...so if you're constrained to using a string to specify that pattern you get into "meta" problems. Let's say I want to extract the word *Rebol* or ?Red?, now there has to be escaping and things get ugly all over again. Back to RegEx. :-/
So what you might actually want isn't a STRING! pattern like s?r but a BLOCK! pattern like ["s" ? "r"]. This would permit constructs like ["?" ? "?"] or [{?} ? {?}]. That's better than rehashing the string hackery that every other language uses.
And that's what PARSE does, albeit in a slightly-less-declarative way. It also uses words instead of symbols, as Rebol likes to do. [{?} skip {?}] is a match rule where skip is an instruction that moves the parse position past any single element of the parse series between the question marks. It could also do so if it were parsing a block as input, and would match [{?} 12-Dec-2012 {?}].
I don't know entirely what the behavior of /ALL would-or-should be with something like "ab??cd e?*f"... if it provided alternate pattern logic or what. I'm assuming the Rebol2 implementation is brief? So likely it only matches one pattern.
To set a baseline, here's a possibly-lame PARSE solution for the s?r intent:
>> parse "here is a string" [
some [ ; match rule repeatedly
to "s" ; advance to *before* "s"
pos: ; save position as potential match
skip ; now skip the "s"
[ ; [sub-rule]
skip ; ignore any single character (the "?")
"r" ; match the "r", and if we do...
return pos ; return the position we saved
| ; | (otherwise)
none ; no-op, keep trying to match
]
]
fail ; have PARSE return NONE
]
== "string"
If you wanted it to be s*r you would change the skip "r" return pos into a to "r" return pos.
On an efficiency note, I'll mention that it is indeed the case that characters are matched against characters faster than strings. So to #"s" and #"r" to end make a measurable difference in the speed when parsing strings in general. Beyond that, I'm sure others can do better.
The rule is certainly longer than "s?r". But it's not that long when comments are taken out:
[some [to #"s" pos: skip [skip #"r" return pos | none]] fail]
(Note: It does leak pos: as written. Is there a USE in PARSE, implemented or planned?)
Yet a nice thing about it is that it offers hook points at all the moments of decision, and without the escaping defects a naive string solution has. (I'm tempted to give my usual "Bad LEGO alligator vs. Good LEGO alligator" speech.)
But if you don't want to code in PARSE directly, it seems the real answer would be some kind of "Glob Expression"-to-PARSE compiler. It might be the best interpretation of glob Rebol would have, because you could do a one-off:
>> parse "here is a string" glob "s?r"
== "string"
Or if you are going to be doing the match often, cache the compiled expression. Also, let's imagine our block form uses words for literacy:
s?r-rule: glob ["s" one "r"]
pos-1: parse "here is a string" s?r-rule
pos-2: parse "reuse compiled RegEx string" s?r-rule
It might be interesting to see such a compiler for regex as well. These also might accept not only string input but also block input, so that both "s.r" and ["s" . "r"] were legal...and if you used the block form you wouldn't need escaping and could write ["." . "."] to match ".A."
Fairly interesting things would be possible. Given that in RegEx:
(abc|def)=\g{1}
matches abc=abc or def=def
but not abc=def or def=abc
Rebol could be modified to take either the string form or compile into a PARSE rule with a form like:
regex [("abc" | "def") "=" (1)]
Then you get a dialect variation that doesn't need escaping. Designing and writing such compilers is left as an exercise for the reader. :-)
I've broken this into two functions: one that creates a rule to match the given search value, and the other to perform the search. Separating the two allows you to reuse the same generated parse block where one search value is applied over multiple iterations:
expand-wildcards: use [literal][
literal: complement charset "*?"
func [
{Creates a PARSE rule matching VALUE expanding * (any characters) and ? (any one character)}
value [any-string!] "Value to expand"
/local part
][
collect [
parse value [
; empty search string FAIL
end (keep [return (none)])
|
; only wildcard return HEAD
some #"*" end (keep [to end])
|
; everything else...
some [
; single char matches
#"?" (keep 'skip)
|
; textual match
copy part some literal (keep part)
|
; indicates the use of THRU for the next string
some #"*"
; but first we're going to match single chars
any [#"?" (keep 'skip)]
; it's optional in case there's a "*?*" sequence
; in which case, we're going to ignore the first "*"
opt [
copy part some literal (
keep 'thru keep part
)
]
]
]
]
]
]
like: func [
{Finds a value in a series and returns the series at the start of it.}
series [any-string!] "Series to search"
value [any-string! block!] "Value to find"
/local skips result
][
; shortens the search a little where the search starts with a regular char
skips: switch/default first value [
#[none] #"*" #"?" ['skip]
][
reduce ['skip 'to first value]
]
any [
block? value
value: expand-wildcards value
]
parse series [
some [
; we have our match
result: value
; and return it
return (result)
|
; step through the string until we get a match
skips
]
; at the end of the string, no matches
fail
]
]
Splitting the function also gives you a base to optimize the two different concerns: finding the start and matching the value.
I went with PARSE as even though *? are seemingly simple rules, there is nothing quite as expressive and quick as PARSE to effectively implementing such a search.
It might yet as per #HostileFork to consider a dialect instead of strings with wildcards—indeed to the point where Regex is replaced by a compile-to-parse dialect, but is perhaps beyond the scope of the question.

How to resolve Xtext variables' names and keywords statically?

I have a grammar describing an assembler dialect. In code section programmer can refer to registers from a certain list and to defined variables. Also I have a rule matching both [reg0++413] and [myVariable++413]:
BinaryBiasInsideFetchOperation:
'['
v = (Register|[IntegerVariableDeclaration]) ( gbo = GetBiasOperation val = (Register|IntValue|HexValue) )?
']'
;
But when I try to compile it, Xtext throws a warning:
Decision can match input such as "'[' '++' 'reg0' ']'" using multiple alternatives: 2, 3. As a result, alternative(s) 3 were disabled for that input
Spliting the rules I've noticed, that
BinaryBiasInsideFetchOperation:
'['
v = Register ( gbo = GetBiasOperation val = (Register|IntValue|HexValue) )?
']'
;
BinaryBiasInsideFetchOperation:
'['
v = [IntegerVariableDeclaration] ( gbo = GetBiasOperation val = (Register|IntValue|HexValue) )?
']'
;
work well separately, but not at the same time. When I try to compile both of them, XText writes a number of errors saying that registers from list could be processed ambiguously. So:
1) Am I right, that part of rule v = (Register|[IntegerVariableDeclaration]) matches any IntegerVariable name including empty, but rule v = [IntegerVariableDeclaration] matches only nonempty names?
2) Is it correct that when I try to compile separate rules together Xtext thinks that [IntegerVariableDeclaration] can concur with Register?
3) How to resolve this ambiguity?
edit: definitors
Register:
areg = ('reg0' | 'reg1' | 'reg2' | 'reg3' | 'reg4' | 'reg5' | 'reg6' | 'reg7' )
;
IntegerVariableDeclaration:
section = SectionServiceWord? name=ID ':' type = IntegerType ('[' size = IntValue ']')? ( value = IntegerVariableDefinition )? ';'
;
ID is a standart terminal which parses a single word, a.k.a identifier
No, (Register|[IntegerVariableDeclaration]) can't match Empty. Actually, [IntegerVariableDeclaration] is the same than [IntegerVariableDeclaration|ID], it is matching ID rule.
Yes, i think you can't split your rules.
I can't reproduce your problem (i need full grammar), but, in order to solve your problem you should look at this article about xtext grammar debugging:
Compile grammar in debug mode by adding the following line into your workflow.mwe2
fragment = org.eclipse.xtext.generator.parser.antlr.DebugAntlrGeneratorFragment {}
Open generated antrl debug grammar with AntlrWorks and check the diagram.
In addition to Fabien's answer, I'd like to add that an omnimatching rule like
AnyId:
name = ID
;
instead of
(Register|[IntegerVariableDeclaration])
solves the problem. One need to dynamically check if AnyId.name is a Regiser, Variable or something else like Constant.

Resources