I've written a lexer and parser in ml-ulex and ml-antlr. (Running sml/nj)
I have a the following rule in my lexer:
<TRUSTED> [\"] => ( YYBEGIN INQUOTE ; Tokens.ValidText yytext );
It's identified correctly, but yytext has several additional '\' (backslashes)in the string. I get \\\\\\\" instead of \".
To make things more perplexing, I replaced the rule in my lexer with this one:
<TRUSTED> [\"] => ( YYBEGIN INQUOTE ; Tokens.DBLQUOTE );
And in my grammer I had the token
prod: DBLQUOTE => ( "\"" );
With the same results as before....
If I replace the rule with
prod: DBLQUOTE => ( "*test*" );
Then I get exactly one test as my string.
I've tried using String.fromString on the resulting string, and it doesn't do the trick. I'm not sure what's happening or why.
Related
I'm using ANTLR4 to try to parse code that has asterisk-leading comments, like:
* This is a comment
I was initially having issues with multiplication expressions getting mistaken for these comments, so decided to make my lexer rule:
LINE_COMMENT : '\r\n' '*' ~[\r\n]* ;
This forces there to be a newline so it doesn't see 2 * 3, with '* 3' being a comment.
This worked just fine until I had code that starts with a comment on the first line, which does not have a newline to begin with. For example:
* This is the first line of the code's file\r\n
* This is the second line of the codes's file\r\n
I have also tried the {getCharPositionInLine==x}? to make sure that it only recognizes a comment if there is an asterisk or spaces/tabs coming first in the current line. This works when using
antlr4 *.g4
, but will not work with my JavaScript parser generated using
antlr4 -Dlanguage=JavaScript *.g4
Is there a way to get the same results of {getCharPositionInLine==x}? with my JavaScript parser or some way to prevent multiplication from being recognized as a comment? I should also mention that this coding language doesn't use semicolons at the end of lines.
I've tried playing around with this simple grammar, but I haven't had any luck.
grammar wow;
program : expression | Comment ;
expression : expression '*' expression
| NUMBER ;
Comment : '*' ~[\r\n]*;
NUMBER : [0-9]+ ;
Asterisk : '*' ;
Space : ' ' -> skip;
and using a test file: test.txt
5 * 5
Make the comment rule match at least one more non-whitespace character, otherwise it could match the same content as the Asterisk rule, like so:
Comment: '*' ' '* ~[\r\n]+;
Do comments have to be at the beginning of line?
If so you can check it with this._tokenStartCharPositionInLine == 0 and have lexer rule like this
Comment : '*' ~[\r\n]* {this._tokenStartCharPositionInLine == 0}?;
If not, you should gather information about previous tokens, which could allow us to have multiplication (for example your NUMBER rule), so you should write something like (java code)
#lexer::members {
private static final Set<Integer> MULTIPLIABLE_TOKENS = new HashSet<>();
static {
MULTIPLIABLE_TOKENS.add(NUMBER);
}
private boolean canBeMultiplied = false;
#Override
public void emit(final Token token) {
final int type = token.getType();
if (type != Whitespace && type != Newline) { // skip ws tokens from consideration
canBeMultiplied = MULTIPLIABLE_TOKENS.contains(type);
}
super.emit(token);
}
}
Comment : {!canBeMultiplied}? '*' ~[\r\n]*;
UPDATE
If you need function analogs for JavaScript, take a look into the sources -> Lexer.js
I am attempting to create a strongly defined xml language, but have run into trouble on element values between element tags. I want them to be treated like a string except they are not wrapped in quotes. Here is a basic grammar I created to demonstrate the idea:
grammar org.xtext.example.myxml.MyXml hidden(WS)
generate myXml "http://www.xtext.org/example/myxml/MyXml"
import "http://www.eclipse.org/emf/2002/Ecore" as ecore
Element:
{Element}
'<Element' attributes+=ElementAttribute* ('/>' | '>'
subElement+=SubElement*
'</Element' '>')
;
SubElement:
{SubElement}
'<SubElement' attributes+=SubElementAttribute* ('/>' | '>'
value=ElementValue
'</SubElement' '>')
;
ElementAttribute:
NameAttribute | TypeAttribute
;
SubElementAttribute:
NameAttribute
;
TypeAttribute:
'type' '=' type=STRING
;
NameAttribute:
'name' '=' name=STRING
;
ElementValue hidden():
value=ID
;
terminal STRING:
'"' ( '\\' . /* 'b'|'t'|'n'|'f'|'r'|'u'|'"'|"'"|'\\' */ | !('\\'|'"') )* '"' |
"'" ( '\\' . /* 'b'|'t'|'n'|'f'|'r'|'u'|'"'|"'"|'\\' */ | !('\\'|"'") )* "'"
;
terminal WS: (' '|'\t'|'\r'|'\n')+;
terminal ID: '^'?('a'..'z'|'A'..'Z'|'_'|'0'..'9'|':'|'-'|'('|')')*;
Here is a test to demonstrate its usage:
#Test
def void parseXML() {
val result = parseHelper.parse('''
<Element type="myType" name="myName">
<SubElement>some string:like-stuff here </SubElement>
</Element>
''')
Assert.assertNotNull(result)
val errors = result.eResource.errors
for (error : errors) {
println(error.message)
}
}
The error I get from this exact code is mismatched input 'string:like-stuff' expecting '</SubElement'
Obviously this will not work because ID does not allow for white space, adding white space to ID fixes the above error, but causes other issues parsing. So my question is how can I parse the element value into a string-like representation without causing ambiguity for the parser in other areas. The only way I have been able to get this to work in any form in my full language is by turning the ElementValue into a list of ID's separated by white space. (I could not get it to work on this minimal example however, not sure what is different)
I would not really recommend it because Xtext is usually not the best fit for XML parsing, but it would probably be possible by turning ElementValue into a datatype rule that allows everything that doesn't create an ambiguity.
Something along the lines of:
ElementValue returns ecore::EString hidden(): (ID|WS|STRING|UNMATCHED)+ ;
and at the end of the grammar:
terminal UNMATCHED: .;
You will probably want to make SubElement.value optional to allow for an empty element.
value=ElementValue?
I'm triying to write a parser for javascript identifiers so far this is what I have:
// All this rules have string as attribute.
identifier_ = identifier_start
>>
*(
identifier_part >> -(qi::char_(".") > identifier_part)
)
;
identifier_part = +(qi::alnum | qi::char_("_"));
identifier_start = qi::char_("a-zA-Z$_");
This parser work fine for the list of "good identifiers" in my tests:
"x__",
"__xyz",
"_",
"$",
"foo4_.bar_3",
"$foo.bar",
"$foo",
"_foo_bar.foo",
"_foo____bar.foo"
but I'm having trouble with one of the bad identifiers: foo$bar. This is supposed to fail, but it success!! And the sintetized attribute has the value "foo".
Here is the debug ouput for foo$bar:
<identifier_>
<try>foo$bar</try>
<identifier_start>
<try>foo$bar</try>
<success>oo$bar</success>
<attributes>[[f]]</attributes>
</identifier_start>
<identifier_part>
<try>oo$bar</try>
<success>$bar</success>
<attributes>[[f, o, o]]</attributes>
</identifier_part>
<identifier_part>
<try>$bar</try>
<fail/>
</identifier_part>
<success>$bar</success>
<attributes>[[f, o, o]]</attributes>
</identifier_>
What I want is to the parser fails when parsing foo$bar but not when parsing $foobar.
What I'm missing?
You don't require that the parser needs to consume all input.
When a rule stops matching before the $ sign, it returns with success, because nothing says it can't be followed by a $ sign. So, you would like to assert that it isn't followed by a character that could be part of an identifier:
identifier_ = identifier_start
>>
*(
identifier_part >> -(qi::char_(".") > identifier_part)
) >> !identifier_start
;
A related directive is distinct from the Qi repository: http://www.boost.org/doc/libs/1_55_0/libs/spirit/repository/doc/html/spirit_repository/qi_components/directives/distinct.html
I'm trying to build a grammar for a recognizer of a spice-like language using Antlr-3.1.3 (I use this version because of the Python target). I don't have experience with parsers. I've found a master thesis where the student has done the syntactic analysis of the SPICE 2G6 language and built a parser using the LEX and YACC compiler writing tools. (http://digitool.library.mcgill.ca/R/?func=dbin-jump-full&object_id=60667&local_base=GEN01-MCG02) In chapter 4, he describes a grammar in Backus-Naur form for the SPICE 2G6 language, and appends to the work the LEX and YACC code files of the parser.
I'm basing myself in this work to create a simpler grammar for a recognizer of a more restrictive spice language.
I read the Antlr manual, but could not figure out how to solve two problems, that the code snippet below illustrates.
grammar Najm_teste;
resistor
: RES NODE NODE VALUE 'G2'? COMMENT? NEWLINE
;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+; // non-negative integer
VALUE : REAL; // non-negative real
fragment
SIG : '+'|'-';
fragment
DIG : '0'..'9';
fragment
EXP : ('E'|'e') SIG? DIG+;
fragment
FLT : (DIG+ '.' DIG+)|('.' DIG+)|(DIG+ '.');
fragment
REAL : (DIG+ EXP?)|(FLT EXP?);
COMMENT : '%' ( options {greedy=false;} : . )* NEWLINE;
NEWLINE : '\r'? '\n';
WS : (' '|'\t')+ {$channel=HIDDEN;};
// END:tokens
In the grammar above, the token NODE is a subset of the set represented by the VALUE token. The grammar correctly interprets an input like "R1 5 0 1.1/n", but cannot interpret an input like "R1 5 0 1/n", because it maps "1" to the token NODE, instead of mapping it to the token VALUE, as NODE comes before VALUE in the tokens section. Given such inputs, does anyone has an idea of how can I map the "1" to the correct token VALUE, or a suggestion of how can I alter the grammar so that I can correctly interpret the input?
The second problem is the presence of a comment at the end of a line. Because the NEWLINE token delimits: (1) the end of a comment; and (2) the end of a line of code. When I include a comment at the end of a line of code, two newline characters are necessary to the parser correctly recognize the line of code, otherwise, just one newline character is necessary. How could I improve this?
Thanks!
Problem 1
The lexer does not "listen" to the parser. The lexer simply creates tokens that contain as much characters as possible. In case two tokens match the same amount of characters, the token defined first will "win". In other words, "1" will always be tokenized as a NODE, even though the parser is trying to match a VALUE.
You can do something like this instead:
resistor
: RES NODE NODE value 'G2'? COMMENT? NEWLINE
;
value : NODE | REAL;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+;
REAL : (DIG+ EXP?) | (FLT EXP?);
...
E.g., I removed VALUE, added value and removed fragment from REAL
Problem 2
Do not let the comment match the line break:
COMMENT : '%' ~('\r' | '\n')*;
where ~('\r' | '\n')* matches zero or more chars other than line break characters.
I have a regular expression which looks something like this:
(\bee[0-9]{9}in\b)|(\bee[0-9]{9}[a-zA-Z]{2}\b)
Now if the input string is ee123456789ab then the second part of | matches the string. But if the input string is ee123456789in first part of | consumes the whole string and the second part doesn't get a change to match the string? I want both parts of | to have their change to match the string so that I come to know that both parts were able to match the string. Is it even possible to do that using regular expression?
You can use lookahead assertions:
^(?=(ee[0-9]{9}in$)?)(?=(ee[0-9]{9}[a-zA-Z]{2}$)?)
This will capture a match in both \1 and \2; if either of the two is empty, then the corresponding part of the regex has not matched.
I've changed the word boundary anchors to start/end of string anchors since you're testing against the entire string, not just substrings.
In Python:
>>> import re
>>> r = re.compile(r"^(?=(ee[0-9]{9}in$)?)(?=(ee[0-9]{9}[a-zA-Z]{2}$)?)")
>>> m = r.match("ee123456789ab")
>>> m.group(1)
>>> m.group(2)
'ee123456789ab'
>>> m = r.match("ee123456789in")
>>> m.group(1)
'ee123456789in'
>>> m.group(2)
'ee123456789in'
Explanation:
^ # Start of string
(?= # Look ahead to see if it's possible to match...
( # and capture...
ee[0-9]{9}in # regex 1
$ # (end of string)
)? # (make the match optional)
) # End of lookahead
(?= # Second lookahead, same idea...
(
ee[0-9]{9}[a-zA-Z]{2}
$
)?
)
It's not possible with regular expressions. If any part of it matches, it's considered a match. You would have to do it with two different expressions and see if both succeeded.
An OR is a or no matter what, can't get around that.
As #Tim mentioned it can be done with lookahead(s):
You can stand still and look at the same text more than once.
So, one way is to look at each expression without moving,
each expression is optional. -
(?= ( ee [0-9]{9} in )? )
(?= ( ee [0-9]{9} [a-zA-Z]{3} )? )
This is bad because, although the position will advance after the last
expression, it will only advance 1 inter-character position. It also
allows overlapp when searching in a global context.
Searches can be sped up by consuming a character -
(?= ( ee [0-9]{9} in )? )
(?= ( ee [0-9]{9} [a-zA-Z]{3} )? )
.
The engine does an optimization when something is consumed,
advances in chunks (unknown how it decides).
If you have other expressions included with these, it requires that the
position be advanced past here or nothing will match. This could also eliminate
overlapped matching of text (if thats a goal).
Its actually hard to avoid overlap unless you know for sure one expression will
be longer than the other. If thats the case then you could always do a conditional
(if available) to consume the larger text -
(?= ( ee [0-9]{9} in )? )
(?= ( ee [0-9]{9} [a-zA-Z]{3} )? )
(?(2) \2 | \1 )
And, if you know one is a subset of the other, you could just do this -
(?= ( ee [0-9]{9} in )? ) ( ee [0-9]{9} [a-zA-Z]{3} )
Either way, depending on the expressions, much thought has to go into designing
consumption into the regex to avoid overlap.