Xtext terminal overlapping

Xtext terminal overlapping - editor

I am new to Xtext and I am facing following issue:
Under every "error id :" line i can expect every printable character with spaces/tabs between. My language is indent-based so this "terminal" cannot start with space character.
Edit/:
Example code for this language would look like this:
package somepkg:
error UNKNOWN:
Unknown error.
error ZERO_DIVISION:
Do not divide by zero you {0} donkey!.
Closest i get to this language specification is this:
grammar com.example.lang.ermsglang.Ermsglang with org.eclipse.xtext.xbase.Xbase hidden(WS)
import "http://www.eclipse.org/emf/2002/Ecore" as ecore
generate ermsglang "http://www.example.com/lang/ermsglang/Ermsglang"
Model:
{Model}
'package' name=ENAME ':'
(BEGIN
(expressions+=Error)+
END)?
;
Error:
{Error}
'error' name=ENAME ':'
(BEGIN
(expressions+=Anything)+
END)?
;
Anything:
(ENAME|EMSG|INT)
;
//Terminals must be disjunctive
terminal ENAME:
('_'|'A'..'Z') ('_'|'A'..'Z')*
;
terminal EMSG:
('!'..'/'|':'..'#'|'['..'~')+
;
terminal SL_COMMENT:
'#' !('\n'|'\r')* ('\r'? '\n')?
;
// The following synthetic tokens are used for the indentation-aware blocks
terminal BEGIN: 'synthetic:BEGIN'; // increase indentation
terminal END: 'synthetic:END'; // decrease indentation
But still, this allows either ENAME or EMSG or INT terminals, so you cant mix for example numbers with characters. Problem is terminals have to be disjunctive so if i modify rule "ANYTHING" like this:
terminal ANYTHING:
(ENAME|EMSG|INT)+
;
or
Anything:
(ENAME|EMSG|INT)+
;
will be a problem with lexer/parser which cannot determine which terminal is which. How to deal with this situation? Thanks.
//Edit: Thank to Christian for working example, there is still one problem with SL_COMMENT, in this example second error keyword is highlighted with message
missing RULE_END at 'error'
package A :
error B :
a
#bopsa Akfkfndsfio
error A_C_S :
:aasdasdasd

the follwoing grammar works for me
grammar org.xtext.example.mydsl3.MyDsl hidden (WS, SL_COMMENT)
generate myDsl "http://www.xtext.org/example/mydsl3/MyDsl"
import "http://www.eclipse.org/emf/2002/Ecore" as ecore
Model:
{Model}
'package' name=ENAME ':'
(BEGIN
(expressions+=Error)+
END)?
;
Error:
{Error}
'error' name=ENAME ':'
(BEGIN
(expressions+=Anything)+
END)?
;
Anything:
(ENAME|EMSG|INT|':')
;
//Terminals must be disjunctive
terminal ENAME:
('_'|'A'..'Z'|'a'..'z') ('_'|'A'..'Z'|'a'..'z')*
;
terminal INT returns ecore::EInt: ('0'..'9')+;
terminal EMSG:
('!'..'/'|';'..'#'|'['..'~')+
;
terminal SL_COMMENT:
'#' !('\n'|'\r')* ('\r'? '\n')?
;
// The following synthetic tokens are used for the indentation-aware blocks
terminal BEGIN: 'synthetic:BEGIN'; // increase indentation
terminal END: 'synthetic:END'; // decrease indentation
terminal WS : (' '|'\t'|'\r'|'\n')+;
terminal ANY_OTHER: .;

Related

How to deal with string-like values between xml tags in an xtext grammar

I am attempting to create a strongly defined xml language, but have run into trouble on element values between element tags. I want them to be treated like a string except they are not wrapped in quotes. Here is a basic grammar I created to demonstrate the idea:
grammar org.xtext.example.myxml.MyXml hidden(WS)
generate myXml "http://www.xtext.org/example/myxml/MyXml"
import "http://www.eclipse.org/emf/2002/Ecore" as ecore
Element:
{Element}
'<Element' attributes+=ElementAttribute* ('/>' | '>'
subElement+=SubElement*
'</Element' '>')
;
SubElement:
{SubElement}
'<SubElement' attributes+=SubElementAttribute* ('/>' | '>'
value=ElementValue
'</SubElement' '>')
;
ElementAttribute:
NameAttribute | TypeAttribute
;
SubElementAttribute:
NameAttribute
;
TypeAttribute:
'type' '=' type=STRING
;
NameAttribute:
'name' '=' name=STRING
;
ElementValue hidden():
value=ID
;
terminal STRING:
'"' ( '\\' . /* 'b'|'t'|'n'|'f'|'r'|'u'|'"'|"'"|'\\' */ | !('\\'|'"') )* '"' |
"'" ( '\\' . /* 'b'|'t'|'n'|'f'|'r'|'u'|'"'|"'"|'\\' */ | !('\\'|"'") )* "'"
;
terminal WS: (' '|'\t'|'\r'|'\n')+;
terminal ID: '^'?('a'..'z'|'A'..'Z'|'_'|'0'..'9'|':'|'-'|'('|')')*;
Here is a test to demonstrate its usage:
#Test
def void parseXML() {
val result = parseHelper.parse('''
<Element type="myType" name="myName">
<SubElement>some string:like-stuff here </SubElement>
</Element>
''')
Assert.assertNotNull(result)
val errors = result.eResource.errors
for (error : errors) {
println(error.message)
}
}
The error I get from this exact code is mismatched input 'string:like-stuff' expecting '</SubElement'
Obviously this will not work because ID does not allow for white space, adding white space to ID fixes the above error, but causes other issues parsing. So my question is how can I parse the element value into a string-like representation without causing ambiguity for the parser in other areas. The only way I have been able to get this to work in any form in my full language is by turning the ElementValue into a list of ID's separated by white space. (I could not get it to work on this minimal example however, not sure what is different)

I would not really recommend it because Xtext is usually not the best fit for XML parsing, but it would probably be possible by turning ElementValue into a datatype rule that allows everything that doesn't create an ambiguity.
Something along the lines of:
ElementValue returns ecore::EString hidden(): (ID|WS|STRING|UNMATCHED)+ ;
and at the end of the grammar:
terminal UNMATCHED: .;
You will probably want to make SubElement.value optional to allow for an empty element.
value=ElementValue?

Antlr4 token existence messing up parsing

first time poster so my greatest apologies if I break the rules.
I'm using Antlr4 to create a log parser and I'm running into some issues that I don't understand.
I'm trying to parse the following input log sequence:
USA1-RR-SRX240-EDGE-01 created 10.20.30.40/50985->11.12.13.14/443
With the following grammar:
grammar Juniper;
WS : (' '|'\t')+ -> skip ;
NL : '\r'? '\n' -> skip ;
fragment DIGIT : '0'..'9' ;
NUMBER : DIGIT+ ;
IPADDRESS : NUMBER '.' NUMBER '.' NUMBER '.' NUMBER ;
SLASH : '/' -> skip ;
RIGHTARROW : '->' -> skip ;
CREATED: 'created' -> skip ;
HOSTNAME : [a-zA-Z0-9\-]+ ;
/* Input sample for rule: USA1-RR-SRX240-EDGE-01 created 10.20.30.40/50985->11.12.13.14/443 */
testcase : HOSTNAME WS CREATED WS IPADDRESS SLASH NUMBER RIGHTARROW IPADDRESS SLASH NUMBER NL;
It's failing and I can't for the life of me figure out why. I know that the token recognition error has something to do with the token that I've defined for HOSTNAME containing the dash in the character class but I'm not sure how to fix it.
$ antlr4 Juniper.g4 && javac Juniper*.java && grun Juniper testcase -tree
USA1-RR-SRX240-EDGE-01 created 10.20.30.40/50985->11.12.13.14/443
line 1:48 token recognition error at: '>'
line 1:30 mismatched input '10.20.30.40' expecting WS
(testcase SA1-RR-SRX240-EDGE-01 10.20.30.40 50985- 11.12.13.14 443)
Please note the second line of the above output is data that I paste into grun and then hit enter and hit control+D.
Any assistance on this would be highly appreciated, been banging me head against the keyboard on this for a bit now.

The problem with recognizing -> is that HOSTNAME matches any sequence of letters, numbers and dashes, and that includes 50985-. Since that match is longer than what NUMBER would match (50985), HOSTNAME wins. That's evidently not what you want.
Parsing log lines generally requires a context-sensitive scanner, and standard parser generators -- which are more oriented towards parsing programming languages -- are not always the ideal tool. In this case, for example, HOSTNAME cannot appear in the context in which it is being recognized, so it shouldn't even be in the list of possible tokens.
Of course, you could define a token which consisted of an ip number and port separated by a slash, which would solve the ambiguity, but (in my opinion) that would be suboptimal because you'll end up rescanning that token to parse it.

ANTLR4 - parse a file line-by-line

I try to write a grammar to parse a file line by line.
My grammar looks like this:
grammar simple;
parse: (line NL)* EOF
line
: statement1
| statement2
| // empty line
;
statement1 : KW1 (INT|FLOAT);
statement2 : KW2 INT;
...
NL : '\r'? '\n';
WS : (' '|'\t')-> skip; // toss out whitespace
If the last line in my input file does not have a newline, I get the following error message:
line xx:37 missing NL at <EOF>
Can somebody please explain, how to write a grammar that actually accepts the last line without newline

Simply don't require NL to fall after the last line. This form is efficient, and simplified based on the fact that line can match the empty sequence (essentially the last line will always be the empty line).
// best if `line` can be empty
parse
: line (NL line)* EOF
;
If line was not allowed to be empty, the following rule is efficient and performs the same operation:
// best if `line` cannot be empty
parse
: (line (NL line)* NL?)? EOF
;
The following rule is equivalent for the case where line cannot be empty, but is much less efficient. I'm including it because I frequently see people write rules in this form where it's easy to see that the specific NL token which was made optional is the one following the last line.
// INEFFICIENT!
parse
: (line NL)* (line NL?)? EOF
;

Grammar for a recognizer of a spice-like language

I'm trying to build a grammar for a recognizer of a spice-like language using Antlr-3.1.3 (I use this version because of the Python target). I don't have experience with parsers. I've found a master thesis where the student has done the syntactic analysis of the SPICE 2G6 language and built a parser using the LEX and YACC compiler writing tools. (http://digitool.library.mcgill.ca/R/?func=dbin-jump-full&object_id=60667&local_base=GEN01-MCG02) In chapter 4, he describes a grammar in Backus-Naur form for the SPICE 2G6 language, and appends to the work the LEX and YACC code files of the parser.
I'm basing myself in this work to create a simpler grammar for a recognizer of a more restrictive spice language.
I read the Antlr manual, but could not figure out how to solve two problems, that the code snippet below illustrates.
grammar Najm_teste;
resistor
: RES NODE NODE VALUE 'G2'? COMMENT? NEWLINE
;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+; // non-negative integer
VALUE : REAL; // non-negative real
fragment
SIG : '+'|'-';
fragment
DIG : '0'..'9';
fragment
EXP : ('E'|'e') SIG? DIG+;
fragment
FLT : (DIG+ '.' DIG+)|('.' DIG+)|(DIG+ '.');
fragment
REAL : (DIG+ EXP?)|(FLT EXP?);
COMMENT : '%' ( options {greedy=false;} : . )* NEWLINE;
NEWLINE : '\r'? '\n';
WS : (' '|'\t')+ {$channel=HIDDEN;};
// END:tokens
In the grammar above, the token NODE is a subset of the set represented by the VALUE token. The grammar correctly interprets an input like "R1 5 0 1.1/n", but cannot interpret an input like "R1 5 0 1/n", because it maps "1" to the token NODE, instead of mapping it to the token VALUE, as NODE comes before VALUE in the tokens section. Given such inputs, does anyone has an idea of how can I map the "1" to the correct token VALUE, or a suggestion of how can I alter the grammar so that I can correctly interpret the input?
The second problem is the presence of a comment at the end of a line. Because the NEWLINE token delimits: (1) the end of a comment; and (2) the end of a line of code. When I include a comment at the end of a line of code, two newline characters are necessary to the parser correctly recognize the line of code, otherwise, just one newline character is necessary. How could I improve this?
Thanks!

Problem 1
The lexer does not "listen" to the parser. The lexer simply creates tokens that contain as much characters as possible. In case two tokens match the same amount of characters, the token defined first will "win". In other words, "1" will always be tokenized as a NODE, even though the parser is trying to match a VALUE.
You can do something like this instead:
resistor
: RES NODE NODE value 'G2'? COMMENT? NEWLINE
;
value : NODE | REAL;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+;
REAL : (DIG+ EXP?) | (FLT EXP?);
...
E.g., I removed VALUE, added value and removed fragment from REAL
Problem 2
Do not let the comment match the line break:
COMMENT : '%' ~('\r' | '\n')*;
where ~('\r' | '\n')* matches zero or more chars other than line break characters.

antlr html pcdata

Im trying to write very simple HTML parser with ANTLR and Im facing problem, that ~ rule which should match all until specified character is not working.
My lexer grammar:
lexer grammar HtmlParserLexer;
HTML: OHTML PCDATA CHTML;
PCDATA :(~'<') ; //match all until <
OHTML: '<html>';
CHTML: '</html>';
Im trying to match:
<html>foo bar</html>
Error from Eclipse ANTLR plugin Interpreter:
MismatchedTokenException: line 1:7 mismatched input UNKNOW expecting '<'
Which means, that my grammar ignore PCDATA rule and I dont know why.
Thanks in advance for your help.

The rule PCDATA :(~'<') ; matches a single character other than '<'. You'll need to repeat it once or more: PCDATA :(~'<')+ ; (notice the +).
You may also want to allow <html></html> (nothing in between<html> and </html>). In that case, you shouldn't change PCDATA :(~'<')+ ; into PCDATA :(~'<')* ;, but do this instead:
HTML: OHTML PCDATA? CHTML;
PCDATA : (~'<')+ ;
because you shouldn't create lexer rules that could potentially match an empty string.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Xtext terminal overlapping - editor

Related

How to deal with string-like values between xml tags in an xtext grammar

Antlr4 token existence messing up parsing

ANTLR4 - parse a file line-by-line

Grammar for a recognizer of a spice-like language

antlr html pcdata

Categories

Resources