I am parsing a language which has some difficult syntax that I need some help or suggestions to tackle it.
Heres an a typical line =>
IF CLCI((ZNTEM+CHRCNT),1,1H())) EQ 0
The difficult bit is nH(....any character within the character set.....) where n is 1 in this example and the single char in question is a ')' . My lexer has:
fragment Lp: '(';
fragment Rp: ')';
LP: Lp;
RP: Rp; etc....
My current non-working solution is to switch modes in the lexer because then I can then define all the special chars to consume
// Default mode rules
STRING_SUB: INT 'H' LP -> pushMode(ISLAND) ; // switch to ISLAND mode
and then to switch back
// Special mode of INT H ( ID )
// ID is the string substitute which can includes, spaces, backslash, etc, special chars
mode ISLAND;
ISLAND_CLOSE : RP -> popMode ; // back to nomal mode
ID : SpecialChars+; // match/send ID in tag to parser
fragment SpecialChars: '\u0020'..'\u0027' | '\u002A'..'\u0060' | '\u0061'..'\u007E' | '¦';
But obviously the pop mode trigger is the ')' which fails in the particular example case, because the payload is a RP. Any suggestions?
Related
I'm using ANTLR4 to try to parse code that has asterisk-leading comments, like:
* This is a comment
I was initially having issues with multiplication expressions getting mistaken for these comments, so decided to make my lexer rule:
LINE_COMMENT : '\r\n' '*' ~[\r\n]* ;
This forces there to be a newline so it doesn't see 2 * 3, with '* 3' being a comment.
This worked just fine until I had code that starts with a comment on the first line, which does not have a newline to begin with. For example:
* This is the first line of the code's file\r\n
* This is the second line of the codes's file\r\n
I have also tried the {getCharPositionInLine==x}? to make sure that it only recognizes a comment if there is an asterisk or spaces/tabs coming first in the current line. This works when using
antlr4 *.g4
, but will not work with my JavaScript parser generated using
antlr4 -Dlanguage=JavaScript *.g4
Is there a way to get the same results of {getCharPositionInLine==x}? with my JavaScript parser or some way to prevent multiplication from being recognized as a comment? I should also mention that this coding language doesn't use semicolons at the end of lines.
I've tried playing around with this simple grammar, but I haven't had any luck.
grammar wow;
program : expression | Comment ;
expression : expression '*' expression
| NUMBER ;
Comment : '*' ~[\r\n]*;
NUMBER : [0-9]+ ;
Asterisk : '*' ;
Space : ' ' -> skip;
and using a test file: test.txt
5 * 5
Make the comment rule match at least one more non-whitespace character, otherwise it could match the same content as the Asterisk rule, like so:
Comment: '*' ' '* ~[\r\n]+;
Do comments have to be at the beginning of line?
If so you can check it with this._tokenStartCharPositionInLine == 0 and have lexer rule like this
Comment : '*' ~[\r\n]* {this._tokenStartCharPositionInLine == 0}?;
If not, you should gather information about previous tokens, which could allow us to have multiplication (for example your NUMBER rule), so you should write something like (java code)
#lexer::members {
private static final Set<Integer> MULTIPLIABLE_TOKENS = new HashSet<>();
static {
MULTIPLIABLE_TOKENS.add(NUMBER);
}
private boolean canBeMultiplied = false;
#Override
public void emit(final Token token) {
final int type = token.getType();
if (type != Whitespace && type != Newline) { // skip ws tokens from consideration
canBeMultiplied = MULTIPLIABLE_TOKENS.contains(type);
}
super.emit(token);
}
}
Comment : {!canBeMultiplied}? '*' ~[\r\n]*;
UPDATE
If you need function analogs for JavaScript, take a look into the sources -> Lexer.js
I am attempting to create a strongly defined xml language, but have run into trouble on element values between element tags. I want them to be treated like a string except they are not wrapped in quotes. Here is a basic grammar I created to demonstrate the idea:
grammar org.xtext.example.myxml.MyXml hidden(WS)
generate myXml "http://www.xtext.org/example/myxml/MyXml"
import "http://www.eclipse.org/emf/2002/Ecore" as ecore
Element:
{Element}
'<Element' attributes+=ElementAttribute* ('/>' | '>'
subElement+=SubElement*
'</Element' '>')
;
SubElement:
{SubElement}
'<SubElement' attributes+=SubElementAttribute* ('/>' | '>'
value=ElementValue
'</SubElement' '>')
;
ElementAttribute:
NameAttribute | TypeAttribute
;
SubElementAttribute:
NameAttribute
;
TypeAttribute:
'type' '=' type=STRING
;
NameAttribute:
'name' '=' name=STRING
;
ElementValue hidden():
value=ID
;
terminal STRING:
'"' ( '\\' . /* 'b'|'t'|'n'|'f'|'r'|'u'|'"'|"'"|'\\' */ | !('\\'|'"') )* '"' |
"'" ( '\\' . /* 'b'|'t'|'n'|'f'|'r'|'u'|'"'|"'"|'\\' */ | !('\\'|"'") )* "'"
;
terminal WS: (' '|'\t'|'\r'|'\n')+;
terminal ID: '^'?('a'..'z'|'A'..'Z'|'_'|'0'..'9'|':'|'-'|'('|')')*;
Here is a test to demonstrate its usage:
#Test
def void parseXML() {
val result = parseHelper.parse('''
<Element type="myType" name="myName">
<SubElement>some string:like-stuff here </SubElement>
</Element>
''')
Assert.assertNotNull(result)
val errors = result.eResource.errors
for (error : errors) {
println(error.message)
}
}
The error I get from this exact code is mismatched input 'string:like-stuff' expecting '</SubElement'
Obviously this will not work because ID does not allow for white space, adding white space to ID fixes the above error, but causes other issues parsing. So my question is how can I parse the element value into a string-like representation without causing ambiguity for the parser in other areas. The only way I have been able to get this to work in any form in my full language is by turning the ElementValue into a list of ID's separated by white space. (I could not get it to work on this minimal example however, not sure what is different)
I would not really recommend it because Xtext is usually not the best fit for XML parsing, but it would probably be possible by turning ElementValue into a datatype rule that allows everything that doesn't create an ambiguity.
Something along the lines of:
ElementValue returns ecore::EString hidden(): (ID|WS|STRING|UNMATCHED)+ ;
and at the end of the grammar:
terminal UNMATCHED: .;
You will probably want to make SubElement.value optional to allow for an empty element.
value=ElementValue?
I've the following island grammar that works fine (and I think as expected):
lexer grammar FastTestLexer;
// Default mode rules (the SEA)
OPEN1 : '#' -> mode(ISLAND) ; // switch to ISLAND mode
OPEN2 : '##' -> mode(ISLAND);
OPEN3 : '###' -> mode(ISLAND);
OPEN4 : '####' -> mode(ISLAND);
LISTING_OPEN : '~~~~~' -> mode(LISTING);
NL : [\r\n]+;
TEXT : ~('#'|'~')+; // ~('#'|'~')+ ; // clump all text together
mode ISLAND;
CLOSE1 : '#' -> mode(DEFAULT_MODE) ; // back to SEA mode
CLOSE2 : '##' -> mode(DEFAULT_MODE) ; // back to SEA mode
CLOSE3 : '###' -> mode(DEFAULT_MODE) ; // back to SEA mode
CLOSE4 : '####' -> mode(DEFAULT_MODE) ; // back to SEA mode
INLINE : ~'#'+ ; // clump all text together
mode LISTING;
LISTING_CLOSE : '~~~~~' -> mode(DEFAULT_MODE);
INLINE_LISTING : ~'~'+; //~('~'|'#')+;
And the parser grammar:
parser grammar FastTextParser;
options { tokenVocab=FastTestLexer; } // use tokens from ModeTagsLexer.g4
dnpMD
: subheadline NL headline NL lead (subheading | listing | text | NL)*
;
headline
: OPEN1 INLINE CLOSE1
;
subheadline
: OPEN2 INLINE CLOSE2
;
lead
: OPEN3 INLINE CLOSE3
;
subheading
: OPEN4 INLINE CLOSE4
;
listing
: LISTING_OPEN INLINE_LISTING LISTING_CLOSE
;
text
: TEXT
;
Input text like this ones working fine:
## Heading2 ##
# Heading1 #
### Heading3 ###
fffff
#### Heading4 ####
I'm a line.
~~~~~
ffffff
~~~~~
I'm a line, too.
#### Heading4a ####
The TEXT lexer token is matching all the text. Of course except '#' and '~' so the parser knows when there are headings and listings are coming.
My problem is that within the text both characters '#' and '~' should be allowed. The single '#' is only needed for the heading and this parser rule is not active within the body (just one heading at the beginning of the document).
Is there a way to allow '#' and '~' within the text without escaping? My first thought was to disallow '##' within the text:
TEXT : ~('##'|'~')+;
But multiple characters are not allowed there. :(
Maybe someone can give me a hint. But I think this isn't solvable at all. Not solvable with ANTLR4 I mean. Maybe there's another technology.
You could try to do more work in the parser and less in the lexer. Allow # and ~ inside text and not inside TEXT, something similar to:
text
: TEXT
: OPEN1
: TEXT text
: OPEN1 text
;
Adjust the rules for the headlines etc. accordingly.
That way, not the lexer has to decide what a # (or ~) means, what can be relatively hard, because the lexer does not really know the context, but it only decides that it has seen a hash sign. Instead, the parser decides on the meaning of it, and it knows the context in which it appears.
I'm driving crazy trying to generate a parser Grammar with ANTLR.
I've got plain text file like:
Diagram : VW 503 FSX 09/02/2015 12/02/2015 STP
Fleet : AAAA
OFF :
AAA 05+44 5R06
KKK 05+55 06.04 1R06 5530
ZZZ 06.24 06.30 1R06 5530
YYY 07.53 REVRSE
YYY 08.23 9G98 5070
WORKS :
MILES :(LD) 1288.35 (ETY) 3.18 (TOT) 1291.53
Each "Diagram" entity is contained beetween "Diagram :" and the "(TOT) before EOF.
In the same plain txt file multiple "Diagram" entity can be present.
I've done some test with ANTRL
`grammar Hello2;
xxxt : diagram+;
diagram : DIAGRAM_ini txt fleet LEGS+ DIAGRAM_end;
txt : TEXT;
fleet : FLEET_INI txt;
num : NUMBER;
// Lexer Rules
DIAGRAM_ini : 'Diagram :';
DIAGRAM_end : '(TOT)' ;
LEGS : ('AAA' | 'KKK' | 'ZZZ' | 'YYY') ;
FLEET_INI : 'Fleet :';
TEXT : ('a'..'z')+ ;
NUMBER: ('0'..'9') ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ -> skip ;`
My Goal is to be able to parse Diagrams recursively, and gather all LEGS text/number.
Any help/tips is much more than appreciated!
Many Thanks
Regs
S.
I suggest not parsing the file like you did. This file does not define a language with words and grammar, but rather a formatted text of chars:
The formatting conventions are rather weak
The labels before the colon cannot serve as tokens since they may reappear in the body (AAA (=label) vs AAAA (=body)
The tokens must be very primitive to fit this requirements
Solution with ANTLR
You need a weaker grammar to solve this problem, e.g.
grammar diagrams;
diagrams : diagram+ ;
diagram : section+ ;
section : WORD ':' body? ;
body : textline+;
textline : (WORD | NUMBER | SIGNS)* ('\r' | '\n')+;
WORD : LETTER+ ;
NUMBER : DIGIT+ ;
SIGNS : SIGN+ ;
WHITESPACE : ( '\t' | ' ' )+ -> skip ;
fragment LETTER : ('a'..'z' | 'A'..'Z') ;
fragment SIGN : ('.'|'+'|'('|')'|'/') ;
fragment DIGIT : ('0'..'9') ;
Run a visitor on the Parsing result
to build up the normalized text of body
to filter out the LEGS lines out of the body
to parse a LEGS line with another parser (a regexp-parser would be sufficient here, but you could also define another ANTLR-Parser)
Another alternative:
Try out Packrat parsing (e.g. parboiled)
- it is (especially for people with low experience in compiler construction) more comprehensible
it matches better to your grammar design
parboiled is pure java (grammar specified in java)
Disadvantages:
Whitespace handling must be done in Parser Rules
Debugging/Error Messages are a problem (with all packrat parsers)
I'm trying to build a grammar for a recognizer of a spice-like language using Antlr-3.1.3 (I use this version because of the Python target). I don't have experience with parsers. I've found a master thesis where the student has done the syntactic analysis of the SPICE 2G6 language and built a parser using the LEX and YACC compiler writing tools. (http://digitool.library.mcgill.ca/R/?func=dbin-jump-full&object_id=60667&local_base=GEN01-MCG02) In chapter 4, he describes a grammar in Backus-Naur form for the SPICE 2G6 language, and appends to the work the LEX and YACC code files of the parser.
I'm basing myself in this work to create a simpler grammar for a recognizer of a more restrictive spice language.
I read the Antlr manual, but could not figure out how to solve two problems, that the code snippet below illustrates.
grammar Najm_teste;
resistor
: RES NODE NODE VALUE 'G2'? COMMENT? NEWLINE
;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+; // non-negative integer
VALUE : REAL; // non-negative real
fragment
SIG : '+'|'-';
fragment
DIG : '0'..'9';
fragment
EXP : ('E'|'e') SIG? DIG+;
fragment
FLT : (DIG+ '.' DIG+)|('.' DIG+)|(DIG+ '.');
fragment
REAL : (DIG+ EXP?)|(FLT EXP?);
COMMENT : '%' ( options {greedy=false;} : . )* NEWLINE;
NEWLINE : '\r'? '\n';
WS : (' '|'\t')+ {$channel=HIDDEN;};
// END:tokens
In the grammar above, the token NODE is a subset of the set represented by the VALUE token. The grammar correctly interprets an input like "R1 5 0 1.1/n", but cannot interpret an input like "R1 5 0 1/n", because it maps "1" to the token NODE, instead of mapping it to the token VALUE, as NODE comes before VALUE in the tokens section. Given such inputs, does anyone has an idea of how can I map the "1" to the correct token VALUE, or a suggestion of how can I alter the grammar so that I can correctly interpret the input?
The second problem is the presence of a comment at the end of a line. Because the NEWLINE token delimits: (1) the end of a comment; and (2) the end of a line of code. When I include a comment at the end of a line of code, two newline characters are necessary to the parser correctly recognize the line of code, otherwise, just one newline character is necessary. How could I improve this?
Thanks!
Problem 1
The lexer does not "listen" to the parser. The lexer simply creates tokens that contain as much characters as possible. In case two tokens match the same amount of characters, the token defined first will "win". In other words, "1" will always be tokenized as a NODE, even though the parser is trying to match a VALUE.
You can do something like this instead:
resistor
: RES NODE NODE value 'G2'? COMMENT? NEWLINE
;
value : NODE | REAL;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+;
REAL : (DIG+ EXP?) | (FLT EXP?);
...
E.g., I removed VALUE, added value and removed fragment from REAL
Problem 2
Do not let the comment match the line break:
COMMENT : '%' ~('\r' | '\n')*;
where ~('\r' | '\n')* matches zero or more chars other than line break characters.