Antlr3 parser path command shell - path

I need to parse the command shell such as:
cp /home/test /home/test2
My problem is in the correct path parsing.
I defined a rule (I can not use a token as path but I need to define it in the parser):
path : ('/' ID)+;
with
ID: (A.. Z | a.. z) +;
WS: (' ') {$channel = HIDDEN;};
I need to keep the token WS hidden, but this gives me the problem that the 2 paths in this example are considered as a single path.
How can I solve this problem?
Thanks

With a little playing around in ANTLRWorks I was able to get this to work:
commands
: command+ EOF;
command
: (CMD first=path second=path '\n') {System.out.println("Command found, first path:" + $first.text + ", and second path:" + $second.text + "\n");};
path : FILE {System.out.println("file is:" + $FILE.text);};
fragment
ID: ('A'..'Z'|'a'..'z')('A'..'Z'|'a'..'z'|'0'..'9')+;
CMD
: ID;
FILE
: ('/' ID)+;
WS: (' '|'\t'|'\r'|'\n') {$channel = HIDDEN;};
Please notice that I had to create a few more lexer rules and then start putting different parser rules to test. I used a java target and will let you use what ever target you want.
Oh yeah, each command has to be on a separate line because of the '\n' in the command rule.

Ok, based on your comment, how about something like this:
commands
: command+ EOF;
command
: (ID ' ' (path)+ ' ' (path)+ '\n') {System.out.println("Command found:" + $command.text + "\n");};
path :
('/' ID)+ {System.out.println("path is:" + $path.text);};
ID: ('A'..'Z'|'a'..'z')('A'..'Z'|'a'..'z'|'0'..'9')+;
WS: (' '|'\t'|'\r'|'\n') {$channel = HIDDEN;};
Again, I was able to get this working in ANTLRWorks quickly and it appears to work with the cp command listed above. But pesonally I don't like this as much since your path is a list of four tokens and quickly I could not split out easily. So, you might require a rule between command and path (since I would assume your shell command might have some commands that work with files while others work on directories).
I am also hoping the ID and WS lexer rules are what you want.

Related

How to match [BOF]"Begin of file" in Antlr4 Lexer?

In one Antlr4 syntax, I need the comment (// xxxx) to be always at the start of a line.
The following grammar works fine for most cases.
grammar com;
comment: COMMENT;
COMMENT
: '\n' '//' .*? '\n'
;
By design, it will match \n//comment\n but not //comment\n. But I also want it to match <BOF>//comment\n. How can I implement it?
You may find that this edit is better handled post-parsing, in a semantic validation pass of your parseTree. (NOTE: It's not a requirement that a parser ONLY recognize valid input, just that it correctly interprets the only way to understand that input.)
For example, does // might be a comment have some other, alternate interpretation if it's not at the beginning of the line?
If not, I would probably just accept the // comment ...\n as a token regardless of it's position in the line.
Then, once you have the parse tree, you can check that you comments always have a column of 0. Doing it this way, your grammar is not tied to a particular target language, and, perhaps more importantly, you can give a "nice" error message like "Comments must begin in the first column of a line".
If you try to handle this in the Lexer (or parser), then, if it's NOT in the correct column, you'll get a much more obtuse recognition error that will be more difficult for users to understand.
That is not possible in a language agnostic way. You will have to add target specific code in your grammar and use a predicate to check if the char position is 0:
COMMENT
: {getCharPositionInLine() == 0}? '//' ~[\r\n]*
;
OTHER
: .
;
If you now tokenize the input:
// start
// middle
?//...
// end
with the Java code:
String input = "// start\n// middle\n?//...\n// end";
comLexer lexer = new comLexer(CharStreams.fromString(input));
CommonTokenStream stream = new CommonTokenStream(lexer);
stream.fill();
for (Token t : stream.getTokens()) {
System.out.printf("%-10s'%s'%n",
comLexer.VOCABULARY.getSymbolicName(t.getType()),
t.getText().replace("\n", "\\n"));
}
the following will be printed to your console:
COMMENT '// start'
OTHER '\n'
COMMENT '// middle'
OTHER '\n'
OTHER '?'
OTHER '/'
OTHER '/'
OTHER '.'
OTHER '.'
OTHER '.'
OTHER '\n'
COMMENT '// end'
EOF '<EOF>'
Note that I also removed the \n at the end of the COMMENT, otherwise a comment at the end of the input would not be matched.
EDIT
How I can do it with JavaScript? I cannot find good examples on internet.
By looking at the Javascript source, it looks like {this.column === 0}? is the Javascript equivalent of {getCharPositionInLine() == 0}?
By the way, does the Intellij Plugin support predict? If it does, does it support only Java?
No, the IntelliJ plugin ignores predicates. After all, the code inside a predicate can be any arbitrary chunk of code, making it quite hard to support.

Xtext terminal overlapping

I am new to Xtext and I am facing following issue:
Under every "error id :" line i can expect every printable character with spaces/tabs between. My language is indent-based so this "terminal" cannot start with space character.
Edit/:
Example code for this language would look like this:
package somepkg:
error UNKNOWN:
Unknown error.
error ZERO_DIVISION:
Do not divide by zero you {0} donkey!.
Closest i get to this language specification is this:
grammar com.example.lang.ermsglang.Ermsglang with org.eclipse.xtext.xbase.Xbase hidden(WS)
import "http://www.eclipse.org/emf/2002/Ecore" as ecore
generate ermsglang "http://www.example.com/lang/ermsglang/Ermsglang"
Model:
{Model}
'package' name=ENAME ':'
(BEGIN
(expressions+=Error)+
END)?
;
Error:
{Error}
'error' name=ENAME ':'
(BEGIN
(expressions+=Anything)+
END)?
;
Anything:
(ENAME|EMSG|INT)
;
//Terminals must be disjunctive
terminal ENAME:
('_'|'A'..'Z') ('_'|'A'..'Z')*
;
terminal EMSG:
('!'..'/'|':'..'#'|'['..'~')+
;
terminal SL_COMMENT:
'#' !('\n'|'\r')* ('\r'? '\n')?
;
// The following synthetic tokens are used for the indentation-aware blocks
terminal BEGIN: 'synthetic:BEGIN'; // increase indentation
terminal END: 'synthetic:END'; // decrease indentation
But still, this allows either ENAME or EMSG or INT terminals, so you cant mix for example numbers with characters. Problem is terminals have to be disjunctive so if i modify rule "ANYTHING" like this:
terminal ANYTHING:
(ENAME|EMSG|INT)+
;
or
Anything:
(ENAME|EMSG|INT)+
;
will be a problem with lexer/parser which cannot determine which terminal is which. How to deal with this situation? Thanks.
//Edit: Thank to Christian for working example, there is still one problem with SL_COMMENT, in this example second error keyword is highlighted with message
missing RULE_END at 'error'
package A :
error B :
a
#bopsa Akfkfndsfio
error A_C_S :
:aasdasdasd
the follwoing grammar works for me
grammar org.xtext.example.mydsl3.MyDsl hidden (WS, SL_COMMENT)
generate myDsl "http://www.xtext.org/example/mydsl3/MyDsl"
import "http://www.eclipse.org/emf/2002/Ecore" as ecore
Model:
{Model}
'package' name=ENAME ':'
(BEGIN
(expressions+=Error)+
END)?
;
Error:
{Error}
'error' name=ENAME ':'
(BEGIN
(expressions+=Anything)+
END)?
;
Anything:
(ENAME|EMSG|INT|':')
;
//Terminals must be disjunctive
terminal ENAME:
('_'|'A'..'Z'|'a'..'z') ('_'|'A'..'Z'|'a'..'z')*
;
terminal INT returns ecore::EInt: ('0'..'9')+;
terminal EMSG:
('!'..'/'|';'..'#'|'['..'~')+
;
terminal SL_COMMENT:
'#' !('\n'|'\r')* ('\r'? '\n')?
;
// The following synthetic tokens are used for the indentation-aware blocks
terminal BEGIN: 'synthetic:BEGIN'; // increase indentation
terminal END: 'synthetic:END'; // decrease indentation
terminal WS : (' '|'\t'|'\r'|'\n')+;
terminal ANY_OTHER: .;

XTEXT: Avoiding grammar match when used as a parameter

I'm still new to Xtext, so my apologies if this is a simple question.
I have a custom scripting language, that I am attempting to use XTEXT for syntax checking only. The language has one command per line, and has the format:
COMMAND:PARAMETERS
I have run into an issue when a parameter for a command is also a command keyword. The relevant part of the grammar file:
Model:
(commands += AbstractCommand)*
;
AbstractCommand:
Command1 | Command2
;
Command1:
command = 'command1' ':' value = Parameter
;
Command2:
command = 'command2' ':' value = Parameter
;
Parameter:
value = QualifiedParameter
;
QualifiedParameter:
(ID | ' ' | INT | '.' | '-' )+
;
The problem arises when one of the commands uses another another command as it's parameter. The rules of the language don't allow an actual 2nd command on the same line. In this case, it is just plain text that happens to have the same value as a pre-existing command. For example, assume Command1 and Command2 are expecting a complete sentence as it's parameter. Some sample valid commands would be:
Command1:This is a sentence
Command2:This is also a sentence
Command1:This sentence has Command2 in it
All 3 commands are valid, but the last line will generate an error "missing ":" at " ", because "Command2" has its own rules for parsing.
I've been reading the XTEXT documentation, and it seems like I can use first token set predicates to avoid reading the second token when the first is identified, but I cannot find any examples of this.
i am not sure if i get your question. maybe what you are looking for is the following:
Model: greetings+=Greeting*;
Greeting: "Hello" name=MyID "!";
MyID: "Hello" | ID;
this now allows to parse
Hello You!
Hello Hello!

Antlr4 token existence messing up parsing

first time poster so my greatest apologies if I break the rules.
I'm using Antlr4 to create a log parser and I'm running into some issues that I don't understand.
I'm trying to parse the following input log sequence:
USA1-RR-SRX240-EDGE-01 created 10.20.30.40/50985->11.12.13.14/443
With the following grammar:
grammar Juniper;
WS : (' '|'\t')+ -> skip ;
NL : '\r'? '\n' -> skip ;
fragment DIGIT : '0'..'9' ;
NUMBER : DIGIT+ ;
IPADDRESS : NUMBER '.' NUMBER '.' NUMBER '.' NUMBER ;
SLASH : '/' -> skip ;
RIGHTARROW : '->' -> skip ;
CREATED: 'created' -> skip ;
HOSTNAME : [a-zA-Z0-9\-]+ ;
/* Input sample for rule: USA1-RR-SRX240-EDGE-01 created 10.20.30.40/50985->11.12.13.14/443 */
testcase : HOSTNAME WS CREATED WS IPADDRESS SLASH NUMBER RIGHTARROW IPADDRESS SLASH NUMBER NL;
It's failing and I can't for the life of me figure out why. I know that the token recognition error has something to do with the token that I've defined for HOSTNAME containing the dash in the character class but I'm not sure how to fix it.
$ antlr4 Juniper.g4 && javac Juniper*.java && grun Juniper testcase -tree
USA1-RR-SRX240-EDGE-01 created 10.20.30.40/50985->11.12.13.14/443
line 1:48 token recognition error at: '>'
line 1:30 mismatched input '10.20.30.40' expecting WS
(testcase SA1-RR-SRX240-EDGE-01 10.20.30.40 50985- 11.12.13.14 443)
Please note the second line of the above output is data that I paste into grun and then hit enter and hit control+D.
Any assistance on this would be highly appreciated, been banging me head against the keyboard on this for a bit now.
The problem with recognizing -> is that HOSTNAME matches any sequence of letters, numbers and dashes, and that includes 50985-. Since that match is longer than what NUMBER would match (50985), HOSTNAME wins. That's evidently not what you want.
Parsing log lines generally requires a context-sensitive scanner, and standard parser generators -- which are more oriented towards parsing programming languages -- are not always the ideal tool. In this case, for example, HOSTNAME cannot appear in the context in which it is being recognized, so it shouldn't even be in the list of possible tokens.
Of course, you could define a token which consisted of an ip number and port separated by a slash, which would solve the ambiguity, but (in my opinion) that would be suboptimal because you'll end up rescanning that token to parse it.

Antlr4 ignores tokens

In ANTLR 4 I try to parse a text file, but some of my defined tokens are constantly ignored in favor of others. I produced a small example to show what I mean:
File to parse:
hello world
hello world
Grammar:
grammar TestLexer;
file : line line;
line : 'hello' ' ' 'world' '\n';
LINE : ~[\n]+? '\n';
The ANTLR book explains that 'hello' would become an implicit token, which is placed before the LINE token, and that token order matters. So I'd expect that the parser would NOT match the LINE token, but it does, as the resulting tree shows:
How can I fix this, so that I get the actual implicit tokens?
Btw. I also tried to write explicit tokens before LINE, but that didn't change anything.
Found it myself:
It seems that ANTLR chooses longest tokens first.
So since LINE would always match a whole line it is always preferred.
To still include some "joker" token into a grammar it should be a single symbol.
In my case
grammar TestLexer;
file : line line;
line : 'hello' ' ' 'world' '\n';
LINE : ~[\n];
would work.

Resources