I have an input file with multiple lines and fields separated by space. My definition files are:
scanner.xrl:
Definitions.
DIGIT = [0-9]
ALPHANUM = [0-9a-zA-Z_]
Rules.
(\s|\t)+ : skip_token.
\n : {end_token, {new_line, TokenLine}}.
{ALPHANUM}+ : {token, {string, TokenLine, TokenChars}}.
Erlang code.
parser.yrl:
Nonterminals line.
Terminals string.
Rootsymbol line.
Endsymbol new_line.
line -> string : ['$1'].
line -> string line: ['$1'|'$2'].
Erlang code.
When running it as it is, the first line is parsed and then it stops:
1> A = <<"a b c\nd e\nf\n">>.
2> {ok, T, _} = scanner:string(binary_to_list(A)).
{ok,[{string,1,"a"},
{string,1,"b"},
{string,1,"c"},
{new_line,1},
{string,2,"d"},
{string,2,"e"},
{new_line,2},
{string,3,"f"},
{new_line,3}],
4}
3> parser:parse(T).
{ok,[{string,1,"a"},{string,1,"b"},{string,1,"c"}]}
If I remove the Endsymbol line from parser.yrl and change the scanner.xrl file as follow:
Definitions.
DIGIT = [0-9]
ALPHANUM = [0-9a-zA-Z_]
Rules.
(\s|\t|\n)+ : skip_token.
{ALPHANUM}+ : {token, {string, TokenLine, TokenChars}}.
Erlang code.
All my line are parsed as a single item:
1> A = <<"a b c\nd e\nf\n">>.
<<"a b c\nd e\nf\n">>
2> {ok, T, _} = scanner:string(binary_to_list(A)).
{ok,[{string,1,"a"},
{string,1,"b"},
{string,1,"c"},
{string,2,"d"},
{string,2,"e"},
{string,3,"f"}],
4}
3> parser:parse(T).
{ok,[{string,1,"a"},
{string,1,"b"},
{string,1,"c"},
{string,2,"d"},
{string,2,"e"},
{string,3,"f"}]}
What would be the proper way to signal to the parser that each line should be treated as a separate item? I would like my result to look something like:
{ok,[[{string,1,"a"},
{string,1,"b"},
{string,1,"c"}],
[{string,2,"d"},
{string,2,"e"}],
[{string,3,"f"}]]}
Here is one of the correct lexer/parser pair that does the job with 1 shift/reduce only but I think it will solve your problem, you only need to cleanup tokens as you prefer.
I'm pretty sure there can be much easier and faster way to do it, but during my "lexer fighting times" it was so hard to find at least some information that I hope this will give the idea how to proceed with parsing with Erlang.
scanner.xrl
Definitions.
DIGIT = [0-9]
ALPHANUM = [0-9a-zA-Z_]
Rules.
(\s|\t)+ : skip_token.
\n : {token, {line, TokenLine}}.
{ALPHANUM}+ : {token, {string, TokenLine, TokenChars}}.
Erlang code.
parser.yrl
Nonterminals
Lines
Line
Strings.
Terminals string line.
Rootsymbol Lines.
Lines -> Line Lines : lists:flatten(['$1', '$2']).
Lines -> Line : lists:flatten(['$1']).
Line -> Strings line : {line, lists:flatten(['$1'])}.
Line -> Strings : {line, lists:flatten(['$1'])}.
Strings -> string Strings : lists:append(['$1'], '$2').
Strings -> string : lists:flatten(['$1']).
Erlang code.
output
{ok,[{line,[{string,1,"a"},{string,1,"b"},{string,1,"c"}]},
{line,[{string,2,"d"},{string,2,"e"}]},
{line,[{string,3,"f"}]}]}
The parser flow is the following:
Root defined as abstract "Lines"
"Lines" contains "Line + Lines" or simply "Line", which gives the looping
"Line" contains from "Strings + line" or simple "Strings" when it is end of file
"Strings" contains from 'string' or "'string' + Strings" when there are many strings provided
'line' is the '\n' symbol
Please allow me to give few comments on issues I've discovered in the original code.
You should consider a whole file as a nested array not like a parsing per line, this is why Lines/Line abstracts provided
"Terminals" means that tokens won't be analysed for containing ANY other token, "Nonterminals" will be evaluated further, these are complex data
Related
I several projects I have run into a similar effect in my grammars.
I have the need to parse something like Key="Value"
So I create a grammar (simplest I could make to show the effect):
grammar test;
KEY : [a-zA-Z0-9]+ ;
VALUE : DOUBLEQUOTE [ _a-zA-Z0-9.-]+ DOUBLEQUOTE ;
DOUBLEQUOTE : '"' ;
EQUALS : '=' ;
entry : key=KEY EQUALS value=VALUE;
I can now parse thing="One Two Three" and in my code I receive
key = thing
value = "One Two Three"
In all of my projects I end up with an extra step to strip those " from the value.
Usually something like this (I use Java)
String value = ctx.value.getText();
value = value.substring(1, value.length()-1);
In my real grammars I find it very hard to move the check of the surrounding " into the parser.
Is there a clean way to already drop the " by doing something in the lexer/parser?
Essentially I want ctx.value.getText() to return One Two Three instead of "One Two Three".
Update:
I have been playing with the excellent answer provided by Bart Kiers and found this variation which does exactly what I was looking for.
By putting the DOUBLEQUOTE on a hidden channel they are used by the lexer and hidden from the parser.
TestLexer.g4
lexer grammar TestLexer;
KEY : [a-zA-Z0-9]+;
DOUBLEQUOTE : '"' -> channel(HIDDEN), pushMode(STRING_MODE);
EQUALS : '=';
mode STRING_MODE;
STRING_DOUBLEQUOTE
: '"' -> channel(HIDDEN), type(DOUBLEQUOTE), popMode
;
STRING
: [ _a-zA-Z0-9.-]+
;
and
TestParser.g4
parser grammar TestParser;
options { tokenVocab=TestLexer; }
entry : key=KEY EQUALS value=STRING ;
Try this:
VALUE
: DOUBLEQUOTE [ _a-zA-Z0-9.-]+ DOUBLEQUOTE
{setText(getText().substring(1, getText().length()-1));}
;
Needless to say: this ties your grammar to Java, and (depending how many embedded Java code you have) your grammar will be hard to port to some other target language.
EDIT
Once a token is created, there is no built-in way to separate it (other than doing so in embedded actions, as I demonstrated). What you're looking for can be done, but that means rewriting your grammar so that a string literal is not constructed as a single token. This can be done by using lexical modes so that the string can be constructed in the parser.
A quick demo:
TestLexer.g4
lexer grammar TestLexer;
KEY : [a-zA-Z0-9]+;
DOUBLEQUOTE : '"' -> pushMode(STRING_MODE);
EQUALS : '=';
mode STRING_MODE;
STRING_DOUBLEQUOTE
: '"' -> type(DOUBLEQUOTE), popMode
;
STRING_ATOM
: [ _a-zA-Z0-9.-]
;
TestParser.g4
parser grammar TestParser;
options { tokenVocab=TestLexer; }
entry : key=KEY EQUALS value;
value : DOUBLEQUOTE string_atoms DOUBLEQUOTE;
string_atoms : STRING_ATOM*;
If you now run the Java code:
Lexer lexer = new TestLexer(CharStreams.fromString("Key=\"One Two Three\""));
TestParser parser = new TestParser(new CommonTokenStream(lexer));
TestParser.EntryContext entry = parser.entry();
System.out.println(entry.value().string_atoms().getText());
this will be printed:
One Two Three
I'm making a parser for a DSL in Haskell using Alex + Happy.
My DSL uses dice rolls as part of the possible expressions.
Sometimes I have an expression that I want to parse that looks like:
[some code...] 3D6 [... rest of the code]
Which should translate roughly to:
TokenInt {... value = 3}, TokenD, TokenInt {... value = 6}
My DSL also uses variables (basically, Strings), so I have a special token that handle variable names.
So, with this tokens:
"D" { \pos str -> TokenD pos }
$alpha [$alpha $digit \_ \']* { \pos str -> TokenName pos str}
$digit+ { \pos str -> TokenInt pos (read str) }
The result I'm getting when using my parse now is:
TokenInt {... value = 3}, TokenName { ... , name = "D6"}
Which means that my lexer "reads" an Integer and a Variable named "D6".
I have tried many things, for example, i changed the token D to:
$digit "D" $digit { \pos str -> TokenD pos }
But that just consumes the digits :(
Can I parse the dice roll with the numbers?
Or at least parse TokenInt-TokenD-TokenInt?
PS: I'm using PosN as a wrapper, not sure if relevant.
The way I'd go about it would be to extend the TokenD type to TokenD Int Int so using the basic wrapper for convenience I would do
$digit+ D $digit+ { dice }
...
dice :: String -> Token
dice s = TokenD (read $ head ls) (read $ last ls)
where ls = split 'D' s
split can be found here.
This is an extra step that'd usually be done in during syntactic analysis but doesn't hurt much here.
Also I can't make Alex parse $alpha for TokenD instead of TokenName. If we had Di instead of D that'd be no problem. From Alex's docs:
When the input stream matches more than one rule, the rule which matches the longest prefix of the input stream wins. If there are still several rules which match an equal number of characters, then the rule which appears earliest in the file wins.
But then your code should work. I don't know if this is an issue with Alex.
I decided that I could survive with variables starting with lowercase letters (like Haskell variables), so I changed my lexer to parse variables only if they start with a lowercase letter.
That also solved some possible problems with some other reserved words.
I'm still curious to know if there were other solutions, but the problem in itself was solved.
Thank you all!
I am trying parsing a specific log file with Leex/Yecc in Elixir. After many hours I got the easiest scenario to work. However I want to go to the next step, but I cannot figure out how to do so.
First, here is an example of the log format:
[!] plugin error detected
| check the version of the plugin
My simple try was only with the first line, but multiple entries of them, as such:
[!] plugin error detected
[!] plugin error 2 detected
[!] plugin error 3 detected
That worked and gave me a nice map containing the text and the log line type (warning):
iex(20)> LogParser.parse("[!] a big warning\n[!] another warning")
[%{text: "a big warning", type: :warning},
%{text: "another warning", type: :warning}]
That is perfect. But as seen above a log line can continue on a next line, indicated with a pipe character |. My lexer has the pipe character and the parser can understand it, but what I want is the next line to be appended to the text value of my map. For now it is just appended as a string in the map. So instead of:
[%{text: "a big warning ", type: :warning}, " continues on next line"]
I need:
[%{text: "a big warning continues on next line", type: :warning}]
I looked at examples on the net, but most of them have really clear 'end' tokens, such as a closing tag or a closing bracket, and then still it is not really clear to me how to add properties so the eventual AST is correct.
For completeness, here is my lexer:
Definitions.
Char = [a-zA-Z0-9\.\s\,\[\]]
Word = [^\t\s\.#"=]+
Space = [\s\t]
New_Line = [\n]
%New_Line = \n|\r\n|\r
Type_Regular = \[\s\]\s
Type_Warning = \[!\]\s
Pipe = \|
Rules.
{Type_Regular} : {token, {type_regular, TokenLine}}.
{Type_Warning} : {token, {type_warning, TokenLine}}.
{Char} : {token, {char, TokenLine, TokenChars}}.
{Space} : skip_token.
{Pipe} : {token, {pipe, TokenLine}}.
{New_Line} : skip_token.
Erlang code.
And my parser:
Nonterminals lines line line_content chars.
Terminals type_regular type_warning char pipe.
Rootsymbol lines.
lines -> line lines : ['$1'|['$2']].
lines -> line : '$1'.
line -> pipe line_content : '$2'.
line -> type_regular line_content : #{type => regular, text => '$2'}.
line -> type_warning line_content : #{type => warning, text => '$2'}.
line_content -> chars : '$1'.
line_content -> pipe chars : '$1'.
chars -> char chars : unicode:characters_to_binary([get_value('$1')] ++ '$2').
chars -> char : unicode:characters_to_binary([get_value('$1')]).
Erlang code.
get_value({_, _, Value}) -> Value.
If you got even this far, thank you already! If anyone could help out, even bigger thanks!
I'd suggest adding a line_content rule to handle multiple lines separated by pipes and removing the rule line -> pipe line_content : '$2'..
You also have an unnecessary [] around '$2' in the lines clause and the single line clause should return a list to be consistent with the return value of the previous clause and so you don't end up with improper lists.
With these four changes,
-lines -> line lines : ['$1'|['$2']].
+lines -> line lines : ['$1'|'$2'].
-lines -> line : '$1'.
+lines -> line : ['$1'].
-line -> pipe line_content : '$2'.
line -> type_regular line_content : #{type => regular, text => '$2'}.
line -> type_warning line_content : #{type => warning, text => '$2'}.
line_content -> chars : '$1'.
-line_content -> pipe chars : '$1'.
+line_content -> line_content pipe chars : <<'$1'/binary, '$3'/binary>>.
I can parse multiline text just fine:
Belino.parse("[!] Look at the error")
Belino.parse("[!] plugin error detected
| check the version of the plugin")
Belino.parse("[!] a
| warning
[ ] a
| regular
[ ] another
| regular
[!] and another
| warning")
Output:
[%{text: "Look at the error", type: :warning}]
[%{text: "plugin error detected check the version of the plugin",
type: :warning}]
[%{text: "a warning ", type: :warning}, %{text: "a regular ", type: :regular},
%{text: "another regular ", type: :regular},
%{text: "and another warning", type: :warning}]
I try to write a grammar to parse a file line by line.
My grammar looks like this:
grammar simple;
parse: (line NL)* EOF
line
: statement1
| statement2
| // empty line
;
statement1 : KW1 (INT|FLOAT);
statement2 : KW2 INT;
...
NL : '\r'? '\n';
WS : (' '|'\t')-> skip; // toss out whitespace
If the last line in my input file does not have a newline, I get the following error message:
line xx:37 missing NL at <EOF>
Can somebody please explain, how to write a grammar that actually accepts the last line without newline
Simply don't require NL to fall after the last line. This form is efficient, and simplified based on the fact that line can match the empty sequence (essentially the last line will always be the empty line).
// best if `line` can be empty
parse
: line (NL line)* EOF
;
If line was not allowed to be empty, the following rule is efficient and performs the same operation:
// best if `line` cannot be empty
parse
: (line (NL line)* NL?)? EOF
;
The following rule is equivalent for the case where line cannot be empty, but is much less efficient. I'm including it because I frequently see people write rules in this form where it's easy to see that the specific NL token which was made optional is the one following the last line.
// INEFFICIENT!
parse
: (line NL)* (line NL?)? EOF
;
I'm trying to build a grammar for a recognizer of a spice-like language using Antlr-3.1.3 (I use this version because of the Python target). I don't have experience with parsers. I've found a master thesis where the student has done the syntactic analysis of the SPICE 2G6 language and built a parser using the LEX and YACC compiler writing tools. (http://digitool.library.mcgill.ca/R/?func=dbin-jump-full&object_id=60667&local_base=GEN01-MCG02) In chapter 4, he describes a grammar in Backus-Naur form for the SPICE 2G6 language, and appends to the work the LEX and YACC code files of the parser.
I'm basing myself in this work to create a simpler grammar for a recognizer of a more restrictive spice language.
I read the Antlr manual, but could not figure out how to solve two problems, that the code snippet below illustrates.
grammar Najm_teste;
resistor
: RES NODE NODE VALUE 'G2'? COMMENT? NEWLINE
;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+; // non-negative integer
VALUE : REAL; // non-negative real
fragment
SIG : '+'|'-';
fragment
DIG : '0'..'9';
fragment
EXP : ('E'|'e') SIG? DIG+;
fragment
FLT : (DIG+ '.' DIG+)|('.' DIG+)|(DIG+ '.');
fragment
REAL : (DIG+ EXP?)|(FLT EXP?);
COMMENT : '%' ( options {greedy=false;} : . )* NEWLINE;
NEWLINE : '\r'? '\n';
WS : (' '|'\t')+ {$channel=HIDDEN;};
// END:tokens
In the grammar above, the token NODE is a subset of the set represented by the VALUE token. The grammar correctly interprets an input like "R1 5 0 1.1/n", but cannot interpret an input like "R1 5 0 1/n", because it maps "1" to the token NODE, instead of mapping it to the token VALUE, as NODE comes before VALUE in the tokens section. Given such inputs, does anyone has an idea of how can I map the "1" to the correct token VALUE, or a suggestion of how can I alter the grammar so that I can correctly interpret the input?
The second problem is the presence of a comment at the end of a line. Because the NEWLINE token delimits: (1) the end of a comment; and (2) the end of a line of code. When I include a comment at the end of a line of code, two newline characters are necessary to the parser correctly recognize the line of code, otherwise, just one newline character is necessary. How could I improve this?
Thanks!
Problem 1
The lexer does not "listen" to the parser. The lexer simply creates tokens that contain as much characters as possible. In case two tokens match the same amount of characters, the token defined first will "win". In other words, "1" will always be tokenized as a NODE, even though the parser is trying to match a VALUE.
You can do something like this instead:
resistor
: RES NODE NODE value 'G2'? COMMENT? NEWLINE
;
value : NODE | REAL;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+;
REAL : (DIG+ EXP?) | (FLT EXP?);
...
E.g., I removed VALUE, added value and removed fragment from REAL
Problem 2
Do not let the comment match the line break:
COMMENT : '%' ~('\r' | '\n')*;
where ~('\r' | '\n')* matches zero or more chars other than line break characters.