Append item to map with Yecc parser in Elixir/Erlang - erlang

I am trying parsing a specific log file with Leex/Yecc in Elixir. After many hours I got the easiest scenario to work. However I want to go to the next step, but I cannot figure out how to do so.
First, here is an example of the log format:
[!] plugin error detected
| check the version of the plugin
My simple try was only with the first line, but multiple entries of them, as such:
[!] plugin error detected
[!] plugin error 2 detected
[!] plugin error 3 detected
That worked and gave me a nice map containing the text and the log line type (warning):
iex(20)> LogParser.parse("[!] a big warning\n[!] another warning")
[%{text: "a big warning", type: :warning},
%{text: "another warning", type: :warning}]
That is perfect. But as seen above a log line can continue on a next line, indicated with a pipe character |. My lexer has the pipe character and the parser can understand it, but what I want is the next line to be appended to the text value of my map. For now it is just appended as a string in the map. So instead of:
[%{text: "a big warning ", type: :warning}, " continues on next line"]
I need:
[%{text: "a big warning continues on next line", type: :warning}]
I looked at examples on the net, but most of them have really clear 'end' tokens, such as a closing tag or a closing bracket, and then still it is not really clear to me how to add properties so the eventual AST is correct.
For completeness, here is my lexer:
Definitions.
Char = [a-zA-Z0-9\.\s\,\[\]]
Word = [^\t\s\.#"=]+
Space = [\s\t]
New_Line = [\n]
%New_Line = \n|\r\n|\r
Type_Regular = \[\s\]\s
Type_Warning = \[!\]\s
Pipe = \|
Rules.
{Type_Regular} : {token, {type_regular, TokenLine}}.
{Type_Warning} : {token, {type_warning, TokenLine}}.
{Char} : {token, {char, TokenLine, TokenChars}}.
{Space} : skip_token.
{Pipe} : {token, {pipe, TokenLine}}.
{New_Line} : skip_token.
Erlang code.
And my parser:
Nonterminals lines line line_content chars.
Terminals type_regular type_warning char pipe.
Rootsymbol lines.
lines -> line lines : ['$1'|['$2']].
lines -> line : '$1'.
line -> pipe line_content : '$2'.
line -> type_regular line_content : #{type => regular, text => '$2'}.
line -> type_warning line_content : #{type => warning, text => '$2'}.
line_content -> chars : '$1'.
line_content -> pipe chars : '$1'.
chars -> char chars : unicode:characters_to_binary([get_value('$1')] ++ '$2').
chars -> char : unicode:characters_to_binary([get_value('$1')]).
Erlang code.
get_value({_, _, Value}) -> Value.
If you got even this far, thank you already! If anyone could help out, even bigger thanks!

I'd suggest adding a line_content rule to handle multiple lines separated by pipes and removing the rule line -> pipe line_content : '$2'..
You also have an unnecessary [] around '$2' in the lines clause and the single line clause should return a list to be consistent with the return value of the previous clause and so you don't end up with improper lists.
With these four changes,
-lines -> line lines : ['$1'|['$2']].
+lines -> line lines : ['$1'|'$2'].
-lines -> line : '$1'.
+lines -> line : ['$1'].
-line -> pipe line_content : '$2'.
line -> type_regular line_content : #{type => regular, text => '$2'}.
line -> type_warning line_content : #{type => warning, text => '$2'}.
line_content -> chars : '$1'.
-line_content -> pipe chars : '$1'.
+line_content -> line_content pipe chars : <<'$1'/binary, '$3'/binary>>.
I can parse multiline text just fine:
Belino.parse("[!] Look at the error")
Belino.parse("[!] plugin error detected
| check the version of the plugin")
Belino.parse("[!] a
| warning
[ ] a
| regular
[ ] another
| regular
[!] and another
| warning")
Output:
[%{text: "Look at the error", type: :warning}]
[%{text: "plugin error detected check the version of the plugin",
type: :warning}]
[%{text: "a warning ", type: :warning}, %{text: "a regular ", type: :regular},
%{text: "another regular ", type: :regular},
%{text: "and another warning", type: :warning}]

Related

ANTLR4 : mismatched input while trying parse ':'

I am trying to parse message using antlr4
:12B:DOCUMENT:some nice text
DOCUMENT2:some nice text
this is the expected output from the parser
heading -> 12B
subheading -> DOCUMENT
subheading -> DOCUMENT2
TEXT -> some nice text
TEXT -> some nice text
but on trying to extract the heading with the following grammar
grammar Simple;
para : heading* EOF;
header : heading text ;
heading : COLEN HEAD COLEN;
text : TEXT;
/* tokens */
TEXT : ~[\t]+ ;
HEAD : [0-9A-Z]+ ;
COLEN : ':';
one supplying the input I am getting the following error
line 1:0 mismatched input ':12:nithin\n' expecting ':'
Could someone please tell me the possible cause and solution to parse the same? If I've missed anything, over- or under-emphasized a specific point, please let me know in the comments. Thank you so much in advance for your time.

Proper way to parse multiple items

I have an input file with multiple lines and fields separated by space. My definition files are:
scanner.xrl:
Definitions.
DIGIT = [0-9]
ALPHANUM = [0-9a-zA-Z_]
Rules.
(\s|\t)+ : skip_token.
\n : {end_token, {new_line, TokenLine}}.
{ALPHANUM}+ : {token, {string, TokenLine, TokenChars}}.
Erlang code.
parser.yrl:
Nonterminals line.
Terminals string.
Rootsymbol line.
Endsymbol new_line.
line -> string : ['$1'].
line -> string line: ['$1'|'$2'].
Erlang code.
When running it as it is, the first line is parsed and then it stops:
1> A = <<"a b c\nd e\nf\n">>.
2> {ok, T, _} = scanner:string(binary_to_list(A)).
{ok,[{string,1,"a"},
{string,1,"b"},
{string,1,"c"},
{new_line,1},
{string,2,"d"},
{string,2,"e"},
{new_line,2},
{string,3,"f"},
{new_line,3}],
4}
3> parser:parse(T).
{ok,[{string,1,"a"},{string,1,"b"},{string,1,"c"}]}
If I remove the Endsymbol line from parser.yrl and change the scanner.xrl file as follow:
Definitions.
DIGIT = [0-9]
ALPHANUM = [0-9a-zA-Z_]
Rules.
(\s|\t|\n)+ : skip_token.
{ALPHANUM}+ : {token, {string, TokenLine, TokenChars}}.
Erlang code.
All my line are parsed as a single item:
1> A = <<"a b c\nd e\nf\n">>.
<<"a b c\nd e\nf\n">>
2> {ok, T, _} = scanner:string(binary_to_list(A)).
{ok,[{string,1,"a"},
{string,1,"b"},
{string,1,"c"},
{string,2,"d"},
{string,2,"e"},
{string,3,"f"}],
4}
3> parser:parse(T).
{ok,[{string,1,"a"},
{string,1,"b"},
{string,1,"c"},
{string,2,"d"},
{string,2,"e"},
{string,3,"f"}]}
What would be the proper way to signal to the parser that each line should be treated as a separate item? I would like my result to look something like:
{ok,[[{string,1,"a"},
{string,1,"b"},
{string,1,"c"}],
[{string,2,"d"},
{string,2,"e"}],
[{string,3,"f"}]]}
Here is one of the correct lexer/parser pair that does the job with 1 shift/reduce only but I think it will solve your problem, you only need to cleanup tokens as you prefer.
I'm pretty sure there can be much easier and faster way to do it, but during my "lexer fighting times" it was so hard to find at least some information that I hope this will give the idea how to proceed with parsing with Erlang.
scanner.xrl
Definitions.
DIGIT = [0-9]
ALPHANUM = [0-9a-zA-Z_]
Rules.
(\s|\t)+ : skip_token.
\n : {token, {line, TokenLine}}.
{ALPHANUM}+ : {token, {string, TokenLine, TokenChars}}.
Erlang code.
parser.yrl
Nonterminals
Lines
Line
Strings.
Terminals string line.
Rootsymbol Lines.
Lines -> Line Lines : lists:flatten(['$1', '$2']).
Lines -> Line : lists:flatten(['$1']).
Line -> Strings line : {line, lists:flatten(['$1'])}.
Line -> Strings : {line, lists:flatten(['$1'])}.
Strings -> string Strings : lists:append(['$1'], '$2').
Strings -> string : lists:flatten(['$1']).
Erlang code.
output
{ok,[{line,[{string,1,"a"},{string,1,"b"},{string,1,"c"}]},
{line,[{string,2,"d"},{string,2,"e"}]},
{line,[{string,3,"f"}]}]}
The parser flow is the following:
Root defined as abstract "Lines"
"Lines" contains "Line + Lines" or simply "Line", which gives the looping
"Line" contains from "Strings + line" or simple "Strings" when it is end of file
"Strings" contains from 'string' or "'string' + Strings" when there are many strings provided
'line' is the '\n' symbol
Please allow me to give few comments on issues I've discovered in the original code.
You should consider a whole file as a nested array not like a parsing per line, this is why Lines/Line abstracts provided
"Terminals" means that tokens won't be analysed for containing ANY other token, "Nonterminals" will be evaluated further, these are complex data

FsLex aborts with parse error on '{'

My Lexer is supposed to distinguish brackets and maintain a stack of opened brackets during lexing. For this I specified a helper function in my fsl file like this:
let updateBracketStack sign = // whenever a bracket is parsed, update the stack accordingly
match sign with
| '[' -> push sign
| '{' -> push sign
| ']' -> if top() = '[' then pop() else ()
| '}' -> if top() = '{' then pop() else ()
| _ -> ()
The stack of course is a ref of char list. And push, top, pop are implemented accordingly.
The problem is that everything worked up until I added the { character. Now FsLex simply dies with error: parse error
If I change the characters to strings, i.e. write "{" FsLex is fine again, so a workaround would be to change the implementation to a stack of strings instead of characters.
My question is however, where does this behaviour come from? Is this a bug if FsLex?
FsLex's parser is generated using FsLexYacc. The message "parse error" means the lexing (of your .fsl file) until error position is succeeded but parsing is failed at that position. To find the root cause you will need to post full input text to FsLex.
This is only guess. FsLex could be confused by '{' character since it is also open token for embedded code block? Or your input text contains some special unicode characters but it looks like whitespace on editor?
One possible workaround is, to create another module and .fs file, LexHelper module in LexHelper.fs, and place your helper functions in it, and open it from .fsl file.
EDIT
Looking at the source code of FsLexYacc, it doesn't handle } character enclosed by single-quotes in embedded F# code, but does when enclosed by double-quotes.
https://github.com/fsprojects/FsLexYacc/blob/master/src/FsLex/fslexlex.fsl

ANTLR4 - parse a file line-by-line

I try to write a grammar to parse a file line by line.
My grammar looks like this:
grammar simple;
parse: (line NL)* EOF
line
: statement1
| statement2
| // empty line
;
statement1 : KW1 (INT|FLOAT);
statement2 : KW2 INT;
...
NL : '\r'? '\n';
WS : (' '|'\t')-> skip; // toss out whitespace
If the last line in my input file does not have a newline, I get the following error message:
line xx:37 missing NL at <EOF>
Can somebody please explain, how to write a grammar that actually accepts the last line without newline
Simply don't require NL to fall after the last line. This form is efficient, and simplified based on the fact that line can match the empty sequence (essentially the last line will always be the empty line).
// best if `line` can be empty
parse
: line (NL line)* EOF
;
If line was not allowed to be empty, the following rule is efficient and performs the same operation:
// best if `line` cannot be empty
parse
: (line (NL line)* NL?)? EOF
;
The following rule is equivalent for the case where line cannot be empty, but is much less efficient. I'm including it because I frequently see people write rules in this form where it's easy to see that the specific NL token which was made optional is the one following the last line.
// INEFFICIENT!
parse
: (line NL)* (line NL?)? EOF
;

How to not require spaces in ANTLR4

I am using ANTLR4 to try and parse the following text:
ex1, ex2: examples
var1,var2,var3: variables
Since the second line does not have whitespace after the commas, it doesn't parse correctly. If I add in the whitespace, then it works. The rules I am currently using to parse this:
line : list ':' name;
list : listitem (',' listitem)*;
listitem : [a-zA-Z0-9]+;
name : [a-zA-Z0-9]+;
This works perfectly for lines like line 1, but fails on lines like line 2, if there are parenthesis or pretty much any punctuation, it wants some whitespace after the punctuation and I can't always guarantee that about the input.
Does anyone know how to fix this?
First add explicit lexer rules (starting with a capital letter). Then add a lexer rule for whitespace and ignore the whitespace:
line : list ':' name;
list : listitem (',' listitem)*;
listitem : Identifier;
name : Identifier;
Identifier : [a-zA-Z0-9]+; // only one lexer rule for name and listitem, since and Identifier may be a name or listitem depending only on the position
WhiteSpace : (' '|'\t') -> skip;
NewLine : ('\r'?'\n'|'\r') -> skip; // or don't skip if you need it as a statement terminator

Resources