Haskell . Why does my parser fail - parsing

As as exercise of the Haskell Book written by bitemyapp I need to make a parser which can parse the given log.
So I made this : https://gist.github.com/RoelofWobben/79058b1a6a5c24f08a495045c7a685f9
but when I test it with : ` parseString parseMultipleDays myLog I see this error message :
Failure (ErrInfo {_errDoc = (interactive):3:1: error: expected: new-line
# 2025-02-05
^ , _errDeltas = [Lines 2 0 20 0]})
anyone who can give me a hint where the bug is. When needed I can make a repo of the code I have with some tests.
I use trifecta because that one is explained in the chapter

string "--" *> manyTill anyChar newline *> newline
manyTill already consumes the terminator, so the above defines a comment to be "--", followed by anything, followed by two newlines.
Your input only contains one newline after the comment, so you get an error telling you that the parser expected a second newline instead of the #.

Related

How to match [BOF]"Begin of file" in Antlr4 Lexer?

In one Antlr4 syntax, I need the comment (// xxxx) to be always at the start of a line.
The following grammar works fine for most cases.
grammar com;
comment: COMMENT;
COMMENT
: '\n' '//' .*? '\n'
;
By design, it will match \n//comment\n but not //comment\n. But I also want it to match <BOF>//comment\n. How can I implement it?
You may find that this edit is better handled post-parsing, in a semantic validation pass of your parseTree. (NOTE: It's not a requirement that a parser ONLY recognize valid input, just that it correctly interprets the only way to understand that input.)
For example, does // might be a comment have some other, alternate interpretation if it's not at the beginning of the line?
If not, I would probably just accept the // comment ...\n as a token regardless of it's position in the line.
Then, once you have the parse tree, you can check that you comments always have a column of 0. Doing it this way, your grammar is not tied to a particular target language, and, perhaps more importantly, you can give a "nice" error message like "Comments must begin in the first column of a line".
If you try to handle this in the Lexer (or parser), then, if it's NOT in the correct column, you'll get a much more obtuse recognition error that will be more difficult for users to understand.
That is not possible in a language agnostic way. You will have to add target specific code in your grammar and use a predicate to check if the char position is 0:
COMMENT
: {getCharPositionInLine() == 0}? '//' ~[\r\n]*
;
OTHER
: .
;
If you now tokenize the input:
// start
// middle
?//...
// end
with the Java code:
String input = "// start\n// middle\n?//...\n// end";
comLexer lexer = new comLexer(CharStreams.fromString(input));
CommonTokenStream stream = new CommonTokenStream(lexer);
stream.fill();
for (Token t : stream.getTokens()) {
System.out.printf("%-10s'%s'%n",
comLexer.VOCABULARY.getSymbolicName(t.getType()),
t.getText().replace("\n", "\\n"));
}
the following will be printed to your console:
COMMENT '// start'
OTHER '\n'
COMMENT '// middle'
OTHER '\n'
OTHER '?'
OTHER '/'
OTHER '/'
OTHER '.'
OTHER '.'
OTHER '.'
OTHER '\n'
COMMENT '// end'
EOF '<EOF>'
Note that I also removed the \n at the end of the COMMENT, otherwise a comment at the end of the input would not be matched.
EDIT
How I can do it with JavaScript? I cannot find good examples on internet.
By looking at the Javascript source, it looks like {this.column === 0}? is the Javascript equivalent of {getCharPositionInLine() == 0}?
By the way, does the Intellij Plugin support predict? If it does, does it support only Java?
No, the IntelliJ plugin ignores predicates. After all, the code inside a predicate can be any arbitrary chunk of code, making it quite hard to support.

Problem with parsing file ending on a newline

It seems a bit like a trivial question, but I am stuck on parsing the end of file EOF using my own island grammar. I am using the new VScode extension btw.
I've mostly been using the examples from the basic recipes and have a simple grammar with the following layout rules:
layout Whitespace = [\t-\n\r\ ]*;
lexical IntegerLiteral = [0-9]+ !>> [0-9];
lexical Comment = "%%" ![\n]* $;
Using this, and some rules it parses some simple files, but will give a parse error anytime a file ends in a newline. (newlines in between lines are no problem).
Am is missing something obvious?
Thanks!
It sounds a bit like your grammar is missing a start nonterminal. All grammar rules get whitespace in between their constituent symbols but not at the start or the end.
A start nonterminal is the exception:
start syntax Islands = Island+;
Islands parseIslands(loc input)
= parse(#start[Islands], input).top;
Passing the start nonterminal to parse will allow the file to start and end with whitespace, and using the .top field you can ignore that whitespace from the parse tree again by projecting out the middle Islands tree.
Island grammars tend to be a complex beast, so without sharing the full grammar and input string, it might be a bit hard to answer this question. But I'll share some generic feedback.
he layout production might be ambiguous, if any other part of your language has optional parts. Rascal's parsing is non-greedy. So if you have:
lexical A = "a";
lexical B = "b";
lexical C = "c";
syntax A = A? B? C;
After fusing in the layouts, this becomes:
A` = A? Whitespace? B? Whitespace? C;
Now since whitespace is not eating all characters, the grammar is ambigous, as the parser can "bind" a whitespace between the A and B, or between the B and C. So in most cases, you want to make sure it's a greedy match by adding a follow restriction:
layout Whitespace = [\t-\n \r \ ]* !>> [\t-\n \r \ ];
Also, I fixed a bug, the layout definition didn't include a space as valid whitespace. Rascal allows for spaces in the character class (for readability), so in case we need to add a space, you have to say \ .
For the rest, it looks okay, but like I started with, island grammars are a bit harder to debug without both the full syntax, and what you want to have as water and what as island.

Match Symbol specific number of times

When defining a syntax, it is possible to match 1 or more times (+) or 0 or more times (*) similarly to how it is done in regex. However, I have not found in the rascal documentation if it is possible to also match a Symbol a specific amount of times. In regex (and Rascal patterns) this is done with an integer between two curly brackets but this doesn't seem to work for syntax definition. Ideally, I'd want something like:
lexical Line = [0-9.]+;
syntax Sym = sym: {Line Newline}{5};
Which would only try to match the first 5 lines of the text below:
..0..
11.11
44.44
1.11.1
33333
55555
No this meta syntax does not exist in Rascal. We did not add it.
You could write an over-estimation like this and have a post-parse filter reject more than 5 items:
syntax Sym = fiveLines: (Line NewLine)+ lines
visit (myParseTree) {
case (Sym) `<(Line NewLine)+ lines>` :
throw ParseError(x.src) when length(lines) != 5;
}
Or unfold the loop like so:
syntax Sym
= Line NewLine
Line NewLine
Line NewLine
Line NewLine
Line NewLine
;
Repetition with an integer parameter sounds like a good feature request for us the consider, if you need it badly. We only have to consider what it means for Rascal's type-system; for the parser generator its a simple rule to add.

Antlr4 ignores tokens

In ANTLR 4 I try to parse a text file, but some of my defined tokens are constantly ignored in favor of others. I produced a small example to show what I mean:
File to parse:
hello world
hello world
Grammar:
grammar TestLexer;
file : line line;
line : 'hello' ' ' 'world' '\n';
LINE : ~[\n]+? '\n';
The ANTLR book explains that 'hello' would become an implicit token, which is placed before the LINE token, and that token order matters. So I'd expect that the parser would NOT match the LINE token, but it does, as the resulting tree shows:
How can I fix this, so that I get the actual implicit tokens?
Btw. I also tried to write explicit tokens before LINE, but that didn't change anything.
Found it myself:
It seems that ANTLR chooses longest tokens first.
So since LINE would always match a whole line it is always preferred.
To still include some "joker" token into a grammar it should be a single symbol.
In my case
grammar TestLexer;
file : line line;
line : 'hello' ' ' 'world' '\n';
LINE : ~[\n];
would work.

ANTLR4 - parse a file line-by-line

I try to write a grammar to parse a file line by line.
My grammar looks like this:
grammar simple;
parse: (line NL)* EOF
line
: statement1
| statement2
| // empty line
;
statement1 : KW1 (INT|FLOAT);
statement2 : KW2 INT;
...
NL : '\r'? '\n';
WS : (' '|'\t')-> skip; // toss out whitespace
If the last line in my input file does not have a newline, I get the following error message:
line xx:37 missing NL at <EOF>
Can somebody please explain, how to write a grammar that actually accepts the last line without newline
Simply don't require NL to fall after the last line. This form is efficient, and simplified based on the fact that line can match the empty sequence (essentially the last line will always be the empty line).
// best if `line` can be empty
parse
: line (NL line)* EOF
;
If line was not allowed to be empty, the following rule is efficient and performs the same operation:
// best if `line` cannot be empty
parse
: (line (NL line)* NL?)? EOF
;
The following rule is equivalent for the case where line cannot be empty, but is much less efficient. I'm including it because I frequently see people write rules in this form where it's easy to see that the specific NL token which was made optional is the one following the last line.
// INEFFICIENT!
parse
: (line NL)* (line NL?)? EOF
;

Resources