Match Symbol specific number of times - rascal

When defining a syntax, it is possible to match 1 or more times (+) or 0 or more times (*) similarly to how it is done in regex. However, I have not found in the rascal documentation if it is possible to also match a Symbol a specific amount of times. In regex (and Rascal patterns) this is done with an integer between two curly brackets but this doesn't seem to work for syntax definition. Ideally, I'd want something like:
lexical Line = [0-9.]+;
syntax Sym = sym: {Line Newline}{5};
Which would only try to match the first 5 lines of the text below:
..0..
11.11
44.44
1.11.1
33333
55555

No this meta syntax does not exist in Rascal. We did not add it.
You could write an over-estimation like this and have a post-parse filter reject more than 5 items:
syntax Sym = fiveLines: (Line NewLine)+ lines
visit (myParseTree) {
case (Sym) `<(Line NewLine)+ lines>` :
throw ParseError(x.src) when length(lines) != 5;
}
Or unfold the loop like so:
syntax Sym
= Line NewLine
Line NewLine
Line NewLine
Line NewLine
Line NewLine
;
Repetition with an integer parameter sounds like a good feature request for us the consider, if you need it badly. We only have to consider what it means for Rascal's type-system; for the parser generator its a simple rule to add.

Related

Problem with parsing file ending on a newline

It seems a bit like a trivial question, but I am stuck on parsing the end of file EOF using my own island grammar. I am using the new VScode extension btw.
I've mostly been using the examples from the basic recipes and have a simple grammar with the following layout rules:
layout Whitespace = [\t-\n\r\ ]*;
lexical IntegerLiteral = [0-9]+ !>> [0-9];
lexical Comment = "%%" ![\n]* $;
Using this, and some rules it parses some simple files, but will give a parse error anytime a file ends in a newline. (newlines in between lines are no problem).
Am is missing something obvious?
Thanks!
It sounds a bit like your grammar is missing a start nonterminal. All grammar rules get whitespace in between their constituent symbols but not at the start or the end.
A start nonterminal is the exception:
start syntax Islands = Island+;
Islands parseIslands(loc input)
= parse(#start[Islands], input).top;
Passing the start nonterminal to parse will allow the file to start and end with whitespace, and using the .top field you can ignore that whitespace from the parse tree again by projecting out the middle Islands tree.
Island grammars tend to be a complex beast, so without sharing the full grammar and input string, it might be a bit hard to answer this question. But I'll share some generic feedback.
he layout production might be ambiguous, if any other part of your language has optional parts. Rascal's parsing is non-greedy. So if you have:
lexical A = "a";
lexical B = "b";
lexical C = "c";
syntax A = A? B? C;
After fusing in the layouts, this becomes:
A` = A? Whitespace? B? Whitespace? C;
Now since whitespace is not eating all characters, the grammar is ambigous, as the parser can "bind" a whitespace between the A and B, or between the B and C. So in most cases, you want to make sure it's a greedy match by adding a follow restriction:
layout Whitespace = [\t-\n \r \ ]* !>> [\t-\n \r \ ];
Also, I fixed a bug, the layout definition didn't include a space as valid whitespace. Rascal allows for spaces in the character class (for readability), so in case we need to add a space, you have to say \ .
For the rest, it looks okay, but like I started with, island grammars are a bit harder to debug without both the full syntax, and what you want to have as water and what as island.

FParsec - how to escape a separator

I'm working on an EDI file parser, and I'm having considerable difficulty implementing an escape for the 'segment terminator'. For anyone fortunate enough to not work with EDI, the segment terminator (usually an apostrophe) is the deliter between segments, which are like cells.
The desired behaviour looks something like this:
ABC+123'DEF+567' -> ["ABC+123", "DEF+567"]
ABC+123?'DEF+567' -> ["ABC+123?'DEF+567"]
Using FParsec, without escaping the apostrophe (and, for simplicity, ignoring parameterisation), the parser looks something like this:
let pSegment = //logic to parse the contents of a segment
let pAllSegments = sepEndBy pSegment (str "'")
This approach with the above example would yield ["ABC+123?", "DEF+567"].
My next consideration was to use a regex:
let pAllSegments = sepEndBy pSegment (regex #"[^\?]'")
The problem here is that the character prior to the apostrophe is also consumed, leading to incomplete messages.
I'm fairly certain I just don't understand FParsec well enough here. Does anyone have any pointers?
The issue is in the parse contents step.
The parser is working 'bottom up'. It finds the contents of the segments, which are not permitted to contain the terminator, then finds that all these segments are separated by the terminator, and constructs the list.
My error was in the pSegment step, which was using a parameterised version of (?:[A-Za-z0-9 \\.]|\?[\?\+:\?])*. See that second ?? That should have been a '.

Why are redundant parenthesis not allowed in syntax definitions?

This syntax module is syntactically valid:
module mod1
syntax Empty =
;
And so is this one, which should be an equivalent grammar to the previous one:
module mod2
syntax Empty =
( )
;
(The resulting parser accepts only empty strings.)
Which means that you can make grammars such as this one:
module mod3
syntax EmptyOrKitchen =
( ) | "kitchen"
;
But, the following is not allowed (nested parenthesis):
module mod4
syntax Empty =
(( ))
;
I would have guessed that redundant parenthesis are allowed, since they are allowed in things like expressions, e.g. ((2)) + 2.
This problem came up when working with the data types for internal representation of rascal syntax definitions. The following code will create the same module as in the last example, namely mod4 (modulo some whitespace):
import Grammar;
import lang::rascal::format::Grammar;
str sm1 = definition2rascal(\definition("unknown_main",("the-module":\module("unknown",{},{},grammar({sort("Empty")},(sort("Empty"):prod(sort("Empty"),[
alt({seq([])})
],{})))))));
The problematic part of the data is on its own line - alt({seq([])}). If this code is changed to seq([]), then you get the same syntax module as mod2. If you further delete this whole expression, i.e. so that you get this:
str sm3 =
definition2rascal(\definition("unknown_main",("the-module":\module("unknown",{},{},grammar({sort("Empty")},(sort("Empty"):prod(sort("Empty"),[
], {})))))));
Then you get mod1.
So should such redundant parenthesis by printed by the definition2rascal(...) function? And should it matter with regards to making the resulting module valid or not?
Why they are not allowed is basically we wanted to see if we could do without. There is currently no priority relation between the symbol kinds, so in general there is no need to have a bracket syntax (like you do need to + and * in expressions).
Already the brackets have two different semantics, one () being the epsilon symbol and two (Sym1 Sym2 ...) being a nested sequence. This nested sequence is defined (syntactically) to expect at least two symbols. Now we could without ambiguity introduce a third semantics for the brackets with a single symbol or relax the requirement for sequence... But we reckoned it would be confusing that in one case you would get an extra layer in the resulting parse tree (sequence), while in the other case you would not (ignored superfluous bracket).
More detailed wise, the problem of printing seq([]) is not so much a problem of the meta syntax but rather that the backing abstract notation is more relaxed than the concrete notation (i.e. it is a bigger language or an over-approximation). The parser generator will generate a working parser for seq([]). But, there is no Rascal notation for an empty sequence and I guess the pretty printer should throw an exception.

Lua pattern help (Double parentheses)

I have been coding a program in Lua that automatically formats IRC logs from a roleplay. In the roleplay logs there is a specific guideline for "Out of character" conversation, which we use double parentheses for. For example: ((<Things unrelated to roleplay go here>)). I have been trying to have my program remove text between double brackets (and including both brackets). The code is:
ofile = io.open("Output.txt", "w")
rfile = io.open("Input.txt", "r")
p = rfile:read("*all")
w = string.gsub(p, "%(%(.*?%)%)", "")
ofile:write(w)
The pattern here is > "%(%(.*?%)%)" I've tried multiple variations of the pattern. All resulted in fruitless results:
1. %(%(.*?%)%) --Wouldn't do anything.
2. %(%(.*%)%) --Would remove *everything* after the first OOC message.
Then, my friend told me that prepending the brackets with percentages wouldn't work, and that I had to use backslashes to 'escape' the parentheses.
3. \(\(.*\)\) --resulted in the output file being completely empty.
4. (\(\(.*\)\)) --Same result as above.
5. (\(\(.*?\)\) --would for some reason, remove large parts of the text for no apparent reason.
6. \(\(.*?\)\) --would just remove all the text except for the last line.
The short, absolute question:
What pattern would I need to use to remove all text between double parentheses, and remove the double parentheses themselves too?
You're friend is thinking of regular expressions. Lua patterns are similar, but different. % is the correct escape character.
Your pattern should be %(%(.-%)%). The - is similar to * in that it matches any number of the preceding sequence, but while * tries to match as many characters as it can (it's greedy), - matches the least amount of characters possible (it's non-greedy). It won't go overboard and match extra double-close-parenthesis.

Match newlines only under specific conditions

I am writing parser of language similar to javascript with its semicolon insertion ex:
var x = 1 + 2;
x;
and
var x = 1 + 2
x
and even
var x = 1 +
2
x
are the same.
For now my lexer matches newline (\n) only when it occurs after token different that semicolon. That plays nice with basic situations like 1 and 2 but how i can deal with third situation? i.e. new line happening in the middle of expression. I can't match new line every time because it would pollute my parser (inserting alternatives with newlines token everywhere) and I also cannot match them at all because it is statement terminator. Basically I would be the best to somehow check during parsing end of the statement if there was a new line character or semicolon there.
This has gone unanswered for a while. I cannot see why you cannot make a statement separator a newline **or* a semicolon. A bit like this:
whitespace [ \t]+
%%
{whitespace} /* Skip */
;[\n]* return(SEMICOLON);
[\n]+ return(SEMICOLON);
Then you're grammar is not messed up at all, as you only get the semicolon in the grammar.

Resources