PEGKit combine matched symbols on stack - parsing

I'm writing a grammar for PEGKit to parse a Twine exported Twee file. This is my first time using PEGKit and I'm trying to get to grips with how it works.
I have this twee source file that I'm parsing
:: Passage One
P1 Line One
P1 Line Two
:: Passage Two
P2 Line One
P2 Line Two
Currently I've worked out how to parse the above using the following grammar
#before {
PKTokenizer *t = self.tokenizer;
[t.symbolState add:#"::"];
[t.commentState addSingleLineStartMarker:#"::"];
// New lines as symbols
[t.whitespaceState setWhitespaceChars:NO from:'\n' to:'\n'];
[t.whitespaceState setWhitespaceChars:NO from:'\r' to:'\r'];
[t setTokenizerState:t.symbolState from:'\n' to:'\n'];
[t setTokenizerState:t.symbolState from:'\r' to:'\r'];
}
start = passage+;
passage = passageTitle contentLine*;
passageTitle = passageStart Word+ eol+;
contentLine = singleLine eol+;
singleLine = Word+;
passageStart = '::'!;
eol = '\n'! | '\r'!;
and the result I get is
[Passage, One, P1, Line, One, P1, Line, Two, Passage, Two, P2, Line, One, P2, Line, Two]::/Passage/One/
/P1/Line/One/
/P1/Line/Two/
/
/::/Passage/Two/
/P2/Line/One/
/P2/Line/Two/
^
Ideally, I'd like the parser to combine the words matched for the passageTitle into a single string similar to how the built in PEGKit QuotedString grammar works. I would also like the words matched for a contentLine to be combined as well.
So, eventually, I would have this on the stack
[Passage One, P1 Line One, P1 Line Two, Passage Two, P2 Line One, P2 Line Two]
Any thoughts on how to achieve this would be appreciated.

Creator of PEGKit here.
I understand your ultimate strategy (to collect/combine lines as single string objects), and agree that it makes sense, however, I disagree with your proposed tactic to achieve that (to alter tokenization to try to combine what are essentially multiple separate tokens into single tokens).
Combining lines into convenient string objects makes sense, but altering tokenization to achieve that, doesn't make sense IMO (at least not with a recursive descent parsing kit PEGKit) when the lines in question don't have obvious 'bracketing' characters like quotes or brackets.
You could treat the passageTitle lines starting with :: as single-line Comment tokens, but I probably wouldn't since I gather they are semantically not comments.
So instead of merging multiple tokens via the tokenizer, you should merge multiple tokens in the more natural way for PEGKit: in the parser delegate callbacks.
We have two different cases to deal with here:
The passageTitle lines
The contentLine lines
In your grammar, remove this line so we won't be treating passageTitles as Comment tokens (you didn't have that completely correctly configured anyhow, but never mind that):
[t.commentState addSingleLineStartMarker:#"::"];
And also in your grammar, remove the ! from your passageStart rule so that those tokens won't be discarded:
passageStart = '::';
That's all for the grammar. Now in your Parser Delegate callbacks, implement the two necessary callback methods for the title and content lines. And in each callback, pull all of the necessary tokens off the PKAssembly's stack, and merge them into a single string (in reverse).
#interface TweeDelegate : NSObject
#end
#implementation TweeDelegate
- (void)parser:(PKParser *)p didMatchPassageTitle:(PKAssembly *)a {
NSArray *toks = [a objectsAbove:[PKToken tokenWithTokenType:PKTokenTypeSymbol stringValue:#"::" doubleValue:0.0]];
[a pop]; // discard `::`
NSMutableString *buf = [NSMutableString string];
for (PKToken *tok in [toks reverseObjectEnumerator]) {
[buf appendFormat:#"%# ", tok.stringValue];
}
CFStringTrimWhitespace((CFMutableStringRef)buf);
NSLog(#"Title: %#", buf); // Passage One
}
- (void)parser:(PKParser *)p didMatchContentLine:(PKAssembly *)a {
NSArray *toks = [a objectsAbove:nil];
NSMutableString *buf = [NSMutableString string];
for (PKToken *tok in [toks reverseObjectEnumerator]) {
[buf appendFormat:#"%# ", tok.stringValue];
}
CFStringTrimWhitespace((CFMutableStringRef)buf);
NSLog(#"Content: %#", buf); // P1 Line One
}
#end
I receive the following output:
Title: Passage One
Content: P1 Line One
Content: P1 Line Two
Title: Passage Two
Content: P2 Line One
Content: P2 Line Two
As for what to do with these strings once you have created them, I'll leave up to you :). Hope that helps.

Related

Implement heredocs with trim indent using PEG.js

I working on a language similar to ruby called gaiman and I'm using PEG.js to generate the parser.
Do you know if there is a way to implement heredocs with proper indentation?
xxx = <<<END
hello
world
END
the output should be:
"hello
world"
I need this because this code doesn't look very nice:
def foo(arg) {
if arg == "here" then
return <<<END
xxx
xxx
END
end
end
this is a function where the user wants to return:
"xxx
xxx"
I would prefer the code to look like this:
def foo(arg) {
if arg == "here" then
return <<<END
xxx
xxx
END
end
end
If I trim all the lines user will not be able to use a string with leading spaces when he wants. Does anyone know if PEG.js allows this?
I don't have any code yet for heredocs, just want to be sure if something that I want is possible.
EDIT:
So I've tried to implement heredocs and the problem is that PEG doesn't allow back-references.
heredoc = "<<<" marker:[\w]+ "\n" text:[\s\S]+ marker {
return text.join('');
}
It says that the marker is not defined. As for trimming I think I can use location() function
I don't think that's a reasonable expectation for a parser generator; few if any would be equal to the challenge.
For a start, recognising the here-string syntax is inherently context-sensitive, since the end-delimiter must be a precise copy of the delimiter provided after the <<< token. So you would need a custom lexical analyser, and that means that you need a parser generator which allows you to use a custom lexical analyser. (So a parser generator which assumes you want a scannerless parser might not be the optimal choice.)
Recognising the end of the here-string token shouldn't be too difficult, although you can't do it with a single regular expression. My approach would be to use a custom scanning function which breaks the here-string into a series of lines, concatenating them as it goes until it reaches a line containing only the end-delimiter.
Once you've recognised the text of the literal, all you need to normalise the spaces in the way you want is the column number at which the <<< starts. With that, you can trim each line in the string literal. So you only need a lexical scanner which accurately reports token position. Trimming wouldn't normally be done inside the generated lexical scanner; rather, it would be the associated semantic action. (Equally, it could be a semantic action in the grammar. But it's always going to be code that you write.)
When you trim the literal, you'll need to deal with the cases in which it is impossible, because the user has not respected the indentation requirement. And you'll need to do something with tab characters; getting those right probably means that you'll want a lexical scanner which computes visible column positions rather than character offsets.
I don't know if peg.js corresponds with those requirements, since I don't use it. (I did look at the documentation, and failed to see any indication as to how you might incorporate a custom scanner function. But that doesn't mean there isn't a way to do it.) I hope that the discussion above at least lets you check the detailed documentation for the parser generator you want to use, and otherwise find a different parser generator which will work for you in this use case.
Here is the implementation of heredocs in Peggy successor to PEG.js that is not maintained anymore. This code was based on the GitHub issue.
heredoc = "<<<" begin:marker "\n" text:($any_char+ "\n")+ _ end:marker (
&{ return begin === end; }
/ '' { error(`Expected matched marker "${begin}", but marker "${end}" was found`); }
) {
const loc = location();
const min = loc.start.column - 1;
const re = new RegExp(`\\s{${min}}`);
return text.map(line => {
return line[0].replace(re, '');
}).join('\n');
}
any_char = (!"\n" .)
marker_char = (!" " !"\n" .)
marker "Marker" = $marker_char+
_ "whitespace"
= [ \t\n\r]* { return []; }
EDIT: above didn't work with another piece of code after heredoc, here is better grammar:
{ let heredoc_begin = null; }
heredoc = "<<<" beginMarker "\n" text:content endMarker {
const loc = location();
const min = loc.start.column - 1;
const re = new RegExp(`^\\s{${min}}`, 'mg');
return {
type: 'Literal',
value: text.replace(re, '')
};
}
__ = (!"\n" !" " .)
marker 'Marker' = $__+
beginMarker = m:marker { heredoc_begin = m; }
endMarker = "\n" " "* end:marker &{ return heredoc_begin === end; }
content = $(!endMarker .)*

How to achieve capturing groups in flex lex?

I wanted to match for a string which starts with a '#', then matches everything until it matches the character that follows '#'. This can be achieved using capturing groups like this: #(.)[^(?1)]*(?1)(EDIT this regex is also erroneous). This matches #$foo$, does not match #%bar&, matches first 6 characters of #"foo"bar.
But since flex lex does not support capturing groups, what is the workaround here?
As you say, (f)lex does not support capturing groups, and it certainly doesn't support backreferences.
So there is no simple workaround, but there are workarounds. Here are a few possibilities:
You can read the input one character at a time using the input() function, until you find the matching character (but you have to create your own buffer to store the characters, because characters read by input() are not added to the current token). This is not the most efficient because reading one character at a time is a bit clunky, but it's the only interface that (f)lex offers. (The following snippet assumes you have some kind of expandable stringBuilder; if you are using C++, this would just be replaced with a std::string.)
#. { StringBuilder sb = string_builder_new();
int delim = yytext[1];
for (;;) {
int next = input();
if (next == delim) break;
if (next == EOF ) { /* Signal error */; break; }
string_builder_addchar(next);
}
yylval = string_builder_release();
return DELIMITED_STRING;
}
Even less efficiently, but perhaps more conveniently, you can get (f)lex to accumulate the characters in yytext using yymore(), matching one character at a time in a start condition:
%x DELIMITED
%%
int delim;
#. { delim = yytext[1]; BEGIN(DELIMITED); }
<DELIMITED>.|\n { if (yytext[0] == delim) {
yylval = strdup(yytext);
BEGIN(INITIAL);
return DELIMITED_STRING;
}
yymore();
}
<DELIMITED><<EOF>> { /* Signal unterminated string error */ }
The most efficient solution (in (f)lex) is to just write one rule for each possible delimiter. While that's a lot of rules, they could be easily generated with a small script in whatever scripting language you prefer. And, actually, there are not that many rules, particularly if you don't allow alphabetic and non-printing characters to be delimiters. This has the additional advantage that if you want Perl-like parenthetic delimiters (#(Hello) instead of #(Hello(), you can just modify the individual pattern to suit (as I've done below). [Note 1] Since all the actions are the same; it might be easier to use a macro for the action, making it easier to modify.
/* Ordinary punctuation */
#:[^:]*: { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
#:[^:]*: { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
#![^!]*! { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
#\.[^.]*\. { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
/* Matched pairs */
#<[^>]*> { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
#\[[^]]*] { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
/* Trap errors */
# { /* Report unmatched or invalid delimiter error */ }
If I were writing a script to generate these rules, I would use hexadecimal escapes for all the delimiter characters rather than trying to figure out which ones needed escapes.
Notes:
Perl requires nested balanced parentheses in constructs like that. But you can't do that with regular expressions; if you wanted to reproduce Perl behaviour, you'd need to use some variation on one of the other suggestions. I'll try to revisit this answer later to address that feature.

lua tables - string representation

as a followup question to lua tables - allowed values and syntax:
I need a table that equates large numbers to strings. The catch seems to be that strings with punctuation are not allowed:
local Names = {
[7022003001] = fulsom jct, OH
[7022003002] = kennedy center, NY
}
but neither are quotes:
local Names = {
[7022003001] = "fulsom jct, OH"
[7022003002] = "kennedy center, NY"
}
I have even tried without any spaces:
local Names = {
[7022003001] = fulsomjctOH
[7022003002] = kennedycenterNY
}
When this module is loaded, wireshark complains "}" is expected to close "{" at line . How can I implement a table with a string that contains spaces and punctuation?
As per Lua Reference Manual - 3.1 - Lexical Conventions:
A short literal string can be delimited by matching single or double quotes, and can contain the (...) C-like escape sequences (...).
That means the short literal string in Lua is:
local foo = "I'm a string literal"
This matches your second example. The reason why it fails is because it lacks a separator between table members:
local Names = {
[7022003001] = "fulsom jct, OH",
[7022003002] = "kennedy center, NY"
}
You can also add a trailing separator after the last member.
The more detailed description of the table constructor can be found in 3.4.9 - Table Constructors. It could be summed up by the example provided there:
a = { [f(1)] = g; "x", "y"; x = 1, f(x), [30] = 23; 45 }
I really, really recommend using the Lua Reference Manual, it is an amazing helper.
I also highly encourage you to read some basic tutorials e.g. Learn Lua in 15 minutes. They should give you an overview of the language you are trying to use.

Why does this Rascal pattern matching code use so much memory and time?

I'm trying to write what I would think of as an extremely simple piece of code in Rascal: Testing if list A contains list B.
Starting out with some very basic code to create a list of strings
public list[str] makeStringList(int Start, int End)
{
return [ "some string with number <i>" | i <- [Start..End]];
}
public list[str] toTest = makeStringList(0, 200000);
My first try was 'inspired' by the sorting example in the tutor:
public void findClone(list[str] In, str S1, str S2, str S3, str S4, str S5, str S6)
{
switch(In)
{
case [*str head, str i1, str i2, str i3, str i4, str i5, str i6, *str tail]:
{
if(S1 == i1 && S2 == i2 && S3 == i3 && S4 == i4 && S5 == i5 && S6 == i6)
{
println("found duplicate\n\t<i1>\n\t<i2>\n\t<i3>\n\t<i4>\n\t<i5>\n\t<i6>");
}
fail;
}
default:
return;
}
}
Not very pretty, but I expected it to work. Unfortunately, the code runs for about 30 seconds before crashing with an "out of memory" error.
I then tried a better looking alternative:
public void findClone2(list[str] In, list[str] whatWeSearchFor)
{
for ([*str head, *str mid, *str end] := In)
if (mid == whatWeSearchFor)
println("gotcha");
}
with approximately the same result (seems to run a little longer before running out of memory)
Finally, I tried a 'good old' C-style approach with a for-loop
public void findClone3(list[str] In, list[str] whatWeSearchFor)
{
cloneLength = size(whatWeSearchFor);
inputLength = size(In);
if(inputLength < cloneLength) return [];
loopLength = inputLength - cloneLength + 1;
for(int i <- [0..loopLength])
{
isAClone = true;
for(int j <- [0..cloneLength])
{
if(In[i+j] != whatWeSearchFor[j])
isAClone = false;
}
if(isAClone) println("Found clone <whatWeSearchFor> on lines <i> through <i+cloneLength-1>");
}
}
To my surprise, this one works like a charm. No out of memory, and results in seconds.
I get that my first two attempts probably create a lot of temporary string objects that all have to be garbage collected, but I can't believe that the only solution that worked really is the best solution.
Any pointers would be greatly appreciated.
My relevant eclipse.ini settings are
-XX:MaxPermSize=512m
-Xms512m
-Xss64m
-Xmx1G
We'll need to look to see why this is happening. Note that, if you want to use pattern matching, this is maybe a better way to write it:
public void findClone(list[str] In, str S1, str S2, str S3, str S4, str S5, str S6) {
switch(In) {
case [*str head, S1, S2, S3, S4, S5, S6, *str tail]: {
println("found duplicate\n\t<S1>\n\t<S2>\n\t<S3>\n\t<S4>\n\t<S5>\n\t<S6>");
}
default:
return;
}
}
If you do this, you are taking advantage of Rascal's matcher to actually find the matching strings directly, versus your first example in which any string would match but then you needed to use a number of separate comparisons to see if the match represented the combination you were looking for. If I run this on 110145 through 110150 it takes a while but works and it doesn't seem to grow beyond the heap space you allocated to it.
Also, is there a reason you are using fail? Is this to continue searching?
It's an algorithmic issue like Mark Hills said. In Rascal some short code can still entail a lot of nested loops, almost implicitly. Basically every * splice operator on a fresh variable that you use on the pattern side in a list generates one level of loop nesting, except for the last one which is just the rest of the list.
In your code of findClone2 you are first generating all combinations of sublists and then filtering them using the if construct. So that's a correct algorithm, but probably slow. This is your code:
void findClone2(list[str] In, list[str] whatWeSearchFor)
{
for ([*str head, *str mid, *str end] := In)
if (mid == whatWeSearchFor)
println("gotcha");
}
You see how it has a nested loop over In, because it has two effective * operators in the pattern. The code runs therefore in O(n^2), where n is the length of In. I.e. it has quadratic runtime behaviour for the size of the In list. In is a big list so this matters.
In the following new code, we filter first while generating answers, using fewer lines of code:
public void findCloneLinear(list[str] In, list[str] whatWeSearchFor)
{
for ([*str head, *whatWeSearchFor, *str end] := In)
println("gotcha");
}
The second * operator does not generate a new loop because it is not fresh. It just "pastes" the given list values into the pattern. So now there is actually only one effective * which generates a loop which is the first on head. This one makes the algorithm loop over the list. The second * tests if the elements of whatWeSearchFor are all right there in the list after head (this is linear in the size of whatWeSearchFor and then the last *_ just completes the list allowing for more stuff to follow.
It's also nice to know where the clone is sometimes:
public void findCloneLinear(list[str] In, list[str] whatWeSearchFor)
{
for ([*head, *whatWeSearchFor, *_] := In)
println("gotcha at <size(head)>");
}
Rascal does not have an optimising compiler (yet) which might possibly internally transform your algorithms to equivalent optimised ones. So as a Rascal programmer you are still asked to know the effect of loops on your algorithms complexity and know that * is a very short notation for a loop.

Finding follow sets - infinite recursion

While finding follow sets, rules such as
A->aA can lead to infinite recursion. Is there any coding technique to avoid it?
Note that the above example is just an example, in practice such a recursion could happen indirectly as well.
Here is my sample C code for finding follow sets. The grammar is stored as an array of linked lists. Please tell me if the code is unclear at any point.
set findFollowSet(char nonTerminal[], Grammar G, hashTable2 h) //later assume that all first sets are already in the hashtable.
{
LINK temp1 = find2(h, nonTerminal);
set s= createEmptySet();
set temp = createEmptySet();
char lhs[80] = "\0";
int i;
//special case
if(temp1->numRightSideOf==0) //its not on right side of any grammar rule
return insert(s, "$");
for(i=0;i<temp1->numRightSideOf;i++)
{
link l = G.rules[temp1->rightSideOf[i]];
strcpy(lhs, l->symbol); //storing the lhs just in case the nonTerm appears on the rightmost end of the rule.
printf("!!!!! %s\n", lhs);
sleep(1);
//finding nonTerminal in G
while(l!=NULL)
{
if(strcmp(l->symbol, nonTerminal) == 0)
break;
l=l->next;
}
//found the nonTerminal in G
if(l->next!=NULL)
{
temp = findFirstSet(l->next, G, h);
temp = removeElement(temp, "EPSILON");
}
else //its on the rightmost end of the rule
temp = findFollowSet(lhs, G, h);
s = setUnion(s, temp); destroySet(temp);
}
return s;
}
FIRST and FOLLOW sets are defined recursively, so you need to find the recursive closure. What this mean in practice is that you don't find the FOLLOW set for a single non-terminal -- you find all the FOLLOW sets for all the terminals simultaneously, by starting with all sets empty and going over the grammar adding symbols to different sets, until no more symbols can be added to any set. So you end up with something like:
FOLLOW[*] = {}; // all follow sets start empty
done = false;
while (!done)
done = true;
for (R : each rule in the grammar)
A = RHS[R];
tmp = FOLLOW[A];
for (S : each symbol in LHS[R] from right to left)
if (S is terminal)
tmp = {S};
else
if (!(FOLLOW[S] contains tmp))
done = false
FOLLOW[S] |= tmp
if (epsilon in FIRST[S])
tmp |= FIRST[S] - epsilon
else
tmp = FIRST[S]
Ok I got the answer but its inefficient.
So if anyone wants to suggest some more efficient answer, please feel welcomed.
Just store the recursion stack explicitly and at each recursive call, check if the entry already exists in the stack.
Mind you, you need to check the entire stack not just the top of it.

Resources