When using a Raku grammar, can one limit parsing time? - parsing

The following Raku program defines a grammar Parser then attempts to use it to parse the string baa. However, program execution takes too long. Is there a way to limit the amount of execution time devoted to parsing so that parsing can be deemed to have exceeded the desired limit and timed out ?
grammar Parser
{
token TOP { <S> }
token S { '' | <S> <S> | 'a' <S> 'b' | <S> 'a' | 'b' <S> 'b' }
}
sub MAIN()
{
say Parser.parse( 'baa' ).Bool ; # Would like True, False, or Timeout
} # end sub MAIN
Also, might there be plans to have Raku implement the Adaptive LL(*) parsing of ANTLR? Version 4.11.1 of ANTLR has code generation targets including Java, Python, and others, but not Raku.

There is currently no way to stop parsing other, other than by exiting the process. If that is ok with your situation, then you could do something like:
start {
sleep 10; # however much time you want it to give
note "Sorry, it took to long";
exit 1;
}
If that is not an option, there are several variations on the above theme, with e.g. putting the grammar parsing into a start block and wait for the promise to be kept, or broken.

Related

Is there a way to insert phases between the lexer and parser in ANTLR

I am writing a lexer/parser for a language that allows abbreviations (and globs) for its keywords. And, I am trying to determine the best way to do it.
And one thought that occurs to me, is to insert a phase between the lexer and the parser, where the lexer recognizes the general class, e.g. is this a "command name" or is it an "option" and then passes those general tokens to a second phase which does further analysis and recognizes which command name it is and passes that on as the token type to the parser.
It will make the parser simple. I will only have to deal with well formed command names. Every token will be clear what it means.
It will keep the lexer simple. It will only have to divide things into classes. This is a simple name. This is a glob. This is an option name (starts with a dash).
The phase is the middle will also be relatively simple. The simple name (and option forms) will only have to deal with strings. The glob form can use standard glob techniques to match the glob against the legal candidates, which are in the tables for the simple names and options.
The question is how to insert that phase into ANTLR, so that I call the lexer and it creates tokens and the intermediate phase massages them and then the parser gets the tokens the intermediate phase has categorized.
Is there a known solution for this?
Something like:
lexer grammar simple
letter: [A-Z][a-z];
digit: [0-9];
glob-char: [*?];
name: letter (letter | digit)*;
option: '-'name;
glob: (glob-char|letter)(glob-char|letter|digit)*;
glob-option: '-'glob;
filter grammar name;
end: 'e' | 'end';
generate: 'ge' | 'generate';
goto: 'go' | 'goto';
help: 'h' | 'help';
if: 'i' | 'if';
then: 't' | 'then';
parser grammar simple;
The user (programmer writing the language I am parsing) need to be to write
g*te and have if match generate.
The phase between the lexer and the parser when it sees a glob needs to look at the glob (and the list of keywords) and see if only one of them matches the glob and if so, return that keyword. The stuff I listed in the "filter grammar" is the stuff that builds the list of keywords globs can match. I have found code on the web that matches globs to a list of names. That part isn't hard.
And, I've since found in the ANTLR doc how to run arbitrary code on matching a token and how to change the resulting tokens type. (See my answer.)
It looks like you can use lexerCustomActions to achieve the desired effect. Something like the following.
in your lexer:
GLOB: [-A-Za-z0-9_.]* '*' [-A-Za-z0-9_.*]* { setType(lexGlob(getText())); }
in your Java (or whatever language you are using code):
void int lexGlob(String origText()) {
return xyzzy; // some code that computes the right kind of token type
}

Parse a block where each line starts with a specific symbol

I need to parse a block of code which looks like this:
* Block
| Line 1
| Line 2
| ...
It is easy to do:
block : head lines;
head : '*' line;
lines : lines '|' line
| '|' line
;
Now I wonder, how can I add nested blocks, e.g.:
* Block
| Line 1
| * Subblock
| | Line 1.1
| | ...
| Line 2
| ...
Can this be expressed as a LALR grammar?
I can, of course, parse the top-level blocks and than run my parser again to deal with each of these top-level blocks. However, I'm just learning this topic so it's interesting for me to avoid such approach.
The nested-block language is not context-free [Note 2], so it cannot be parsed with an LALR(k) parser.
However, nested parenthesis languages are, of course, context-free and it is relatively easy to transform the input into a parenthetic form by replacing the initial | sequences in the lexical scanner. The transformation is simple:
when the initial sequence of |s is longer than the previous line, insert an BEGIN_BLOCK. (The initial sequence must be exactly one | longer; otherwise it is presumably a syntax error.)
when the initial sequence of |s, is shorter then the previous line, enough END_BLOCKs are inserted to bring the expected length to the correct value.
The |s themselves are not passed through to the parser.
This is very similar to the INDENT/DEDENT strategy used to parse layout-aware languages like Python an Haskell. The main difference is that here we don't need a stack of indent levels.
Once that transformation is finished, the grammar will look something like:
content: /* empty */
| content line
| content block
block : head BEGIN_BLOCK content END_BLOCK
| head
head : '*' line
A rough outline of a flex implementation would be something like this: (see Note 1, below).
%x INDENT CONTENT
%%
static int depth = 0, new_depth = 0;
/* Handle pending END_BLOCKs */
send_end:
if (new_depth < depth) {
--depth;
return END_BLOCK;
}
^"|"[[:blank:]]* { new_depth = 1; BEGIN(INDENT); }
^. { new_depth = 0; yyless(0); BEGIN(CONTENT);
goto send_end; }
^\n /* Ignore blank lines */
<INDENT>{
"|"[[:blank:]]* ++new_depth;
. { yyless(0); BEGIN(CONTENT);
if (new_depth > depth) {
++depth;
if (new_depth > depth) { /* Report syntax error */ }
return BEGIN_BLOCK;
} else goto send_end;
}
\n BEGIN(INITIAL); /* Maybe you care about this blank line? */
}
/* Put whatever you use here to lexically scan the lines */
<CONTENT>{
\n BEGIN(INITIAL);
}
Notes:
Not everyone will be happy with the goto but it saves some code-duplication. The fact that the state variable (depth and new_depth) are local static variables makes the lexer non-reentrant and non-restartable (after an error). That's only useful for toy code; for anything real, you should make the lexical scanner re-entrant and put the state variables into the extra data structure.
The terms "context-free" and "context-sensitive" are technical descriptions of grammars, and are therefore a bit misleading. Intuitions based on what the words seem to mean are often wrong. One very common source of context-sensitivity is a language where validity depends on two different derivations of the same non-terminal producing the same token sequence. (Assuming the non-terminal could derive more than one token sequence; otherwise, the non-terminal could be eliminated.)
There are lots of examples of such context-sensitivity in normal programming languages; usually, the grammar will allow these constructs and the check will be performed later in some semantic analysis phase. These include the requirement that an identifier be declared (two derivations of IDENTIFIER produce the same string) or the requirement that a function be called with the correct number of parameters (here, it is only necessary that the length of the derivations of the non-terminals match, but that is sufficient to trigger context-sensitivity).
In this case, the requirement is that two instances of what might be called bar-prefix in consecutive lines produce the same string of |s. In this case, since the effect is really syntactic, deferring to a later semantic analysis defeats the point of parsing. Whether the other examples of context-sensitivity are "syntactic" or "semantic" is a debate which produces a surprising amount of heat without casting much light on the discussion.
If you write an explicit end-of-block token, things become clearer:
*Block{
|Line 1
*SubBlock{
| line 1.1
| line 1.2
}
|Line 2
|...
}
and grammar becomes:
block : '*' ID '{' lines '}'
lines : lines '|' line
| lines block
|

ANTRL 3 grammar omitted part of input source code

I am using this ANTLR 3 grammar and ANTLRWorks for testing that grammar.
But I can't figure out why some parts of my input text are omitted.
I would like to rewrite this grammar and display every element (lparen, keywords, semicolon,..) of the source file (input) in AST / CST.
I've tried everything, but without success. Can someone who is experienced with ANTLR help me?
Parse tree:
I've managed to narrow it down to the semic rule:
/*
This rule handles semicolons reported by the lexer and situations where the ECMA 3 specification states there should be semicolons automaticly inserted.
The auto semicolons are not actually inserted but this rule behaves as if they were.
In the following situations an ECMA 3 parser should auto insert absent but grammaticly required semicolons:
- the current token is a right brace
- the current token is the end of file (EOF) token
- there is at least one end of line (EOL) token between the current token and the previous token.
The RBRACE is handled by matching it but not consuming it.
The EOF needs no further handling because it is not consumed by default.
The EOL situation is handled by promoting the EOL or MultiLineComment with an EOL present from off channel to on channel
and thus making it parseable instead of handling it as white space. This promoting is done in the action promoteEOL.
*/
semic
#init
{
// Mark current position so we can unconsume a RBRACE.
int marker = input.mark();
// Promote EOL if appropriate
promoteEOL(retval);
}
: SEMIC
| EOF
| RBRACE { input.rewind(marker); }
| EOL | MultiLineComment // (with EOL in it)
;
So, the EVIL semicolon insertion strikes again!
I'm not really sure, but I think these mark/rewind calls are getting out of sync. The #init block is executed when the rule is entered for branch selection and for actual matching. It's actually creating a lot of marks but not cleaning them up. But I don't know why it messes up the parse tree like that.
Anyway, here's a working version of the same rule:
semic
#init
{
// Promote EOL if appropriate
promoteEOL(retval);
}
: SEMIC
| EOF
| { int pos = input.index(); } RBRACE { input.seek(pos); }
| EOL | MultiLineComment // (with EOL in it)
;
It's much simpler and doesn't use the mark/rewind mechanism.
But there's a catch: the semic rule in the parse tree will have a child node } in the case of a semicolon insertion before a closing brace. Try to remove the semicolon after i-- and see the result. You'll have to detect this and handle it in your code. semic should either contain a ; token, or contain EOL (which means a semicolon got silently inserted at this point).

bison and grammar: replaying the parse stack

I have not messed with building languages or parsers in a formal way since grad school and have forgotten most of what I knew back then. I now have a project that might benefit from such a thing but I'm not sure how to approach the following situation.
Let's say that in the language I want to parse there is a token that means "generate a random floating point number" in an expression.
exp: NUMBER
{$$ = $1;}
| NUMBER PLUS exp
{$$ = $1 + $3;}
| R PLUS exp
{$$ = random() + $3;}
;
I also want a "list" generating operator that will reevaluate an "exp" some number of times. Maybe like:
listExp: NUMBER COLON exp
{
for (int i = 0; i < $1; i++) {
print $3;
}
}
;
The problem I see is that "exp" will have already been reduced by the time the loop starts. If I have the input
2 : R + 2
then I think the random number will be generated as the "exp" is parsed and 2 added to it -- lets say the result is 2.0055. Then in the list expression I think 2.0055 would be printed out twice.
Is there a way to mark the "exp" before evaluation and then parse it as many times as the list loop count requires? The idea being to get a different random number in each evaluation.
Your best bet is to build an AST and evaluate the entire AST at the end of the parse. In-line evaluation is only possible for very simple (i.e. "calculator-like") projects.
Instead of an AST, you could construct code for a stack- or three-address- virtual machine. That's generally more efficient, particularly if you intend to execute the code frequently, but the AST is a lot simpler to construct, and executing it is a single depth-first scan.
Depending on your language design there are at least 5 different points at which a token in the language could be bound to a value. They are:
Pre-processor (like C #define)
Lexer: recognise tokens
Parser: recognise token structure, output AST
Semantic analysis: analyse AST, assign types and conversions etc
Code generation: output executable code or execute code directly.
If you have a token that can occur multiple times and you want to assign it a different random value each time, then phase 4 is the place to do it. If you generate an AST, walk the tree and assign the values. If you go straight to code generation (or an interpreter) do it then.

bison error recovery

I have found out that I can use 'error' in the grammar rule as a mechanism for error recovery. So if there was an error, the parser must discard the current line and resume parsing from the next line. An example from bison manual to achieve this could be something like this:
stmts:
exp
|stmts exp
| error '\n'
But I cannot use that; because I had to make flex ignores '\n' in my scannar, so that an expression is not restricted to be expressed in one line. How can I make the parser -when encountering an error- continue parsing to the following line, given that there is no special character (i.e. semicolon) to indicate an end of expression and there is no 'newline' token?
Thanks..
Since you've eliminated the marker used by the example, you're going to have to pull a stunt to get the equivalent effect.
I think you can use this:
stmts:
exp
| stmts exp
| error { eat_to_newline(); }
Where eat_to_newline() is a function in the scanner (source file) that arranges to discard any saved tokens and read up to the next newline.
extern void eat_to_newline(void);
void eat_to_newline(void)
{
int c;
while ((c = getchar()) != EOF && c != '\n')
;
}
It probably needs to be a little more complex than that, but not a lot more complex than that. You might need to use yyerrok; (and, as the comment reminds me, yyclearin; too) after calling eat_to_newline().

Resources