Aliasing frequently used patterns in Lex - parsing

I have one regexp, which is used in a several rules. Can I define alias for it, to keep this regexp definition in one place and just use it across the code?
Example:
[A-Za-z0-9].[A-Za-z0-9_-]* (expression) NAME (alias)
...
%%
NAME[=]NAME {
//Do something.
}
%%

It goes in the definitions section of your lex input file (before the %%) and you use it in a regular expression by putting the name inside curly braces ({…}). For example:
name [A-Za-z0-9][A-Za-z0-9_-]*
%%
{name}[=]{name} { /* Do something */ }

Related

How to implement parser for a grammar in Java

I have defined a grammar and now I'm implementing a parser for it.
The program should start with the keyword main followed by an opening curly bracket, followed in turn by a (possibly empty) sequence of statements, and terminated by a closing curly bracket. My questions is how to define the program in the parser? I have tried several different ways, including this, but it doesn't seem to be correct when I test.
public void program() {
//Program -> MAIN LCBR Statement* RCBR
eat("MAIN");
eat("LCBR");
while (lex.token().type != "RCBR") {
statement();
}
}
Any suggestions would be appreciated!

How can you write a customizable grammar?

For a chat bot I'm writing, I want to make its parser customizable so people don't need to modify the bot itself to add hooks for whatever types of chat messages they want to. The parser uses a grammar. At the moment, I handle this with a class that looks something like this:
class Rule {
has Regex:D $.matcher is required;
has &.parser is required;
method new(::?CLASS:_: Regex:D $matcher, &parser) {
self.bless: :$matcher, :&parser
}
method match(::?CLASS:D: Str:D $target --> Replier:_) {
$target ~~ $!matcher;
$/.defined ?? &!parser(self, $/) !! Nil
}
}
An array of these would then be looped through from the parser's actions class. This allows for people to add their own "rules" for the parser, which solves my problem, but this is clunky and this is reinventing grammars! What I really want is for people to be able to write something like a slang for my parser. While augment could be used for this, it wouldn't be useful in this case since it's possible the user would want to change how they augment the parser during runtime, but augment is handled during compile-time. How can this be done?
All this takes is 5 or 10 lines of boilerplate, depending on whether or not you use an actions class.
If you take a look at Metamodel::GrammarHOW, as of writing, you'll find this:
class Perl6::Metamodel::GrammarHOW
is Perl6::Metamodel::ClassHOW
does Perl6::Metamodel::DefaultParent
{
}
Grammars are an extension of classes! This means it's possible to declare metamethods in them. Building on How can classes be made parametric in Perl 6?, if the user provides roles for the grammar and actions class, they can be mixed in before parsing via parameterization. If you've written a slang before, this might sound familiar; mixing in roles like this is how $*LANG.refine_slang works!
If you want a token in a grammar to be augmentable, you would make it a proto token. All that would be needed afterwards is a parameterize metamethod that mixes in its argument, which would be a role of some kind:
grammar Foo::Grammar {
token TOP { <foo> }
proto token foo {*}
token foo:sym<foo> { <sym> }
method ^parameterize(Foo::Grammar:U $this is raw, Mu $grammar-role is raw --> Foo::Grammar:U) {
my Foo::Grammar:U $mixin := $this.^mixin: $grammar-role;
$mixin.^set_name: $this.^name ~ '[' ~ $grammar-role.^name ~ ']';
$mixin
}
}
class Foo::Actions {
method TOP($/) { make $<foo>.made; }
method foo:sym<foo>($/) { make ~$<sym>; }
method ^parameterize(Foo::Actions:U $this is raw, Mu $actions-role is raw --> Foo::Actions:U) {
my Foo::Actions:U $mixin := $this.^mixin: $actions-role;
$mixin.^set_name: $this.^name ~ '[' ~ $actions-role.^name ~ ']';
$mixin
}
}
Then the roles to mix in can be declared like so:
role Bar::Grammar {
token foo:sym<bar> { <sym> }
}
role Bar::Actions {
method foo:sym<bar>($/) { make ~$<sym>; }
}
Now Foo can be augmented with Bar before parsing if desired:
Foo::Grammar.subparse: 'foo', actions => Foo::Actions.new;
say $/ && $/.made; # OUTPUT: foo
Foo::Grammar.subparse: 'bar', actions => Foo::Actions.new;
say $/ && $/.made; # OUTPUT: #<failed match>
Foo::Grammar[Bar::Grammar].subparse: 'foo', actions => Foo::Actions[Bar::Actions].new;
say $/ && $/.made; # OUTPUT: foo
Foo::Grammar[Bar::Grammar].subparse: 'bar', actions => Foo::Actions[Bar::Actions].new;
say $/ && $/.made; # OUTPUT: bar
Edit: the mixin metamethod can accept any number of roles as arguments, and parameterization can work with any signature. This means you can make parameterizations of grammars or actions classes accept any number of roles if you tweak the parameterize metamethod a bit:
method ^parameterize(Mu $this is raw, *#roles --> Mu) {
my Mu $mixin := $this.^mixin: |#roles;
$mixin.^set_name: $this.^name ~ '[' ~ #roles.map(*.^name).join(', ') ~ ']';
$mixin
}

Flex-lexer: Write state defines to a different file

I want to use the start states of flex inside functions (and external files). Therefore I need the state definitions to be inside an external header file.
Is there any way of letting the definitions be written to an external file?
The code below shows an example of using the states inside functions defined inside the l-file
lexer.l
%{
void changeState(){
YY_START = MY_STATE;
}
%}
%x MY_STATE
%%
[ rules ]
%%
The following should work:
lexer.l
%x MY_STATE
%%
[ rules ]
%%
void changeState(){
BEGIN(MY_STATE);
}
Don't forget, that the upper section is actually only for declarations. Definitions should go in the last section. That way, they are places after the #define section

Interpolation in Concrete Syntax Matching

I'm working with a Java 8 grammar and I want to find occurrences of a method invocation, more specifically it.hasNext(), when it is an Iterator.
This works:
visit(unit) {
case (MethodInvocation)`it . <TypeArguments? ta> hasNext()`: {
println("found");
}
}
Ideally I would like to match with any identifier, not just it.
So I tried using String interpolation, which compiles but doesn't match:
str iteratorId = "it";
visit(unit) {
case (MethodInvocation)`$iteratorId$ . <TypeArguments? ta> hasNext()`: {
println("achei");
}
}
I also tried several other ways, including pattern variable uses (as seen in the docs) but I can't get this to work.
Is this kind of matching possible in rascal? If yes, how can it be done?
The answer specifically depends on the grammar you are using, which I did not look up, but in general in concrete syntax fragments this notation is used for placeholders: <NonTerminal variableName>
So your pattern should look something like the following:
str iteratorId = "it";
visit(unit) {
case (MethodInvocation)`<MethodName name>.<TypeArguments? ta>hasNext()`:
if (iteratorId == "<name>") println("bingo!");
}
That is assuming that MethodName is indeed a non-terminal in your Java8 grammar and part of the syntax rule for method invocations.

Parse a list of subroutines

I have written parser_sub.mly and lexer_sub.mll which can parse a subroutine. A subroutine is a block of statement englobed by Sub and End Sub.
Actually, the raw file I would like to deal with contains a list of subroutines and some useless texts. Here is an example:
' a example file
Sub f1()
...
End Sub
haha
' hehe
Sub f2()
...
End Sub
So I need to write parser.mly and lexer.mll which can parse this file by ignoring all the comments (e.g. haha, ' hehe, etc.) and calling parser_sub.main, and returns a list of subroutines.
Could anyone tell me how to let the parser ignore all the useless sentences (sentences outside a Sub and End Sub)?
Here is a part of parser.mly I tried to write:
%{
open Syntax
%}
%start main
%type <Syntax.ev> main
%%
main:
subroutine_declaration* { $1 };
subroutine_declaration:
SUB name = subroutine_name LPAREN RPAREN EOS
body = procedure_body?
END SUB
{ { subroutine_name = name;
procedure_body_EOS_opt = body; } }
The rules and parsing for procedure_body are complex and are actually defined in parser_sub.mly and lexer_sub.mll, so how could I let parser.mly and lexer.mll do not repeat defining it, and just call parser_sub.main?
Maybe we can set some flag when we are inside subroutine:
sub_starts:
SUB { inside:=true };
sub_ends:
ENDSUB { inside:=false };
subroutine_declaration:
sub_starts name body sub_ends { ... }
And when this flag is not set you just skip any input?
If the stuff you want so skip can have any form (not necessarily valid tokens of your language), you pretty much have to solve this by hacking your lexer, as Kakadu suggests. This may be the easiest thing in any case.
If the filler (stuff to skip) consists of valid tokens, and you want to skip using a grammar rule, it seems to me the main problem is to define a nonterminal that matches any token other than END. This will be unpleasant to keep up to date, but seems possible.
Finally you have the problem that your end marker is two symbols, END SUB. You have to handle the case where you see END not followed by SUB. This is even trickier because SUB is your beginning marker also. Again, one way to simplify this would be to hack your lexer so that it treats END SUB as a single token. (Usually this is trickier than you'd expect, say if you want to allow comments between END and SUB.)

Resources