Specify common grammar actions for rules of different arity - parsing

I'm trying to write a parser for a simple DSL which has a good dozen statements in the form <statementName> <param1> <param2> ... ;, where the number of parameters vary. As the structure of the statements is very similar (all matching the statement name string followed by a series of tokens given by name) and the structure of the made results is very similar (all storing the statement name and a hash of the parameters), I'd like to know how I could specify the wanted result structure without having to repeat myself for each statement action.
Pseudo-code of an action class that would help me to specify such a result structure:
class FooActions {
method *_stmt ($/) {
#result[0] = make string of statement name $/[0];
#result[1] = make hash of $/[1..] with the keys being the name of the rule
at index (i.e. '"var"' for `<var=identifier>` and `"type"` for `<type>`, etc.) and
values being the `.made` results for the rules at index (see below);
return #result;
}
method identifier ($/) { return ~$/ }
method number ($/) { return +$/ }
method type ($/) { return ~$/ }
}
Test file:
use v6;
use Test;
use Foo;
my $s;
$s = 'GoTo 2 ;';
is_deeply Foo::FooGrammar.parse($s).made, ('GoTo', {pos => 2});
$s = 'Set foo 3 ;';
is_deeply Foo::FooGrammar.parse($s).made, ('Set', {var => 'foo', target => 3});
$s = 'Get bar Long ;';
is_deeply Foo::FooGrammar.parse($s).made, ('Get', {var => 'bar', type => 'Long'});
$s = 'Set foo bar ;';
is_deeply Foo::FooGrammar.parse($s).made, ('Set', {var => 'foo', target => 'bar'});
Grammar:
use v6;
unit package Foo;
grammar FooGrammar is export {
rule TOP { <stmt> ';' }
rule type { 'Long' | 'Int' }
rule number { \d+ }
rule identifier { <alpha> \w* }
rule numberOrIdentifier { <number> || <identifier> }
rule goto_stmt { 'GoTo' <pos=number> }
rule set_stmt { 'Set' <var=identifier> <target=numberOrIdentifier> }
rule get_stmt { 'Get' <var=identifier> <type> }
rule stmt { <goto_stmt> || <set_stmt> || <get_stmt> }
}

This approach represents each statement type as a Proto-regex and uses syms to avoid repeating the statement keywords (GoTo etc).
Individual statements don't have action methods. These are handled at the next level (TOP), which uses the caps method on the match, to convert it to a hash.
The <sym> capture is used to extra the keyword. The remainder of the line s converted to a hash. Solution follows:
Grammar and Actions:
use v6;
unit package Foo;
grammar Grammar is export {
rule TOP { <stmt> ';' }
token type { 'Long' | 'Int' }
token number { \d+ }
token identifier { <alpha>\w* }
rule numberOrIdentifier { <number> || <identifier> }
proto rule stmt {*}
rule stmt:sym<GoTo> { <sym> <pos=.number> }
rule stmt:sym<Set> { <sym> <var=.identifier> <target=.numberOrIdentifier> }
rule stmt:sym<Get> { <sym> <var=.identifier> <type> }
}
class Actions {
method number($/) { make +$/ }
method identifier($/) { make ~$/ }
method type($/) { make ~$/ }
method numberOrIdentifier($/) { make ($<number> // $<identifier>).made }
method TOP($/) {
my %caps = $<stmt>.caps;
my $keyw = .Str
given %caps<sym>:delete;
my %args = %caps.pairs.map: {.key => .value.made};
make ($keyw,%args, );
}
}
Tests:
use v6;
use Test;
use Foo;
my $actions = Foo::Actions.new;
my $s;
$s = 'GoTo 2 ;';
is-deeply Foo::Grammar.parse($s, :$actions).made, ('GoTo', {pos => 2});
$s = 'Set foo 3;';
is-deeply Foo::Grammar.parse($s, :$actions).made, ('Set', {var => 'foo', target => 3});
$s = 'Get bar Long ;';
is-deeply Foo::Grammar.parse($s, :$actions).made, ('Get', {var => 'bar', type => 'Long'});
$s = 'Set foo bar ;';
is-deeply Foo::Grammar.parse($s, :$actions).made, ('Set', {var => 'foo', target => 'bar'});

Related

How can I tell pest.rs to flatten grammar?

Let's say I have a rule like,
key = { ASCII_ALPHA ~ ( ASCII_ALPHA | "_" )+ }
value = { (!NEWLINE ~ ANY)+ }
keyvalue = { key ~ "=" ~ value? }
option = { key }
This supports a
K=V
K=
K
Which is want to set/unset a key, and to specify an option, what I don't like is the syntax for option which produces an AST like this,
rule: option,
span: Span {
str: "check_local_user",
start: 302,
end: 318,
},
inner: [
Pair {
rule: key,
span: Span {
str: "check_local_user",
start: 302,
end: 318,
},
inner: [],
},
],
I don't like that my option has inner with key. I'm just wanting to the option to have the same grammar as a key. Is there any method in Pest.rs to write the grammar such that
inner { myStuff }
outer = { inner }
gets flattened to
outer = { myStuff }
Using the Atomic Parsing Token #, I could accomplish this.
option = #{ key }
It's documented as,
Any rules called by atomic rules do not generate token pairs.

How to make my go parser code more readable

I'm writing a recursive descent parser in Go for a simple made-up language, so I'm designing the grammar as I go. My parser works but I wanted to ask if there are any best practices for how I should lay out my code or when I should put code in its own function etc ... to make it more readable.
I've been building the parser by following the simple rules I've learned so far ie. each non-terminal is it's own function, even though my code works I think looks really messy and unreadable.
I've included the code for the assignment non-terminal and the grammar above the function.
I've taken out most of the error handling to keep the function smaller.
Here's some examples of what that code can parse:
a = 10
a,b,c = 1,2,3
a int = 100
a,b string = "hello", "world"
Can anyone give me some advice as to how I can make my code more readable please?
// assignment : variable_list '=' expr_list
// | variable_list type
// | variable_list type '=' expr_list
func (p *Parser) assignment() ast.Noder {
assignment := &ast.AssignmentNode{}
assignment.Left = p.variable_list()
// This if-statement deals with rule 2 or 3
if p.currentToken.Type != token.ASSIGN {
// Static variable declaration
// Could be a declaration or an assignment
// Only static variables can be declared without providing a value
assignment.IsStatic = true
assignment.Type = p.var_type().Value
assignment.Right = nil
p.nextToken()
// Rule 2 is finished at this point in the code
// This if-statement is for rule 3
if p.currentToken.Type == token.ASSIGN {
assignment.Operator = p.currentToken
p.nextToken()
assignment.Right = p.expr_list()
}
} else {
// This deals with rule 1
assignment.Operator = p.currentToken
p.nextToken()
assignment.Right = p.expr_list()
}
if assignment.Right == nil {
for i := 0; i < len(assignment.Left); i++ {
assignment.Right = append(assignment.Right, nil)
}
}
if len(assignment.Left) != len(assignment.Right) {
p.FoundError(p.syntaxError("variable mismatch, " + strconv.Itoa(len(assignment.Left)) + " on left but " + strconv.Itoa(len(assignment.Right)) + " on right,"))
}
return assignment
}
how I can make my code more readable?
For readability, a prerequisite for correct, maintainable code,
// assignment : variable_list '=' expr_list
// | variable_list type
// | variable_list type '=' expr_list
func (p *Parser) assignment() ast.Noder {
assignment := &ast.AssignmentNode{}
// variable_list
assignment.Left = p.variable_list()
// type
if p.currentToken.Type != token.ASSIGN {
// Static variable declaration
// Could be a declaration or an assignment
// Only static variables can be declared without providing a value
assignment.IsStatic = true
assignment.Type = p.var_type().Value
p.nextToken()
}
// '=' expr_list
if p.currentToken.Type == token.ASSIGN {
assignment.Operator = p.currentToken
p.nextToken()
assignment.Right = p.expr_list()
}
// variable_list [expr_list]
if assignment.Right == nil {
for i := 0; i < len(assignment.Left); i++ {
assignment.Right = append(assignment.Right, nil)
}
}
if len(assignment.Left) != len(assignment.Right) {
p.FoundError(p.syntaxError(fmt.Sprintf(
"variable mismatch, %d on left but %d on right,",
len(assignment.Left), len(assignment.Right),
)))
}
return assignment
}
Note: This likely inefficient and overly complicated:
for i := 0; i < len(assignment.Left); i++ {
assignment.Right = append(assignment.Right, nil)
}
What is the type of assignment.Right?
As far as how to make your code more readable, there is not always a cut and dry answer. I personally find that code is more readable when you can use function names in place of comments in the code. A lot of people like to recommend the book "Clean Code" by Robert C. Martin. He pushes this throughout the book, small functions that have one purpose and are self documenting (via the function name).
Of course, as I said before this is a subjective topic. I took a crack at it, and came up with the code below, which I personally feel is more readable. It also uses the function names to document what is going on. That way the reader doesn't necessarily need to dig into every single statement in the code, but rather just the high level function names if they don't need all of the details.
// assignment : variable_list '=' expr_list
// | variable_list type
// | variable_list type '=' expr_list
func (p *Parser) assignment() ast.Noder {
assignment := &ast.AssignmentNode{}
assignment.Left = p.variable_list()
// This if-statement deals with rule 2 or 3
if p.currentToken.Type != token.ASSIGN {
// Static variable declaration
// Could be a declaration or an assignment
// Only static variables can be declared without providing a value
p.parseStaticStatement(assignment)
} else {
p.parseVariableAssignment(assignment)
}
if assignment.Right == nil {
assignment.appendDefaultValues()
}
p.checkForUnbalancedAssignment(assignment)
return assignment
}
func (p *Parser) parseStaticStatement(assignment *ast.AssingmentNode) {
assignment.IsStatic = true
assignment.Type = p.var_type().Value
assignment.Right = nil
p.nextToken()
// Rule 2 is finished at this point in the code
// This if-statement is for rule 3
if p.currentToken.Type == token.ASSIGN {
a.parseStaticAssignment()
}
}
func (p *Parser) parseStaticAssignment(assignment *ast.AssignmentNode) {
assignment.Operator = p.currentToken
p.nextToken()
assignment.Right = p.expr_list()
}
func (p *Parser) parseVariableAssignment(assignment *ast.AssignmentNode) {
// This deals with rule 1
assignment.Operator = p.currentToken
p.nextToken()
assignment.Right = p.expr_list()
}
func (a *ast.AssignmentNode) appendDefaultValues() {
for i := 0; i < len(assignment.Left); i++ {
assignment.Right = append(assignment.Right, nil)
}
}
func (p *Parser) checkForUnbalancedAssignment(assignment *ast.AssignmentNode) {
if len(assignment.Left) != len(assignment.Right) {
p.FoundError(p.syntaxError("variable mismatch, " + strconv.Itoa(len(assignment.Left)) + " on left but " + strconv.Itoa(len(assignment.Right)) + " on right,"))
}
}
I hope that you find this helpful. I am more than willing to answer any further questions that you may have if you leave a comment on my response.

How to JvmModelInferrer method body from XExpression and append boilerplate code

In a JvmModelInferrer, when constructing the body of a method or constructor, how do you insert both an XExpression from the grammar
body = op.body
and additional "boilerplate" code, for example
body = [
append(
'''
System.out.println("BOILERPLATE");
'''
)
]
I can achieve either but not both.
For a minimal working example, consider the following canonical Xbase grammar,
grammar org.example.xbase.entities.Entities with org.eclipse.xtext.xbase.Xbase
generate entities "http://www.example.org/xbase/entities/Entities"
Model:
importSection=XImportSection?
entities+=Entity*;
Entity:
'entity' name=ID ('extends' superType=JvmParameterizedTypeReference)? '{'
attributes += Attribute*
operations += Operation*
'}';
Attribute:
'attr' (type=JvmTypeReference)? name=ID ('=' initexpression=XExpression)? ';';
Operation:
'op' (type=JvmTypeReference)? name=ID
'(' (params+=FullJvmFormalParameter (',' params+=FullJvmFormalParameter)*)? ')'
body=XBlockExpression;
and JvmModelInferrer,
package org.example.xbase.entities.jvmmodel
import com.google.inject.Inject
import org.eclipse.xtext.xbase.jvmmodel.AbstractModelInferrer
import org.eclipse.xtext.xbase.jvmmodel.IJvmDeclaredTypeAcceptor
import org.eclipse.xtext.xbase.jvmmodel.JvmTypesBuilder
import org.example.xbase.entities.entities.Entity
class EntitiesJvmModelInferrer extends AbstractModelInferrer {
#Inject extension JvmTypesBuilder
def dispatch void infer(Entity entity, IJvmDeclaredTypeAcceptor acceptor, boolean isPreIndexingPhase) {
acceptor.accept(entity.toClass("entities."+entity.name)) [
documentation = entity.documentation
if (entity.superType != null)
superTypes += entity.superType.cloneWithProxies
entity.attributes.forEach[
a |
val type = a.type ?: a.initexpression?.inferredType
members += a.toField(a.name, type) [
documentation = a.documentation
if (a.initexpression != null)
initializer = a.initexpression
]
members += a.toGetter(a.name, type)
members += a.toSetter(a.name, type)
]
entity.operations.forEach[
op |
members += op.toMethod(op.name, op.type ?: inferredType) [
documentation = op.documentation
for (p : op.params) {
parameters += p.toParameter(p.name, p.parameterType)
}
// body = [
// append(
// '''
// System.out.println("BOILERPLATE");
// '''
// )
// ]
body = op.body
]
]
]
}
}
As the comments suggest, I would like to insert "boilerplate" code into the body of the method, before the XExpression itself. While I can insert the boilerplate, or the expression, I cannot work out how to do both.
this does not work. the only thing you can do is to infer two methods
methodWithBoilerplate() {
//pre
methodwithoutboilerplate
//post
}
methodwithoutboilerplate() {
usercode goes here
}
for the first use body = '''code here'''
for the second use body = exp.body

Parsing alphanumeric identifiers with underscores and calling from top level parser

Trying to do something similar to this question except allow underscores from the second character onwards. Not just camel case.
I can test the parser in isolation successfully but when composed in a higher level parser, I get errors
Take the following example:
#![allow(dead_code)]
#[macro_use]
extern crate nom;
use nom::*;
type Bytes<'a> = &'a [u8];
#[derive(Clone,PartialEq,Debug)]
pub enum Statement {
IF,
ELSE,
ASSIGN((String)),
IDENTIFIER(String),
EXPRESSION,
}
fn lowerchar(input: Bytes) -> IResult<Bytes, char>{
if input.is_empty() {
IResult::Incomplete(Needed::Size(1))
} else if (input[0] as char)>='a' && 'z'>=(input[0] as char) {
IResult::Done(&input[1..], input[0] as char)
} else {
IResult::Error(error_code!(ErrorKind::Custom(1)))
}
}
named!(identifier<Bytes,Statement>, map_res!(
recognize!(do_parse!(
lowerchar >>
//alt_complete! is not having the effect it's supposed to so the alternatives need to be ordered from shortest to longest
many0!(alt!(
complete!(is_a!("_"))
| complete!(take_while!(nom::is_alphanumeric))
)) >> ()
)),
|id: Bytes| {
//println!("{:?}",std::str::from_utf8(id).unwrap().to_string());
Result::Ok::<Statement, &str>(
Statement::IDENTIFIER(std::str::from_utf8(id).unwrap().to_string())
)
}
));
named!(expression<Bytes,Statement>, alt_complete!(
identifier //=> { |e: Statement| e }
//| assign_expr //=> { |e: Statement| e }
| if_expr //=> { |e: Statement| e }
));
named!(if_expr<Bytes,Statement>, do_parse!(
if_statement: preceded!(
tag!("if"),
delimited!(tag!("("),expression,tag!(")"))
) >>
//if_statement: delimited!(tag!("("),tag!("hello"),tag!(")")) >>
if_expr: expression >>
//else_statement: opt_res!(tag!("else")) >>
(Statement::IF)
));
#[cfg(test)]
mod tests {
use super::*;
use IResult::*;
//use IResult::Done;
#[test]
fn ident_token() {
assert_eq!(identifier(b"iden___ifiers"), Done::<Bytes, Statement>(b"" , Statement::IDENTIFIER("iden___ifiers".to_string())));
assert_eq!(identifier(b"iden_iFIErs"), Done::<Bytes, Statement>(b"" , Statement::IDENTIFIER("iden_iFIErs".to_string())));
assert_eq!(identifier(b"Iden_iFIErs"), Error(ErrorKind::Custom(1))); // Supposed to fail since not valid
assert_eq!(identifier(b"_den_iFIErs"), Error(ErrorKind::Custom(1))); // Supposed to fail since not valid
}
#[test]
fn if_token() {
assert_eq!(if_expr(b"if(a)a"), Error(ErrorKind::Alt)); // Should have passed
assert_eq!(if_expr(b"if(hello)asdas"), Error(ErrorKind::Alt)); // Should have passed
}
#[test]
fn expr_parser() {
assert_eq!(expression(b"iden___ifiers"), Done::<Bytes, Statement>(b"" , Statement::IDENTIFIER("iden___ifiers".to_string())));
assert_eq!(expression(b"if(hello)asdas"), Error(ErrorKind::Alt)); // Should have been able to recognise an IF statement via expression parser
}
}

PEG for Python style indentation

How would you write a Parsing Expression Grammar in any of the following Parser Generators (PEG.js, Citrus, Treetop) which can handle Python/Haskell/CoffeScript style indentation:
Examples of a not-yet-existing programming language:
square x =
x * x
cube x =
x * square x
fib n =
if n <= 1
0
else
fib(n - 2) + fib(n - 1) # some cheating allowed here with brackets
Update:
Don't try to write an interpreter for the examples above. I'm only interested in the indentation problem. Another example might be parsing the following:
foo
bar = 1
baz = 2
tap
zap = 3
# should yield (ruby style hashmap):
# {:foo => { :bar => 1, :baz => 2}, :tap => { :zap => 3 } }
Pure PEG cannot parse indentation.
But peg.js can.
I did a quick-and-dirty experiment (being inspired by Ira Baxter's comment about cheating) and wrote a simple tokenizer.
For a more complete solution (a complete parser) please see this question: Parse indentation level with PEG.js
/* Initializations */
{
function start(first, tail) {
var done = [first[1]];
for (var i = 0; i < tail.length; i++) {
done = done.concat(tail[i][1][0])
done.push(tail[i][1][1]);
}
return done;
}
var depths = [0];
function indent(s) {
var depth = s.length;
if (depth == depths[0]) return [];
if (depth > depths[0]) {
depths.unshift(depth);
return ["INDENT"];
}
var dents = [];
while (depth < depths[0]) {
depths.shift();
dents.push("DEDENT");
}
if (depth != depths[0]) dents.push("BADDENT");
return dents;
}
}
/* The real grammar */
start = first:line tail:(newline line)* newline? { return start(first, tail) }
line = depth:indent s:text { return [depth, s] }
indent = s:" "* { return indent(s) }
text = c:[^\n]* { return c.join("") }
newline = "\n" {}
depths is a stack of indentations. indent() gives back an array of indentation tokens and start() unwraps the array to make the parser behave somewhat like a stream.
peg.js produces for the text:
alpha
beta
gamma
delta
epsilon
zeta
eta
theta
iota
these results:
[
"alpha",
"INDENT",
"beta",
"gamma",
"INDENT",
"delta",
"DEDENT",
"DEDENT",
"epsilon",
"INDENT",
"zeta",
"DEDENT",
"BADDENT",
"eta",
"theta",
"INDENT",
"iota",
"DEDENT",
"",
""
]
This tokenizer even catches bad indents.
I think an indentation-sensitive language like that is context-sensitive. I believe PEG can only do context-free langauges.
Note that, while nalply's answer is certainly correct that PEG.js can do it via external state (ie the dreaded global variables), it can be a dangerous path to walk down (worse than the usual problems with global variables). Some rules can initially match (and then run their actions) but parent rules can fail thus causing the action run to be invalid. If external state is changed in such an action, you can end up with invalid state. This is super awful, and could lead to tremors, vomiting, and death. Some issues and solutions to this are in the comments here: https://github.com/dmajda/pegjs/issues/45
So what we are really doing here with indentation is creating something like a C-style blocks which often have their own lexical scope. If I were writing a compiler for a language like that I think I would try and have the lexer keep track of the indentation. Every time the indentation increases it could insert a '{' token. Likewise every time it decreases it could inset an '}' token. Then writing an expression grammar with explicit curly braces to represent lexical scope becomes more straight forward.
You can do this in Treetop by using semantic predicates. In this case you need a semantic predicate that detects closing a white-space indented block due to the occurrence of another line that has the same or lesser indentation. The predicate must count the indentation from the opening line, and return true (block closed) if the current line's indentation has finished at the same or shorter length. Because the closing condition is context-dependent, it must not be memoized.
Here's the example code I'm about to add to Treetop's documentation. Note that I've overridden Treetop's SyntaxNode inspect method to make it easier to visualise the result.
grammar IndentedBlocks
rule top
# Initialise the indent stack with a sentinel:
&{|s| #indents = [-1] }
nested_blocks
{
def inspect
nested_blocks.inspect
end
}
end
rule nested_blocks
(
# Do not try to extract this semantic predicate into a new rule.
# It will be memo-ized incorrectly because #indents.last will change.
!{|s|
# Peek at the following indentation:
save = index; i = _nt_indentation; index = save
# We're closing if the indentation is less or the same as our enclosing block's:
closing = i.text_value.length <= #indents.last
}
block
)*
{
def inspect
elements.map{|e| e.block.inspect}*"\n"
end
}
end
rule block
indented_line # The block's opening line
&{|s| # Push the indent level to the stack
level = s[0].indentation.text_value.length
#indents << level
true
}
nested_blocks # Parse any nested blocks
&{|s| # Pop the indent stack
# Note that under no circumstances should "nested_blocks" fail, or the stack will be mis-aligned
#indents.pop
true
}
{
def inspect
indented_line.inspect +
(nested_blocks.elements.size > 0 ? (
"\n{\n" +
nested_blocks.elements.map { |content|
content.block.inspect+"\n"
}*'' +
"}"
)
: "")
end
}
end
rule indented_line
indentation text:((!"\n" .)*) "\n"
{
def inspect
text.text_value
end
}
end
rule indentation
' '*
end
end
Here's a little test driver program so you can try it easily:
require 'polyglot'
require 'treetop'
require 'indented_blocks'
parser = IndentedBlocksParser.new
input = <<END
def foo
here is some indented text
here it's further indented
and here the same
but here it's further again
and some more like that
before going back to here
down again
back twice
and start from the beginning again
with only a small block this time
END
parse_tree = parser.parse input
p parse_tree
I know this is an old thread, but I just wanted to add some PEGjs code to the answers. This code will parse a piece of text and "nest" it into a sort of "AST-ish" structure. It only goes one deep and it looks ugly, furthermore it does not really use the return values to create the right structure but keeps an in-memory tree of your syntax and it will return that at the end. This might well become unwieldy and cause some performance issues, but at least it does what it's supposed to.
Note: Make sure you have tabs instead of spaces!
{
var indentStack = [],
rootScope = {
value: "PROGRAM",
values: [],
scopes: []
};
function addToRootScope(text) {
// Here we wiggle with the form and append the new
// scope to the rootScope.
if (!text) return;
if (indentStack.length === 0) {
rootScope.scopes.unshift({
text: text,
statements: []
});
}
else {
rootScope.scopes[0].statements.push(text);
}
}
}
/* Add some grammar */
start
= lines: (line EOL+)*
{
return rootScope;
}
line
= line: (samedent t:text { addToRootScope(t); }) &EOL
/ line: (indent t:text { addToRootScope(t); }) &EOL
/ line: (dedent t:text { addToRootScope(t); }) &EOL
/ line: [ \t]* &EOL
/ EOF
samedent
= i:[\t]* &{ return i.length === indentStack.length; }
{
console.log("s:", i.length, " level:", indentStack.length);
}
indent
= i:[\t]+ &{ return i.length > indentStack.length; }
{
indentStack.push("");
console.log("i:", i.length, " level:", indentStack.length);
}
dedent
= i:[\t]* &{ return i.length < indentStack.length; }
{
for (var j = 0; j < i.length + 1; j++) {
indentStack.pop();
}
console.log("d:", i.length + 1, " level:", indentStack.length);
}
text
= numbers: number+ { return numbers.join(""); }
/ txt: character+ { return txt.join(""); }
number
= $[0-9]
character
= $[ a-zA-Z->+]
__
= [ ]+
_
= [ ]*
EOF
= !.
EOL
= "\r\n"
/ "\n"
/ "\r"

Resources