How to make my go parser code more readable - parsing

I'm writing a recursive descent parser in Go for a simple made-up language, so I'm designing the grammar as I go. My parser works but I wanted to ask if there are any best practices for how I should lay out my code or when I should put code in its own function etc ... to make it more readable.
I've been building the parser by following the simple rules I've learned so far ie. each non-terminal is it's own function, even though my code works I think looks really messy and unreadable.
I've included the code for the assignment non-terminal and the grammar above the function.
I've taken out most of the error handling to keep the function smaller.
Here's some examples of what that code can parse:
a = 10
a,b,c = 1,2,3
a int = 100
a,b string = "hello", "world"
Can anyone give me some advice as to how I can make my code more readable please?
// assignment : variable_list '=' expr_list
// | variable_list type
// | variable_list type '=' expr_list
func (p *Parser) assignment() ast.Noder {
assignment := &ast.AssignmentNode{}
assignment.Left = p.variable_list()
// This if-statement deals with rule 2 or 3
if p.currentToken.Type != token.ASSIGN {
// Static variable declaration
// Could be a declaration or an assignment
// Only static variables can be declared without providing a value
assignment.IsStatic = true
assignment.Type = p.var_type().Value
assignment.Right = nil
p.nextToken()
// Rule 2 is finished at this point in the code
// This if-statement is for rule 3
if p.currentToken.Type == token.ASSIGN {
assignment.Operator = p.currentToken
p.nextToken()
assignment.Right = p.expr_list()
}
} else {
// This deals with rule 1
assignment.Operator = p.currentToken
p.nextToken()
assignment.Right = p.expr_list()
}
if assignment.Right == nil {
for i := 0; i < len(assignment.Left); i++ {
assignment.Right = append(assignment.Right, nil)
}
}
if len(assignment.Left) != len(assignment.Right) {
p.FoundError(p.syntaxError("variable mismatch, " + strconv.Itoa(len(assignment.Left)) + " on left but " + strconv.Itoa(len(assignment.Right)) + " on right,"))
}
return assignment
}

how I can make my code more readable?
For readability, a prerequisite for correct, maintainable code,
// assignment : variable_list '=' expr_list
// | variable_list type
// | variable_list type '=' expr_list
func (p *Parser) assignment() ast.Noder {
assignment := &ast.AssignmentNode{}
// variable_list
assignment.Left = p.variable_list()
// type
if p.currentToken.Type != token.ASSIGN {
// Static variable declaration
// Could be a declaration or an assignment
// Only static variables can be declared without providing a value
assignment.IsStatic = true
assignment.Type = p.var_type().Value
p.nextToken()
}
// '=' expr_list
if p.currentToken.Type == token.ASSIGN {
assignment.Operator = p.currentToken
p.nextToken()
assignment.Right = p.expr_list()
}
// variable_list [expr_list]
if assignment.Right == nil {
for i := 0; i < len(assignment.Left); i++ {
assignment.Right = append(assignment.Right, nil)
}
}
if len(assignment.Left) != len(assignment.Right) {
p.FoundError(p.syntaxError(fmt.Sprintf(
"variable mismatch, %d on left but %d on right,",
len(assignment.Left), len(assignment.Right),
)))
}
return assignment
}
Note: This likely inefficient and overly complicated:
for i := 0; i < len(assignment.Left); i++ {
assignment.Right = append(assignment.Right, nil)
}
What is the type of assignment.Right?

As far as how to make your code more readable, there is not always a cut and dry answer. I personally find that code is more readable when you can use function names in place of comments in the code. A lot of people like to recommend the book "Clean Code" by Robert C. Martin. He pushes this throughout the book, small functions that have one purpose and are self documenting (via the function name).
Of course, as I said before this is a subjective topic. I took a crack at it, and came up with the code below, which I personally feel is more readable. It also uses the function names to document what is going on. That way the reader doesn't necessarily need to dig into every single statement in the code, but rather just the high level function names if they don't need all of the details.
// assignment : variable_list '=' expr_list
// | variable_list type
// | variable_list type '=' expr_list
func (p *Parser) assignment() ast.Noder {
assignment := &ast.AssignmentNode{}
assignment.Left = p.variable_list()
// This if-statement deals with rule 2 or 3
if p.currentToken.Type != token.ASSIGN {
// Static variable declaration
// Could be a declaration or an assignment
// Only static variables can be declared without providing a value
p.parseStaticStatement(assignment)
} else {
p.parseVariableAssignment(assignment)
}
if assignment.Right == nil {
assignment.appendDefaultValues()
}
p.checkForUnbalancedAssignment(assignment)
return assignment
}
func (p *Parser) parseStaticStatement(assignment *ast.AssingmentNode) {
assignment.IsStatic = true
assignment.Type = p.var_type().Value
assignment.Right = nil
p.nextToken()
// Rule 2 is finished at this point in the code
// This if-statement is for rule 3
if p.currentToken.Type == token.ASSIGN {
a.parseStaticAssignment()
}
}
func (p *Parser) parseStaticAssignment(assignment *ast.AssignmentNode) {
assignment.Operator = p.currentToken
p.nextToken()
assignment.Right = p.expr_list()
}
func (p *Parser) parseVariableAssignment(assignment *ast.AssignmentNode) {
// This deals with rule 1
assignment.Operator = p.currentToken
p.nextToken()
assignment.Right = p.expr_list()
}
func (a *ast.AssignmentNode) appendDefaultValues() {
for i := 0; i < len(assignment.Left); i++ {
assignment.Right = append(assignment.Right, nil)
}
}
func (p *Parser) checkForUnbalancedAssignment(assignment *ast.AssignmentNode) {
if len(assignment.Left) != len(assignment.Right) {
p.FoundError(p.syntaxError("variable mismatch, " + strconv.Itoa(len(assignment.Left)) + " on left but " + strconv.Itoa(len(assignment.Right)) + " on right,"))
}
}
I hope that you find this helpful. I am more than willing to answer any further questions that you may have if you leave a comment on my response.

Related

Match brackets the kotlin way

I'm giving Kotlin a go; coding contently, I have an ArrayList of chars which i want to classify depending on how brackets are matched:
(abcde) // ok characters other than brackets can go anywhere
)abcde( // ok matching the brackets 'invertedly' are ok
(({()})) // ok
)()()([] // ok
([)] // bad can't have different overlapping bracket pairs
((((( // bad all brackets need to have a match
My solution comes out(recursive):
//charList is a property
//Recursion starter'upper
private fun classifyListOfCharacters() : Boolean{
var j = 0
while (j < charList.size ) {
if (charList[j].isBracket()){
j = checkMatchingBrackets(j+1, charList[j])
}
j++
}
return j == commandList.size
}
private fun checkMatchingBrackets(i: Int, firstBracket :Char) : Int{
var j = i
while (j < charList.size ) {
if (charList[j].isBracket()){
if (charList[j].matchesBracket(firstBracket)){
return j //Matched bracket normal/inverted
}
j = checkMatchingBrackets(j+1, charList[j])
}
j++
}
return j
}
This works, but is this how you do it in Kotlin? It feels like I've coded java in Kotlin syntax
Found this Functional languages better at recursion, I've tried thinking in terms of manipulating functions and sending them down the recursion but to no avail. I'd be glad to be pointed in the right direction, code, or some pseudo-code of a possible refactoring.
(Omitted some extension methods regarding brackets, I think it's clear what they do)
Another, possibly a simpler approach to this problem is maintaining a stack of brackets while you iterate over the characters.
When you encounter another bracket:
If it matches the top of the stack, you pop the top of the stack;
If it does not match the top of the stack (or the stack is empty), you push it onto the stack.
If any brackets remain on the stack at the end, it means they are unmatched, and the answer is false. If the stack ends up empty, the answer is true.
This is correct, because a bracket at position i in a sequence can match another one at position j, only if there's no unmatched bracket of a different kind between them (at position k, i < k < j). The stack algorithm simulates exactly this logic of matching.
Basically, this algorithm could be implemented in a single for-loop:
val stack = Stack<Char>()
for (c in charList) {
if (!c.isBracket())
continue
if (stack.isNotEmpty() && c.matchesBracket(stack.peek())) {
stack.pop()
} else {
stack.push(c)
}
}
return stack.isEmpty()
I've reused your extensions c.isBracket(...) and c.matchesBracket(...). The Stack<T> is a JDK class.
This algorithm hides the recursion and the brackets nesting inside the abstraction of the brackets stack. Compare: your current approach implicitly uses the function call stack instead of the brackets stack, but the purpose is the same: it either finds a match for the top character or makes a deeper recursive call with another character on top.
Hotkey's answer (using a for loop) is great. However, you asked for an optimized recursion solution. Here is an optimized tail recursive function (Note the tailrec modifier before the function):
tailrec fun isBalanced(input: List<Char>, stack: Stack<Char>): Boolean = when {
input.isEmpty() -> stack.isEmpty()
else -> {
val c = input.first()
if (c.isBracket()) {
if (stack.isNotEmpty() && c.matchesBracket(stack.peek())) {
stack.pop()
} else {
stack.push(c)
}
}
isBalanced(input.subList(1, input.size), stack)
}
}
fun main(args: Array<String>) {
println("check: ${isBalanced("(abcde)".toList(), Stack())}")
}
This function calls itself until the input becomes empty and returns true if the stack is empty when the input becomes empty.
If we look at the decompiled Java equivalent of the generated bytecode, this recursion has been optimized to an efficient while loop by the compiler so we won't get StackOverflowException (removed Intrinsics null checks):
public static final boolean isBalanced(#NotNull String input, #NotNull Stack stack) {
while(true) {
CharSequence c = (CharSequence)input;
if(c.length() == 0) {
return stack.isEmpty();
}
char c1 = StringsKt.first((CharSequence)input);
if(isBracket(c1)) {
Collection var3 = (Collection)stack;
if(!var3.isEmpty() && matchesBracket(c1, ((Character)stack.peek()).charValue())) {
stack.pop();
} else {
stack.push(Character.valueOf(c1));
}
}
input = StringsKt.drop(input, 1);
}
}

Retrieve LHS/RHS value of operator

I'm looking to do something similar to this how to get integer variable name and its value from Expr* in clang using the RecursiveASTVisitor
The goal is to first retrieve all assignment operations then perform my own checks on them, to do taint analysis.
I've overridden the VisitBinaryOperator as such
bool VisitBinaryOperator (BinaryOperator *bOp) {
if ( !bOP->isAssignmentOp() ) {
return true;
}
Expr *LHSexpr = bOp->getLHS();
Expr *RHSexpr = bOp->getRHS();
LHSexpr->dump();
RHSexpr->dump();
}
This RecursiveASTVisitor is being run on Objective C codes, so I do not know what the LHS or RHS type will evaluate to (could even be a function on the RHS?)
Would it be possible to get the text representation of what is on the LHS/RHS out from clang in order to perform regex expression on them??
Sorry, I found something similar that works for this particular case.
Solution:
bool VisitBinaryOperator (BinaryOperator *bOp) {
if ( !bOP->isAssignmentOp() ) {
return true;
}
Expr *LHSexpr = bOp->getLHS();
Expr *RHSexpr = bOp->getRHS();
std::string LHS_string = convertExpressionToString(LHSexpr);
std::string RHS_string = convertExpressionToString(RHSexpr);
return true;
}
std::string convertExpressionToString(Expr *E) {
SourceManager &SM = Context->getSourceManager();
clang::LangOptions lopt;
SourceLocation startLoc = E->getLocStart();
SourceLocation _endLoc = E->getLocEnd();
SourceLocation endLoc = clang::Lexer::getLocForEndOfToken(_endLoc, 0, SM, lopt);
return std::string(SM.getCharacterData(startLoc), SM.getCharacterData(endLoc) - SM.getCharacterData(startLoc));
}
Only thing I'm not very sure about is why _endLoc is required to compute endLoc and how is the Lexer actually working.
EDIT:
Link to the post I found help Getting the source behind clang's AST

How to match the start and end of a block

I want to define a special code block, which may starts by any combination of characters of {[<#, and the end will be }]>#.
Some example:
{
block-content
}
{##
block-content
##}
#[[<{###
block-content
###}>]]#
Is it possible with petitparser-dart?
Yes, back-references are possible, but it is not that straight-forward.
First we need a function that can reverse our delimiter. I came up with the following:
String reverseDelimiter(String token) {
return token.split('').reversed.map((char) {
if (char == '[') return ']';
if (char == '{') return '}';
if (char == '<') return '>';
return char;
}).join();
}
Then we have to declare the stopDelimiter parser. It is undefined at this point, but will be replaced with the real parser as soon as we know it.
var stopDelimiter = undefined();
In the action of the startDelimiter we replace the stopDelimiter with a dynamically created parser as follows:
var startDelimiter = pattern('#<{[').plus().flatten().map((String start) {
stopDelimiter.set(string(reverseDelimiter(start)).flatten());
return start;
});
The rest is trivial, but depends on your exact requirements:
var blockContents = any().starLazy(stopDelimiter).flatten();
var parser = startDelimiter & blockContents & stopDelimiter;
The code above defines the blockContents parser so that it reads through anything until the matching stopDelimiter is encountered. The provided examples pass:
print(parser.parse('{ block-content }'));
// Success[1:18]: [{, block-content , }]
print(parser.parse('{## block-content ##}'));
// Success[1:22]: [{##, block-content , ##}]
print(parser.parse('#[[<{### block-content ###}>]]#'));
// Success[1:32]: [#[[<{###, block-content , ###}>]]#]
The above code doesn't work if you want to nest the parser. If necessary, that problem can be avoided by remembering the previous stopDelimiter and restoring it.

ANTLR Tree Grammar for loops

I'm trying to implement a parser by directly reading a treeWalker and implementing the commands needed for the compiler on the fly. So if I have a command like:
statement
:
^('WRITE' expression)
{
//Here is the command that is created by my Tree Parser
ch.emitRO("OUT",0,0,0,"write out the value of ac");
//and then I handle it in my other classes
}
;
I want it to write OUT 0,0,0; to a file. That's my grammar.
I have a problem though with the loop section in my grammar it is:
'WHILE'^ expression 'DO' stat_seq 'ENDDO'
and in the tree parser:
doWhileStatement
:
^('WHILE' expression 'DO' stat_seq 'ENDDO')
;
What I want to do is directly parse the code from the while loop into the commands I need. I came up with this solution but it doesn't work:
doWhileStatement
:
^('WHILE' e=expression head='DO'
{
int loopHead =((CommonTree) head).getTokenStartIndex();
}
stat_seq
{
if ($e.result==1) {
input.seek(loopHead);
doWhileStatement();
}
}
'ENDDO')
;
for the record here are some of the other commands I've written:
(ignore the code written in brackets, it's for the generation of the commands in a text file.)
stat_seq
:
(statement)+
;
statement
:
^(':=' ID e=expression) { variables.put($ID.text,e); }
| ^('WRITE' expression)
{
ch.emitRM("LDC",ac,$expression.result,0,"pass the expression value to the ac reg");
ch.emitRO("OUT",ac,0,0,"write out the value of ac");
}
| ^('READ' ID)
{
ch.emitRO("IN",ac,0,0,"read value");
}
| ^('IF' expression 'THEN'
{
ch.emitRM("LDC",ac1,$expression.result,0,"pass the expression result to the ac reg");
int savedLoc1 = ch.emitSkip(1);
}
sseq1=stat_seq
'ELSE'
{
int savedLoc2 = ch.emitSkip(1);
ch.emitBackup(savedLoc1);
ch.emitRM("JEQ",ac1,savedLoc2+1,0,"skip as many places as needed depending on the expression");
ch.emitRestore();
}
sseq2=stat_seq
{
int savedLoc3 = ch.emitSkip(0);
ch.emitBackup(savedLoc2);
ch.emitRM("LDC",PC_REG,savedLoc3,0,"skip for the else command");
ch.emitRestore();
}
'ENDIF')
| doWhileStatement
;
Any help would be appreciated, thank you
I found it for everyone who has the same problem I did it like this and it's working:
^('WHILE'
{int c = input.index();}
expression
{int s=input.index();}
.* )// .* is a sequence of statements
{
int next = input.index(); // index of node following WHILE
input.seek(c);
match(input, Token.DOWN, null);
pushFollow(FOLLOW_expression_in_statement339);
int condition = expression();
state._fsp--;
//there is a problem here
//expression() seemed to be reading from the grammar file and I couldn't
//get it to read from the tree walker rule somehow
//It printed something like no viable alt at input 'DOWN'
//I googled it and found this mistake
// So I copied the code from the normal while statement
// And pasted it here and it works like a charm
// Normally there should only be int condition = expression()
while ( condition == 1 ) {
input.seek(s);
stat_seq();//stat_seq is a sequence of statements: (statement ';')+
input.seek(c);
match(input, Token.DOWN, null); //Copied value from EvaluatorWalker.java
//cause couldn't find another way to do it
pushFollow(FOLLOW_expression_in_statement339);
condition = expression();
state._fsp--;
System.out.println("condition:"+condition + " i:"+ variables.get("i"));
}
input.seek(next);
}
I wrote the problem at the comments of my code. If anyone can help me out and answer this for me how to do it I would be grateful. It's so weird that there is nearly no feedback on a correct way to implement loops within a tree grammar on the fly.
Regards,
Alex

PEG for Python style indentation

How would you write a Parsing Expression Grammar in any of the following Parser Generators (PEG.js, Citrus, Treetop) which can handle Python/Haskell/CoffeScript style indentation:
Examples of a not-yet-existing programming language:
square x =
x * x
cube x =
x * square x
fib n =
if n <= 1
0
else
fib(n - 2) + fib(n - 1) # some cheating allowed here with brackets
Update:
Don't try to write an interpreter for the examples above. I'm only interested in the indentation problem. Another example might be parsing the following:
foo
bar = 1
baz = 2
tap
zap = 3
# should yield (ruby style hashmap):
# {:foo => { :bar => 1, :baz => 2}, :tap => { :zap => 3 } }
Pure PEG cannot parse indentation.
But peg.js can.
I did a quick-and-dirty experiment (being inspired by Ira Baxter's comment about cheating) and wrote a simple tokenizer.
For a more complete solution (a complete parser) please see this question: Parse indentation level with PEG.js
/* Initializations */
{
function start(first, tail) {
var done = [first[1]];
for (var i = 0; i < tail.length; i++) {
done = done.concat(tail[i][1][0])
done.push(tail[i][1][1]);
}
return done;
}
var depths = [0];
function indent(s) {
var depth = s.length;
if (depth == depths[0]) return [];
if (depth > depths[0]) {
depths.unshift(depth);
return ["INDENT"];
}
var dents = [];
while (depth < depths[0]) {
depths.shift();
dents.push("DEDENT");
}
if (depth != depths[0]) dents.push("BADDENT");
return dents;
}
}
/* The real grammar */
start = first:line tail:(newline line)* newline? { return start(first, tail) }
line = depth:indent s:text { return [depth, s] }
indent = s:" "* { return indent(s) }
text = c:[^\n]* { return c.join("") }
newline = "\n" {}
depths is a stack of indentations. indent() gives back an array of indentation tokens and start() unwraps the array to make the parser behave somewhat like a stream.
peg.js produces for the text:
alpha
beta
gamma
delta
epsilon
zeta
eta
theta
iota
these results:
[
"alpha",
"INDENT",
"beta",
"gamma",
"INDENT",
"delta",
"DEDENT",
"DEDENT",
"epsilon",
"INDENT",
"zeta",
"DEDENT",
"BADDENT",
"eta",
"theta",
"INDENT",
"iota",
"DEDENT",
"",
""
]
This tokenizer even catches bad indents.
I think an indentation-sensitive language like that is context-sensitive. I believe PEG can only do context-free langauges.
Note that, while nalply's answer is certainly correct that PEG.js can do it via external state (ie the dreaded global variables), it can be a dangerous path to walk down (worse than the usual problems with global variables). Some rules can initially match (and then run their actions) but parent rules can fail thus causing the action run to be invalid. If external state is changed in such an action, you can end up with invalid state. This is super awful, and could lead to tremors, vomiting, and death. Some issues and solutions to this are in the comments here: https://github.com/dmajda/pegjs/issues/45
So what we are really doing here with indentation is creating something like a C-style blocks which often have their own lexical scope. If I were writing a compiler for a language like that I think I would try and have the lexer keep track of the indentation. Every time the indentation increases it could insert a '{' token. Likewise every time it decreases it could inset an '}' token. Then writing an expression grammar with explicit curly braces to represent lexical scope becomes more straight forward.
You can do this in Treetop by using semantic predicates. In this case you need a semantic predicate that detects closing a white-space indented block due to the occurrence of another line that has the same or lesser indentation. The predicate must count the indentation from the opening line, and return true (block closed) if the current line's indentation has finished at the same or shorter length. Because the closing condition is context-dependent, it must not be memoized.
Here's the example code I'm about to add to Treetop's documentation. Note that I've overridden Treetop's SyntaxNode inspect method to make it easier to visualise the result.
grammar IndentedBlocks
rule top
# Initialise the indent stack with a sentinel:
&{|s| #indents = [-1] }
nested_blocks
{
def inspect
nested_blocks.inspect
end
}
end
rule nested_blocks
(
# Do not try to extract this semantic predicate into a new rule.
# It will be memo-ized incorrectly because #indents.last will change.
!{|s|
# Peek at the following indentation:
save = index; i = _nt_indentation; index = save
# We're closing if the indentation is less or the same as our enclosing block's:
closing = i.text_value.length <= #indents.last
}
block
)*
{
def inspect
elements.map{|e| e.block.inspect}*"\n"
end
}
end
rule block
indented_line # The block's opening line
&{|s| # Push the indent level to the stack
level = s[0].indentation.text_value.length
#indents << level
true
}
nested_blocks # Parse any nested blocks
&{|s| # Pop the indent stack
# Note that under no circumstances should "nested_blocks" fail, or the stack will be mis-aligned
#indents.pop
true
}
{
def inspect
indented_line.inspect +
(nested_blocks.elements.size > 0 ? (
"\n{\n" +
nested_blocks.elements.map { |content|
content.block.inspect+"\n"
}*'' +
"}"
)
: "")
end
}
end
rule indented_line
indentation text:((!"\n" .)*) "\n"
{
def inspect
text.text_value
end
}
end
rule indentation
' '*
end
end
Here's a little test driver program so you can try it easily:
require 'polyglot'
require 'treetop'
require 'indented_blocks'
parser = IndentedBlocksParser.new
input = <<END
def foo
here is some indented text
here it's further indented
and here the same
but here it's further again
and some more like that
before going back to here
down again
back twice
and start from the beginning again
with only a small block this time
END
parse_tree = parser.parse input
p parse_tree
I know this is an old thread, but I just wanted to add some PEGjs code to the answers. This code will parse a piece of text and "nest" it into a sort of "AST-ish" structure. It only goes one deep and it looks ugly, furthermore it does not really use the return values to create the right structure but keeps an in-memory tree of your syntax and it will return that at the end. This might well become unwieldy and cause some performance issues, but at least it does what it's supposed to.
Note: Make sure you have tabs instead of spaces!
{
var indentStack = [],
rootScope = {
value: "PROGRAM",
values: [],
scopes: []
};
function addToRootScope(text) {
// Here we wiggle with the form and append the new
// scope to the rootScope.
if (!text) return;
if (indentStack.length === 0) {
rootScope.scopes.unshift({
text: text,
statements: []
});
}
else {
rootScope.scopes[0].statements.push(text);
}
}
}
/* Add some grammar */
start
= lines: (line EOL+)*
{
return rootScope;
}
line
= line: (samedent t:text { addToRootScope(t); }) &EOL
/ line: (indent t:text { addToRootScope(t); }) &EOL
/ line: (dedent t:text { addToRootScope(t); }) &EOL
/ line: [ \t]* &EOL
/ EOF
samedent
= i:[\t]* &{ return i.length === indentStack.length; }
{
console.log("s:", i.length, " level:", indentStack.length);
}
indent
= i:[\t]+ &{ return i.length > indentStack.length; }
{
indentStack.push("");
console.log("i:", i.length, " level:", indentStack.length);
}
dedent
= i:[\t]* &{ return i.length < indentStack.length; }
{
for (var j = 0; j < i.length + 1; j++) {
indentStack.pop();
}
console.log("d:", i.length + 1, " level:", indentStack.length);
}
text
= numbers: number+ { return numbers.join(""); }
/ txt: character+ { return txt.join(""); }
number
= $[0-9]
character
= $[ a-zA-Z->+]
__
= [ ]+
_
= [ ]*
EOF
= !.
EOL
= "\r\n"
/ "\n"
/ "\r"

Resources