I'm looking to do something similar to this how to get integer variable name and its value from Expr* in clang using the RecursiveASTVisitor
The goal is to first retrieve all assignment operations then perform my own checks on them, to do taint analysis.
I've overridden the VisitBinaryOperator as such
bool VisitBinaryOperator (BinaryOperator *bOp) {
if ( !bOP->isAssignmentOp() ) {
return true;
}
Expr *LHSexpr = bOp->getLHS();
Expr *RHSexpr = bOp->getRHS();
LHSexpr->dump();
RHSexpr->dump();
}
This RecursiveASTVisitor is being run on Objective C codes, so I do not know what the LHS or RHS type will evaluate to (could even be a function on the RHS?)
Would it be possible to get the text representation of what is on the LHS/RHS out from clang in order to perform regex expression on them??
Sorry, I found something similar that works for this particular case.
Solution:
bool VisitBinaryOperator (BinaryOperator *bOp) {
if ( !bOP->isAssignmentOp() ) {
return true;
}
Expr *LHSexpr = bOp->getLHS();
Expr *RHSexpr = bOp->getRHS();
std::string LHS_string = convertExpressionToString(LHSexpr);
std::string RHS_string = convertExpressionToString(RHSexpr);
return true;
}
std::string convertExpressionToString(Expr *E) {
SourceManager &SM = Context->getSourceManager();
clang::LangOptions lopt;
SourceLocation startLoc = E->getLocStart();
SourceLocation _endLoc = E->getLocEnd();
SourceLocation endLoc = clang::Lexer::getLocForEndOfToken(_endLoc, 0, SM, lopt);
return std::string(SM.getCharacterData(startLoc), SM.getCharacterData(endLoc) - SM.getCharacterData(startLoc));
}
Only thing I'm not very sure about is why _endLoc is required to compute endLoc and how is the Lexer actually working.
EDIT:
Link to the post I found help Getting the source behind clang's AST
Related
I'm writing a recursive descent parser in Go for a simple made-up language, so I'm designing the grammar as I go. My parser works but I wanted to ask if there are any best practices for how I should lay out my code or when I should put code in its own function etc ... to make it more readable.
I've been building the parser by following the simple rules I've learned so far ie. each non-terminal is it's own function, even though my code works I think looks really messy and unreadable.
I've included the code for the assignment non-terminal and the grammar above the function.
I've taken out most of the error handling to keep the function smaller.
Here's some examples of what that code can parse:
a = 10
a,b,c = 1,2,3
a int = 100
a,b string = "hello", "world"
Can anyone give me some advice as to how I can make my code more readable please?
// assignment : variable_list '=' expr_list
// | variable_list type
// | variable_list type '=' expr_list
func (p *Parser) assignment() ast.Noder {
assignment := &ast.AssignmentNode{}
assignment.Left = p.variable_list()
// This if-statement deals with rule 2 or 3
if p.currentToken.Type != token.ASSIGN {
// Static variable declaration
// Could be a declaration or an assignment
// Only static variables can be declared without providing a value
assignment.IsStatic = true
assignment.Type = p.var_type().Value
assignment.Right = nil
p.nextToken()
// Rule 2 is finished at this point in the code
// This if-statement is for rule 3
if p.currentToken.Type == token.ASSIGN {
assignment.Operator = p.currentToken
p.nextToken()
assignment.Right = p.expr_list()
}
} else {
// This deals with rule 1
assignment.Operator = p.currentToken
p.nextToken()
assignment.Right = p.expr_list()
}
if assignment.Right == nil {
for i := 0; i < len(assignment.Left); i++ {
assignment.Right = append(assignment.Right, nil)
}
}
if len(assignment.Left) != len(assignment.Right) {
p.FoundError(p.syntaxError("variable mismatch, " + strconv.Itoa(len(assignment.Left)) + " on left but " + strconv.Itoa(len(assignment.Right)) + " on right,"))
}
return assignment
}
how I can make my code more readable?
For readability, a prerequisite for correct, maintainable code,
// assignment : variable_list '=' expr_list
// | variable_list type
// | variable_list type '=' expr_list
func (p *Parser) assignment() ast.Noder {
assignment := &ast.AssignmentNode{}
// variable_list
assignment.Left = p.variable_list()
// type
if p.currentToken.Type != token.ASSIGN {
// Static variable declaration
// Could be a declaration or an assignment
// Only static variables can be declared without providing a value
assignment.IsStatic = true
assignment.Type = p.var_type().Value
p.nextToken()
}
// '=' expr_list
if p.currentToken.Type == token.ASSIGN {
assignment.Operator = p.currentToken
p.nextToken()
assignment.Right = p.expr_list()
}
// variable_list [expr_list]
if assignment.Right == nil {
for i := 0; i < len(assignment.Left); i++ {
assignment.Right = append(assignment.Right, nil)
}
}
if len(assignment.Left) != len(assignment.Right) {
p.FoundError(p.syntaxError(fmt.Sprintf(
"variable mismatch, %d on left but %d on right,",
len(assignment.Left), len(assignment.Right),
)))
}
return assignment
}
Note: This likely inefficient and overly complicated:
for i := 0; i < len(assignment.Left); i++ {
assignment.Right = append(assignment.Right, nil)
}
What is the type of assignment.Right?
As far as how to make your code more readable, there is not always a cut and dry answer. I personally find that code is more readable when you can use function names in place of comments in the code. A lot of people like to recommend the book "Clean Code" by Robert C. Martin. He pushes this throughout the book, small functions that have one purpose and are self documenting (via the function name).
Of course, as I said before this is a subjective topic. I took a crack at it, and came up with the code below, which I personally feel is more readable. It also uses the function names to document what is going on. That way the reader doesn't necessarily need to dig into every single statement in the code, but rather just the high level function names if they don't need all of the details.
// assignment : variable_list '=' expr_list
// | variable_list type
// | variable_list type '=' expr_list
func (p *Parser) assignment() ast.Noder {
assignment := &ast.AssignmentNode{}
assignment.Left = p.variable_list()
// This if-statement deals with rule 2 or 3
if p.currentToken.Type != token.ASSIGN {
// Static variable declaration
// Could be a declaration or an assignment
// Only static variables can be declared without providing a value
p.parseStaticStatement(assignment)
} else {
p.parseVariableAssignment(assignment)
}
if assignment.Right == nil {
assignment.appendDefaultValues()
}
p.checkForUnbalancedAssignment(assignment)
return assignment
}
func (p *Parser) parseStaticStatement(assignment *ast.AssingmentNode) {
assignment.IsStatic = true
assignment.Type = p.var_type().Value
assignment.Right = nil
p.nextToken()
// Rule 2 is finished at this point in the code
// This if-statement is for rule 3
if p.currentToken.Type == token.ASSIGN {
a.parseStaticAssignment()
}
}
func (p *Parser) parseStaticAssignment(assignment *ast.AssignmentNode) {
assignment.Operator = p.currentToken
p.nextToken()
assignment.Right = p.expr_list()
}
func (p *Parser) parseVariableAssignment(assignment *ast.AssignmentNode) {
// This deals with rule 1
assignment.Operator = p.currentToken
p.nextToken()
assignment.Right = p.expr_list()
}
func (a *ast.AssignmentNode) appendDefaultValues() {
for i := 0; i < len(assignment.Left); i++ {
assignment.Right = append(assignment.Right, nil)
}
}
func (p *Parser) checkForUnbalancedAssignment(assignment *ast.AssignmentNode) {
if len(assignment.Left) != len(assignment.Right) {
p.FoundError(p.syntaxError("variable mismatch, " + strconv.Itoa(len(assignment.Left)) + " on left but " + strconv.Itoa(len(assignment.Right)) + " on right,"))
}
}
I hope that you find this helpful. I am more than willing to answer any further questions that you may have if you leave a comment on my response.
I have a somewhat simple problem that i somehow cannot find any answers for. While working on parsing a larger grammar, i discovered that parsing any string larger then 15 characters would lead the parser to return as failed. The parser looks like this:
namespace parser {
template <typename Iterator>
struct p_grammar : qi::grammar<Iterator, standard::space_type> {
p_grammar() : p_grammar::base_type(spec) {
spec = "qwertyuiopasdfgh";
}
qi::rule<Iterator, standard::space_type> spec;
};
And will be run from within another function:
void MainWindow::parserTest() {
typedef parser::p_grammar<std::string::const_iterator> p_grammar;
p_grammar grammar;
using boost::spirit::standard::space;
std::string::const_iterator iter = editor->toPlainText().toStdString().begin();
std::string::const_iterator end = editor->toPlainText().toStdString().end();
if ( phrase_parse(iter,end,grammar,space) ) {
outputLog->append("Parsing succesfull");
} else {
outputLog->append("Parsing failed");
}
}
Removing the last character in "qwertyuiopasdfgh", so only 15 characters are present, makes it parse without failure.
Feel like I'm overlooking something obvious here.
You should be using valid iterators:
std::string value = editor->toPlainText().toStdString()
std::string::const_iterator iter = value.begin(), end = value.end();
You were using iterators into a temporary that wasn't stored.
I am trying to learn the Dart language, by transposing the exercices given by my school for C programming.
The very first exercice in our C pool is to write a function print_alphabet() that prints the alphabet in lowercase; it is forbidden to print the alphabet directly.
In POSIX C, the straightforward solution would be:
#include <unistd.h>
void print_alphabet(void)
{
char c;
c = 'a';
while (c <= 'z')
{
write(STDOUT_FILENO, &c, 1);
c++;
}
}
int main(void)
{
print_alphabet();
return (0);
}
However, as far as I know, the current version of Dart (1.1.1) does not have an easy way of dealing with characters. The farthest I came up with (for my very first version) is this:
void print_alphabet()
{
var c = "a".codeUnits.first;
var i = 0;
while (++i <= 26)
{
print(c.toString());
c++;
}
}
void main() {
print_alphabet();
}
Which prints the ASCII value of each character, one per line, as a string ("97" ... "122"). Not really what I intended…
I am trying to search for a proper way of doing this. But the lack of a char type like the one in C is giving me a bit of a hard time, as a beginner!
Dart does not have character types.
To convert a code point to a string, you use the String constructor String.fromCharCode:
int c = "a".codeUnitAt(0);
int end = "z".codeUnitAt(0);
while (c <= end) {
print(String.fromCharCode(c));
c++;
}
For simple stuff like this, I'd use "print" instead of "stdout", if you don't mind the newlines.
There is also:
int char_a = 'a'.codeUnitAt(0);
print(String.fromCharCodes(new Iterable.generate(26, (x) => char_a + x)));
or, using newer list literal syntax:
int char_a = 'a'.codeUnitAt(0);
int char_z = 'z'.codeUnitAt(0);
print(String.fromCharCodes([for (var i = char_a; i <= char_z; i++) i]));
As I was finalizing my post and rephrasing my question’s title, I am no longer barking up the wrong tree thanks to this question about stdout.
It seems that one proper way of writing characters is to use stdout.writeCharCode from the dart:io library.
import 'dart:io';
void ft_print_alphabet()
{
var c = "a".codeUnits.first;
while (c <= "z".codeUnits.first)
stdout.writeCharCode(c++);
}
void main() {
ft_print_alphabet();
}
I still have no clue about how to manipulate character types, but at least I can print them.
I've been trying to implement a BASIC language interpreter (in C/C++) but I haven't found any book or (thorough) article which explains the process of parsing the language constructs. Some commands are rather complex and hard to parse, especially conditionals and loops, such as IF-THEN-ELSE and FOR-STEP-NEXT, because they can mix variables with constants and entire expressions and code and everything else, for example:
10 IF X = Y + Z THEN GOTO 20 ELSE GOSUB P
20 FOR A = 10 TO B STEP -C : PRINT C$ : PRINT WHATEVER
30 NEXT A
It seems like a nightmare to be able to parse something like that and make it work. And to make things worse, programs written in BASIC can easily be a tangled mess. That's why I need some advice, read some book or whatever to make my mind clear about this subject. What can you suggest?
You've picked a great project - writing interpreters can be lots of fun!
But first, what do we even mean by an interpreter? There are different types of interpreters.
There is the pure interpreter, where you simply interpret each language element as you find it. These are the easiest to write, and the slowest.
A step up, would be to convert each language element into some sort of internal form, and then interpret that. Still pretty easy to write.
The next step, would be to actually parse the language, and generate a syntax tree, and then interpret that. This is somewhat harder to write, but once you've done it a few times, it becomes pretty easy.
Once you have a syntax tree, you can fairly easily generate code for a custom stack virtual machine. A much harder project is to generate code for an existing virtual machine, such as the JVM or CLR.
In programming, like most engineering endeavors, careful planning greatly helps, especially with complicated projects.
So the first step is to decide which type of interpreter you wish to write. If you have not read any of a number of compiler books (e.g., I always recommend Niklaus Wirth's "Compiler Construction" as one of the best introductions to the subject, and is now freely available on the web in PDF form), I would recommend that you go with the pure interpreter.
But you still need to do some additional planning. You need to rigorously define what it is you are going to be interpreting. EBNF is great for this. For a gentile introduction EBNF, read the first three parts of a Simple Compiler at http://www.semware.com/html/compiler.html It is written at the high school level, and should be easy to digest. Yes, I tried it on my kids first :-)
Once you have defined what it is you want to be interpreting, you are ready to write your interpreter.
Abstractly, you're simple interpreter will be divided into a scanner (technically, a lexical analyzer), a parser, and an evaluator. In the simple pure interpolator case, the parser and evaluator will be combined.
Scanners are easy to write, and easy to test, so we won't spend any time on them. See the aforementioned link for info on crafting a simple scanner.
Lets (for example) define your goto statement:
gotostmt -> 'goto' integer
integer -> [0-9]+
This tells us that when we see the token 'goto' (as delivered by the scanner), the only thing that can follow is an integer. And an integer is simply a string a digits.
In pseudo code, we might handle this as so:
(token - is the current token, which is the current element just returned via the scanner)
loop
if token == "goto"
goto_stmt()
elseif token == "gosub"
gosub_stmt()
elseif token == .....
endloop
proc goto_stmt()
expect("goto") -- redundant, but used to skip over goto
if is_numeric(token)
--now, somehow set the instruction pointer at the requested line
else
error("expecting a line number, found '%s'\n", token)
end
end
proc expect(s)
if s == token
getsym()
return true
end
error("Expecting '%s', found: '%s'\n", curr_token, s)
end
See how simple it is? Really, the only hard thing to figure out in a simple interpreter is the handling of expressions. A good recipe for handling those is at: http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm Combined with the aforementioned references, you should have enough to handle the sort of expressions you would encounter in BASIC.
Ok, time for a concrete example. This is from a larger 'pure interpreter', that handles a enhanced version of Tiny BASIC (but big enough to run Tiny Star Trek :-) )
/*------------------------------------------------------------------------
Simple example, pure interpreter, only supports 'goto'
------------------------------------------------------------------------*/
#include <stdio.h>
#include <stdlib.h>
#include <stdarg.h>
#include <string.h>
#include <setjmp.h>
#include <ctype.h>
enum {False=0, True=1, Max_Lines=300, Max_Len=130};
char *text[Max_Lines+1]; /* array of program lines */
int textp; /* used by scanner - ptr in current line */
char tok[Max_Len+1]; /* the current token */
int cur_line; /* the current line number */
int ch; /* current character */
int num; /* populated if token is an integer */
jmp_buf restart;
int error(const char *fmt, ...) {
va_list ap;
char buf[200];
va_start(ap, fmt);
vsprintf(buf, fmt, ap);
va_end(ap);
printf("%s\n", buf);
longjmp(restart, 1);
return 0;
}
int is_eol(void) {
return ch == '\0' || ch == '\n';
}
void get_ch(void) {
ch = text[cur_line][textp];
if (!is_eol())
textp++;
}
void getsym(void) {
char *cp = tok;
while (ch <= ' ') {
if (is_eol()) {
*cp = '\0';
return;
}
get_ch();
}
if (isalpha(ch)) {
for (; !is_eol() && isalpha(ch); get_ch()) {
*cp++ = (char)ch;
}
*cp = '\0';
} else if (isdigit(ch)) {
for (; !is_eol() && isdigit(ch); get_ch()) {
*cp++ = (char)ch;
}
*cp = '\0';
num = atoi(tok);
} else
error("What? '%c'", ch);
}
void init_getsym(const int n) {
cur_line = n;
textp = 0;
ch = ' ';
getsym();
}
void skip_to_eol(void) {
tok[0] = '\0';
while (!is_eol())
get_ch();
}
int accept(const char s[]) {
if (strcmp(tok, s) == 0) {
getsym();
return True;
}
return False;
}
int expect(const char s[]) {
return accept(s) ? True : error("Expecting '%s', found: %s", s, tok);
}
int valid_line_num(void) {
if (num > 0 && num <= Max_Lines)
return True;
return error("Line number must be between 1 and %d", Max_Lines);
}
void goto_line(void) {
if (valid_line_num())
init_getsym(num);
}
void goto_stmt(void) {
if (isdigit(tok[0]))
goto_line();
else
error("Expecting line number, found: '%s'", tok);
}
void do_cmd(void) {
for (;;) {
while (tok[0] == '\0') {
if (cur_line == 0 || cur_line >= Max_Lines)
return;
init_getsym(cur_line + 1);
}
if (accept("bye")) {
printf("That's all folks!\n");
exit(0);
} else if (accept("run")) {
init_getsym(1);
} else if (accept("goto")) {
goto_stmt();
} else {
error("Unknown token '%s' at line %d", tok, cur_line); return;
}
}
}
int main() {
int i;
for (i = 0; i <= Max_Lines; i++) {
text[i] = calloc(sizeof(char), (Max_Len + 1));
}
setjmp(restart);
for (;;) {
printf("> ");
while (fgets(text[0], Max_Len, stdin) == NULL)
;
if (text[0][0] != '\0') {
init_getsym(0);
if (isdigit(tok[0])) {
if (valid_line_num())
strcpy(text[num], &text[0][textp]);
} else
do_cmd();
}
}
}
Hopefully, that will be enough to get you started. Have fun!
I will certainly get beaten by telling this ...but...:
First, I am actually working on a standalone library ( as a hobby ) that is made of:
a tokenizer, building linear (flat list) of tokens from the source text and following the same sequence as the text ( lexems created from the text flow ).
A parser by hands (syntax analyse; pseudo-compiler )
There is no "pseudo-code" nor "virtual CPU/machine".
Instructions(such as 'return', 'if' 'for' 'while'... then arithemtic expressions ) are represented by a base c++-struct/class and is the object itself. The base object, I name it atom, have a virtual method called "eval", among other common members, that is the "execution/branch" also by itself. So no matter I have an 'if' statement with its possible branchings ( single statement or bloc of statements/instructions ) as true or false condition, it will be called from the base virtual atom::eval() ... and so on for everything that is an atom.
Even 'objects' such as variables are 'atom'. 'eval()' will simply return its value from a variant container held by the atom itself ( pointer, refering to the 'local' variant instance (the instance variant iself) held the 'atom' or to another variant held by an atom that is created in a given 'bloc/stack'. So 'atom' are 'inplace' instructions/objects.
As of now, as an example, chunk of not really meaningful 'code' as below just works:
r = 5!; // 5! : (factorial of 5 )
Response = 1 + 4 - 6 * --r * ((3+5)*(3-4) * 78);
if (Response != 1){ /* '<>' also is not equal op. */
return r^3;
}
else{
return 0;
}
Expressions ( arithemtics ) are built into binary tree expression:
A = b+c; =>
=
/ \
A +
/ \
b c
So the 'instruction'/statement for expression like above is the tree-entry atom that in the above case, is the '=' (binary) operator.
The tree is built with atom::r0,r1,r2 :
atom 'A' :
r0
|
A
/ \
r1 r2
Regarding 'full-duplex' mecanism between c++ runtime and the 'script' library, I've made class_adaptor and adaptor<> :
ex.:
template<typename R, typename ...Args> adaptor_t<T,R, Args...>& import_method(const lstring& mname, R (T::*prop)(Args...)) { ... }
template<typename R, typename ...Args> adaptor_t<T,R, Args...>& import_property(const lstring& mname, R (T::*prop)(Args...)) { ... }
Second: I know there are plenty of tools and libs out there such as lua, boost::bind<*>, QML, JSON, etc... But in my situation, I need to create my very own [edit] 'independant' [/edit] lib for "live scripting". I was scared that my 'interpreter' could take a huge amount of RAM, but I am surprised that it is not as big as using QML,jscript or even lua :-)
Thank you :-)
Don't bother with hacking a parser together by hand. Use a parser generator. lex + yacc is the classic lexer/parser generator combination, but a Google search will reveal plenty of others.
I'm trying to implement a parser by directly reading a treeWalker and implementing the commands needed for the compiler on the fly. So if I have a command like:
statement
:
^('WRITE' expression)
{
//Here is the command that is created by my Tree Parser
ch.emitRO("OUT",0,0,0,"write out the value of ac");
//and then I handle it in my other classes
}
;
I want it to write OUT 0,0,0; to a file. That's my grammar.
I have a problem though with the loop section in my grammar it is:
'WHILE'^ expression 'DO' stat_seq 'ENDDO'
and in the tree parser:
doWhileStatement
:
^('WHILE' expression 'DO' stat_seq 'ENDDO')
;
What I want to do is directly parse the code from the while loop into the commands I need. I came up with this solution but it doesn't work:
doWhileStatement
:
^('WHILE' e=expression head='DO'
{
int loopHead =((CommonTree) head).getTokenStartIndex();
}
stat_seq
{
if ($e.result==1) {
input.seek(loopHead);
doWhileStatement();
}
}
'ENDDO')
;
for the record here are some of the other commands I've written:
(ignore the code written in brackets, it's for the generation of the commands in a text file.)
stat_seq
:
(statement)+
;
statement
:
^(':=' ID e=expression) { variables.put($ID.text,e); }
| ^('WRITE' expression)
{
ch.emitRM("LDC",ac,$expression.result,0,"pass the expression value to the ac reg");
ch.emitRO("OUT",ac,0,0,"write out the value of ac");
}
| ^('READ' ID)
{
ch.emitRO("IN",ac,0,0,"read value");
}
| ^('IF' expression 'THEN'
{
ch.emitRM("LDC",ac1,$expression.result,0,"pass the expression result to the ac reg");
int savedLoc1 = ch.emitSkip(1);
}
sseq1=stat_seq
'ELSE'
{
int savedLoc2 = ch.emitSkip(1);
ch.emitBackup(savedLoc1);
ch.emitRM("JEQ",ac1,savedLoc2+1,0,"skip as many places as needed depending on the expression");
ch.emitRestore();
}
sseq2=stat_seq
{
int savedLoc3 = ch.emitSkip(0);
ch.emitBackup(savedLoc2);
ch.emitRM("LDC",PC_REG,savedLoc3,0,"skip for the else command");
ch.emitRestore();
}
'ENDIF')
| doWhileStatement
;
Any help would be appreciated, thank you
I found it for everyone who has the same problem I did it like this and it's working:
^('WHILE'
{int c = input.index();}
expression
{int s=input.index();}
.* )// .* is a sequence of statements
{
int next = input.index(); // index of node following WHILE
input.seek(c);
match(input, Token.DOWN, null);
pushFollow(FOLLOW_expression_in_statement339);
int condition = expression();
state._fsp--;
//there is a problem here
//expression() seemed to be reading from the grammar file and I couldn't
//get it to read from the tree walker rule somehow
//It printed something like no viable alt at input 'DOWN'
//I googled it and found this mistake
// So I copied the code from the normal while statement
// And pasted it here and it works like a charm
// Normally there should only be int condition = expression()
while ( condition == 1 ) {
input.seek(s);
stat_seq();//stat_seq is a sequence of statements: (statement ';')+
input.seek(c);
match(input, Token.DOWN, null); //Copied value from EvaluatorWalker.java
//cause couldn't find another way to do it
pushFollow(FOLLOW_expression_in_statement339);
condition = expression();
state._fsp--;
System.out.println("condition:"+condition + " i:"+ variables.get("i"));
}
input.seek(next);
}
I wrote the problem at the comments of my code. If anyone can help me out and answer this for me how to do it I would be grateful. It's so weird that there is nearly no feedback on a correct way to implement loops within a tree grammar on the fly.
Regards,
Alex