Clang - how to retrieve "Expr" as string? - clang

I am using Clang/libtooling (ASTComsumer with a Matcher) to visit ALL return statements (ReturnStmt). I need to extract the expression that comes after the keyword return in a string form so that I can put that in a macro that I am replacing return statement with.
For example, I want to replace the following line:
return somefunc() + 1;
with
FUNCTION_EXIT(somefunc() + 1); // FUNCTION_EXIT is a C macro
The macro will return from the function after doing some logging.
I am using ReturnStmt::getRetValue() that returns an Expr and tried to get it in string form (so that it can be passed to the macro), but I haven't found a way yet. Is there a way to stringify Expr?

Clang has a strict separation of concerns between the abstract syntax tree (AST) and the actual source code. The component that converts between these is the Lexer. To get the raw source for an Expr e:
const string text = Lexer::getSourceText(e.getSourceRange(), source_manager, opt);
Note that the SourceManager and LangOptions are available from the ASTContext. If the code you're parsing has macros then things get more complicated because you have to care about spelling location versus expansion location; SourceManager has a bunch of different functions to convert between these.
Good luck!

Related

Dart regular expression using `$1`

I want to use $1 in adart regular expression to get the grouping in the expression.
Just like javascript regular expressions.
void main() {
// javascript
// 'hello-world'.replace(/-(w)/, '+$1'); // hello+world
print('hello-world'.replaceAll(RegExp(r'-(w)'), '+\$1')); // hello+$1orld
}
Completely unexpected.
Expected result hello+world, actual resulthello+$1orld
The Dart replaceAll method does not treat its second argument as a replacement pattern.
Rather than introduce a new language for specifying replacements in a string, Dart allows you to use a normal Dart expression to compute the replacement.
Dart has a short syntax for functions, so this is not much longer than the string, but it allows you to insert arbitrarily complicated computations when necessary.
So, use the replaceAllMapped function and write the replacement as follows:
print('hello-world'.replaceAllMapped(RegExp(r'-(w)'), (m) => '+${m[1]}'));
For this particular example, the RegExp is so simple that you can replace with a plain string:
print('hello-world'.replaceAll(RegExp(r'-(?=w)'), '+'));
This replaces a - with a + only when the - is followed by a w.

How to use context free grammars?

Could someone help me with using context free grammars. Up until now I've used regular expressions to remove comments, block comments and empty lines from a string so that it can be used to count the PLOC. This seems to be extremely slow so I was looking for a different more efficient method.
I saw the following post: What is the best way to ignore comments in a java file with Rascal?
I have no idea how to use this, the help doesn't get me far as well. When I try to define the line used in the post I immediately get an error.
lexical SingleLineComment = "//" ~[\n] "\n";
Could someone help me out with this and also explain a bit about how to setup such a context free grammar and then to actually extract the wanted data?
Kind regards,
Bob
First this will help: the ~ in Rascal CFG notation is not in the language, the negation of a character class is written like so: ![\n].
To use a context-free grammar in Rascal goes in three steps:
write it, like for example the syntax definition of the Func language here: http://docs.rascal-mpl.org/unstable/Recipes/#Languages-Func
Use it to parse input, like so:
// This is the basic parse command, but be careful it will not accept spaces and newlines before and after the TopNonTerminal text:
Prog myParseTree = parse(#Prog, "example string");
// you can do the same directly to an input file:
Prog myParseTree = parse(#TopNonTerminal, |home:///myProgram.func|);
// if you need to accept layout before and after the program, use a "start nonterminal":
start[Prog] myParseTree = parse(#start[TopNonTerminal], |home:///myProgram.func|);
Prog myProgram = myParseTree.top;
// shorthand for parsing stuff:
myProgram = [Prog] "example";
myProgram = [Prog] |home:///myLocation.txt|;
Once you have the tree you can start using visit and / deepmatch to extract information from the tree, or write recursive functions if you like. Examples can be found here: http://docs.rascal-mpl.org/unstable/Recipes/#Languages-Func , but here are some common idioms as well to extract information from a parse tree:
// produces the source location of each node in the tree:
myParseTree#\loc
// produces a set of all nodes of type Stat
{ s | /Stat s := myParseTree }
// pattern match an if-then-else and bind the three expressions and collect them in a set:
{ e1, e2, e3 | (Stat) `if <Exp e1> then <Exp e2> else <Exp e3> end` <- myExpressionList }
// collect all locations of all sub-trees (every parse tree is of a non-terminal type, which is a sub-type of Tree. It uses |unknown:///| for small sub-trees which have not been annotated for efficiency's sake, like literals and character classes:
[ t#\loc?|unknown:///| | /Tree t := myParseTree ]
That should give you a start. I'd go try out some stuff and look at more examples. Writing a grammar is a nice thing to do, but it does require some trial and error methods like writing a regex, but even more so.
For the grammar you might be writing, which finds source code comments but leaves the rest as "any character" you will need to use the longest match disambiguation a lot:
lexical Identifier = [a-z]+ !>> [a-z]; // means do not accept an Identifier if there is still [a-z] to add to it; so only the longest possible Identifier will match.
This kind of context-free grammar is called an "Island Grammar" metaphorically, because you will write precise rules for the parts you want to recognize (the comments are "Islands") while leaving the rest as everything else (the rest is "Water"). See https://dl.acm.org/citation.cfm?id=837160

How can I handle previously declared constants with a parser expression grammar?

Say I have the following, in a toy DSL:
int foo(int bar = 0);
With a tool such as rust-peg, I could define some simple parser expression grammar (PEG) rules to match it (assume appropriate structs FnProto and 'Arg'):
function -> FnProto
= t:type " " n:name "(" v:arglist ");"
{ FnProto { return_type:t, name:n, args:v } }
arglist -> Vec<Arg>
= arg ** ","
arg -> Arg
= t:type " " n:name " = " z:integer { Arg { typename:t, name:n, value:z } }
type -> String
= "int" { match_str.to_string() }
name -> String
= [a-zA-Z_]+[a-zA-Z0-9_] { match_str.to_string() }
integer -> i64
= "-"? [0-9]+ { match_str.parse().unwrap() }
In practice such simple rules are insufficient, but they will serve to illustrate my point.
Now consider the following situation, where the default value of bar is a constant defined previously in the same file:
int BAZ = 0xDEADBEEF;
int foo(int bar = BAZ);
Now the rule for parsing functions needs to accept not only integer literals as default argument values, but also any previously declared constants.
I could do one pass to parse constants and substitute the appropriate values in a second pass, but do I really have to resort to two passes? Is there some way I can refer to previously parsed data from within a rule?
You are confusing "parsing" (the recognition of a valid program, perhaps including capture of a representation of it [e.g, as an AST]) and semantic analysis and/or execution.
Your parser should define what is legal to say, syntactically, in the language. Nothing less, and nothing more. You might be able to write some programs that are semantic nonsense that the parser will not complain about.
Having parsed the text, you now need "other passes" over the parsed data (not the source text) to build classic compiler structures such as symbol tables, and to check that all uses of symbols are valid. To do those other passes, you could arguably reparse the text but you've done that already once by assumption. The standard solution here is to have the first parse build an abstract syntax tree (AST) representing the essential details of the program. Those "other passes" operate by walking the AST rather than parsing the source text again.
This is all classic and taught in standard compiler classes and books. If you are serious about building a programming language, you will need this background.

Is it legitimate for a tokenizer to have a stack?

I have designed a new language for that I want to write a reasonable lexer and parser.
For the sake of brevity, I have reduced this language to a minimum so that my questions are still open.
The language has implicit and explicit strings, arrays and objects. An implicit string is just a sequence of characters that does not contain <, {, [ or ]. An explicit string looks like <salt<text>salt> where salt is an arbitrary identifier (i.e. [a-zA-Z][a-zA-Z0-9]*) and text is an arbitrary sequence of characters that does not contain the salt.
An array starts with [, followed by objects and/or strings and ends with ].
All characters within an array that don't belong to an array, object or explicit string do belong to an implicit string and the length of each implicit string is maximal and greater than 0.
An object starts with { and ends with } and consists of properties. A property starts with an identifier, followed by a colon, then optional whitespaces and then either an explicit string, array or object.
So the string [ name:test <xml<<html>[]</html>>xml> {name:<b<test>b>}<b<bla>b> ] represents an array with 6 items: " name:test ", "<html>[]</html>", " ", { name: "test" }, "bla" and " " (the object is notated in json).
As one can see, this language is not context free due to the explicit string (that I don't want to miss). However, the syntax tree is nonambiguous.
So my question is: Is a property a token that may be returned by a tokenizer? Or should the tokenizer return T_identifier, T_colon when he reads an object property?
The real language allows even prefixes in the identifier of a property, e.g. ns/name:<a<test>a> where ns is the prefix for a namespace.
Should the tokenizer return T_property_prefix("ns"), T_property_prefix_separator, T_property_name("name"), T_property_colon or just T_property("ns/name") or even T_identifier("ns"), T_slash, T_identifier("name"), T_colon?
If the tokenizer should recognize properties (which would be useful for syntax highlighters), he must have a stack, because name: is not a property if it is in an array. To decide whether bla: in [{foo:{bar:[test:baz]} bla:{}}] is a property or just an implicit string, the tokenizer must track when he enters and leave an object or array.
Thus, the tokenizer would not be a finite state machine any more.
Or does it make sense to have two tokenizers - the first, which separates whitespaces from alpha-numerical character sequences and special characters like : or [, the second, which uses the first to build more semantical tokens? The parser could then operate on top of the second tokenizer.
Anyways, the tokenizer must have an infinite lookahead to see when an explicit string ends. Or should the detection of the end of an explicit string happen inside the parser?
Or should I use a parser generator for my undertaking? Since my language is not context free, I don't think there is an appropriate parser generator.
Thanks in advance for your answers!
flex can be requested to provide a context stack, and many flex scanners use this feature. So, while it may not fit with a purist view of how a scanner scans, it is a perfectly acceptable and supported feature. See this chapter of the flex manual for details on how to have different lexical contexts (called "start conditions"); at the very end is a brief description of the context stack. (Don't miss the sentence which notes that you need %option stack to enable the stack.) [See Note 1]
Slightly trickier is the requirement to match strings with variable end markers. flex does not have any variable match feature, but it does allow you to read one character at a time from the scanner input, using the function input(). That's sufficient for your language (at least as described).
Here's a rough outline of a possible scanner:
%option stack
%x SC_OBJECT
%%
/* initial/array context only */
[^][{<]+ yylval = strdup(yytext); return STRING;
/* object context only */
<SC_OBJECT>{
[}] yy_pop_state(); return '}';
[[:alpha:]][[:alnum:]]* yylval = strdup(yytext); return ID;
[:/] return yytext[0];
[[:space:]]+ /* Ignore whitespace */
}
/* either context */
<*>{
[][] return yytext[0]; /* char class with [] */
[{] yy_push_state(SC_OBJECT); return '{';
"<"[[:alpha:]][[:alnum:]]*"<" {
/* We need to save a copy of the salt because yytext could
* be invalidated by input().
*/
char* salt = strdup(yytext);
char* saltend = salt + yyleng;
char* match = salt;
/* The string accumulator code is *not* intended
* to be a model for how to write string accumulators.
*/
yylval = NULL;
size_t length = 0;
/* change salt to what we're looking for */
*salt = *(saltend - 1) = '>';
while (match != saltend) {
int ch = input();
if (ch == EOF) {
yyerror("Unexpected EOF");
/* free the temps and do something */
}
if (ch == *match) ++match;
else if (ch == '>') match = salt + 1;
else match = salt;
/* Don't do this in real code */
yylval = realloc(yylval, ++length);
yylval[length - 1] = ch;
}
/* Get rid of the terminator */
yylval[length - yyleng] = 0;
free(salt);
return STRING;
}
. yyerror("Invalid character in object");
}
I didn't test that thoroughly, but here is what it looks like with your example input:
[ name:test <xml<<html>[]</html>>xml> {name:<b<test>b>}<b<bla>b> ]
Token: [
Token: STRING: -- name:test --
Token: STRING: --<html>[]</html>--
Token: STRING: -- --
Token: {
Token: ID: --name--
Token: :
Token: STRING: --test--
Token: }
Token: STRING: --bla--
Token: STRING: -- --
Token: ]
Notes
In your case, unless you wanted to avoid having a parser, you don't actually need a stack since the only thing that needs to be pushed onto the stack is an object context, and a stack with only one possible value can be replaced with a counter.
Consequently, you could just remove the %option stack and define a counter at the top of the scan. Instead of pushing the start condition, you increment the counter and set the start condition; instead of popping, you decrement the counter and reset the start condition if it drops to 0.
%%
/* Indented text before the first rule is inserted at the top of yylex */
int object_count = 0;
<*>[{] ++object_count; BEGIN(SC_OBJECT); return '{';
<SC_OBJECT[}] if (!--object_count) BEGIN(INITIAL); return '}'
Reading the input one character at a time is not the most efficient. Since in your case, a string terminate must start with >, it would probably be better to define a separate "explicit string" context, in which you recognized [^>]+ and [>]. The second of these would do the character-at-a-time match, as with the above code, but would terminate instead of looping if it found a non-matching character other than >. However, the simple code presented may turn out to be fast enough, and anyway it was just intended to be good enough to do a test run.
I think the traditional way to parse your language would be to have the tokenizer return T_identifier("ns"), T_slash, T_identifier("name"), T_colon for ns/name:
Anyway, I can see three reasonable ways you could implement support for your language:
Use lex/flex and yacc/bison. The tokenizers generated by lex/flex do not have stack so you should be using T_identifier and not T_context_specific_type. I didn't try the approach so I can't give a definite comment on whether your language could be parsed by lex/flex and yacc/bison. So, my comment is try it to see if it works. You may find information about the lexer hack useful: http://en.wikipedia.org/wiki/The_lexer_hack
Implement a hand-built recursive descent parser. Note that this can be easily built without separate lexer/parser stages. So, if the lexemes depend on context it is easy to handle when using this approach.
Implement your own parser generator which turns lexemes on and off based on the context of the parser. So, the lexer and the parser would be integrated together using this approach.
I once worked for a major network security vendor where deep packet inspection was performed by using approach (3), i.e. we had a custom parser generator. The reason for this is that approach (1) doesn't work for two reasons: firstly, data can't be fed to lex/flex and yacc/bison incrementally, and secondly, HTTP can't be parsed by using lex/flex and yacc/bison because the meaning of the string "HTTP" depends on its location, i.e. it could be a header value or the protocol specifier. The approach (2) didn't work because data can't be fed incrementally to recursive descent parsers.
I should add that if you want to have meaningful error messages, a recursive descent parser approach is heavily recommended. My understanding is that the current version of gcc uses a hand-built recursive descent parser.

Evaluating constant expressions in clang tools

I'm writing a Clang tool and I'm trying to figure out how to evaluate a string literal given access to the program's AST. Given the following program:
class DHolder {
public:
DHolder(std::string s) {}
};
DHolder x("foo");
I have the following code in the Clang tool:
const CXXConstructExpr *ctor = ... // constructs `x` above
const Expr *expr = ctor->getArg(0); // the "foo" expression
???
How can I get from the Expr representing the "foo" string literal to an actual C++ string in my tool? I've tried to do something like:
// From ExprConstant.cpp
Evaluate(result, info, expr);
but I don't know how to initialize the result and info parameters.
Any clues?
I realize this is an old question, but I ran into this a moment ago when I could not use stringLiteral() to bind to any arguments (the code is not C++11). For example, I have a CXXMMemberCallExpr:
addProperty(object, char*, char*, ...); // has 7 arguments, N=[0,6]
The AST dump shows that ahead of the StringLiteral is a CXXBindTemporaryExpr. So in order for my memberCallExpr query to bind using hasArgument(N,expr()), I wrapped my query with bindTemporaryExpr() (shown here on separate lines for readability):
memberCallExpr(
hasArgument(6, bindTemporaryExpr(
hasDescendant(stringLiteral().bind("argument"))
)
)
)
The proper way to do this is to use the AST matchers to match the string literal and bind a name to it so it can be later referenced, like this:
StatementMatcher m =
constructExpr(hasArgument(0, stringLiteral().bind("myLiteral"))).bind("myCtor");
and then in the match callback do this:
const CXXConstructExpr *ctor =
result.Nodes.getNodeAs<CXXConstructExpr("optionMatcher");
const StringLiteral *optNameLiteral =
result.Nodes.getNodeAs<StringLiteral>("optName");
The literal can then be accessed through
optNameLiteral->getString().str();

Resources