I'm writing a Clang tool and I'm trying to figure out how to evaluate a string literal given access to the program's AST. Given the following program:
class DHolder {
public:
DHolder(std::string s) {}
};
DHolder x("foo");
I have the following code in the Clang tool:
const CXXConstructExpr *ctor = ... // constructs `x` above
const Expr *expr = ctor->getArg(0); // the "foo" expression
???
How can I get from the Expr representing the "foo" string literal to an actual C++ string in my tool? I've tried to do something like:
// From ExprConstant.cpp
Evaluate(result, info, expr);
but I don't know how to initialize the result and info parameters.
Any clues?
I realize this is an old question, but I ran into this a moment ago when I could not use stringLiteral() to bind to any arguments (the code is not C++11). For example, I have a CXXMMemberCallExpr:
addProperty(object, char*, char*, ...); // has 7 arguments, N=[0,6]
The AST dump shows that ahead of the StringLiteral is a CXXBindTemporaryExpr. So in order for my memberCallExpr query to bind using hasArgument(N,expr()), I wrapped my query with bindTemporaryExpr() (shown here on separate lines for readability):
memberCallExpr(
hasArgument(6, bindTemporaryExpr(
hasDescendant(stringLiteral().bind("argument"))
)
)
)
The proper way to do this is to use the AST matchers to match the string literal and bind a name to it so it can be later referenced, like this:
StatementMatcher m =
constructExpr(hasArgument(0, stringLiteral().bind("myLiteral"))).bind("myCtor");
and then in the match callback do this:
const CXXConstructExpr *ctor =
result.Nodes.getNodeAs<CXXConstructExpr("optionMatcher");
const StringLiteral *optNameLiteral =
result.Nodes.getNodeAs<StringLiteral>("optName");
The literal can then be accessed through
optNameLiteral->getString().str();
Related
I am using Clang/libtooling (ASTComsumer with a Matcher) to visit ALL return statements (ReturnStmt). I need to extract the expression that comes after the keyword return in a string form so that I can put that in a macro that I am replacing return statement with.
For example, I want to replace the following line:
return somefunc() + 1;
with
FUNCTION_EXIT(somefunc() + 1); // FUNCTION_EXIT is a C macro
The macro will return from the function after doing some logging.
I am using ReturnStmt::getRetValue() that returns an Expr and tried to get it in string form (so that it can be passed to the macro), but I haven't found a way yet. Is there a way to stringify Expr?
Clang has a strict separation of concerns between the abstract syntax tree (AST) and the actual source code. The component that converts between these is the Lexer. To get the raw source for an Expr e:
const string text = Lexer::getSourceText(e.getSourceRange(), source_manager, opt);
Note that the SourceManager and LangOptions are available from the ASTContext. If the code you're parsing has macros then things get more complicated because you have to care about spelling location versus expansion location; SourceManager has a bunch of different functions to convert between these.
Good luck!
I need some help regarding a problem I face in my flex code.
My task: To write a flex code which recognizes the declaration part of a programming language, described below.
Let a programming language PL. Its variable definition part is described as follows:
At the beginning we have to start with the keyword "var". After writing this keyword we have to write the variable names(one or more) separated by commas ",". Then a colon ":" is inserted and after that we must write the variable type(say real, boolean, integer or char in my example) followed by a semicolon ";". After doing the previous steps there is the potentiality to declare into a new line new variables(variable names separated by commas "," followed by colon ":" followed by variable type followed by a semicolon ";"), but we must not use the "var" keyword again at the beginning of the new line( the "var" keyword is written once!!!)
E.g.
var number_of_attendants, sum: integer;
ticket_price: real;
symbols: char;
Concretely, I do not know how to make it possible to define that each and every declaration part must start only with the 'var' keyword. Until now, if I would begin a declaration part directly declaring a variable, say x (without having written "var" at the beginning of the line), then no error would occur(unwanted state).
My current flex code below:
%{
#include <stdio.h>
%}
VAR_DEFINER "var"
VAR_NAME [a-zA-Z][a-zA-Z0-9_]*
VAR_TYPE "real"|"boolean"|"integer"|"char"
SUBEXPRESSION [{VAR_NAME}[","{VAR_NAME}]*":"[ \t\n]*{VAR_TYPE}";"]+
EXPRESSION {VAR_DEFINER}{SUBEXPRESSION}
%%
^{EXPRESSION} {
printf("This is not a well-syntaxed expression!\n");
return 0;
}
{EXPRESSION} printf("This is a well-syntaxed expression!\n");
";"[ \t\n]*{VAR_DEFINER} {
printf("The keyword 'var' is defined once at the beginning of a new line. You can not use it again\n");
return 0;
}
{VAR_DEFINER} printf("A keyword: %s\n", yytext);
^{VAR_DEFINER} printf("Each and every declaration part must start with the 'var' keyword.\n");
{VAR_TYPE}";" printf("The variable type is: %s\n", yytext);
{VAR_NAME} printf("A variable name: %s\n", yytext);
","/[ \t\n]*{VAR_NAME} /* eat up commas */
":"/[ \t\n]*{VAR_TYPE}";" /* eat up single colon */
[ \t\n]+ /* eat up whitespace */
. {
printf("Unrecognized character: %s\n", yytext);
return 0;
}
%%
main(argc, argv)
int argc;
char** argv;
{
++argv, --argc;
if (argc > 0)
yyin = fopen(argv[0],"r");
else
yyin = stdin;
yylex();
}
I hope to have made it as much as possible clear.
I am looking forward to reading your answers!
You seem to be trying to do too much in the scanner. Do you really have to do everything in Flex? In other words, is this an exercise to learn advanced use of Flex, or is it a problem that may be solved using more appropriate tools?
I've read that the first Fortran compiler took 18 staff-years to create, back in the 1950's. Today, "a substantial compiler can be implemented even as a student project in a one-semester compiler design course", as the Dragon Book from 1986 says. One of the main reasons for this increased efficiency is that we have learned how to divide the compiler into modules that can be constructed separately. The two first such parts, or phases, of a typical compiler is the scanner and the parser.
The scanner, or lexical analyzer, can be generated by Flex from a specification file, or constructed otherwise. Its job is to read the input, which consists of a sequence of characters, and split it into a sequence of tokens. A token is the smallest meaningful part of the input language, such as a semicolon, the keyword var, the identifier number_of_attendants, or the operator <=. You should not use the scanner to do more than that.
Here is how I woould write a simplified Flex specification for your tokens:
[ \t\n] { /* Ignore all whitespace */ }
var { return VAR; }
real { return REAL; }
boolean { return BOOLEAN; }
integer { return INTEGER; }
char { return CHAR; }
[a-zA-Z][a-zA-Z0-9_]* { return VAR_NAME; }
. { return yytext[0]; }
The sequence of tokens is then passed on to the parser, or syntactical analyzer. The parser compares the token sequence with the grammar for the language. For example, the input var number_of_attendants, sum : integer; consists of the keyword var, a comma-separated list of variables, a colon, a data type keyword, and a semicolon. If I understand what your input is supposed to look like, perhaps this grammar would be correct:
program : VAR typedecls ;
typedecls : typedecl | typedecls typedecl ;
typedecl : varlist ':' var_type ';' ;
varlist : VAR_NAME | varlist ',' VAR_NAME ;
var_type : REAL | BOOLEAN | INTEGER | CHAR ;
This grammar happens to be written in a format that Bison, a parser-generator that often is used together with Flex, can understand.
If you separate your solution into a lexical part, using Flex, and a grammar part, using Bison, your life is likely to be much simpler and happier.
I love the simplicity of types like
type Code = Code of string
But I would like to put some restrictions on string (in this case - do not allow empty of spaces-only strings). Something like
type nonemptystring = ???
type Code = Code of nonemptystring
How do I define this type in F# idiomatic way? I know I can make it a class with constructor or a restricted module with factory function, but is there an easy way?
A string is essentially a sequence of char values (in Haskell, BTW, String is a type alias for [Char]). A more general question, then, would be if it's possible to statically declare a list as having a given size.
Such a language feature is know as Dependent Types, and F# doesn't have it. The short answer, therefore, is that this is not possible to do in a declarative fashion.
The easiest, and probably also most idiomatic, way, then, would be to define Code as a single-case Discriminated Union:
type Code = Code of string
In the module that defines Code, you'd also define a function that clients can use to create Code values:
let tryCreateCode candidate =
if System.String.IsNullOrWhiteSpace candidate
then None
else Some (Code candidate)
This function contains the run-time logic that prevents clients from creating empty Code values:
> tryCreateCode "foo";;
val it : Code option = Some (Code "foo")
> tryCreateCode "";;
val it : Code option = None
> tryCreateCode " ";;
val it : Code option = None
What prevents a client from creating an invalid Code value, then? For example, wouldn't a client be able to circumvent the tryCreateCode function and simply write Code ""?
This is where signature files come in. You create a signature file (.fsi), and in that declare types and functions like this:
type Code
val tryCreateCode : string -> Code option
Here, the Code type is declared, but its 'constructor' isn't. This means that you can't directly create values of this types. This, for example, doesn't compile:
Code ""
The error given is:
error FS0039: The value, constructor, namespace or type 'Code' is not defined
The only way to create a Code value is to use the tryCreateCode function.
As given here, you can no longer access the underlying string value of Code, unless you also provide a function for that:
let toString (Code x) = x
and declare it in the same .fsi file as above:
val toString : Code -> string
That may look like a lot of work, but is really only six lines of code, and three lines of type declaration (in the .fsi file).
Unfortunately there isn't convenient syntax for declaring a restricted subset of types but I would leverage active patterns to do this. As you rightly say, you can make a type and check it's validity when you construct it:
/// String type which can't be null or whitespace
type FullString (string) =
let string =
match (System.String.IsNullOrWhiteSpace string) with
|true -> invalidArg "string" "string cannot be null or whitespace"
|false -> string
member this.String = string
Now, constructing this type naively may throw runtime exceptions and we don't want that! So let's use active patterns:
let (|FullStr|WhitespaceStr|NullStr|) (str : string) =
match str with
|null -> NullStr
|str when System.String.IsNullOrWhiteSpace str -> WhitespaceStr
|str -> FullStr(FullString(str))
Now we have something that we can use with pattern matching syntax to build our FullStrings. This function is safe at runtime because we only create a FullString if we're in the valid case.
You can use it like this:
let printString str =
match str with
|NullStr -> printfn "The string is null"
|WhitespaceStr -> printfn "The string is whitespace"
|FullStr fstr -> printfn "The string is %s" (fstr.String)
Say I have the following, in a toy DSL:
int foo(int bar = 0);
With a tool such as rust-peg, I could define some simple parser expression grammar (PEG) rules to match it (assume appropriate structs FnProto and 'Arg'):
function -> FnProto
= t:type " " n:name "(" v:arglist ");"
{ FnProto { return_type:t, name:n, args:v } }
arglist -> Vec<Arg>
= arg ** ","
arg -> Arg
= t:type " " n:name " = " z:integer { Arg { typename:t, name:n, value:z } }
type -> String
= "int" { match_str.to_string() }
name -> String
= [a-zA-Z_]+[a-zA-Z0-9_] { match_str.to_string() }
integer -> i64
= "-"? [0-9]+ { match_str.parse().unwrap() }
In practice such simple rules are insufficient, but they will serve to illustrate my point.
Now consider the following situation, where the default value of bar is a constant defined previously in the same file:
int BAZ = 0xDEADBEEF;
int foo(int bar = BAZ);
Now the rule for parsing functions needs to accept not only integer literals as default argument values, but also any previously declared constants.
I could do one pass to parse constants and substitute the appropriate values in a second pass, but do I really have to resort to two passes? Is there some way I can refer to previously parsed data from within a rule?
You are confusing "parsing" (the recognition of a valid program, perhaps including capture of a representation of it [e.g, as an AST]) and semantic analysis and/or execution.
Your parser should define what is legal to say, syntactically, in the language. Nothing less, and nothing more. You might be able to write some programs that are semantic nonsense that the parser will not complain about.
Having parsed the text, you now need "other passes" over the parsed data (not the source text) to build classic compiler structures such as symbol tables, and to check that all uses of symbols are valid. To do those other passes, you could arguably reparse the text but you've done that already once by assumption. The standard solution here is to have the first parse build an abstract syntax tree (AST) representing the essential details of the program. Those "other passes" operate by walking the AST rather than parsing the source text again.
This is all classic and taught in standard compiler classes and books. If you are serious about building a programming language, you will need this background.
Using F# in Visual Studio 2012, this code compiles:
let ``foo.bar`` = 5
But this code does not:
type ``foo.bar`` = class end
Invalid namespace, module, type or union case name
According to section 3.4 of the F# language specification:
Any sequence of characters that is enclosed in double-backtick marks (````),
excluding newlines, tabs, and double-backtick pairs themselves, is treated
as an identifier.
token ident =
| ident-text
| `` [^ '\n' '\r' '\t']+ | [^ '\n' '\r' '\t'] ``
Section 5 defines type as:
type :=
( type )
type -> type -- function type
type * ... * type -- tuple type
typar -- variable type
long-ident -- named type, such as int
long-ident<types> -- named type, such as list<int>
long-ident< > -- named type, such as IEnumerable< >
type long-ident -- named type, such as int list
type[ , ... , ] -- array type
type lazy -- lazy type
type typar-defns -- type with constraints
typar :> type -- variable type with subtype constraint
#type -- anonymous type with subtype constraint
... and Section 4.2 defines long-ident as:
long-ident := ident '.' ... '.' ident
As far as I can tell from the spec, types are named with long-idents, and long-idents can be idents. Since idents support double-backtick-quoted punctuation, it therefore seems like types should too.
So am I misreading the spec? Or is this a compiler bug?
It definitely looks like the specification is not synchronized with the actual implementation, so there is a bug on one side or the other.
When you use identifier in double backticks, the compiler treats it as a name and simply generates type (or member) with the name you specified in backticks. It does not do any name mangling to make sure that the identifier is valid type/member name.
This means that it is not too surprising that you cannot use identifiers that would clash with some standard meaning in the compiled code. In your example, it is dot, but here are a few other examples:
type ``Foo.Bar``() = // Dot is not allowed because it represents namespace
member x.Bar = 0
type ``Foo`1``() = // Single backtick is used to compile generic types
member x.Bar = 0
type ``Foo+Bar``() = // + is used in the name of a nested type
member x.Bar = 0
The above examples are not allowed as type names (because they clash with some standard meaning), but you can use them in let-bindings, because there are no such restrictions on variable names:
let ``foo`1`` = 0
let ``foo.bar`` = 2
let ``foo+bar`` = 1
This is definitely something that should be explained in the documentation & the specification, but I hope this helps to clarify what is going on.