lex & yacc get current position - parsing

In lex & yacc there is a macro called YY_INPUT which can be redefined, for example in a such way
#define YY_INPUT(buf,result,maxlen) do { \
const int n = gzread(gz_yyin, buf, maxlen); \
if (n < 0) { \
int errNumber = 0; \
reportError( gzerror(gz_yyin, &errNumber)); } \
\
result = n > 0 ? n : YY_NULL; \
} while (0)
I have some grammar rule which called YYACCEPT macro.
If after YYACCEPT I called gztell (or ftell), then I got a wrong number, because parser already read some unnecessary data.
So how I can get current position if I have some rule which called YYACCEPT in it(one bad solution will be to read character by character)
(I have already done something like this:
#define YY_USER_ACTION do { \
current_position += yyleng; \
} while (0)
but seems its not work
)

You have to keep track of the offset yourself. A simple but annoying solution is to put:
offset += yyleng;
in every flex action. Fortunately, you can do this implicitly by defining the YY_USER_ACTION macro, which is executed just before the token action.
That might still not be right for your grammar, because bison often reads one token ahead. So you'll also need to attach the value of offset to each lexical token, most conveniently using the location facility (yylloc).
Edit: added more details on location tracking.
The following has not been tested. You should read the sections in both the flex and the bison manual about location tracking.
The yylloc global variable and its default type are included in the generated bison code if you use the --locations command line option or the %locations directive, or if you simply refer to a location value in some rule, using the # syntax, which is analogous to the $ syntax (that is, #n is the location value of the right-hand-side object whose semantic value is $n). Unfortunately, the default type for yylloc uses ints, which are not wide enough to hold a file offset, although you might not be planning on parsing files for which this matters. In any event, it's easy enough to change; you merely have to #define the YYLTYPE macro at the top of your bison file. The default YYLTYPE is:
typedef struct YYLTYPE
{
int first_line;
int first_column;
int last_line;
int last_column;
} YYLTYPE;
For a minimum modification, I'd suggest keeping the names unchanged; otherwise you'll also need to fix the YYLLOC_DEFAULT macro in your bison file. The default YYLLOC_DEFAULT ensures that non-terminals get a location value whose first_line and first_column members come from the first element in the non-terminal's RHS, and whose last_line and last_column members come from the last element. Since it is a macro, it will work with any assignable type for the various members, so it will be sufficient to change the column members to long, size_t or offset_t, as you feel appropriate:
#define YYLTYPE yyltype;
typedef struct yyltype {
int first_line;
offset_t first_column;
int last_line;
offset_t last_column;
} yyltype;
Then in your flex input, you could define the YY_USER_ACTION macro:
offset_t offset;
extern YYLTYPE yylloc;
#define YY_USER_ACTION \
offset += yyleng; \
yylloc.last_line = yylineno; \
yylloc.last_column = offset;
With all that done and appropriate initialization, you should be able to use the appropriate #n.last_column in the ACCEPT rule to extract the offset of the end of the last token in the accepted input.

Related

Prefix and postfix operator

#include <stdio.h>
int main()
{
int x=5, y;
y=x+++x;
printf("%d", x);
printf("%d", y);
}
What I found is that postfix increment has higher precedence than prefix.
Hence
y=x+++x;
y=(x++)+x;
y=10
x=6
But when I execute the program :y=11,x=6
Please correct me if I am understanding anything wrong
Splitting it down and remembering that there is also the "right-left rule"
This is a simple rule that allows you to interpret any declaration. It runs as follows:
Start reading the declaration from the innermost parentheses, go right, and then go left.
So, parsing Right-to-Left in the absence of parentheses:
x=5, so using the prefix operator ++x increments the value before y is assigned.
Then an assignment is made to y equivalent to 5+6 (i.e. x+(++x))
x was incremented, so x=6

Clang: How to get the macro name used for size of a constant size array declaration

TL;DR;
How to get the macro name used for size of a constant size array declaration, from a callExpr -> arg_0 -> DeclRefExpr.
Detailed Problem statement:
Recently I started working on a challenge which requires source to source transformation tool for modifying
specific function calls with an additional argument. Reasearching about the ways i can acheive introduced me
to this amazing toolset Clang. I've been learning how to use different tools provided in libtooling to
acheive my goal. But now i'm stuck at a problem, seek your help here.
Considere the below program (dummy of my sources), my goal is to rewrite all calls to strcpy
function with a safe version of strcpy_s and add an additional parameter in the new function call
i.e - destination pointer maximum size. so, for the below program my refactored call would be like
strcpy_s(inStr, STR_MAX, argv[1]);
I wrote a RecursiveVisitor class and inspecting all function calls in VisitCallExpr method, to get max size
of the dest arg i'm getting VarDecl of the first agrument and trying to get the size (ConstArrayType). Since
the source file is already preprocessed i'm seeing 2049 as the size, but what i need is the macro STR_MAX in
this case. how can i get that?
(Creating replacements with this info and using RefactoringTool replacing them afterwards)
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define STR_MAX 2049
int main(int argc, char **argv){
char inStr[STR_MAX];
if(argc>1){
//Clang tool required to transaform the below call into strncpy_s(inStr, STR_MAX, argv[1], strlen(argv[1]));
strcpy(inStr, argv[1]);
} else {
printf("\n not enough args");
return -1;
}
printf("got [%s]", inStr);
return 0;
}
As you noticed correctly, the source code is already preprocessed and it has all the macros expanded. Thus, the AST will simply have an integer expression as the size of array.
A little bit of information on source locations
NOTE: you can skip it and proceed straight to the solution below
The information about expanded macros is contained in source locations of AST nodes and usually can be retrieved using Lexer (Clang's lexer and preprocessor are very tightly connected and can be even considered one entity). It's a bare minimum and not very obvious to work with, but it is what it is.
As you are looking for a way to get the original macro name for a replacement, you only need to get the spelling (i.e. the way it was written in the original source code) and you don't need to carry much about macro definitions, function-style macros and their arguments, etc.
Clang has two types of different locations: SourceLocation and CharSourceLocation. The first one can be found pretty much everywhere through the AST. It refers to a position in terms of tokens. This explains why begin and end positions can be somewhat counterintuitive:
// clang::DeclRefExpr
//
// ┌─ begin location
foo(VeryLongButDescriptiveVariableName);
// └─ end location
// clang::BinaryOperator
//
// ┌─ begin location
int Result = LHS + RHS;
// └─ end location
As you can see, this type of source location points to the beginning of the corresponding token. CharSourceLocation on the other hand, points directly to the characters.
So, in order to get the original text of the expression, we need to convert SourceLocation's to CharSourceLocation's and get the corresponding text from the source.
The solution
I've modified your example to show other cases of macro expansions as well:
#define STR_MAX 2049
#define BAR(X) X
int main() {
char inStrDef[STR_MAX];
char inStrFunc[BAR(2049)];
char inStrFuncNested[BAR(BAR(STR_MAX))];
}
The following code:
// clang::VarDecl *VD;
// clang::ASTContext *Context;
auto &SM = Context->getSourceManager();
auto &LO = Context->getLangOpts();
auto DeclarationType = VD->getTypeSourceInfo()->getTypeLoc();
if (auto ArrayType = DeclarationType.getAs<ConstantArrayTypeLoc>()) {
auto *Size = ArrayType.getSizeExpr();
auto CharRange = Lexer::getAsCharRange(Size->getSourceRange(), SM, LO);
// Lexer gets text for [start, end) and we want him to grab the end as well
CharRange.setEnd(CharRange.getEnd().getLocWithOffset(1));
auto StringRep = Lexer::getSourceText(CharRange, SM, LO);
llvm::errs() << StringRep << "\n";
}
produces this output for the snippet:
STR_MAX
BAR(2049)
BAR(BAR(STR_MAX))
I hope this information is helpful. Happy hacking with Clang!

How does the data structure for a lexical analysis look?

I know the lexical analyser tokenizes the input and stores it in a stream, or at least that is what I understood. Unfortunately nearly all articles I have read only talk about lexing simple expressions. What I am interested in is how to tokenize something like:
if (fooBar > 5) {
for (var i = 0; i < alot.length; i++) {
fooBar += 2 + i;
}
}
Please note that this is pseudo code.
Question: I would like to know how the data structure looks like for tokens created by the lexer? I really have no idea for the example i gave above where code is nested. Some example would be nice.
First of all, tokens are not necessarily stored. Some compilers do store the tokens in a table or other data structure, but for a simple compiler (if there is such a thing) it's sufficient in most cases that the lexer can return the type of the next token to be parsed and then in some cases the parser might ask the lexer for the actual text that the token is made up of.
If we use your sample code,
if (fooBar > 5) {
for (var i = 0; i < alot.length; i++) {
fooBar += 2 + i;
}
}
The type of the first token in this sample might be defined as TOK_IF corresponding to the "if" keyword. The next token might be TOK_LPAREN, then TOK_IDENT, then TOK_GREATER, then TOK_INT_LITERAL, and so on. What exactly the types should be is defined by you as the author of the lexer (or tokenizer) code. (Note that there are about a million different tools to help you avoid the somewhat tedious task of coming up with these details by hand.)
Except for TOK_IDENT and TOK_INT_LITERAL the tokens we've seen so far are defined entirely by their type. For these two, we would need to be able to ask the lexer for the underlying text so that we can evaluate the value of the token.
So a tiny excerpt of the parser dealing with an IF statement in pseudo-code might look something like:
...
switch(lexer.GetNextTokenType())
case TOK_IF:
{
// "if" statement
if (lexer.GetNextTokenType() != TOK_LPAREN)
throw SyntaxError('( expected');
ParseRelationalExpression(lexer);
if (lexer.GetNextTokenType() != TOK_RPAREN)
throw SyntaxError(') expected');
...
and so on.
If the compiler did choose to actually store the tokens for later reference, and some compilers do e.g. to allow for more efficient backtracking, one way would be to use a structure similar to the following
struct {
int TokenType;
char* TokenStart;
int TokenLength;
}
The container for these might be a linked list or std::vector (assuming C++).

Why when I use #define for int I need to wrap them in brackets?

This is my example I've found:
#define kNumberOfViews (37)
#define kViewsWide (5)
#define kViewMargin (2.0)
Why it cannot be like that?
#define kNumberOfViews 37
#define kViewsWide 5
#define kViewMargin 2.0
And what means k in front? Is there a some guide for it?
It is not really required in your example, but the use of parenthesis in defines is a useful approach to make sure your define states exactly what you mean in the context of the define and protects it from side effects when used in code.
E.g
#define VAR1 40
#define VAR2 20
#define SAVETYPING1 VAR1-VAR2
#define SAVETYPING2 (VAR1-VAR2)
Then in your code
foo(4*SAVETYPING1); // comes out as foo(140)
Is not the same as
foo(4*SAVETYPING2); // comes out as foo(80)
As for what the k prefix means. It is used for constants. Plenty of discussion here on the origins:
Objective C - Why do constants start with k
#define SOME_VALUE 1234
It is preprocessor directive. It means, that before your code is compiled, all occurrences of SOME_VALUE will be replaced by 1234. Alternative to this would be
const int kSomeValue = 1234;
For discussion about advantages of one or the other see
#define vs const in Objective-C
As for brackets - in more complex cases they are necessary exactly because preprocessor makes copy-paste with #define. Consider this example:
#define BIRTH_YEAR 1990
#define CURRENT_YEAR 2015
#define AGE CURRENT_YEAR - BIRTH_YEAR
...
// later in the code
int ageInMonths = AGE * 12;
Here one might expect that ageInMonths = 25 * 12, but instead it is computed as ageInMonths = 2015 - 1990 * 12 = 2015 - (1990 * 12). That is why correct definition of AGE should have been
#define AGE (CURRENT_YEAR - BIRTH_YEAR)
As for naming conventions, AFAIK for #define constants capital cases with underscores are used, and for const constants camel names with leading k are used.
k is just a hungarian notation convention to indicate that that is a constant value. Personally I find it dumb, but it is a convention that many people follow. It isn't required for the code to work at all.
I am not sure why the examples you saw had parens around them, but there is no need to have parentheses around #define values.

Precedence and associativity of prefix and postfix in c

int main()
{
char arr[] = "geeksforgeeks";
char *ptr = arr;
while(*ptr != '\0')
++*ptr++;
printf("%s %s", arr, ptr);
getchar();
return 0;
}
The statement inside while loop ++ptr++ behaves in a way that I don't understand. The post increment should have been evaluated first because of its high priority and the first output value should have been f(incrementing e). But that does not happen. To understand i changed the statement to ++*(ptr++), so it may give the output what i expect(ffltgpshfflt is the output i expected;but actual output hffltgpshfflt). But still the output does not change. () operator has higher precedence than the pre-increment. But why does not the output change?
We have:
++*ptr++
First, the postfix operator is applied as you said. However, as per definition of the postfix increment operator, ptr++ evaluates to ptr and increases ptr by 1. The expression does not evaluate to the increased value, but to the original value.
So *(ptr++) evaluates to the same value as *ptr, the former just also increases ptr. Therefore, the first element in the array is modified in the first pass of the algorithm.
The parentheses don't matter because the postfix increment already has precedence.
If you replace this with:
++*++ptr
you get
gffltgpshfflt
where the order of execution of operators is the same; the difference is that prefix ++ works differently than postfix ++ - it evaluates to the increased value. Note that this also messes up the null terminator, as ptr is modified before it is checked for equality to 0.

Resources