I'm new to Clang, and trying to write some clang-tidy checks. I want to find something that works as a "variable table", to check if some names are well-formed.
My intuition is like this:
To write redefinition code will sometimes cause an error, which is thrown out by Clang's diagnostics. like:
int main(){
int x;
int x; // error: redefinition
return 0;
}
From my perspective, clang may keep a dynamic variable table to check whether a new definition is compatible/overloading/error.
I tried to dive into clang source code and explored something:
Identifiertable, is kept by preprocessor, which marks all the identifiers, but does not do the semantic legal checking.
DeclContext, which seems to be an interface for users to use, a product produced by semantic checking.
My question is :
How Clang do the legal checking?
Am I able to get the variable table(If there exists such kind of things)?
If I cannot get such things, how could I know which variables are reachable from a location?
Thanks for your suggestions!
TLDR; see Answers below.
Discussion
All of your questions are related to one term of C standard, identifier, in C99-6.2.1-p1:
An identifier can denote an object; a function; a tag or a member of a structure, union, or
enumeration; a typedef name; a label name; a macro name; or a macro parameter.
Each identifier has its own scope, one of the following, according to C99-6.2.1-p2:
For each different entity that an identifier designates, the identifier is visible (i.e., can be
used) only within a region of program text called its scope.
Since what you are interested in are the variables inside a function (i.e., int x), then it should then obtain a block scope.
There is an process called linkage for the identifiers in the same scope, according to C99-6.2.2-p2:
An identifier declared in different scopes or in the same scope more than once can be
made to refer to the same object or function by a process called linkage.
This is exactly the one that put a constraint that there should be only one identifier for one same object, or in your saying, definition legally checking. Therefore compiling the following codes
/* file_1.c */
int a = 123;
/* file_2.c */
int a = 456;
would cause an linkage error:
% clang file_*
...
ld: 1 duplicate symbol
clang: error: linker command failed with exit code 1
However, in your case, the identifiers are inside the same function body, which is more likely the following:
/* file.c */
int main(){
int b;
int b=1;
}
Here identifier b has a block scope, which shall have no linkage, according to C99-6.2.2-p6:
The following identifiers have no linkage: an identifier declared to be anything other than
an object or a function; an identifier declared to be a function parameter; a block scope
identifier for an object declared without the storage-class specifier extern.
Having no linkage means that we cannot apply the rules mentioned above to it, that is, it should not be related to a linkage error kind.
It is naturally considered it as an error of redefinition. But, while it is indeed defined in C++, which is called One Definition Rule, it is NOT in C.(check this or this for more details) There is no exact definition for dealing with those duplicate identifiers in a same block scope. Hence it is an implementation-defined behavior. This might be the reason why with clang, the resulting errors after compiling the above codes (file.c) differs from the ones by gcc, as shown below:
(note that the term 'with no linkage' by gcc)
# ---
# GCC (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04))
# ---
$ gcc file.c
file.c: In function ‘main’:
file.c:4:6: error: redeclaration of ‘b’ with no linkage
int b=1;
^
file.c:3:6: note: previous declaration of ‘b’ was here
int b;
^
# ---
# CLANG (Apple clang version 13.0.0 (clang-1300.0.29.3))
# ---
% clang file.c
file.c:4:6: error: redefinition of 'b'
int b;
^
file.c:3:6: note: previous definition is here
int b=1;
^
1 error generated.
Answers
With all things above, I think it suffices to answer your questions:
How clang perform the definition legally checking?
For global variables, either clang or gcc would follow the C standard rules, that is to say, they handle the so-called "redefinition errors" by the process called Linkage. For local variables, it is undefined behavior, or more precisely, implementation-defined behavior.
In fact, They both view the "redefinition" as an error. Although variable names inside a function body would be vanished after compiled (you can verify this in the assembly output), it is undoubtedly more natural and helpful for letting them be unique.
Am I able to get the variable table(If there exists such kind of things)?
Having not so much knowledge about clang internals, but according to the standards quoted above, along with an analysis of compiling, we can infer that IdentifierTable might not much fit your needs, since it exists in "preprocessing" stage, which is before "linking" stage. To take a look how clang compiler deals with duplicate variables (or more formally, symbols), and how to store them, you might want to check the whole project of lld, or in particular, SymbolTable.
Related
I'm trying to find any information parentheses syntax for macro arguments in GNU Assembler. E.g. I have following code:
.macro do_block, enc, in, rounds, rk, rkp, i
eor \in\().16b, \in\().16b, v15.16b
...
(taken from here)
What does paretheses in \in\().16b mean? Where to find documentaion for this syntax?
Okay, I've found the answer. This is special syntax to escape macro-argument name.
From the documentation:
Note that since each of the macargs can be an identifier exactly as any other one permitted by the target architecture, there may be occasional problems if the target hand-crafts special meanings to certain characters when they occur in a special position. For example:
...
problems might occur with the period character (‘.’) which is often allowed inside opcode names (and hence identifier names). So for example constructing a macro to build an opcode from a base name and a length specifier like this:
.macro opcode base length
\base.\length
.endm
and invoking it as ‘opcode store l’ will not create a ‘store.l’ instruction but instead > generate some kind of error as the assembler tries to interpret the text \base.\length.
The string \() can be used to separate the end of a macro argument from the following text. eg:
.macro opcode base length
\base\().\length
.endm
well i was reading some common concepts regarding parsing in compiler..i came across look ahead and read ahead symbol i search and read about them but i am stuck like why we need both of them ? would be grateful for any kind suggestion
Lookahead symbol: when node being considered in parse tree is for a terminal, and the
terminal matches lookahead symbol,then we advance in both parse and
input
read aheadsymbol: lexical analyzer may need to read some character
before it can decide on the token to be returned
One of these is about parsing and refers to the next token to be produced by the lexical scanner. The other one, which is less formal, is about lexical analysis and refers to the next character in the input stream. It should be clear which is which.
Note that while most parsers only require a single lookahead token, it is not uncommon for lexical analysis to have to backtrack, which is equivalent to examining several unconsumed input characters.
I hope I got your question right.
Consider C.
It has several punctuators that begin the same way:
+, ++, +=
-, --, -=, ->
<, <=, <<, <<=
...
In order to figure out which one it is when you see the first + or - or <, you need to look ahead one character in the input (and then maybe one more for <<=).
A similar thing can happen at a higher level:
{
ident1 ident2;
ident3;
ident4:;
}
Here ident1, ident3 and ident4 can begin a declaration, an expression or a label. You can't tell which one immediately. You can consult your existing declarations to see if ident1 or ident3 is already known (as a type or variable/function/enumeration), but it's still ambiguous because a colon may follow and if it does, it's a label because it's permitted to use the same identifier for both a label and a type/variable/function/enumeration (those two name spaces do not intersect), e.g.:
{
typedef int ident1;
ident1 ident2; // same as int ident2
int ident3 = 0;
ident3; // unused expression of value 0
ident1:; // unused label
ident2:; // unused label
ident3:; // unused label
}
So, you may very well need to look ahead a character or a token (or "unread" one) to deal with situations like these.
I am confused as to whether the following is allowed:
(I am using declaration in the forloop rule however declaration also defines how to declare other things. Could this be error checked later in the compiler? Am I clear?)
declaration :
operand ASSIGNMENTOPERATOR variable var_type CONST?
|operations ASSIGNMENTOPERATOR variable var_type CONST?
|funcall ASSIGNMENTOPERATOR variable var_type CONST?
|(funcall|operand|NOINDEXARRAY) ASSIGNMENTOPERATOR variable var_type ARRAY CONST? ;
forloop :
block
(LPARENS ((number_operation ASSIGNMENTOPERATOR variable)|number_functions)
SEMICOLON bool_operation
SEMICOLON declaration
RPARENS
)
'for'
;
UPDATE: I know that it would work when I supply the right type of declaration inside the for loop. The question is what happens if I don't?
It seems what you have in mind is a semantic phase, which is very typical in parser setups. Parsing the input is only a small part of the work. Usually you have a step after that to validate your parse tree (e.g. look for duplicate variable names or unknown symbols and check other conditions). This is usually called the semantic phase (parsing is the syntactic phase).
You can use this semantic phase for all kind of error checking, including your declaration check (whatever you want to check there, that's not clear from your question).
I am using Apple's Unified Logging for the first time, and have had some success. However, I can't get it to work with the suggested %{timeval}.*P custom format specifier.
My first attempt was something like:
struct timeval some_time;
// ... populate `some_time`
os_log_info(OS_LOG_DEFAULT, "a thing happened at %{timeval}.*P", some_time);
But clang reports an error: field precision should have type 'int', but argument has type 'struct timeval'.
I believe the problem is that clang doesn't doesn't understand the os_log formatting rules, and if I could work out how to have clang suppress the error via clang diagnostic push etc, I would. It does appear that the underlying macro OS_LOG_CALL_WITH_FORMAT seems to make an attempt to do that, without luck.
Am I misusing the timeval format specifier?
This is with Xcode 8.3.1, I haven't tried earlier versions of Xcode.
TL;DR
Working code
os_log_info(OS_LOG_DEFAULT, "a thing happened at %{timeval}.16P", &some_time);
OR
os_log_info(OS_LOG_DEFAULT, "a thing happened at %{timeval}.*P", (int)sizeof(some_time), &some_time);
The problem here is that wildcard (*) in format definition stands for int. So in fact %{timeval}.*P accepts 2 arguments: int and a pointer to timeval struct.
Unfortunately, this is very badly documented and even Apple Sample Code of os_log usage doesn't have examples of this.
So far the best way to see examples is to look into tests in LLVM source code.
format-strings-oslog.m
__builtin_os_log_format(buf, "%d", i);
__builtin_os_log_format(buf, "%P", p); // expected-warning {{using '%P' format specifier without precision}}
__builtin_os_log_format(buf, "%.10P", p);
__builtin_os_log_format(buf, "%.*P", p); // expected-warning {{field precision should have type 'int', but argument has type 'void *'}}
__builtin_os_log_format(buf, "%.*P", i, p);
__builtin_os_log_format(buf, "%.*P", i, i); // expected-warning {{format specifies type 'void *' but the argument has type 'int'}}
The timeval format specifier is correct as according to documentation. Apparently Clang is not yet aware of these new format strings as of Xcode 8.3.2. Deactivating it via #clang diagnostic ignored "-Wformat" would be a futile attempt.
Apparently there is no solution to this – in the mean time avoid using these new fancy format specifiers. Specifically, avoid the format strings that ends with a .*P. However the d formats seems to work.
I've been recently writing parser for language based on C. I'm using CUP (Yacc for Java).
I want to implement "The lexer hack" (http://eli.thegreenplace.net/2011/05/02/the-context-sensitivity-of-c%E2%80%99s-grammar-revisited/ or https://en.wikipedia.org/wiki/The_lexer_hack), to distinguish typedef names and variable/function names etc. To enable declaring variables of the same name as type declared earlier (example from first link):
typedef int AA;
void foo() {
AA aa; /* OK - define variable aa of type AA */
float AA; /* OK - define variable AA of type float */
}
we have to introduce some new productions, where variable/function name could be either IDENTIFIER or TYPENAME. And this is the moment where difficulties occur - conflicts in grammar.
I was trying not to use this messy Yacc grammar for gcc 3.4 (http://yaxx.googlecode.com/svn-history/r2/trunk/gcc-3.4.0/gcc/c-parse.y), but this time I have no idea how to resolve conflicts on my own. I took a look at Yacc grammar:
declarator:
after_type_declarator
| notype_declarator
;
after_type_declarator:
...
| TYPENAME
;
notype_declarator:
...
| IDENTIFIER
;
fndef:
declspecs_ts setspecs declarator
// some action code
// the rest of production
...
setspecs: /* empty */
// some action code
declspecs_ts means declaration_specifiers where
"Whether a type specifier has been seen; after a type specifier, a typedef name is an identifier to redeclare (_ts or _nots)."
From declspecs_ts we can reach
typespec_nonreserved_nonattr:
TYPENAME
...
;
At the first glance I can't believe how shift/reduce conflicts does not appear!
setspecs is empty, so we have declspecs_ts followed by declarator, so that we can expect that parser should be confused whether TYPENAME is from declspecs_ts or from declarator.
Can anyone explain this briefly (or even precisely). Thanks in advance!
EDIT:
Useful link: http://www.gnu.org/software/bison/manual/bison.html#Semantic-Tokens
I can't speak for the specific code.
But the basic trick is that the C lexer inspects every IDENTIFIER, and decides if might be the name of a typedef. If so, then it changes the lexeme type to TYPEDEF and hands it to the parser.
How is the lexer to know what identifiers are typedefs? The parser must in effect tell it, by capturing typedef information as it runs. Somewhere in the grammar related to declarations, there must be an action to provide this information. I would have expected it to be attached to the grammar rules for, well, typedef declarations.
You didn't show what "setspec" did; maybe that's the place. A common trick used with LR parser generators is to introduce a grammar rule E with an empty right hand (your example "setspec"?), to be invoked in the middle of some other grammar rule (your example "fndef") just to enable access to a semantic action in the middle of processing that rule.
This whole trick is to get around parsing ambiguity if you can't tell typedefs from other identifiers. If your parser tolerates ambiguity, you don't need this hack at all; just parse, and built ASTs with both (sub)parses. After you acquire the AST, a tree walk can find type information and eliminate inconsistent subparses. We do this with GLR for both C and C++, and it beautifully separates parsing from name resolution.