Clang: How to get the macro name used for size of a constant size array declaration - clang

TL;DR;
How to get the macro name used for size of a constant size array declaration, from a callExpr -> arg_0 -> DeclRefExpr.
Detailed Problem statement:
Recently I started working on a challenge which requires source to source transformation tool for modifying
specific function calls with an additional argument. Reasearching about the ways i can acheive introduced me
to this amazing toolset Clang. I've been learning how to use different tools provided in libtooling to
acheive my goal. But now i'm stuck at a problem, seek your help here.
Considere the below program (dummy of my sources), my goal is to rewrite all calls to strcpy
function with a safe version of strcpy_s and add an additional parameter in the new function call
i.e - destination pointer maximum size. so, for the below program my refactored call would be like
strcpy_s(inStr, STR_MAX, argv[1]);
I wrote a RecursiveVisitor class and inspecting all function calls in VisitCallExpr method, to get max size
of the dest arg i'm getting VarDecl of the first agrument and trying to get the size (ConstArrayType). Since
the source file is already preprocessed i'm seeing 2049 as the size, but what i need is the macro STR_MAX in
this case. how can i get that?
(Creating replacements with this info and using RefactoringTool replacing them afterwards)
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define STR_MAX 2049
int main(int argc, char **argv){
char inStr[STR_MAX];
if(argc>1){
//Clang tool required to transaform the below call into strncpy_s(inStr, STR_MAX, argv[1], strlen(argv[1]));
strcpy(inStr, argv[1]);
} else {
printf("\n not enough args");
return -1;
}
printf("got [%s]", inStr);
return 0;
}

As you noticed correctly, the source code is already preprocessed and it has all the macros expanded. Thus, the AST will simply have an integer expression as the size of array.
A little bit of information on source locations
NOTE: you can skip it and proceed straight to the solution below
The information about expanded macros is contained in source locations of AST nodes and usually can be retrieved using Lexer (Clang's lexer and preprocessor are very tightly connected and can be even considered one entity). It's a bare minimum and not very obvious to work with, but it is what it is.
As you are looking for a way to get the original macro name for a replacement, you only need to get the spelling (i.e. the way it was written in the original source code) and you don't need to carry much about macro definitions, function-style macros and their arguments, etc.
Clang has two types of different locations: SourceLocation and CharSourceLocation. The first one can be found pretty much everywhere through the AST. It refers to a position in terms of tokens. This explains why begin and end positions can be somewhat counterintuitive:
// clang::DeclRefExpr
//
// ┌─ begin location
foo(VeryLongButDescriptiveVariableName);
// └─ end location
// clang::BinaryOperator
//
// ┌─ begin location
int Result = LHS + RHS;
// └─ end location
As you can see, this type of source location points to the beginning of the corresponding token. CharSourceLocation on the other hand, points directly to the characters.
So, in order to get the original text of the expression, we need to convert SourceLocation's to CharSourceLocation's and get the corresponding text from the source.
The solution
I've modified your example to show other cases of macro expansions as well:
#define STR_MAX 2049
#define BAR(X) X
int main() {
char inStrDef[STR_MAX];
char inStrFunc[BAR(2049)];
char inStrFuncNested[BAR(BAR(STR_MAX))];
}
The following code:
// clang::VarDecl *VD;
// clang::ASTContext *Context;
auto &SM = Context->getSourceManager();
auto &LO = Context->getLangOpts();
auto DeclarationType = VD->getTypeSourceInfo()->getTypeLoc();
if (auto ArrayType = DeclarationType.getAs<ConstantArrayTypeLoc>()) {
auto *Size = ArrayType.getSizeExpr();
auto CharRange = Lexer::getAsCharRange(Size->getSourceRange(), SM, LO);
// Lexer gets text for [start, end) and we want him to grab the end as well
CharRange.setEnd(CharRange.getEnd().getLocWithOffset(1));
auto StringRep = Lexer::getSourceText(CharRange, SM, LO);
llvm::errs() << StringRep << "\n";
}
produces this output for the snippet:
STR_MAX
BAR(2049)
BAR(BAR(STR_MAX))
I hope this information is helpful. Happy hacking with Clang!

Related

zig structs, pointers, field access

I was trying to implement vector algebra with generic algorithms and ended up playing with iterators. I have found two examples of not obvious and unexpected behaviour:
if I have pointer p to a struct (instance) with field fi, I can access the field as simply as p.fi (rather than p.*.fi)
if I have a "member" function fun(this: *Self) (where Self = #This()) and an instance s of the struct, I can call the function as simply as s.fun() (rather than (&s).fun())
My questions are:
is it documented (or in any way mentioned) somewhere? I've looked through both language reference and guide from ziglearn.org and didn't find anything
what is it that we observe in these examples? syntactic sugar for two particular cases or are there more general rules from which such behavior can be deduced?
are there more examples of weird pointers' behaviour?
For 1 and 2, you are correct. In Zig the dot works for both struct values and struct pointers transparently. Similarly, namespaced functions also do the right thing when invoked.
The only other similar behavior that I can think of is [] syntax used on arrays. You can use both directly on an array value and an array pointer interchangeably. This is somewhat equivalent to how the dot operates on structs.
const std = #import("std");
pub fn main() !void {
const arr = [_]u8{1,2,3};
const foo = &arr;
std.debug.print("{}", .{arr[2]});
std.debug.print("{}", .{foo[2]});
}
AFAIK these are the only three instances of this behavior. In all other cases if something asks for a pointer you have to explicitly provide it. Even when you pass an array to a function that accepts a slice, you will have to take the array's pointer explicitly.
The authoritative source of information is the language reference but checking it quickly, it doesn't seem to have a dedicated paragraph. Maybe there's some example that I missed though.
https://ziglang.org/documentation/0.8.0/
I first learned this syntax by going through the ziglings course, which is linked to on ziglang.org.
in exercise 43 (https://github.com/ratfactor/ziglings/blob/main/exercises/043_pointers5.zig)
// Note that you don't need to dereference the "pv" pointer to access
// the struct's fields:
//
// YES: pv.x
// NO: pv.*.x
//
// We can write functions that take pointer arguments:
//
// fn foo(v: *Vertex) void {
// v.x += 2;
// v.y += 3;
// v.z += 7;
// }
//
// And pass references to them:
//
// foo(&v1);
The ziglings course goes quite in-depth on a few language topics, so it's definitely work checking out if you're interested.
With regards to other syntax: as the previous answer mentioned, you don't need to dereference array pointers. I'm not sure about anything else (I thought function pointers worked the same, but I just ran some tests and they do not.)

lex & yacc get current position

In lex & yacc there is a macro called YY_INPUT which can be redefined, for example in a such way
#define YY_INPUT(buf,result,maxlen) do { \
const int n = gzread(gz_yyin, buf, maxlen); \
if (n < 0) { \
int errNumber = 0; \
reportError( gzerror(gz_yyin, &errNumber)); } \
\
result = n > 0 ? n : YY_NULL; \
} while (0)
I have some grammar rule which called YYACCEPT macro.
If after YYACCEPT I called gztell (or ftell), then I got a wrong number, because parser already read some unnecessary data.
So how I can get current position if I have some rule which called YYACCEPT in it(one bad solution will be to read character by character)
(I have already done something like this:
#define YY_USER_ACTION do { \
current_position += yyleng; \
} while (0)
but seems its not work
)
You have to keep track of the offset yourself. A simple but annoying solution is to put:
offset += yyleng;
in every flex action. Fortunately, you can do this implicitly by defining the YY_USER_ACTION macro, which is executed just before the token action.
That might still not be right for your grammar, because bison often reads one token ahead. So you'll also need to attach the value of offset to each lexical token, most conveniently using the location facility (yylloc).
Edit: added more details on location tracking.
The following has not been tested. You should read the sections in both the flex and the bison manual about location tracking.
The yylloc global variable and its default type are included in the generated bison code if you use the --locations command line option or the %locations directive, or if you simply refer to a location value in some rule, using the # syntax, which is analogous to the $ syntax (that is, #n is the location value of the right-hand-side object whose semantic value is $n). Unfortunately, the default type for yylloc uses ints, which are not wide enough to hold a file offset, although you might not be planning on parsing files for which this matters. In any event, it's easy enough to change; you merely have to #define the YYLTYPE macro at the top of your bison file. The default YYLTYPE is:
typedef struct YYLTYPE
{
int first_line;
int first_column;
int last_line;
int last_column;
} YYLTYPE;
For a minimum modification, I'd suggest keeping the names unchanged; otherwise you'll also need to fix the YYLLOC_DEFAULT macro in your bison file. The default YYLLOC_DEFAULT ensures that non-terminals get a location value whose first_line and first_column members come from the first element in the non-terminal's RHS, and whose last_line and last_column members come from the last element. Since it is a macro, it will work with any assignable type for the various members, so it will be sufficient to change the column members to long, size_t or offset_t, as you feel appropriate:
#define YYLTYPE yyltype;
typedef struct yyltype {
int first_line;
offset_t first_column;
int last_line;
offset_t last_column;
} yyltype;
Then in your flex input, you could define the YY_USER_ACTION macro:
offset_t offset;
extern YYLTYPE yylloc;
#define YY_USER_ACTION \
offset += yyleng; \
yylloc.last_line = yylineno; \
yylloc.last_column = offset;
With all that done and appropriate initialization, you should be able to use the appropriate #n.last_column in the ACCEPT rule to extract the offset of the end of the last token in the accepted input.

Import and write GeoTIFF in Octave

I am using MATLAB in my office and Octave when I am at home. Although they are very similar, I was trying to do something I would expected to be very easy and obvious, but found it really annoying. I can't find out how to import TIFF images in Octave. I know the MATLAB geotiffread function is not present, but I thought there would be another method.
I could also skip importing them, as I can work with the imread function in some cases, but then the second problem would be that I can't find a way to write a georeferenced TIFF file (in MATLAB I normally call geotiffwrite with geotiffinfo inputs inside). My TIFF files are usually 8 bit unsigned integer or 32 bit signed integer. I hope someone can suggest a way to solve this problem. I also saw this thread but did not understand if it is possible to use the code proposed by Ashish in Octave.
You may want to look at the mapping library in Octave.
You can also use the raster functions to work with GeoTiffs
Example:
pkg load mapping
filename=”C:\\sl\SDK\\DTED\\n45_w122_1arc_v2.tif”
rasterinfo (filename)
rasterdraw (filename)
The short answer is you can't do it in Octave out of the box. But this is not because it is impossible to do it. It is simply because no one has yet bothered to implement it. As a piece of free software, Octave has the features that its users are willing to spend time or money implementing.
About writing of signed 32-bit images
As of version 3.8.1, Octave uses either GraphicsMagick or ImageMagick to handle the reading and writing of images. This introduces some problems. The number 1 is that your precision is limited to how you built GraphicsMagick (its quantum-depth option). In addition, you can only write unsigned integers. Hopefully this will change in the future but since not many users require it, it's been this way until now.
Dealing with geotiff
Provided you know C++, you can write this functions yourself. This shouldn't be too hard since there is already libgeotiff, a C library for it. You would only need to write a wrapper as an Octave oct function (of course, if you don't know C or C++, then this "only" becomes a lot of work).
Here is the example oct file code which needs to be compiled. I have taken reference of https://gerasimosmichalitsianos.wordpress.com/2018/01/08/178/
#include <octave/oct.h>
#include "iostream"
#include "fstream"
#include "string"
#include "cstdlib"
#include <cstdio>
#include "gdal_priv.h"
#include "cpl_conv.h"
#include "limits.h"
#include "stdlib.h"
using namespace std;
typedef std::string String;
DEFUN_DLD (test1, args, , "write geotiff")
{
NDArray maindata = args(0).array_value ();
const dim_vector dims = maindata.dims ();
int i,j,nrows,ncols;
nrows=dims(0);
ncols=dims(1);
//octave_stdout << maindata(i,0);
NDArray transform1 = args(1).array_value ();
double* transform = (double*) CPLMalloc(sizeof(double)*6);
float* rowBuff = (float*) CPLMalloc(sizeof(float)*ncols);
//GDT_Float32 *rowBuff = CPLMalloc(sizeof(GDT_Float32)*ncols);
String tiffname;
tiffname = "nameoftiff2.tif";
cout<<"The transformation matrix is";
for (i=0; i<6; i++)
{
transform[i]=transform1(i);
cout<<transform[i]<<" ";
}
GDALAllRegister();
CPLPushErrorHandler(CPLQuietErrorHandler);
GDALDataset *geotiffDataset;
GDALDriver *driverGeotiff;
GDALRasterBand *geotiffBand;
OGRSpatialReference oSRS;
char **papszOptions = NULL;
char *pszWKT = NULL;
oSRS.SetWellKnownGeogCS( "WGS84" );
oSRS.exportToWkt( &pszWKT );
driverGeotiff = GetGDALDriverManager()->GetDriverByName("GTiff");
geotiffDataset = (GDALDataset *) driverGeotiff->Create(tiffname.c_str(),ncols,nrows,1,GDT_Float32,NULL);
geotiffDataset->SetGeoTransform(transform);
geotiffDataset->SetProjection(pszWKT);
//CPLFree( pszSRS_WKT );
cout<<" \n Number of rows and columns in array are: \n";
cout<<nrows<<" "<<ncols<<"\n";
for (i=0; i<nrows; i++)
{
for (j=0; j <ncols; j++)
rowBuff[j]=maindata(i,j);
//cout<<rowBuff[0]<<"\n";
geotiffDataset->GetRasterBand(1)->RasterIO(GF_Write,0,i,ncols,1,rowBuff,ncols,1,GDT_Float32,0,0);
}
GDALClose(geotiffDataset) ;
CPLFree(transform);
CPLFree(rowBuff);
CPLFree(pszWKT);
GDALDestroyDriverManager();
return octave_value_list();
}
it can be compiled and run using following
mkoctfile -lgdal test1.cc
aa=rand(50,53);
b=[60,1,0,40,0,-1];
test1(aa,b);

compilers - Instruction Selection for type declarations in AST

I'm learning compilers and creating a code generator for a simple language that deals with two types: characters and integers.
After the user input has been scanned by the scanner and then parsed by the parser, I get an AST representation of the input. I have made a code generation for an even simpler language which only processes expressions with integers, operators and variables.
However with this new language I sometimes get a subtree for a type declaration, like this:
(IS TYPE (x) (INT))
which says x is of type INT.
Should there be a case in my code generator which deals with these type declarations? Or is this simply for the semantic analyzer to type check, so I should just assume the types have been checked and ignore this part of the tree and simply assign the value for x?
Both situations are possible, you need to describe more about your language, to see if you really need to add that feature to your code generator, or skip it as unnecessary, and avoid extra work with this difficult and interesting topic of designing a programming language.
Is you "code generator" a program that recieves as an input code in a programming language (maybe small one) and outputs code in another programming language (maybe small one) ?
This tool is usually called a "translator".
Is you "code generator" a program that receive as an input a programming language and outputs assembler / bytecode like programming language ?
This tool is usually called a "compiler".
Note: "pile" is a synonym for "stack".
Usually an A.S.T., stores the type of an operation, or function call. Example, in c:
...
int a = 3;
int b = 5;
float c = (float)(a * b);
...
The last line, generates an A.S.T. similar to this, (skip A.S.T. for other lines):
..................................................................
..................................................................
......................+--------------+............................
......................| [root] |............................
......................| (no type) = |............................
......................+------+-------+............................
.............................|....................................
.................+-----------+------------+.......................
.................|........................|.......................
...........+-----+-----+....+-------------+-------------+.........
...........| (int) c |....| (float) (cast operation) |.........
...........+-----------+....+-------------+-------------+.........
..........................................|.......................
....................................+-----+-----+.................
....................................| (int) () |.................
....................................+-----+-----+.................
..........................................|.......................
....................................+-----+-----+.................
....................................| (int) * |.................
....................................+-----+-----+.................
..........................................|.......................
..............................+-----------+-----------+...........
..............................|.......................|...........
........................+-----+-----+...........+-----+-----+.....
........................| (int) a |...........| (float) b |.....
........................+-----------+...........+-----------+.....
..................................................................
..................................................................
Note that the "(float)" cast its like an operator or a function,
similar to your question.
Good Luck.
If this is a declaration
(IS TYPE (x) (INT))
then x should be laid out in memory. In the case of C and automatic variables, local auto variables are allocated on stack. To allocate needed size of stack you should know sizes of all local vars and sizes are from types.
If this variable is stored in a register, you should select a register of needed size (think about x86 with: AL, AX, EAX, RAX - the same register with different sizes), if your target has such.
Also, type is needed when there is an ambiguous operation in AST, which can operate on different data sizes (e.g. char, short, int - or 8-bit, 16-bit, 32-bit, etc). And for some assemblers, size of data is encoded into instruction itself; so codegen should remember sizes of variables.
Or, if the type of operation was not recorded in AST, the ADD:
(ADD (x) (y))
may mean both float and int additions (ADD or FADD instructions), so types of x and y are needed in codegen to select right variant.

Using read() system call of UNIX to find the user given pattern

I am trying to emulate grep pattern of UNIX using a C program( just for learning ). The code that i have written is giving me a run time error..
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#define MAXLENGTH 1000
char userBuf[MAXLENGTH];
int main ( int argc, char *argv[])
{
int numOfBytes,fd,i;
if (argc != 2)
printf("Supply correct number of arguments.\n");
//exit(1);
fd =open("pattern.txt",O_RDWR);
if ( fd == -1 )
printf("File does not exist.\n");
//exit(1);
while ( (numOfBytes = read(fd,userBuf,MAXLENGTH)) > 0 )
;
printf("NumOfBytes = %d\n",numOfBytes);
for(i=0;userBuf[i] != '\0'; ++i)
{
if ( strstr(userBuf,argv[1]) )
printf("%s\n",userBuf);
}
}
The program is printing infinitely, the lines containing the pattern . I tried debugging , but couldn't figure out the error. Please let me know where am i wrong.,
Thanks
Say the string is "fooPATTERN". Your first time through the loop, you check for the pattern in "fooPATTERN" and find it. Then your second time through the loop, you check for the pattern in "ooPATTERN" and find it again. Then your third time, you check for the pattern in "oPATTERN" and find it again.
Since you're doing this to learn, I won't tell you much more. You can decide how best to solve it. There are at least two fundamentally different ways you could solve it. One is to do less on each pass of the loop to ensure you only find it once. The other is to make sure your next pass of the loop is past any pattern that was found.
One thing to think about: If the pattern is 'oo' and the string is 'ooo', how many patterns should be found? 1 or 2?
The 'read' does not delimit the data with a null character.
The while loop should encompase the for loop - it doesn't
First, you shouldn't be using raw Unix i/o with open and read if you're just learning C. Start with standard C i/o with fopen and fread/fscanf/fgets and so forth.
Second, you're reading in successive pieces of the file into the same buffer, overwriting the buffer each time, and only ever processing the last contents of the buffer.
Third, nothing guarantees that your buffer will be zero-terminated when you read into it with read(). In fact, it usually won't be.
Fourth, you're not using the i variable in the body of your loop. I can't tell exactly what you were shooting for here, but doing the same thing on the same data umpteen thousand times surely wasn't it.
Fifth, always compile with the fullest warning settings you can abide -- at lest -Wall with GCC. It should have complained that you call read() without including <unistd.h>.

Resources