Pre-Processing using m4 - preprocessor

I am writing a pre-processor for Free-Pascal (Course Work) using m4. I was reading the thread at stackoverflow here and from there reached a blog which essentially shows the basic usage of m4 for pre-processing for C. The blogger uses a testing C file test.c.m4 like this:
define(`DEF', `3')
int main(int argc, char *argv[]) {
printf("%d\n", DEF);
return 0;
and generates processed C file like this using m4, which is fine.
$ m4 test.c.m4 > test.c
$ cat test.c
#include <stdio.h>
int main(int argc, char *argv[]) {
printf("%dn", 3);
return 0;
My doubts are:
1. The programmer will write the code where the line
define(`DEF', `3')
would be
#define DEF 3
then who converts this line to the above line? We can use tool like sed or awk to do the same but then what is the use of m4. The thing that m4 does can be implemented using sed also.
It would be very helpful if someone can tell me how to convert the programmer's code into a file that can be used by m4.
2. I had another issue using m4. The comment in languages like C are removed before pre-processing so can this be done using m4? For this I was looking for commands in m4 by which I can replace the comments using regex and I found regexp(), but it requires the string to be replaced as argument which is not available in this case. So how to achieve this?
Sorry if this is a naive question. I read the documentation of m4 but could not find a solution.

m4 is the tool that will convert DEF to 3 in this case. It is true that sed or awk could serve the same purpose for this simple case but m4 is a much more powerful tool because it a) allows macros to be parameterized, b) includes conditionals, c) allows macros to be redefined through the input file, and much more. For example, one could write (in the file for.pas.m4, inspired by ratfor):
define(`LOOP',`for $1 := 1 to $2 do
... which produces the following output ready for the Pascal compiler when processed by m4 for.pas.m4:
for i := 1 to 10 do
Removing general Pascal comments using m4 would not be possible but creating a macro to include a comment that will be deleted by `m4' in processing is straightforward:
NOTE(`This is a comment')
x := 3;
... produces:
x := 3;
Frequently-used macros that are to be expanded by m4 can be put in a common file that can be included at the start of any Pascal file that uses them, making it unnecessary to define all the required macros in every Pascal file. See include (file) in the m4 manual.


Clang: How to get the macro name used for size of a constant size array declaration

How to get the macro name used for size of a constant size array declaration, from a callExpr -> arg_0 -> DeclRefExpr.
Detailed Problem statement:
Recently I started working on a challenge which requires source to source transformation tool for modifying
specific function calls with an additional argument. Reasearching about the ways i can acheive introduced me
to this amazing toolset Clang. I've been learning how to use different tools provided in libtooling to
acheive my goal. But now i'm stuck at a problem, seek your help here.
Considere the below program (dummy of my sources), my goal is to rewrite all calls to strcpy
function with a safe version of strcpy_s and add an additional parameter in the new function call
i.e - destination pointer maximum size. so, for the below program my refactored call would be like
strcpy_s(inStr, STR_MAX, argv[1]);
I wrote a RecursiveVisitor class and inspecting all function calls in VisitCallExpr method, to get max size
of the dest arg i'm getting VarDecl of the first agrument and trying to get the size (ConstArrayType). Since
the source file is already preprocessed i'm seeing 2049 as the size, but what i need is the macro STR_MAX in
this case. how can i get that?
(Creating replacements with this info and using RefactoringTool replacing them afterwards)
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define STR_MAX 2049
int main(int argc, char **argv){
char inStr[STR_MAX];
//Clang tool required to transaform the below call into strncpy_s(inStr, STR_MAX, argv[1], strlen(argv[1]));
strcpy(inStr, argv[1]);
} else {
printf("\n not enough args");
return -1;
printf("got [%s]", inStr);
return 0;
As you noticed correctly, the source code is already preprocessed and it has all the macros expanded. Thus, the AST will simply have an integer expression as the size of array.
A little bit of information on source locations
NOTE: you can skip it and proceed straight to the solution below
The information about expanded macros is contained in source locations of AST nodes and usually can be retrieved using Lexer (Clang's lexer and preprocessor are very tightly connected and can be even considered one entity). It's a bare minimum and not very obvious to work with, but it is what it is.
As you are looking for a way to get the original macro name for a replacement, you only need to get the spelling (i.e. the way it was written in the original source code) and you don't need to carry much about macro definitions, function-style macros and their arguments, etc.
Clang has two types of different locations: SourceLocation and CharSourceLocation. The first one can be found pretty much everywhere through the AST. It refers to a position in terms of tokens. This explains why begin and end positions can be somewhat counterintuitive:
// clang::DeclRefExpr
// ┌─ begin location
// └─ end location
// clang::BinaryOperator
// ┌─ begin location
int Result = LHS + RHS;
// └─ end location
As you can see, this type of source location points to the beginning of the corresponding token. CharSourceLocation on the other hand, points directly to the characters.
So, in order to get the original text of the expression, we need to convert SourceLocation's to CharSourceLocation's and get the corresponding text from the source.
The solution
I've modified your example to show other cases of macro expansions as well:
#define STR_MAX 2049
#define BAR(X) X
int main() {
char inStrDef[STR_MAX];
char inStrFunc[BAR(2049)];
char inStrFuncNested[BAR(BAR(STR_MAX))];
The following code:
// clang::VarDecl *VD;
// clang::ASTContext *Context;
auto &SM = Context->getSourceManager();
auto &LO = Context->getLangOpts();
auto DeclarationType = VD->getTypeSourceInfo()->getTypeLoc();
if (auto ArrayType = DeclarationType.getAs<ConstantArrayTypeLoc>()) {
auto *Size = ArrayType.getSizeExpr();
auto CharRange = Lexer::getAsCharRange(Size->getSourceRange(), SM, LO);
// Lexer gets text for [start, end) and we want him to grab the end as well
auto StringRep = Lexer::getSourceText(CharRange, SM, LO);
llvm::errs() << StringRep << "\n";
produces this output for the snippet:
I hope this information is helpful. Happy hacking with Clang!

how to make clang compile to LLVM IR with textual labels for simple function

Hello I have to parse some LLVM IR code for a compiler course. I am very new to LLVM.
I have clang and LLVM on my computer, and when I compile a simple C program:
#include <stdio.h>
int main(int argc, char *argv[])
for (int i = 0; i < 10; i++) {
return 0;
using command: clang -cc1 test.c -emit-llvm
I get llvm IR with what I believe are called implicit blocks:
; <label>:4 ; preds = %9, %0
However my parser also needs to handle llvm IR with textual labels:
for.cond: ; preds =, %entry
My problem is that I do not know how to generate such IR and was hoping someone show me how.
I tried Google and such, but I couldn't find appropriate information. Thanks in advance.
The accepted answer is no longer valid. Nor is it a good way to achieve the stated.
In case someone stumbles upon this question, like I did, I'm providing the answer.
clang-8 -S -fno-discard-value-names -emit-llvm test.c
use this site with Show detailed bytecode analysis checked

Import and write GeoTIFF in Octave

I am using MATLAB in my office and Octave when I am at home. Although they are very similar, I was trying to do something I would expected to be very easy and obvious, but found it really annoying. I can't find out how to import TIFF images in Octave. I know the MATLAB geotiffread function is not present, but I thought there would be another method.
I could also skip importing them, as I can work with the imread function in some cases, but then the second problem would be that I can't find a way to write a georeferenced TIFF file (in MATLAB I normally call geotiffwrite with geotiffinfo inputs inside). My TIFF files are usually 8 bit unsigned integer or 32 bit signed integer. I hope someone can suggest a way to solve this problem. I also saw this thread but did not understand if it is possible to use the code proposed by Ashish in Octave.
You may want to look at the mapping library in Octave.
You can also use the raster functions to work with GeoTiffs
pkg load mapping
rasterinfo (filename)
rasterdraw (filename)
The short answer is you can't do it in Octave out of the box. But this is not because it is impossible to do it. It is simply because no one has yet bothered to implement it. As a piece of free software, Octave has the features that its users are willing to spend time or money implementing.
About writing of signed 32-bit images
As of version 3.8.1, Octave uses either GraphicsMagick or ImageMagick to handle the reading and writing of images. This introduces some problems. The number 1 is that your precision is limited to how you built GraphicsMagick (its quantum-depth option). In addition, you can only write unsigned integers. Hopefully this will change in the future but since not many users require it, it's been this way until now.
Dealing with geotiff
Provided you know C++, you can write this functions yourself. This shouldn't be too hard since there is already libgeotiff, a C library for it. You would only need to write a wrapper as an Octave oct function (of course, if you don't know C or C++, then this "only" becomes a lot of work).
Here is the example oct file code which needs to be compiled. I have taken reference of
#include <octave/oct.h>
#include "iostream"
#include "fstream"
#include "string"
#include "cstdlib"
#include <cstdio>
#include "gdal_priv.h"
#include "cpl_conv.h"
#include "limits.h"
#include "stdlib.h"
using namespace std;
typedef std::string String;
DEFUN_DLD (test1, args, , "write geotiff")
NDArray maindata = args(0).array_value ();
const dim_vector dims = maindata.dims ();
int i,j,nrows,ncols;
//octave_stdout << maindata(i,0);
NDArray transform1 = args(1).array_value ();
double* transform = (double*) CPLMalloc(sizeof(double)*6);
float* rowBuff = (float*) CPLMalloc(sizeof(float)*ncols);
//GDT_Float32 *rowBuff = CPLMalloc(sizeof(GDT_Float32)*ncols);
String tiffname;
tiffname = "nameoftiff2.tif";
cout<<"The transformation matrix is";
for (i=0; i<6; i++)
cout<<transform[i]<<" ";
GDALDataset *geotiffDataset;
GDALDriver *driverGeotiff;
GDALRasterBand *geotiffBand;
OGRSpatialReference oSRS;
char **papszOptions = NULL;
char *pszWKT = NULL;
oSRS.SetWellKnownGeogCS( "WGS84" );
oSRS.exportToWkt( &pszWKT );
driverGeotiff = GetGDALDriverManager()->GetDriverByName("GTiff");
geotiffDataset = (GDALDataset *) driverGeotiff->Create(tiffname.c_str(),ncols,nrows,1,GDT_Float32,NULL);
//CPLFree( pszSRS_WKT );
cout<<" \n Number of rows and columns in array are: \n";
cout<<nrows<<" "<<ncols<<"\n";
for (i=0; i<nrows; i++)
for (j=0; j <ncols; j++)
GDALClose(geotiffDataset) ;
return octave_value_list();
it can be compiled and run using following
mkoctfile -lgdal

c stream buffer

I am using C and need a stream buffer mechanism that I can write arbitrary bytes two and read bytes from. I would prefer something that is platform independent (or that can at least run on osx and linux). Is anyone aware of any permissive lightweight libraries or code than I can drop in?
I've used buffers within libevent and I may end up going that route, but it seems overkill to have libevent as a dependency when I don't do any sort of event based io.
If you don't mind depending on C++ and possibly some bits of STL, you can use std::stringstream. It shouldn't be too difficult to write a thin C wrapper around it.
Is setbuf(3) (and its aliases) the 'mechanism' you are searching for?
Please consider the following example:
#include <stdio.h>
int main()
char buf[256];
setbuffer(stderr, buf, 256);
fprintf(stderr, "Error: no more oxygen.\n");
buf[1] = 'R';
buf[2] = 'R';
buf[3] = 'O';
buf[4] = 'R';

Parsing really big log files (>1Gb, <5Gb)

I need to parse very large log files (>1Gb, <5Gb) - actually I need to strip the data into objects so I can store them in a DB. The log file is sequential (no line breaks), like:
I need to strip this into the table:
The process need to be as fast as possible. I'm considering using Perl, but any suggestions using C/C++ would be really welcome. Any ideas?
Best regards,
Write a prototype in Perl and compare its performance against how fast you can read data off of the storage medium. My guess is that you'll be I/O bound, which means that using C won't offer a performance boost.
This presentation about the use of Python generators blew my mind:
David M. Beazley shows how to process multi-gigabyte log files by basically defining a generator for each processing step. The generators are then 'plugged' into each other until you have some simple utility functions
lines = lines_from_dir("access-log*","www")
log = apache_log(lines)
for r in log:
print r
which can then be used for all sorts of querying:
stat404 = set(r['request'] for r in log
if r['status'] == 404)
large = (r for r in log
if r['bytes'] > 1000000)
for r in large:
print r['request'], r['bytes']
He also shows that performance compares well to the performance of standard unix tools like grep, find etc.
Of course this being Python, it's much easier to understand and most importantly easier to customise or adapt to different problem sets than perl or awk scripts.
(The code examples above are copied from the presentation slides.)
Lex handles this sort of things amazingly well.
But really, use AWK. It's performance is not bad, even comparing with Perl, etc. Of cource Map/Reduce would work quite well, but what about the overhead of splitting the file into appropriate chunks?
The key won't be the language because the problem is I/O bound, so pick the language that you feel most comfortable with.
The key is how it is coded. You'll be fine as long as you don't load the whole file in memory -- load chunks at a time, and save the data chunks at a time, it will be more efficient.
Java has a PushbackInputStream that may make this easier to code. The idea is that you guess how much to read, and if you read too little, then push the data back, and read a larger chunk.
Then when you've read too much, process the data and then push back the remaining bit and continue to the next iteration of the loop.
Something like this should work.
use strict;
use warnings;
my $filename = shift #ARGV;
open my $io, '<', $filename or die "Can't open $filename";
my ($match_buf, $read_buf, $count);
while (($count = sysread($io, $read_buf, 1024, 0)) != 0) {
$match_buf .= $read_buf;
while ($match_buf =~ s{TIMESTAMP=(\d{14});PARAM1=([^;]+);PARAM2=([^;]+);PARAM3=([^;]+);}{}) {
my ($timestamp, #params) = ($1, $2, $3, $4);
print $timestamp ."\n";
last unless $timestamp;
This is easily handled in Perl, Awk, or C. Here's a start on a version in C for you:
#include <stdio.h>
#include <err.h>
main(int argc, char **argv)
const char *filename = "noeol.txt";
FILE *f;
char buffer[1024], *s, *p;
char line[1024];
size_t n;
if ((f = fopen(filename, "r")) == NULL)
err(1, "cannot open %s", filename);
while (!feof(f)) {
n = fread(buffer, 1, sizeof buffer, f);
if (n == 0)
if (ferror(f))
err(1, "error reading %s", filename);
for (s = p = buffer; p - buffer < n; p++) {
if (*p == ';') {
*p = '\0';
strncpy(line, s, p-s+1);
s = p + 1;
if (strncmp("TIMESTAMP", line, 9) != 0)
printf("%s\n", line);
Sounds like a job for sed:
sed -e 's/;\?[A-Z0-9]*=/|/g' -e 's/\(^\|\)\|\(;$\)//g' < input > output
You might want to take a look at Hadoop (java) or Hadoop Streaming (runs Map/Reduce jobs with any executable or script).
If you code your own solution, you will probably benefit from reading larger chunks of data from the file and processing them in batches (rather than using, say, readline()) and looking for the newline marking the end of each row. With this approach, you need to be mindful that you may not have retrieved the entirety of the last line, so some logic would be required to handle that.
I don't know what performance benefits you'd realize, since I haven't tested it, but I've leveraged similar techniques with success.
I know this is an exotic language and may be not the best solution to do that but when i've ad hoc data, i consider PADS
