Using read() system call of UNIX to find the user given pattern - grep

I am trying to emulate grep pattern of UNIX using a C program( just for learning ). The code that i have written is giving me a run time error..
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#define MAXLENGTH 1000
char userBuf[MAXLENGTH];
int main ( int argc, char *argv[])
{
int numOfBytes,fd,i;
if (argc != 2)
printf("Supply correct number of arguments.\n");
//exit(1);
fd =open("pattern.txt",O_RDWR);
if ( fd == -1 )
printf("File does not exist.\n");
//exit(1);
while ( (numOfBytes = read(fd,userBuf,MAXLENGTH)) > 0 )
;
printf("NumOfBytes = %d\n",numOfBytes);
for(i=0;userBuf[i] != '\0'; ++i)
{
if ( strstr(userBuf,argv[1]) )
printf("%s\n",userBuf);
}
}
The program is printing infinitely, the lines containing the pattern . I tried debugging , but couldn't figure out the error. Please let me know where am i wrong.,
Thanks

Say the string is "fooPATTERN". Your first time through the loop, you check for the pattern in "fooPATTERN" and find it. Then your second time through the loop, you check for the pattern in "ooPATTERN" and find it again. Then your third time, you check for the pattern in "oPATTERN" and find it again.
Since you're doing this to learn, I won't tell you much more. You can decide how best to solve it. There are at least two fundamentally different ways you could solve it. One is to do less on each pass of the loop to ensure you only find it once. The other is to make sure your next pass of the loop is past any pattern that was found.
One thing to think about: If the pattern is 'oo' and the string is 'ooo', how many patterns should be found? 1 or 2?

The 'read' does not delimit the data with a null character.
The while loop should encompase the for loop - it doesn't

First, you shouldn't be using raw Unix i/o with open and read if you're just learning C. Start with standard C i/o with fopen and fread/fscanf/fgets and so forth.
Second, you're reading in successive pieces of the file into the same buffer, overwriting the buffer each time, and only ever processing the last contents of the buffer.
Third, nothing guarantees that your buffer will be zero-terminated when you read into it with read(). In fact, it usually won't be.
Fourth, you're not using the i variable in the body of your loop. I can't tell exactly what you were shooting for here, but doing the same thing on the same data umpteen thousand times surely wasn't it.
Fifth, always compile with the fullest warning settings you can abide -- at lest -Wall with GCC. It should have complained that you call read() without including <unistd.h>.

Related

Clang: How to get the macro name used for size of a constant size array declaration

TL;DR;
How to get the macro name used for size of a constant size array declaration, from a callExpr -> arg_0 -> DeclRefExpr.
Detailed Problem statement:
Recently I started working on a challenge which requires source to source transformation tool for modifying
specific function calls with an additional argument. Reasearching about the ways i can acheive introduced me
to this amazing toolset Clang. I've been learning how to use different tools provided in libtooling to
acheive my goal. But now i'm stuck at a problem, seek your help here.
Considere the below program (dummy of my sources), my goal is to rewrite all calls to strcpy
function with a safe version of strcpy_s and add an additional parameter in the new function call
i.e - destination pointer maximum size. so, for the below program my refactored call would be like
strcpy_s(inStr, STR_MAX, argv[1]);
I wrote a RecursiveVisitor class and inspecting all function calls in VisitCallExpr method, to get max size
of the dest arg i'm getting VarDecl of the first agrument and trying to get the size (ConstArrayType). Since
the source file is already preprocessed i'm seeing 2049 as the size, but what i need is the macro STR_MAX in
this case. how can i get that?
(Creating replacements with this info and using RefactoringTool replacing them afterwards)
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define STR_MAX 2049
int main(int argc, char **argv){
char inStr[STR_MAX];
if(argc>1){
//Clang tool required to transaform the below call into strncpy_s(inStr, STR_MAX, argv[1], strlen(argv[1]));
strcpy(inStr, argv[1]);
} else {
printf("\n not enough args");
return -1;
}
printf("got [%s]", inStr);
return 0;
}
As you noticed correctly, the source code is already preprocessed and it has all the macros expanded. Thus, the AST will simply have an integer expression as the size of array.
A little bit of information on source locations
NOTE: you can skip it and proceed straight to the solution below
The information about expanded macros is contained in source locations of AST nodes and usually can be retrieved using Lexer (Clang's lexer and preprocessor are very tightly connected and can be even considered one entity). It's a bare minimum and not very obvious to work with, but it is what it is.
As you are looking for a way to get the original macro name for a replacement, you only need to get the spelling (i.e. the way it was written in the original source code) and you don't need to carry much about macro definitions, function-style macros and their arguments, etc.
Clang has two types of different locations: SourceLocation and CharSourceLocation. The first one can be found pretty much everywhere through the AST. It refers to a position in terms of tokens. This explains why begin and end positions can be somewhat counterintuitive:
// clang::DeclRefExpr
//
// ┌─ begin location
foo(VeryLongButDescriptiveVariableName);
// └─ end location
// clang::BinaryOperator
//
// ┌─ begin location
int Result = LHS + RHS;
// └─ end location
As you can see, this type of source location points to the beginning of the corresponding token. CharSourceLocation on the other hand, points directly to the characters.
So, in order to get the original text of the expression, we need to convert SourceLocation's to CharSourceLocation's and get the corresponding text from the source.
The solution
I've modified your example to show other cases of macro expansions as well:
#define STR_MAX 2049
#define BAR(X) X
int main() {
char inStrDef[STR_MAX];
char inStrFunc[BAR(2049)];
char inStrFuncNested[BAR(BAR(STR_MAX))];
}
The following code:
// clang::VarDecl *VD;
// clang::ASTContext *Context;
auto &SM = Context->getSourceManager();
auto &LO = Context->getLangOpts();
auto DeclarationType = VD->getTypeSourceInfo()->getTypeLoc();
if (auto ArrayType = DeclarationType.getAs<ConstantArrayTypeLoc>()) {
auto *Size = ArrayType.getSizeExpr();
auto CharRange = Lexer::getAsCharRange(Size->getSourceRange(), SM, LO);
// Lexer gets text for [start, end) and we want him to grab the end as well
CharRange.setEnd(CharRange.getEnd().getLocWithOffset(1));
auto StringRep = Lexer::getSourceText(CharRange, SM, LO);
llvm::errs() << StringRep << "\n";
}
produces this output for the snippet:
STR_MAX
BAR(2049)
BAR(BAR(STR_MAX))
I hope this information is helpful. Happy hacking with Clang!

Save vector to file during debug session (Xcode)

My application has crashed in an assert, and the debugger is attached. To be able to reproduce the crash I want to save a C++ vector with 397 struct{uint64_t, uint64_t} elements to file.
My first approach was to try to print the vector. I can print the vector to the console, but it seems like only the first 256 values are written. Is it possible to remove the 256 element restriction?
I've also searched for a way to save the vector to file from within the debugger, but I've not found any way. I've not even found a way to save a memory region, but I guess that must be possible...
Since you mentioned that you're stopped in the debugger in Xcode, I'll assume you're debugging with lldb. You can use the expression command to execute essentially arbitrary code when you're stopped in the debugger, for example:
expression for(int j = 0; j < 10; j++) { (void)NSLog(#"%d", j); }
Will execute a for loop and print the numbers 0 through 9. You should be able to use a similar technique to iterate over your vector and write it to a file. You can combine multiple expressions using a semicolon, just as if you were writing normal code (well, except for newlines). For example, this will write "Hello, world" to a temporary file at /tmp/vector.dat, not exactly what you want, but I think you'll get the idea:
expression FILE *fp = (FILE*)fopen("/tmp/vector.dat", "w"); (void)fprintf(fp, "Hello, world!\n"); (void)fclose(fp);

how to get the available memory on the device

I'm trying to obtain how much free memory I have on the device. To do this I call the cuda function cuMemGetInfo from a fortran code, but it returns negative values for the free amount of memory, so there's clearly something wrong.
Does anyone know how I can do that?
Thanks
EDIT:
Sorry, in fact my question was not very clear. I'm using OpenACC in Fortran and I call the C++ cuda function cudaMemGetInfo. Finally I could fix the code, the problem was effectively the kind of variables that I was using. Switching to size_ fixed everything. This is the interface in fortran that I'm using:
interface
subroutine get_dev_mem(total,free) bind(C,name="get_dev_mem")
use iso_c_binding
integer(kind=c_size_t)::total,free
end subroutine get_dev_mem
end interface
and this the cuda code
#include <cuda.h>
#include <cuda_runtime.h>
extern "C" {
void get_dev_mem(size_t& total, size_t& free)
{
cuMemGetInfo(&free, &total);
}
}
There's one last question: I pushed an array on the gpu and I checked its size using cuMemGetInfo, then I computed it's size counting the number of bytes, but I don't have the same answer, why? In the first case it is 3052mb large, in the latter 3051mb. This difference of 1mb could be the size of the array descriptor? Here there's the code that I used:
integer, parameter:: long = selected_int_kind(12)
integer(kind=c_size_t) :: total, free1,free2
real(8), dimension(:),allocatable::a
integer(kind=long)::N, eight, four
allocate(a(four*N))
!some OpenACC stuff in order to init the gpu
call get_dev_mem(total,free1)
!$acc data copy(a)
call get_dev_mem(total,free2)
print *,"size a in the gpu = ",(free1-free2)/1024/1024, " mb"
print *,"size a in theory = ", (eight*four*N)/1024/1024, " mb"
!$acc end data
deallocate(a)
Right, so, like commenters have suggested, we're not sure exactly what you're running, but filling in the missing details by guessing, here's a shot:
Most CUDA API calls return a status code (or error code if you will); this is true both in C/C++ and in Fortran, as we can see in the Portland Group's CUDA Fortran Manual:
Most of the runtime API routines are integer functions that return an error code; they return a value of zero if the call was successful, and a nonzero value if there was an error. To interpret the error codes, refer to “Error Handling,” on page 48.
This is the case for cudaMemGetInfo() specifically:
integer function cudaMemGetInfo( free, total )
integer(kind=cuda_count_kind) :: free, total
The two integers for free and total are cuda_count_kind, which if I am not mistaken are effectively unsigned... anyway, I would guess that what you're getting is an error code. Have a look at the Error Handling section on page 48 of the manual.

char buffers comparison

i have two char buffers which i am trying to compare parts of them. i am having a weird problem. i have the following code:
char buffer1[50], buffer2[60];
// Get buffer1 and buffer2 from the network by reading sockets
for(int i = 0; i < 20; i++)
{
if(buffer1[15+i] != buffer2[25+i])
{
printf("%c", buffer1[15+i]);
printf("%c", buffer2[25+i]);
printf("%02x", (unsigned char)buffer1[15+i]);
printf("%02x", (unsigned char)buffer2[25+i]);
break;
}
}
The above code is a simplified version of my actual code which I didnt copy-paste here because its too long. Just in case this might help, I got those two buffer over the network by reading sockets.
The problem is the loop breaks even when both the buffers are the same. To check what is in the buffers, I added the two print statements inside the if statement. And the weird thing is is, the printf statements both print the same value for %c and %02x, but the comparison fails and the loop breaks.
(Disclaimer: I'm not a C/++ expert)
It seems to me like the data is changing while you're looking at it. Two quick questions come to mind:
If you run this in the debugger, and go over the loop step-by-step, does it still happen? If it doesn't, then I strongly suspect my second question will lead you to the answer.
Is the read operation asynchronous? It seems like data is still being read while you're inside the for loop, meaning you didn't wait for the read to finish.
The only thing I see is a timing issue. If they are not the same on the if statement and they are the same on the print statement someone changed them in between.

Parsing really big log files (>1Gb, <5Gb)

I need to parse very large log files (>1Gb, <5Gb) - actually I need to strip the data into objects so I can store them in a DB. The log file is sequential (no line breaks), like:
TIMESTAMP=20090101000000;PARAM1=Value11;PARAM2=Value21;PARAM3=Value31;TIMESTAMP=20090101000100;PARAM1=Value11;PARAM2=Value21;PARAM3=Value31;TIMESTAMP=20090101000152;PARAM1=Value11;PARAM2=Value21;PARAM3=Value31;...
I need to strip this into the table:
TIMESTAMP | PARAM1 | PARAM2 | PARAM3
The process need to be as fast as possible. I'm considering using Perl, but any suggestions using C/C++ would be really welcome. Any ideas?
Best regards,
Arthur
Write a prototype in Perl and compare its performance against how fast you can read data off of the storage medium. My guess is that you'll be I/O bound, which means that using C won't offer a performance boost.
This presentation about the use of Python generators blew my mind:
http://www.dabeaz.com/generators-uk/
David M. Beazley shows how to process multi-gigabyte log files by basically defining a generator for each processing step. The generators are then 'plugged' into each other until you have some simple utility functions
lines = lines_from_dir("access-log*","www")
log = apache_log(lines)
for r in log:
print r
which can then be used for all sorts of querying:
stat404 = set(r['request'] for r in log
if r['status'] == 404)
large = (r for r in log
if r['bytes'] > 1000000)
for r in large:
print r['request'], r['bytes']
He also shows that performance compares well to the performance of standard unix tools like grep, find etc.
Of course this being Python, it's much easier to understand and most importantly easier to customise or adapt to different problem sets than perl or awk scripts.
(The code examples above are copied from the presentation slides.)
Lex handles this sort of things amazingly well.
But really, use AWK. It's performance is not bad, even comparing with Perl, etc. Of cource Map/Reduce would work quite well, but what about the overhead of splitting the file into appropriate chunks?
Try AWK
The key won't be the language because the problem is I/O bound, so pick the language that you feel most comfortable with.
The key is how it is coded. You'll be fine as long as you don't load the whole file in memory -- load chunks at a time, and save the data chunks at a time, it will be more efficient.
Java has a PushbackInputStream that may make this easier to code. The idea is that you guess how much to read, and if you read too little, then push the data back, and read a larger chunk.
Then when you've read too much, process the data and then push back the remaining bit and continue to the next iteration of the loop.
Something like this should work.
use strict;
use warnings;
my $filename = shift #ARGV;
open my $io, '<', $filename or die "Can't open $filename";
my ($match_buf, $read_buf, $count);
while (($count = sysread($io, $read_buf, 1024, 0)) != 0) {
$match_buf .= $read_buf;
while ($match_buf =~ s{TIMESTAMP=(\d{14});PARAM1=([^;]+);PARAM2=([^;]+);PARAM3=([^;]+);}{}) {
my ($timestamp, #params) = ($1, $2, $3, $4);
print $timestamp ."\n";
last unless $timestamp;
}
}
This is easily handled in Perl, Awk, or C. Here's a start on a version in C for you:
#include <stdio.h>
#include <err.h>
int
main(int argc, char **argv)
{
const char *filename = "noeol.txt";
FILE *f;
char buffer[1024], *s, *p;
char line[1024];
size_t n;
if ((f = fopen(filename, "r")) == NULL)
err(1, "cannot open %s", filename);
while (!feof(f)) {
n = fread(buffer, 1, sizeof buffer, f);
if (n == 0)
if (ferror(f))
err(1, "error reading %s", filename);
else
continue;
for (s = p = buffer; p - buffer < n; p++) {
if (*p == ';') {
*p = '\0';
strncpy(line, s, p-s+1);
s = p + 1;
if (strncmp("TIMESTAMP", line, 9) != 0)
printf("\t");
printf("%s\n", line);
}
}
}
fclose(f);
}
Sounds like a job for sed:
sed -e 's/;\?[A-Z0-9]*=/|/g' -e 's/\(^\|\)\|\(;$\)//g' < input > output
You might want to take a look at Hadoop (java) or Hadoop Streaming (runs Map/Reduce jobs with any executable or script).
If you code your own solution, you will probably benefit from reading larger chunks of data from the file and processing them in batches (rather than using, say, readline()) and looking for the newline marking the end of each row. With this approach, you need to be mindful that you may not have retrieved the entirety of the last line, so some logic would be required to handle that.
I don't know what performance benefits you'd realize, since I haven't tested it, but I've leveraged similar techniques with success.
I know this is an exotic language and may be not the best solution to do that but when i've ad hoc data, i consider PADS

Resources