Parsing really big log files (>1Gb, <5Gb) - parsing

I need to parse very large log files (>1Gb, <5Gb) - actually I need to strip the data into objects so I can store them in a DB. The log file is sequential (no line breaks), like:
TIMESTAMP=20090101000000;PARAM1=Value11;PARAM2=Value21;PARAM3=Value31;TIMESTAMP=20090101000100;PARAM1=Value11;PARAM2=Value21;PARAM3=Value31;TIMESTAMP=20090101000152;PARAM1=Value11;PARAM2=Value21;PARAM3=Value31;...
I need to strip this into the table:
TIMESTAMP | PARAM1 | PARAM2 | PARAM3
The process need to be as fast as possible. I'm considering using Perl, but any suggestions using C/C++ would be really welcome. Any ideas?
Best regards,
Arthur

Write a prototype in Perl and compare its performance against how fast you can read data off of the storage medium. My guess is that you'll be I/O bound, which means that using C won't offer a performance boost.

This presentation about the use of Python generators blew my mind:
http://www.dabeaz.com/generators-uk/
David M. Beazley shows how to process multi-gigabyte log files by basically defining a generator for each processing step. The generators are then 'plugged' into each other until you have some simple utility functions
lines = lines_from_dir("access-log*","www")
log = apache_log(lines)
for r in log:
print r
which can then be used for all sorts of querying:
stat404 = set(r['request'] for r in log
if r['status'] == 404)
large = (r for r in log
if r['bytes'] > 1000000)
for r in large:
print r['request'], r['bytes']
He also shows that performance compares well to the performance of standard unix tools like grep, find etc.
Of course this being Python, it's much easier to understand and most importantly easier to customise or adapt to different problem sets than perl or awk scripts.
(The code examples above are copied from the presentation slides.)

Lex handles this sort of things amazingly well.

But really, use AWK. It's performance is not bad, even comparing with Perl, etc. Of cource Map/Reduce would work quite well, but what about the overhead of splitting the file into appropriate chunks?
Try AWK

The key won't be the language because the problem is I/O bound, so pick the language that you feel most comfortable with.
The key is how it is coded. You'll be fine as long as you don't load the whole file in memory -- load chunks at a time, and save the data chunks at a time, it will be more efficient.
Java has a PushbackInputStream that may make this easier to code. The idea is that you guess how much to read, and if you read too little, then push the data back, and read a larger chunk.
Then when you've read too much, process the data and then push back the remaining bit and continue to the next iteration of the loop.

Something like this should work.
use strict;
use warnings;
my $filename = shift #ARGV;
open my $io, '<', $filename or die "Can't open $filename";
my ($match_buf, $read_buf, $count);
while (($count = sysread($io, $read_buf, 1024, 0)) != 0) {
$match_buf .= $read_buf;
while ($match_buf =~ s{TIMESTAMP=(\d{14});PARAM1=([^;]+);PARAM2=([^;]+);PARAM3=([^;]+);}{}) {
my ($timestamp, #params) = ($1, $2, $3, $4);
print $timestamp ."\n";
last unless $timestamp;
}
}

This is easily handled in Perl, Awk, or C. Here's a start on a version in C for you:
#include <stdio.h>
#include <err.h>
int
main(int argc, char **argv)
{
const char *filename = "noeol.txt";
FILE *f;
char buffer[1024], *s, *p;
char line[1024];
size_t n;
if ((f = fopen(filename, "r")) == NULL)
err(1, "cannot open %s", filename);
while (!feof(f)) {
n = fread(buffer, 1, sizeof buffer, f);
if (n == 0)
if (ferror(f))
err(1, "error reading %s", filename);
else
continue;
for (s = p = buffer; p - buffer < n; p++) {
if (*p == ';') {
*p = '\0';
strncpy(line, s, p-s+1);
s = p + 1;
if (strncmp("TIMESTAMP", line, 9) != 0)
printf("\t");
printf("%s\n", line);
}
}
}
fclose(f);
}

Sounds like a job for sed:
sed -e 's/;\?[A-Z0-9]*=/|/g' -e 's/\(^\|\)\|\(;$\)//g' < input > output

You might want to take a look at Hadoop (java) or Hadoop Streaming (runs Map/Reduce jobs with any executable or script).

If you code your own solution, you will probably benefit from reading larger chunks of data from the file and processing them in batches (rather than using, say, readline()) and looking for the newline marking the end of each row. With this approach, you need to be mindful that you may not have retrieved the entirety of the last line, so some logic would be required to handle that.
I don't know what performance benefits you'd realize, since I haven't tested it, but I've leveraged similar techniques with success.

I know this is an exotic language and may be not the best solution to do that but when i've ad hoc data, i consider PADS

Related

Loading/Storing to XMFLOAT4 faster than using XMVECTOR?

I'm going through the DirectX Math/XNA Math library, and I got curious when I read about the alignment requirements for XMVECTOR (Now DirectX::XMVECTOR), and how it is expected of you to use XMFLOAT* for members instead, using XMLoad* and XMStore* when performing mathematical operations. I was specifically curious about the tradeoffs, so I did an experiment, as I'm sure many others have, and tested to see exactly how much you lose having to load and store the vectors for each operation. This is the resulting code:
#include <Windows.h>
#include <chrono>
#include <cstdint>
#include <DirectXMath.h>
#include <iostream>
using std::chrono::high_resolution_clock;
#define TEST_COUNT 1000000000l
int main(void)
{
DirectX::XMVECTOR v1 = DirectX::XMVectorSet(1, 2, 3, 4);
DirectX::XMVECTOR v2 = DirectX::XMVectorSet(2, 3, 4, 5);
DirectX::XMFLOAT4 x{ 1, 2, 3, 4 };
DirectX::XMFLOAT4 y{ 2, 3, 4, 5 };
std::chrono::system_clock::time_point start, end;
std::chrono::milliseconds duration;
// Test with just the XMVECTOR
start = high_resolution_clock::now();
for (uint64_t i = 0; i < TEST_COUNT; i++)
{
v1 = DirectX::XMVectorAdd(v1, v2);
}
end = high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
DirectX::XMFLOAT4 z;
DirectX::XMStoreFloat4(&z, v1);
std::cout << std::endl << "z = " << z.x << ", " << z.y << ", " << z.z << std::endl;
std::cout << duration.count() << " milliseconds" << std::endl;
// Now try with load/store
start = high_resolution_clock::now();
for (uint64_t i = 0; i < TEST_COUNT; i++)
{
v1 = DirectX::XMLoadFloat4(&x);
v2 = DirectX::XMLoadFloat4(&y);
v1 = DirectX::XMVectorAdd(v1, v2);
DirectX::XMStoreFloat4(&x, v1);
}
end = high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << std::endl << "x = " << x.x << ", " << x.y << ", " << x.z << std::endl;
std::cout << duration.count() << " milliseconds" << std::endl;
}
Running a debug build yields the output:
z = 3.35544e+007, 6.71089e+007, 6.71089e+007
25817 milliseconds
x = 3.35544e+007, 6.71089e+007, 6.71089e+007
84344 milliseconds
Okay, so about thrice as slow, but does anyone really take perf tests on debug builds seriously? Here are the results when I do a release build:
z = 3.35544e+007, 6.71089e+007, 6.71089e+007
1980 milliseconds
x = 3.35544e+007, 6.71089e+007, 6.71089e+007
670 milliseconds
Like magic, XMFLOAT4 runs almost three times faster! Somehow the tables have turned. Looking at the code, this makes no sense to me; the second part runs a superset of the commands that the first part runs! There must be something going wrong, or something I am not taking into account. It is difficult to believe that the compiler was able to optimize the second part nine-fold over the much simpler, and theoretically more efficient first part. The only reasonable explanations I have involve either (1) cache behavior, (2) some crazy out of order execution that XMVECTOR can't take advantage of, (3) the compiler is making some insane optimizations, or (4) using XMVECTOR directly has some implicit inefficiency that was able to be optimized away when using XMFLOAT4. That is, the default way the compiler loads and stores XMVECTORs from memory is less efficient than XMLoad* and XMStore*. I attempted to inspect the disassembly, but I'm not all that familiar with X86 and/or SSE2 and Visual Studio does some crazy optimizations making it difficult to follow along with the source code. I also tried the Visual Studio performance analysis tool, but that didn't help as I can't figure out how to make it show the disassembly instead of the code. The only useful information I get out of that is that the first call to XMVectorAdd accounts for ~48.6% of all cycles while the second call to XMVectorAdd accounts for ~4.4% of all cycles.
EDIT:
After some more debugging, here is the assembly for the code that gets run inside of the loop. For the first part:
002912E0 movups xmm1,xmmword ptr [esp+18h] <-- HERE
002912E5 add ecx,1
002912E8 movaps xmm0,xmm2 <-- HERE
002912EB adc esi,0
002912EE addps xmm0,xmm1
002912F1 movups xmmword ptr [esp+18h],xmm0 <-- HERE
002912F6 jne main+60h (0291300h)
002912F8 cmp ecx,3B9ACA00h
002912FE jb main+40h (02912E0h)
And for the second part:
00291400 add ecx,1
00291403 addps xmm0,xmm1
00291406 adc esi,0
00291409 jne main+173h (0291413h)
0029140B cmp ecx,3B9ACA00h
00291411 jb main+160h (0291400h)
Note that these two loops are indeed nearly identical. The only difference is that the first for loop appears to be the one doing the loading and storing! It would appear as though Visual Studio made a ton of optimizations since x and y were on the stack. Changing them both to be on the heap (thus the writes must happen), and the machine code is now identical. Is this generally the case? Is there really no negative side effect to using the storage classes? Other than the fully optimized versions I suppose.
If you define
DirectX::XMVECTOR v3 = DirectX::XMVectorSet(2, 3, 4, 5);
and use v3 instead v1 as a result:
...
for (uint64_t i = 0; i < TEST_COUNT; i++)
{
v3 = DirectX::XMVectorAdd(v1, v2);
}
you got code faster then 2-nd part code using XMLoadFloat4 and XMStoreFloat4
Firstly, don't use Visual Studio's "high-resolution clock" for perf timing. You should use QueryPerformanceCounter instead. See Connect.
SIMD performance is difficult to measure in these micro tests because the overhead of loading up vector data can often dominate with such trivial ALU usage. You really need to do something substantial with the data to see the benefits. Also keep in mind that depending on your compiler settings, the compiler itself may be using the 'scalar' SIMD functionality or even auto-vectoring such trivial code loops.
You are also seeing some issues with the way you are generating your test data. You should create something larger than a single vector on the heap and use that as your source/dest.
PS: The best way to create 'static' XMVECTOR data is to use the XMVECTORF32 type.
static const DirectX::XMVECTORF32 v1 = { 1, 2, 3, 4 };
Note that if you want to have the load/save conversions between XMVECTOR and XMFLOATx to be "automatic", take a look at SimpleMath in the DirectX Tool Kit. You just use types like SimpleMath::Vector4 in your data structures, and the implicit conversion operators take care of calling XMLoadFloat4 / XMStoreFloat4 for you.

OpenCV parallel_for not using multiple processors

I just saw in the new OpenCV 2.4.3 that they added a universal parallel_for. So following this example, I tried to implement it myself. I got it all functioning with my code, but when I timed its processing vs a similar loop done in a typical serial fashion with a regular "for" command, the results were insignificantly faster, or often a tiny bit slower!
I thought maybe this had something to do with my pushing into vectors or something (I'm a pretty big noob to parallel processing), so I set up a test loop of just running through a big number and it still doesn't work.
Code:
class Parallel_Test : public cv::ParallelLoopBody
{
private:
double* const mypointer;
public:
Parallel_Test(double* pointer)
: mypointer(pointer){
}
void operator() (const Range& range) const
{
//This constructor needs to be here otherwise it is considered an abstract class.
// qDebug()<<"This should never be called";
}
void operator ()(const cv::BlockedRange& range) const
{
for (int x = range.begin(); x < range.end(); ++x){
mypointer[x]=x;
}
}
};
//TODO Loop pixels in parallel
double t = (double)getTickCount();
//TEST PARALELL LOOPING AT ALL
double data1[1000000];
cv::parallel_for(BlockedRange(0, 1000000), Parallel_Test(data1));
t = ((double)getTickCount() - t)/getTickFrequency();
qDebug() << "Parallel TEST time " << t << endl;
t = (double)getTickCount();
for(int i =0; i<1000000; i++){
data1[i]=i;
}
t = ((double)getTickCount() - t)/getTickFrequency();
qDebug() << "SERIAL Scan time " << t << endl;
output:
Parallel TEST time 0.00415479
SERIAL Scan time 0.00204597
Wow! I found the answer! "parallel_for" and "parallel_for_" (with a trailing underscore!) are totally different. You need the trailing underscore to make it work! Otherwise it will just run your loop in serial and you will have to use a BLOCKEDRANGE instead of a range! AHH!
Thanks to #Daniil Osokin and especially #Vladislav Vinogradov for pointing this out!
So again you code will need to look something like this:
cv::parallel_for_(Range(0, 1000000), Parallel_Test(data1));
More updated details at: http://answers.opencv.org/question/3730/how-to-use-parallel_for/
The problem is most likely that your loop body is too small.
It appears all you are doing is assigning a pointer in one vector to another.
You really need to think of a parallel for as an inefficient for loop, that is the work inside each iteration needs to be large enough so that you wouldn't dream of getting speedups by unrolling the loop because in addition to the usual decrement, compare and jump that can go on you also have a few interlocked instructions and perhaps a virtual function call or two and some allocations.
So instead of copying a pointer try doing a good amount of real math or work on a large array of data.

c stream buffer

I am using C and need a stream buffer mechanism that I can write arbitrary bytes two and read bytes from. I would prefer something that is platform independent (or that can at least run on osx and linux). Is anyone aware of any permissive lightweight libraries or code than I can drop in?
I've used buffers within libevent and I may end up going that route, but it seems overkill to have libevent as a dependency when I don't do any sort of event based io.
If you don't mind depending on C++ and possibly some bits of STL, you can use std::stringstream. It shouldn't be too difficult to write a thin C wrapper around it.
Is setbuf(3) (and its aliases) the 'mechanism' you are searching for?
Please consider the following example:
#include <stdio.h>
int main()
{
char buf[256];
setbuffer(stderr, buf, 256);
fprintf(stderr, "Error: no more oxygen.\n");
buf[1] = 'R';
buf[2] = 'R';
buf[3] = 'O';
buf[4] = 'R';
fflush(stderr);
}

Using read() system call of UNIX to find the user given pattern

I am trying to emulate grep pattern of UNIX using a C program( just for learning ). The code that i have written is giving me a run time error..
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#define MAXLENGTH 1000
char userBuf[MAXLENGTH];
int main ( int argc, char *argv[])
{
int numOfBytes,fd,i;
if (argc != 2)
printf("Supply correct number of arguments.\n");
//exit(1);
fd =open("pattern.txt",O_RDWR);
if ( fd == -1 )
printf("File does not exist.\n");
//exit(1);
while ( (numOfBytes = read(fd,userBuf,MAXLENGTH)) > 0 )
;
printf("NumOfBytes = %d\n",numOfBytes);
for(i=0;userBuf[i] != '\0'; ++i)
{
if ( strstr(userBuf,argv[1]) )
printf("%s\n",userBuf);
}
}
The program is printing infinitely, the lines containing the pattern . I tried debugging , but couldn't figure out the error. Please let me know where am i wrong.,
Thanks
Say the string is "fooPATTERN". Your first time through the loop, you check for the pattern in "fooPATTERN" and find it. Then your second time through the loop, you check for the pattern in "ooPATTERN" and find it again. Then your third time, you check for the pattern in "oPATTERN" and find it again.
Since you're doing this to learn, I won't tell you much more. You can decide how best to solve it. There are at least two fundamentally different ways you could solve it. One is to do less on each pass of the loop to ensure you only find it once. The other is to make sure your next pass of the loop is past any pattern that was found.
One thing to think about: If the pattern is 'oo' and the string is 'ooo', how many patterns should be found? 1 or 2?
The 'read' does not delimit the data with a null character.
The while loop should encompase the for loop - it doesn't
First, you shouldn't be using raw Unix i/o with open and read if you're just learning C. Start with standard C i/o with fopen and fread/fscanf/fgets and so forth.
Second, you're reading in successive pieces of the file into the same buffer, overwriting the buffer each time, and only ever processing the last contents of the buffer.
Third, nothing guarantees that your buffer will be zero-terminated when you read into it with read(). In fact, it usually won't be.
Fourth, you're not using the i variable in the body of your loop. I can't tell exactly what you were shooting for here, but doing the same thing on the same data umpteen thousand times surely wasn't it.
Fifth, always compile with the fullest warning settings you can abide -- at lest -Wall with GCC. It should have complained that you call read() without including <unistd.h>.

Ordering Output in MPI

in a simple MPI program I have used a column wise division of a large matrix.
How can I order the output so that each matrix appears next to the other ordered ?
I have tried this simple code the effect is quite different from the wanted:
for(int i=0;i<10;i++)
{
for(int k=0;k<numprocs;k++)
{
if (my_id==k){
for(int j=1;j<10;j++)
printf("%d",data[i][j]);
}
MPI_Barrier(com);
}
if(my_id==0)
printf("\n");
}
Seems that each process has his own stdout and so is impossible to have ordered lines output without sending all the data to one master which will print out. Is my guess true ? Or what I'm doing wrong ?
You guessed right. The MPI standard does not specify how stdout from different nodes should be collected for printing at the originating process. It is often the case that when multiple processes are doing prints the output will get merged in an unspecified way. fflush doesn't help.
If you want the output ordered in a certain way, the most portable method would be to send the data to the master process for printing.
For example, in pseudocode:
if (rank == 0) {
print_col(0);
for (i = 1; i < comm_size; i++) {
MPI_Recv(buffer, .... i, ...);
print_col(i);
}
} else {
MPI_Send(data, ..., 0, ...);
}
Another method which can sometimes work would be to use barries to lock step processes so that each process prints in turn. This of course depends on the MPI Implementation and how it handles stdout.
for(i = 0; i < comm_size; i++) {
MPI_Barrier(MPI_COMM_WORLD);
if (i == rank) {
printf(...);
}
}
Of course, in production code where the data is too large to print sensibly anyway, data is eventually combine by having each process writing to a separate file and merged separately, or using MPI I/O (defined in the MPI2 standards) to coordinate parallel writes.
I produced ordered output to a file before using the exact same method. You could try printing to a temporary file, printing the contents of said file and then deleting it.
Have the root processor do all of the printing. Use MPI_Send/MPI_Recv or MPI_Gather (or whatever) to send the data in turn from each processor to the root.
To solve this problem you can use short sleep. I use and then it works in 99%
printf("text nr 1\n");
MPI_Barrier(MPI_COMM_WORLD);
usleep(100);
printf("text nr 2\n");
It's not very elegant but works.

Resources