Does awk store an output file in RAM? - memory

I was doing a simple parsing like awk '{print $3 > "file1.txt"}'
What I noticed was that awk is taking up too much of RAM (the files were huge). Does streaming awk output to file consume memory? Does this work like stream write or does the file remain open till the program terminates?
The exact command that I gave was:
for i in ../../*.txt; do j=${i#*/}; mawk -v f=${j%.txt} '{if(NR%8<=4 && NR%8!=0){print >f"_1.txt" } else{print >f"_2.txt"}}' $i & done
As evident I used mawk. The five input files were around 6GB each and when I ran top I saw 22% memory ~5GB being taken up by each mawk process at its peak. I noticed it because my system was hanging because of low memory.
I am particularly sure that redirection outside awk consumes negligible memory. Have done it several times with files much larger than this and operations more complex than this; I never faced this problem. Since I had to copy different sections of the input files to different output files, I used redirection inside awk.
I know there are other ways to implement this task and in any case my job is done without much issues. All I was interested in is how awk works when writing to a file.
I am not sure if this question is better suited for Superuser.

Related

GNU Parallel -- How to understand "block-size" setting, and guess what to set it to?

How do I set the block-size parameter when running grep using GNU parallel on a single machine with multiple cores, based on the "large_file" filesize, "small_file" filesize and machine I'm using to get the fastest performance possible (or please correct me if there is something else that im missing here)? What are the performance issues/speed bottlenecks I'll run into when setting it too high or too low? I understand what block-size does, in that it blocks out the large_file in chunks, and sends those chunks to each job, but I'm still missing the potential for how and why that would impact speed of execution.
The command in question:
parallel --pipepart --block 100M --jobs 10 -a large_file.csv grep -f small_file.csv
where large_file.csv has in it:
123456 1
234567 2
345667 22
and
where small_file.csv has in it:
1$
2$
and so on...
Thank you!
parallel --pipepart --block -1 --jobs 10 -a large_file.csv grep -f small_file.csv
--block -1 will split large_file.csv into one block per jobslot (here 10 chunks). The splitting will be done on the fly, so it will not be read into RAM to do the splitting.
Splitting into n evenly sized blocks (where n = number of jobs to run in parallel) often makes sense if the time spent per line is roughly the same. If it varies a lot (say, some lines take 100 times longer to process than others), then it may make more sense to chop into more bits. E.g. --block -10 will split into 10 times as many blocks as --block -1.
The optimal value can seldom be guessed in advance, because it may also depend on how fast your disk is. So try different values and identify where the bottleneck is. It is typically one of disk I/O, CPU, RAM, command startup time.

Cannot allocate memory for gawk coprocess

I have a gawk program that dies when trying to start a coprocess. Error message is "fatal: can't open two way pipe `...' for input/output (Cannot allocate memory)". Memory usage of the gawk process at the time of starting the coprocess is around 50%.
The gawk program is structured as follows:
BEGIN {
## Read big file into memory -- takes about 50% of memory
while ( (getline <"bigFile") >0) {
list[$0]
}
}
{
print |& "cat"
}
I assume that starting the coprocess involves a fork(), which would double memory usage, and thus cause an error?
If I force the coprocess to start before loading the file into memory, there is no problem starting the coprocess. But the best way I know of to force the coprocess to start is to write an empty line to it:
print "" |& "cat"
And this obviously is not ideal. (Though I can live with it, if there's no better way around this problem.)
Any ideas on cleaner solutions to this problem?

Completely restore a binary from memory?

I want to know if it's possible to completely restore the binary running in memory.
This is what I've tried,
First read /proc/PID/maps, then dump all relevant sections with gdb (ignore all libraries).
grep sleep /proc/1524/maps | awk -F '[- ]' \
'{print "dump memory sleep." $1 " 0x" $1 " 0x" $2 }' \
| gdb -p 1524
Then I concatenate all dumps in order:
cat sleep.* > sleep-bin
But the file is very much different than /bin/sleep
It seems like to be relocation table and other uninitialized data, so is it impossible to fix a memory dump? (Make it runnable)
Disclaimer: I'm a windows guy and don't know much about the linux process internals and ELF format, but I hope I can help!
I would say it's definitly possible to do, but not for ALL programs. The OS loader loads all parts of the executable into memory that are within a well defined place in the file. For example some uninstallers store data that is appended to the executable file - this will not be loaded to memory so this will be information you cannot restore just by dumping memory.
Another problem is that the information written by the OS is free to be modified by anything on the system that has the right to do so. No normal program would do something like that though.
The starting point would be to find the ELF headers of your executable module in memory and dump that. It will contain pretty much all the data you need for your task. For example:
the number of sections and where they are in memory and in the file
how sections in the file are mapped to sections in virtual memory (they usually have different base addresses and sizes!)
where the relocation data is
For the relocs you would have to read up on that how the reloc data is stored and processed with the ELF format. Once you know that it should be pretty easy to undo the changes for your dump.

popen() system call hangs in HP-Ux 11.11

I have a program which calculates 'Printer Queues Total' value using '/usr/bin/lpstat' through popen() system call.
{
int n=0;
FILE *fp=NULL;
printf("Before popen()");
fp = popen("/usr/bin/lpstat -o | grep '^[^ ]*-[0-9]*[ \t]' | wc -l", "r");
printf("After popen()");
if (fp == NULL)
{
printf("Failed to start lpstat - %s", strerror(errno));
return -1;
}
printf("Before fscanf");
fscanf(fp, "%d", &n);
printf("After fscanf");
printf("Before pclose()");
pclose(fp);
printf("After pclose()");
printf("Value=%d",n);
printf("=== END ===");
return 0;
}
Note: In the command line, '/usr/bin/lpstat' command is hanging for some time as there are many printers available in the network.
The problem here is, the execution is hanging at popen() system call, Where as I would expect it to hang at fscanf() which reads the output from the file stream fp.
If anybody can tell me the reasons for the hang at popen() system call, it will help me in modifying the program to work for my requirement.
Thanks for taking time in reading this post and your efforts.
What people expect does not always have a basis in reality :-)
The command you're running doesn't actually generate any output until it's finished. That would be why it would seem to be hung in the popen rather than the fscanf.
There are two possible reasons for that which spring to mind immediately.
The first is that it's implemented this way, with popen capturing the output in full before delivering the first line. Based on my knowledge of UNIX, this seems unlikely but I can't be sure.
Far more likely is the impact of the pipe. One thing I've noticed is that some filters (like grep) batch up their lines for efficiency. So, while popen itself may be spewing forth its lines immediately (well, until it gets to the delay bit anyway), the fact that grep is holding on to the lines until it gets a big enough block may be causing the delay.
In fact, it's almost certainly the pipe-through-wc, which cannot generate any output until all lines are received from lpstat (you cannot figure out how many lines there are until all the lines have been received). So, even if popen just waited for the first character to be available, that would seem to be where the hang was.
It would be a simple matter to test this by simply removing the pipe-through-grep-and-wc bit and seeing what happens.
Just one other point I'd like to raise. Your printf statements do not have newlines following and, even if they did, there are circumstances where the output may still be fully buffered (so that you probably wouldn't see anything until that program exited, or the buffer filled up).
I would start by changing them to the form:
printf ("message here\n"); fflush (stdout); fsync (fileno (stdout));
to ensure they're flushed fully before continuing. I'd hate this to be a simple misunderstanding of a buffering issue :-)
It sounds as if popen may be hanging whilst lpstat attempts to retrieve information from remote printers. There is a fair amount of discussion on this particular problem. Have a look at that thread, and especially the ones that are linked from that.

Finding what hard drive sectors occupy a file

I'm looking for a nice easy way to find what sectors occupy a given file. My language preference is C#.
From my A-Level Computing class I was taught that a hard drive has a lookup table on the first few KB of the disk. In this table there is a linked list for each file detailing what sectors that file occupies. So I'm hoping there's a convinient way to look in this table for a certain file and see what sectors it occupies.
I have tried Google'ing but I am finding nothing useful. Maybe I'm not searching for the right thing but I can't find anything at all.
Any help is appreciated, thanks.
About Drives
The physical geometry of modern hard drives is no longer directly accessible by the operating system. Early hard drives were simple enough that it was possible to address them according to their physical structure, cylinder-head-sector. Modern drives are much more complex and use systems like zone bit recording , in which not all tracks have the same amount of sectors. It's no longer practical to address them according to their physical geometry.
from the fdisk man page:
If possible, fdisk will obtain the disk geometry automatically. This is not necessarily the physical disk geometry (indeed, modern disks do not really have anything
like a physical geometry, certainly not something that can be described in simplistic Cylinders/Heads/Sectors form)
To get around this problem modern drives are addressed using Logical Block Addressing, which is what the operating system knows about. LBA is an addressing scheme where the entire disk is represented as a linear set of blocks, each block being a uniform amount of bytes (usually 512 or larger).
About Files
In order to understand where a "file" is located on a disk (at the LBA level) you will need to understand what a file is. This is going to be dependent on what file system you are using. In Unix style file systems there is a structure called an inode which describes a file. The inode stores all the attributes a file has and points to the LBA location of the actual data.
Ubuntu Example
Here's an example of finding the LBA location of file data.
First get your file's inode number
$ ls -i
659908 test.txt
Run the file system debugger. "yourPartition" will be something like sda1, it is the partition that your file system is located on.
$sudo debugfs /dev/yourPartition
debugfs: stat <659908>
Inode: 659908 Type: regular Mode: 0644 Flags: 0x80000
Generation: 3039230668 Version: 0x00000000:00000001
...
...
Size of extra inode fields: 28
EXTENTS:
(0): 266301
The number under "EXTENTS", 266301, is the logical block in the file system that your file is located on. If your file is large there will be multiple blocks listed. There's probably an easier way to get that number, I couldn't find one.
To validate that we have the right block use dd to read that block off the disk. To find out your file system block size, use dumpe2fs.
dumpe2fs -h /dev/yourPartition | grep "Block size"
Then put your block size in the ibs= parameter, and the extent logical block in the skip= parameter, and run dd like this:
sudo dd if=/dev/yourPartition of=success.txt ibs=4096 count=1 skip=266301
success.txt should now contain the original file's contents.
sudo hdparm --fibmap file
For ext, vfat and NTFS ..maybe more.
fibmap is also a linux C library.

Resources