GNU Parallel -- How to understand "block-size" setting, and guess what to set it to? - grep

How do I set the block-size parameter when running grep using GNU parallel on a single machine with multiple cores, based on the "large_file" filesize, "small_file" filesize and machine I'm using to get the fastest performance possible (or please correct me if there is something else that im missing here)? What are the performance issues/speed bottlenecks I'll run into when setting it too high or too low? I understand what block-size does, in that it blocks out the large_file in chunks, and sends those chunks to each job, but I'm still missing the potential for how and why that would impact speed of execution.
The command in question:
parallel --pipepart --block 100M --jobs 10 -a large_file.csv grep -f small_file.csv
where large_file.csv has in it:
123456 1
234567 2
345667 22
and
where small_file.csv has in it:
1$
2$
and so on...
Thank you!

parallel --pipepart --block -1 --jobs 10 -a large_file.csv grep -f small_file.csv
--block -1 will split large_file.csv into one block per jobslot (here 10 chunks). The splitting will be done on the fly, so it will not be read into RAM to do the splitting.
Splitting into n evenly sized blocks (where n = number of jobs to run in parallel) often makes sense if the time spent per line is roughly the same. If it varies a lot (say, some lines take 100 times longer to process than others), then it may make more sense to chop into more bits. E.g. --block -10 will split into 10 times as many blocks as --block -1.
The optimal value can seldom be guessed in advance, because it may also depend on how fast your disk is. So try different values and identify where the bottleneck is. It is typically one of disk I/O, CPU, RAM, command startup time.

Related

Snakemake limit the memory usage of jobs

I need to run 20 genomes with a snakemake. So I am using basic steps like alignment, markduplicates, realignment, basecall recalibration and so on in snakemake. The machine I am using has up to 40 virtual cores and 70G memory and I run the program like this.
snakemake -s Snakefile -j 40
This works fine, but as soon as It runs markduplicates along other programs, it stops as I think it overloads the 70 available giga and crashes.
Is there a way to set in snakemake the memory limit to 60G in total for all programs running? I would like snakemake runs less jobs in order to stay under 60giga, is some of the steps require a lot of memory. The command line below crashed as well and used more memorya than allocated.
snakemake -s Snakefile -j 40 --resources mem_mb=60000
It's not enough to specify --resources mem_mb=60000 on the command line, you need also to specify mem_mb for the rules you want to keep in check. E.g.:
rule markdups:
input: ...
ouptut: ...
resources:
mem_mb= 20000
shell: ...
rule sort:
input: ...
ouptut: ...
resources:
mem_mb= 1000
shell: ...
This will submit jobs in such way that you don't exceed a total of 60GB at any one time. E.g. this will keep running at most 3 markdups jobs, or 2 markdups jobs and 20 sort jobs, or 60 sort jobs.
Rules without mem_mb will not be counted towards memory usage, which is probably ok for rules that e.g. copy files and do not need much memory.
How much to assign to each rule is mostly up to your guess. top and htop commands help in monitoring jobs and figuring out how much memory they need. More elaborated solutions could be devised but I'm not sure it's worth it... If you use a job scheduler like slurm, the log files should give you the peak memory usage of each job so you can use them for future guidance. Maybe others have better suggestions.

Array#product RangeError: too big to product

I have 93 arrays. Each array has about 18 values in average
I need to make a product of these arrays.
So I have my two dimension array that store these 93 arrays.
Here is what I try to do
DATASET.first.product(*DATASET[1..-1])
Ruby returns
RangeError: too big to product
Does anyone know some workaround to figure out of it?
Some ways to chunk them?
What you want is impossible.
The product of 93 arrays with ~18 elements each is an array with approximately 549975033204266172374216967425209467080301768557741749051999338598022831065169332830885722071173603516904554174087168 elements, each of which is a 93-element array.
This means you need 549975033204266172374216967425209467080301768557741749051999338598022831065169332830885722071173603516904554174087168 * 93 * 64bit of memory to store it, which is roughly 409181424703974032246417423764355843507744515806959861294687507916928986312485983626178977220953161016576988305520852992 bytes. That is about 40 orders of magnitude more than the number of particles in the universe. In other words, even if you were to convert the entire universe into RAM, you would still need to find a way to store on the order of 827180612553027 yobibyte on each and every particle in the universe; that is about 6000000000000000000000000 times the information content of the World Wide Web and 10000000000000000000000 times the information content of the dark web.
Does anyone know some workaround to figure out of it? Some ways to chunk them?
Even if you process them in chunks, that doesn't change the fact that you still need to process 51147678087996754030802177970544480438468064475869982661835938489616123289060747953272372152619145127072123538190106624 elements. Even if you were able to process one element per CPU instruction (which is unrealistic, you will probably need dozens if not hundreds of instructions), and even if each instruction only takes one clock cycle (which is unrealistic, on current mainstream CPUs, each instruction takes multiple clock cycles), and even if you had a terahertz CPU (which is unrealistic, the fastest current CPUs top out at 5 GHz), and even if your CPU had a million cores (which is unrealistic, even GPUs only have a couple of thousand extremely simple cores), and even if your motherboard had a million sockets (which is unrealistic, mainstream motherboards only have a maximum of 4 sockets, and even the biggest supercomputers only have 10 million cores in total), and even if you had a million of those computers in a cluster, and even if you had a million of those clusters in a supercluster, and even if you had a million friends that also have a supercluster like this, it would still take you about 1621000000000000000000000000000000000000000000000000000000000000000000 years to iterate through them.
Right, so as it is hopefully clear that this should not be attempted I'll take a risk and attempt solving your actual problem.
You've mentioned in the comments that you need this array for property testing - I'll take a massive leap of faith here and assume you want to test that every possible combination satisfies some conditions - and this is the mistake here, as the amount of possible combination is just... large...
Instead, you can test that some of the combinations works. You can easily generate a short, randomized list of combinations using:
Array.new(num) { DATASET.map(&:sample) }
Where num is a number of combinations you want to test. Note that there is a chance that some of the entries will be duplicated - but given your dataset size the chances would be comparable with colliding uuids and can be safely ignored.
Generating such a subset of possible solutions is much easier, faster and, most importantly, possible. Since the output is randomized, it will test slightly different combination on each run, so remember to have some randomization setup in your test suite if you want to be able to recreate failures.

GNU parallel saturates one server instead of distributing jobs equally

I am using GNU parallel 20160222. I have four servers configured in my ~/.parallel/sshloginfile:
48/big1
48/big2
8/small1
8/small2
when I run, say, 32 jobs, I'd expect parallel to start eight on each server. Or even better, two or three each on small1 and small2, and twelve or so each on big1 and big2. But what it is doing is starting 8 jobs on small2 and the remaining jobs locally.
Here is my invocation (I actually use a --profile but I removed it for simplicity):
parallel --verbose --workdir . --sshdelay 0.2 --controlmaster --sshloginfile .. \
"my_cmd {} | gzip > {}.gz" ::: $(seq 1 32)
Here is the main question:
Is there an option missing that would do a more equal allocation of jobs?
Here is another related question:
Is there a way to specify --memfree, --load, etc. per server? Especially --memfree.
I recall GNU Parallel used to fill job slots "from one end". This did not matter if you had way more jobs than job slots: All job slots (both local and remote) would fill up.
It did, however, matter if you had fewer jobs. So it was changed, so GNU Parallel today gives jobs to sshlogins in a round robin fashion - thus spreading it more evenly.
Unfortunately I do not recall which version this change was done. But your can tell if you version does it by running:
parallel -vv -t
and look at which sshlogin is being used.
Re: --memfree
You can build your own using --limit.
I am curious why you want different limits for different servers. The idea behind --memfree is that it is set to the amount of RAM that a single job takes. So if there is enough RAM for a single job, a new job should be started - no matter the server.
You clearly have another situation, so explain about that.
Re: upgrading
Look into parallel --embed.

In SPMD using GNU parallel, is processing the smallest files first the most efficient way?

This is pretty straight forward:
Say I have many files in the folder data/ to process via some executable ./proc. What is the simplest way to maximize efficiency? I have been doing this to gain some efficiency:
ls --sort=size data/* | tac | parallel ./proc
which lists the data according to size, then tac (reverse of cat) flips the order of that output so the smallest files are processed first. Is this the most efficient solution? If not, how can the efficiency be improved (simple solutions preferred)?
I remember that sorting like this leads to better efficiency since larger jobs don't block up the pipeline, but aside from examples I can't find or remember any theory behind this, so any references would be greatly appreciated!
If you need to run all jobs and want to optimize for time to complete them all, you want them to finish the same time. In that case you should run the small jobs last. Otherwise you may have the situation where all cpus are done except one that just started on the last big job. Here you will waste CPU time for all CPUs except the one.
Here are 8 jobs: 7 take 1 second, one takes 5:
1 2 3 4 55555 6 7 8
On a dual core small jobs first:
1368
24755555
On a dual core big jobs first:
555557
123468

Relation between Linux /proc/meminfo and /sys/devices/system/node/nodex/meminfo

I'd like to get the amount of "free memory" per NUMA node.
When dealing with a whole machine, one usually parses /proc/meminfo like free does (the number wanted is MemFree + Buffers + Cached).
There also exist /sys/devices/system/node/nodex/meminfo, which seem to display numbers per NUMA node. Does anybody know how these numbers can be correlated to the content of /proc/meminfo? My trivial assumption would be to sum up some numbers for all NUMA nodes in the system, and the result is equal to some number in /proc/meminfo. But so far I failed to figure out the relationships, especially for page caches.
The code for proc is in fs/proc/meminfo.c, for the sysfs files it's in drivers/base/node.c. Comparing them might give you some hints.
Note that you'll probably never get the numbers to add up 100%, because you can't atomically read the content of all the files, so the values will change while you're reading them.
There also seems to be an inconsistency in the total RAM reported via both methods. One explanation for that is that free_init_mem doesn't appear to be NUMA aware, and increments total_ram_pages but does not do any NUMA accounting.

Resources