Snakemake limit the memory usage of jobs - memory

I need to run 20 genomes with a snakemake. So I am using basic steps like alignment, markduplicates, realignment, basecall recalibration and so on in snakemake. The machine I am using has up to 40 virtual cores and 70G memory and I run the program like this.
snakemake -s Snakefile -j 40
This works fine, but as soon as It runs markduplicates along other programs, it stops as I think it overloads the 70 available giga and crashes.
Is there a way to set in snakemake the memory limit to 60G in total for all programs running? I would like snakemake runs less jobs in order to stay under 60giga, is some of the steps require a lot of memory. The command line below crashed as well and used more memorya than allocated.
snakemake -s Snakefile -j 40 --resources mem_mb=60000

It's not enough to specify --resources mem_mb=60000 on the command line, you need also to specify mem_mb for the rules you want to keep in check. E.g.:
rule markdups:
input: ...
ouptut: ...
resources:
mem_mb= 20000
shell: ...
rule sort:
input: ...
ouptut: ...
resources:
mem_mb= 1000
shell: ...
This will submit jobs in such way that you don't exceed a total of 60GB at any one time. E.g. this will keep running at most 3 markdups jobs, or 2 markdups jobs and 20 sort jobs, or 60 sort jobs.
Rules without mem_mb will not be counted towards memory usage, which is probably ok for rules that e.g. copy files and do not need much memory.
How much to assign to each rule is mostly up to your guess. top and htop commands help in monitoring jobs and figuring out how much memory they need. More elaborated solutions could be devised but I'm not sure it's worth it... If you use a job scheduler like slurm, the log files should give you the peak memory usage of each job so you can use them for future guidance. Maybe others have better suggestions.

Related

GNU parallel saturates one server instead of distributing jobs equally

I am using GNU parallel 20160222. I have four servers configured in my ~/.parallel/sshloginfile:
48/big1
48/big2
8/small1
8/small2
when I run, say, 32 jobs, I'd expect parallel to start eight on each server. Or even better, two or three each on small1 and small2, and twelve or so each on big1 and big2. But what it is doing is starting 8 jobs on small2 and the remaining jobs locally.
Here is my invocation (I actually use a --profile but I removed it for simplicity):
parallel --verbose --workdir . --sshdelay 0.2 --controlmaster --sshloginfile .. \
"my_cmd {} | gzip > {}.gz" ::: $(seq 1 32)
Here is the main question:
Is there an option missing that would do a more equal allocation of jobs?
Here is another related question:
Is there a way to specify --memfree, --load, etc. per server? Especially --memfree.
I recall GNU Parallel used to fill job slots "from one end". This did not matter if you had way more jobs than job slots: All job slots (both local and remote) would fill up.
It did, however, matter if you had fewer jobs. So it was changed, so GNU Parallel today gives jobs to sshlogins in a round robin fashion - thus spreading it more evenly.
Unfortunately I do not recall which version this change was done. But your can tell if you version does it by running:
parallel -vv -t
and look at which sshlogin is being used.
Re: --memfree
You can build your own using --limit.
I am curious why you want different limits for different servers. The idea behind --memfree is that it is set to the amount of RAM that a single job takes. So if there is enough RAM for a single job, a new job should be started - no matter the server.
You clearly have another situation, so explain about that.
Re: upgrading
Look into parallel --embed.

Drake Installation Freeze

I am trying to install the python-binding of drake. After make --j it freezes. I believe I have done everything correctly for the previous steps. Can anyone help? I am running on Ubuntu 18.04 with python 3.6.9.
Thank you in advance. It looks like this.
Frozen Terminal
Use make (no -j flag) or make -j1 because bazel (which is called internally during the build) handles the parallelism of the build (and of tests) and will set the number of jobs to the number of cores by default (appears to be 8 in your case).
To adjust the parallelism to reduce the number of jobs to less than the number of cores, create a file named user.bazelrc at the root of the repository (same level as the WORKSPACE file) with the content
test --jobs=N
for some N less than the number of cores that you have.
See also https://docs.bazel.build/versions/master/guide.html#bazelrc.
From the screen shot, it doesn't look like the drake build system is doing anything wrong. But make -j is probably trying to do too many things in parallel. Try starting with -j4 and if it still freezes, go down to 2, etc.
Possibly out of memory..
A hacky solution is to change the CMakeLists.txt file to set the max number of jobs bazel uses by adding --jobs N (where N is the number of jobs you allow concurrently) after ${BAZEL_TARGETS} like so
ExternalProject_Add(drake_cxx_python
SOURCE_DIR "${PROJECT_SOURCE_DIR}"
CONFIGURE_COMMAND :
BUILD_COMMAND
${BAZEL_ENV}
"${Bazel_EXECUTABLE}"
${BAZEL_STARTUP_ARGS}
build
${BAZEL_ARGS}
${BAZEL_TARGETS}
--jobs 1
BUILD_IN_SOURCE ON
BUILD_ALWAYS ON
INSTALL_COMMAND
${BAZEL_ENV}
"${Bazel_EXECUTABLE}"
${BAZEL_STARTUP_ARGS}
run
${BAZEL_ARGS}
${BAZEL_TARGETS}
--
${BAZEL_TARGETS_ARGS}
USES_TERMINAL_BUILD ON
USES_TERMINAL_INSTALL ON
)

GNU Parallel -- How to understand "block-size" setting, and guess what to set it to?

How do I set the block-size parameter when running grep using GNU parallel on a single machine with multiple cores, based on the "large_file" filesize, "small_file" filesize and machine I'm using to get the fastest performance possible (or please correct me if there is something else that im missing here)? What are the performance issues/speed bottlenecks I'll run into when setting it too high or too low? I understand what block-size does, in that it blocks out the large_file in chunks, and sends those chunks to each job, but I'm still missing the potential for how and why that would impact speed of execution.
The command in question:
parallel --pipepart --block 100M --jobs 10 -a large_file.csv grep -f small_file.csv
where large_file.csv has in it:
123456 1
234567 2
345667 22
and
where small_file.csv has in it:
1$
2$
and so on...
Thank you!
parallel --pipepart --block -1 --jobs 10 -a large_file.csv grep -f small_file.csv
--block -1 will split large_file.csv into one block per jobslot (here 10 chunks). The splitting will be done on the fly, so it will not be read into RAM to do the splitting.
Splitting into n evenly sized blocks (where n = number of jobs to run in parallel) often makes sense if the time spent per line is roughly the same. If it varies a lot (say, some lines take 100 times longer to process than others), then it may make more sense to chop into more bits. E.g. --block -10 will split into 10 times as many blocks as --block -1.
The optimal value can seldom be guessed in advance, because it may also depend on how fast your disk is. So try different values and identify where the bottleneck is. It is typically one of disk I/O, CPU, RAM, command startup time.

In SPMD using GNU parallel, is processing the smallest files first the most efficient way?

This is pretty straight forward:
Say I have many files in the folder data/ to process via some executable ./proc. What is the simplest way to maximize efficiency? I have been doing this to gain some efficiency:
ls --sort=size data/* | tac | parallel ./proc
which lists the data according to size, then tac (reverse of cat) flips the order of that output so the smallest files are processed first. Is this the most efficient solution? If not, how can the efficiency be improved (simple solutions preferred)?
I remember that sorting like this leads to better efficiency since larger jobs don't block up the pipeline, but aside from examples I can't find or remember any theory behind this, so any references would be greatly appreciated!
If you need to run all jobs and want to optimize for time to complete them all, you want them to finish the same time. In that case you should run the small jobs last. Otherwise you may have the situation where all cpus are done except one that just started on the last big job. Here you will waste CPU time for all CPUs except the one.
Here are 8 jobs: 7 take 1 second, one takes 5:
1 2 3 4 55555 6 7 8
On a dual core small jobs first:
1368
24755555
On a dual core big jobs first:
555557
123468

How to estimate memory requirement for submitting a job to a cluster running SGE?

I am trying to submit a job to a cluster [running Sun Grid Engine (SGE)]. The job kept being terminated with the following report:
Job 780603 (temp_new) Aborted
Exit Status = 137
Signal = KILL
User = heaswara
Queue = std.q#comp-0-8.local
Host = comp-0-8.local
Start Time = 08/24/2013 13:49:05
End Time = 08/24/2013 16:26:38
CPU = 02:46:38
Max vmem = 12.055G
failed assumedly after job because:
job 780603.1 died through signal KILL (9)
The resource requirements I had set were:
#$ -l mem_free=10G
#$ -l h_vmem=12G
mem_free is the amount of memory my job requires and h_vmem is the is the upper bound on the amount of memory the job is allowed to use. I wonder my job is being terminated because it requires more than that threshold (12G).
Is there a way to estimate how much memory will be required for my operation? I am trying to figure out what should be the upper bound.
Thanks in advance.
It depends on the nature of the job itself. If you know anything about the program that is being run (i.e., you wrote it), you should be able to make an estimate on how much memory it is going to want. If not, your only recourse is to run it without the limit and see how much it actually uses.
I have a bunch of FPGA build and simulation jobs that I run. After each job, I track how much memory was actually used. I can use this historical information to make an estimate on how much it might use in the future (I pad by 10% in case there are some weird changes in the source). I still have to redo the calculations whenever the vendor delivers a new version of the tools, though, as quite often the memory footprint changes dramatically.

Resources