In SPMD using GNU parallel, is processing the smallest files first the most efficient way? - gnu-parallel

This is pretty straight forward:
Say I have many files in the folder data/ to process via some executable ./proc. What is the simplest way to maximize efficiency? I have been doing this to gain some efficiency:
ls --sort=size data/* | tac | parallel ./proc
which lists the data according to size, then tac (reverse of cat) flips the order of that output so the smallest files are processed first. Is this the most efficient solution? If not, how can the efficiency be improved (simple solutions preferred)?
I remember that sorting like this leads to better efficiency since larger jobs don't block up the pipeline, but aside from examples I can't find or remember any theory behind this, so any references would be greatly appreciated!

If you need to run all jobs and want to optimize for time to complete them all, you want them to finish the same time. In that case you should run the small jobs last. Otherwise you may have the situation where all cpus are done except one that just started on the last big job. Here you will waste CPU time for all CPUs except the one.
Here are 8 jobs: 7 take 1 second, one takes 5:
1 2 3 4 55555 6 7 8
On a dual core small jobs first:
1368
24755555
On a dual core big jobs first:
555557
123468

Related

Snakemake limit the memory usage of jobs

I need to run 20 genomes with a snakemake. So I am using basic steps like alignment, markduplicates, realignment, basecall recalibration and so on in snakemake. The machine I am using has up to 40 virtual cores and 70G memory and I run the program like this.
snakemake -s Snakefile -j 40
This works fine, but as soon as It runs markduplicates along other programs, it stops as I think it overloads the 70 available giga and crashes.
Is there a way to set in snakemake the memory limit to 60G in total for all programs running? I would like snakemake runs less jobs in order to stay under 60giga, is some of the steps require a lot of memory. The command line below crashed as well and used more memorya than allocated.
snakemake -s Snakefile -j 40 --resources mem_mb=60000
It's not enough to specify --resources mem_mb=60000 on the command line, you need also to specify mem_mb for the rules you want to keep in check. E.g.:
rule markdups:
input: ...
ouptut: ...
resources:
mem_mb= 20000
shell: ...
rule sort:
input: ...
ouptut: ...
resources:
mem_mb= 1000
shell: ...
This will submit jobs in such way that you don't exceed a total of 60GB at any one time. E.g. this will keep running at most 3 markdups jobs, or 2 markdups jobs and 20 sort jobs, or 60 sort jobs.
Rules without mem_mb will not be counted towards memory usage, which is probably ok for rules that e.g. copy files and do not need much memory.
How much to assign to each rule is mostly up to your guess. top and htop commands help in monitoring jobs and figuring out how much memory they need. More elaborated solutions could be devised but I'm not sure it's worth it... If you use a job scheduler like slurm, the log files should give you the peak memory usage of each job so you can use them for future guidance. Maybe others have better suggestions.

GNU Parallel -- How to understand "block-size" setting, and guess what to set it to?

How do I set the block-size parameter when running grep using GNU parallel on a single machine with multiple cores, based on the "large_file" filesize, "small_file" filesize and machine I'm using to get the fastest performance possible (or please correct me if there is something else that im missing here)? What are the performance issues/speed bottlenecks I'll run into when setting it too high or too low? I understand what block-size does, in that it blocks out the large_file in chunks, and sends those chunks to each job, but I'm still missing the potential for how and why that would impact speed of execution.
The command in question:
parallel --pipepart --block 100M --jobs 10 -a large_file.csv grep -f small_file.csv
where large_file.csv has in it:
123456 1
234567 2
345667 22
and
where small_file.csv has in it:
1$
2$
and so on...
Thank you!
parallel --pipepart --block -1 --jobs 10 -a large_file.csv grep -f small_file.csv
--block -1 will split large_file.csv into one block per jobslot (here 10 chunks). The splitting will be done on the fly, so it will not be read into RAM to do the splitting.
Splitting into n evenly sized blocks (where n = number of jobs to run in parallel) often makes sense if the time spent per line is roughly the same. If it varies a lot (say, some lines take 100 times longer to process than others), then it may make more sense to chop into more bits. E.g. --block -10 will split into 10 times as many blocks as --block -1.
The optimal value can seldom be guessed in advance, because it may also depend on how fast your disk is. So try different values and identify where the bottleneck is. It is typically one of disk I/O, CPU, RAM, command startup time.

PPO Update Schedule in OpenAi Baselines Implementations

I'm trying to read through the PPO1 code in OpenAi's Baselines implementation of RL algorithms (https://github.com/openai/baselines) to gain a better understanding as to how PPO works, how one might go about implementing it, etc.
I'm confused as to the difference between the "optim_batchsize" and the "timesteps_per_actorbatch" arguments that are fed into the "learn()" function. What are these hyper-parameters?
In addition, I see in the "run_atari.py" file, the "make_atari" and "wrap_deepmind" functions are used to wrap the environment. In the "make_atari" function, it uses the "EpisodicLifeEnv", which ends the episode once the a life is lost. On average, I see that the episode length in the beginning of training is about 7 - 8 timesteps, but the batch size is 256, so I don't see how any updates can occur. Thanks in advance for your help.
I've been going through it on my own as well....their code is a nightmare!
optim_batchsize is the batch size used for optimizing the policy, timesteps_per_actorbatch is the number of time steps the agent runs before optimizing.
On the episodic thing, I am not sure. Two ways it could happen, one is waiting until the 256 entries are filled before actually updating, or the other one is filling the batch with dummy data that does nothing, effectively only updating the 7 or 8 steps that the episode lasted.

Talend- Memory issues. Working with big files

Before admins start to eating me alive, I would like to say to my defense that I cannot comment in the original publications, because I do not have the power, therefore, I have to ask about this again.
I have issues running a job in talend (Open Studio for BIG DATA!). I have an archive of 3 gb. I do not consider that this is too much since I have a computer that has 32 GB in RAM.
While trying to run my job, first I got an error related to heap memory issue, then it changed for a garbage collector error, and now It doesn't even give me an error. (just do nothing and then stops)
I found this SOLUTIONS and:
a) Talend performance
#Kailash commented that parallel is only on the condition that I have to be subscribed to one of the Talend Platform solutions. My comment/question: So there is no other similar option to parallelize a job with a 3Gb archive size?
b) Talend 10 GB input and lookup out of memory error
#54l3d mentioned that its an option to split the lookup file into manageable chunks (may be 500M), then perform the join in many stages for each chunk. My comment/cry for help/question: how can I do that, I do not know how to split the look up, can someone explain this to me a little bit more graphical
c) How to push a big file data in talend?
just to mention that I also went through the "c" but I don't have any comment about it.
The job I am performing (thanks to #iMezouar) looks like this:
1) I have an inputFile MySQLInput coming from a DB in MySQL (3GB)
2) I used the tFirstRows to make it easier for the process (not working)
3) I used the tSplitRow to transform the data form many simmilar columns to only one column.
4) MySQLOutput
enter image description here
Thanks again for reading me and double thanks for answering.
From what I understand, your query returns a lot of data (3GB), and that is causing an error in your job. I suggest the following :
1. Filter data on the database side : replace tSampleRow by a WHERE clause in your tMysqlInput component in order to retrieve fewer rows in Talend.
2. MySQL jdbc driver by default retrieves all data into memory, so you need to use the stream option in tMysqlInput's advanced settings in order to stream rows.

erlang:now/0 is faster than os:timestamp/0?

As said in manual, http://www.erlang.org/erldoc?q=erlang:now
If you do not need the return value to be unique and monotonically increasing, use os:timestamp/0 instead to avoid some overhead.
os:timestamp/0 should be faster than erlang:now/0
But I tested on my PC with timer:tc/3, for 10000000 calls, time spent in microsecond is:
erlang:now 951000
os:timestamp 1365000
Why erlang:now/0 faster than os:timestamp/0?
My OS: Windows 7 x64, erlang version: R16B01.
------------------edit-----------------
I wrote another test code in parallel (100 thread), os:timestamp/0 performed better in parallel. here are data:
----- single thread ------
erlang:now 95000
os:timestamp 147000
----- multi thread ------
erlang:now 333000
os:timestamp 91000
So, I think the "overhead" is for parallel.
I've always thought that the 'some overhead' comment was darkly amusing. The way erlang:now/0 achieves its trick of providing guaranteed unique, monotonically increasing values is to take out a per-VM global lock. In a serial test you won't notice anything, but when you've got a lot of parallel code running, you may.
The function os:timestamp/0 doesn't take out a lock and may return the same value in two processes.
This was recently discussed on the erlang-questions mailing list ("erlang:now() vs os:timestamp()" on 3rd April 2013), where two interesting results emerged:
erlang:now seems to be faster than os:timestamp in interpreted code (as opposed to compiled code, where os:timestamp is faster).
If you benchmark them, you should measure the time taken using os:timestamp instead of erlang:now, since erlang:now forces the clock to advance.
Apart from the excellent answer by troutwine, the reason why erlang:now() is faster in a serial test is probably that it avoids the kernel since you may be calling it faster than time progresses and then you are in a situation where you don't hit the kernel as often.
But note, that your test is deceiving until you add more than a single core. Then os:timestamp() like troutwine writes, will outperform erlang:now().
Also note you are on a weak platform, namely Windows. This usually affects performance in non-trivial ways.

Resources