GNU parallel saturates one server instead of distributing jobs equally - gnu-parallel

I am using GNU parallel 20160222. I have four servers configured in my ~/.parallel/sshloginfile:
48/big1
48/big2
8/small1
8/small2
when I run, say, 32 jobs, I'd expect parallel to start eight on each server. Or even better, two or three each on small1 and small2, and twelve or so each on big1 and big2. But what it is doing is starting 8 jobs on small2 and the remaining jobs locally.
Here is my invocation (I actually use a --profile but I removed it for simplicity):
parallel --verbose --workdir . --sshdelay 0.2 --controlmaster --sshloginfile .. \
"my_cmd {} | gzip > {}.gz" ::: $(seq 1 32)
Here is the main question:
Is there an option missing that would do a more equal allocation of jobs?
Here is another related question:
Is there a way to specify --memfree, --load, etc. per server? Especially --memfree.

I recall GNU Parallel used to fill job slots "from one end". This did not matter if you had way more jobs than job slots: All job slots (both local and remote) would fill up.
It did, however, matter if you had fewer jobs. So it was changed, so GNU Parallel today gives jobs to sshlogins in a round robin fashion - thus spreading it more evenly.
Unfortunately I do not recall which version this change was done. But your can tell if you version does it by running:
parallel -vv -t
and look at which sshlogin is being used.
Re: --memfree
You can build your own using --limit.
I am curious why you want different limits for different servers. The idea behind --memfree is that it is set to the amount of RAM that a single job takes. So if there is enough RAM for a single job, a new job should be started - no matter the server.
You clearly have another situation, so explain about that.
Re: upgrading
Look into parallel --embed.

Related

How to run a python script in the background on azure

I have a uni project in which I have to run a number of machine learning algorithms like SVM, ME, Naive bayes, etc... and perform a grid search on them, to find the optimal sets of hyper-parameters. Running all these would take an exceedingly long amount of time (48-168 hours total but run- in batches) and considering my computer becomes more or less unusable while I run them, I was attempting to find a solution which allowed me to run my code externally. The scripts I have to run are in python and my plan was to run them on azure to make use of its "Azure for students" $100 credit.
My original plan was to use azure's ml notebook section and then run the python scripts in the terminal they provide. My problem with this route is as far as I can tell, when the browser closes, the computation stops which is a problem. I looked into it, and I found some articles mentioning a combination of 'ctrl-z', 'bg', and 'disown', to disconnect the process from the shell but I thought there should definitely be a better way to do it. (I also wasn't sure how this worked in my case where there were 8 processes running at once using gridsearchcv's n_jobs=-1 feature).
I then realized a better way to do this would be to use pipelines. My intent was to create a number of pipelines of the form:
(Import data in xlsx file) -> (python script to run ML) -> (export data to working directory)
And then run them until all the work is completed. In the first stage I used the parameters,
And I got the error,
My intention was to have the excel file pipe into the python script as a data frame but this implantation (and all the others I've tried) isn't working.
My question first question is, how do I get the excel data to pipe into the python script properly?
My second question is, is there a better way to go about doing this? Would running it on the shell be an easier way to do it? If so, how do ensure it runs while my browser is closed? Are there other services that would be better? My main metrics for this are price (Cheap) and time limit (ability to run for long time) but any suggestions would be greatly appreciated.
I also tried using google colab, this worked but it felt slower than running on my computer.
To run a grid search with AzureML, you would use the Sweep job. The simplest way to kick of a Sweep is via the CLI. See here for an example.
$schema: https://azuremlschemas.azureedge.net/latest/sweepJob.schema.json
type: sweep
trial:
command: >-
python hello-sweep.py
--A ${{inputs.A}}
--B ${{search_space.B}}
--C ${{search_space.C}}
code: src
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu#latest
inputs:
A: 0.5
compute: azureml:cpu-cluster
sampling_algorithm: random
search_space:
B:
type: choice
values: ["hello", "world", "hello_world"]
C:
type: uniform
min_value: 0.1
max_value: 1.0
objective:
goal: minimize
primary_metric: random_metric
limits:
max_total_trials: 4
max_concurrent_trials: 2
timeout: 3600
display_name: hello-sweep-example
experiment_name: hello-sweep-example
description: Hello sweep job example.
You can start that job using the AzureML v2 CLI with the following command:
az ml job create -f hello-sweep.yml
That will create max_total_trials number of jobs for different parameter combinations as defined in the search_space governed by the sampling_algorithm, which can be random, grid or bayesian.
The actual job that is started is defined under trial. You need a program or script of some sort that you can execute via a command line and that can take parameters via that command line. command is that command that is executed, code is a folder on the local machine that contains the script/program you want to run and environment is a registered environment in your workspace. azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu#latest is one that is predefined in AzureML, but you can also create your own.
If you prefer Python, here is the same thing done in Python.
See here for a blog post on How to do hyperparameter tuning using Azure ML.

Snakemake limit the memory usage of jobs

I need to run 20 genomes with a snakemake. So I am using basic steps like alignment, markduplicates, realignment, basecall recalibration and so on in snakemake. The machine I am using has up to 40 virtual cores and 70G memory and I run the program like this.
snakemake -s Snakefile -j 40
This works fine, but as soon as It runs markduplicates along other programs, it stops as I think it overloads the 70 available giga and crashes.
Is there a way to set in snakemake the memory limit to 60G in total for all programs running? I would like snakemake runs less jobs in order to stay under 60giga, is some of the steps require a lot of memory. The command line below crashed as well and used more memorya than allocated.
snakemake -s Snakefile -j 40 --resources mem_mb=60000
It's not enough to specify --resources mem_mb=60000 on the command line, you need also to specify mem_mb for the rules you want to keep in check. E.g.:
rule markdups:
input: ...
ouptut: ...
resources:
mem_mb= 20000
shell: ...
rule sort:
input: ...
ouptut: ...
resources:
mem_mb= 1000
shell: ...
This will submit jobs in such way that you don't exceed a total of 60GB at any one time. E.g. this will keep running at most 3 markdups jobs, or 2 markdups jobs and 20 sort jobs, or 60 sort jobs.
Rules without mem_mb will not be counted towards memory usage, which is probably ok for rules that e.g. copy files and do not need much memory.
How much to assign to each rule is mostly up to your guess. top and htop commands help in monitoring jobs and figuring out how much memory they need. More elaborated solutions could be devised but I'm not sure it's worth it... If you use a job scheduler like slurm, the log files should give you the peak memory usage of each job so you can use them for future guidance. Maybe others have better suggestions.

Drake Installation Freeze

I am trying to install the python-binding of drake. After make --j it freezes. I believe I have done everything correctly for the previous steps. Can anyone help? I am running on Ubuntu 18.04 with python 3.6.9.
Thank you in advance. It looks like this.
Frozen Terminal
Use make (no -j flag) or make -j1 because bazel (which is called internally during the build) handles the parallelism of the build (and of tests) and will set the number of jobs to the number of cores by default (appears to be 8 in your case).
To adjust the parallelism to reduce the number of jobs to less than the number of cores, create a file named user.bazelrc at the root of the repository (same level as the WORKSPACE file) with the content
test --jobs=N
for some N less than the number of cores that you have.
See also https://docs.bazel.build/versions/master/guide.html#bazelrc.
From the screen shot, it doesn't look like the drake build system is doing anything wrong. But make -j is probably trying to do too many things in parallel. Try starting with -j4 and if it still freezes, go down to 2, etc.
Possibly out of memory..
A hacky solution is to change the CMakeLists.txt file to set the max number of jobs bazel uses by adding --jobs N (where N is the number of jobs you allow concurrently) after ${BAZEL_TARGETS} like so
ExternalProject_Add(drake_cxx_python
SOURCE_DIR "${PROJECT_SOURCE_DIR}"
CONFIGURE_COMMAND :
BUILD_COMMAND
${BAZEL_ENV}
"${Bazel_EXECUTABLE}"
${BAZEL_STARTUP_ARGS}
build
${BAZEL_ARGS}
${BAZEL_TARGETS}
--jobs 1
BUILD_IN_SOURCE ON
BUILD_ALWAYS ON
INSTALL_COMMAND
${BAZEL_ENV}
"${Bazel_EXECUTABLE}"
${BAZEL_STARTUP_ARGS}
run
${BAZEL_ARGS}
${BAZEL_TARGETS}
--
${BAZEL_TARGETS_ARGS}
USES_TERMINAL_BUILD ON
USES_TERMINAL_INSTALL ON
)

GNU Parallel -- How to understand "block-size" setting, and guess what to set it to?

How do I set the block-size parameter when running grep using GNU parallel on a single machine with multiple cores, based on the "large_file" filesize, "small_file" filesize and machine I'm using to get the fastest performance possible (or please correct me if there is something else that im missing here)? What are the performance issues/speed bottlenecks I'll run into when setting it too high or too low? I understand what block-size does, in that it blocks out the large_file in chunks, and sends those chunks to each job, but I'm still missing the potential for how and why that would impact speed of execution.
The command in question:
parallel --pipepart --block 100M --jobs 10 -a large_file.csv grep -f small_file.csv
where large_file.csv has in it:
123456 1
234567 2
345667 22
and
where small_file.csv has in it:
1$
2$
and so on...
Thank you!
parallel --pipepart --block -1 --jobs 10 -a large_file.csv grep -f small_file.csv
--block -1 will split large_file.csv into one block per jobslot (here 10 chunks). The splitting will be done on the fly, so it will not be read into RAM to do the splitting.
Splitting into n evenly sized blocks (where n = number of jobs to run in parallel) often makes sense if the time spent per line is roughly the same. If it varies a lot (say, some lines take 100 times longer to process than others), then it may make more sense to chop into more bits. E.g. --block -10 will split into 10 times as many blocks as --block -1.
The optimal value can seldom be guessed in advance, because it may also depend on how fast your disk is. So try different values and identify where the bottleneck is. It is typically one of disk I/O, CPU, RAM, command startup time.

Openwhisk: scaling issues on the distributed setup

I setup a distributed openwhisk installation on a few vitual machines as described here https://github.com/apache/incubator-openwhisk/blob/master/ansible/README_DISTRIBUTED.md (had to also install some dependencies on VMs manually because they were expected but not installed by default).
My host file looks like this:
; the first parameter in a host is the inventory_hostname
; used for local actions only
ansible ansible_connection=local
[registry]
xxx.xx.xx.173 ansible_host=xxx.xx.xx.173
[edge]
xxx.xx.xx.176 ansible_host=xxx.xx.xx.176
[apigateway:children]
edge
[redis:children]
edge
[controllers]
xxx.xx.xx.174 ansible_host=xxx.xx.xx.174
xxx.xx.xx.175 ansible_host=xxx.xx.xx.175
[kafkas]
xxx.xx.xx.176 ansible_host=xxx.xx.xx.176
[zookeepers:children]
kafkas
[invokers]
xxx.xx.xx.174 ansible_host=xxx.xx.xx.174
xxx.xx.xx.175 ansible_host=xxx.xx.xx.175
[db]
xxx.xx.xx.176 ansible_host=xxx.xx.xx.176
Everything seems to be running fine in general, I can create actions, invoke them etc.
On the two VMs which host invokers and controllers I turned on htop to check the CPU usage and tried running a python script invoking the same action (prime number calculation which takes time for a large enough input) multiple times in parallel.
The result seems to be that the first invoker works on 100% CPU while the calculation is happening, while the second one is still idling on 5-7% CPU. I also tried different ways of distributing components across multiple VMs, e.g. setting invokers on two machines and one controller separately on another machine, but the result is the same.
What could be the reason for this? And what could be the proper use case to make Openwhisk get the second invoker involved?
In a small deployment, there is a fraction of the invoker pool allocated strictly for docker actions. This is called the blackbox fraction which is 10% by default (with a minimum of 1 invoker, which is why you see one loaded invoker and one idle).
This recent pull request allows all the invokers to be used for small numbers of invokers (up to the reciprocal of the blackbox fraction): https://github.com/apache/incubator-openwhisk/pull/3751

Resources