I am running Dask on a SLURM-managed cluster.
dask-ssh --nprocs 2 --nthreads 1 --scheduler-port 8786 --log-directory `pwd` --hostfile hostfile.$JOBID &
sleep 10
# We need to tell dask Client (inside python) where the scheduler is running
scheduler="`hostname`:8786"
echo "Scheduler is running at ${scheduler}"
export ARL_DASK_SCHEDULER=${scheduler}
echo "About to execute $CMD"
eval $CMD
# Wait for dash-ssh to be shutdown from the python
wait %1
I create a Client inside my python code and then when finished, I shut it down.
c=Client(scheduler_id)
...
c.shutdown()
My reading of the dask-ssh help is that the shutdown will shutdown all workers and then the scheduler. But it does not stop the background dask-ssh and so eventually the job timeouts.
I've tried this interactively in the shell. I cannot see how to stop the scheduler.
I would appreciate any help.
Thanks,
Tim
Recommendation with --scheduler-file
First, when setting up with SLURM you might consider using the --scheduler-file option, which allows you to coordinate the scheduler address using your NFS (which I assume you have given that you're using SLURM). Recommend reading this doc section: http://distributed.readthedocs.io/en/latest/setup.html#using-a-shared-network-file-system-and-a-job-scheduler
dask-scheduler --scheduler-file /path/to/scheduler.json
dask-worker --scheduler-file /path/to/scheduler.json
dask-worker --scheduler-file /path/to/scheduler.json
>>> client = Client(scheduler_file='/path/to/scheduler.json')
Given this it also becomes easier to use the sbatch or qsub command directly. Here is an example with SGE's qsub
# Start a dask-scheduler somewhere and write connection information to file
qsub -b y /path/to/dask-scheduler --scheduler-file /path/to/scheduler.json
# Start 100 dask-worker processes in an array job pointing to the same file
qsub -b y -t 1-100 /path/to/dask-worker --scheduler-file /path/to/scheduler.json
Client.shutdown
It looks like client.shutdown only shuts down the client. You're correct that this is inconsistent with the docstring. I've raised an issue here: https://github.com/dask/distributed/issues/1085 for tracking further developments.
In the meantime
These three commands should suffice to tear down the workers, close the scheduler, and stop the scheduler process
client.loop.add_callback(client.scheduler.retire_workers, close_workers=True)
client.loop.add_callback(client.scheduler.terminate)
client.run_on_scheduler(lambda dask_scheduler: dask_scheduler.loop.stop())
What people usually do
Typically people start and stop clusters with whatever means that they started them. This might involve using SLURM's kill command. We should make the client-focused way more consistent though regardless.
Related
I am trying to submit a job on high compute cluster that needs to run a python code lets say 10000 times. I used gnu parallel but then IT team sent me a mail stating that my job is creating too many ssh login logs in their monitoring system. They asked me to use job arrays instead. My code takes about 12 seconds to run. I believe I need to use #PBS -J statement in my PBS script. Then, I am not sure if it will run in parallel. I need to execute my code lets say on 10 nodes 16 cores each i.e. 160 instances of my code running in parallel. How can I parallelize it i.e. run many instances of my code at a given time utilizing all the resources I have?
Below is the initial pbs script with gnu parallel:
#!/bin/bash
#PBS -P My_project
#PBS -N my_job
#PBS -l select=10:ncpus=16:mem=4GB
#PBS -l walltime=01:30:00
module load anaconda
module load parallel
cd $PBS_O_WORKDIR
JOBSPERNODE=16
parallel --joblog jobs.log --wd $PBS_O_WORKDIR -j $JOBSPERNODE --sshloginfile $PBS_NODEFILE --env PATH "python $PBS_O_WORKDIR/xyz.py" :::: inputs.txt
inputs.txt is a fie with integer values 0-9999 in each line which is fed to my python code as an argument. Code is highly independent and output of one instance does not affect another.
a little late but thought I'd answer anyway.
Arrays will run in parallel, but the number of jobs running at once will depend on the availability of nodes and the limit of jobs per user per queue. Essentially, each HPC will be slightly different.
Adding #PBS -J 1-10000 will create an array of 10000 jobs, and assuming the syntax is the same as the HPC I use, something like ID=$(sed -n "${PBS_ARRAY_INDEX}p" /path/to/inputs.txt) will then be the integers from inputs.txt whereby PBS array number 123 will return the 123rd line of inputs.txt.
Alternatively, since you're on an HPC, if the jobs are only taking 12 seconds each, and you have 10000 iterations, then a for loop will also complete the entire process in 33.33 hours.
I have a python flask app running on uWSGI with a config file that specifics it to spawn multiple workers (which I am assuming are identical processes).
Everything works well except for one part: the python app runs a bash command to download an update a database every day using a scheduler, which needs to run only once but multiple processes means that it runs multiple times at the same time, thus corrupting the downloaded file.
Is there a way to run this bash command on only one instance of uWSGI workers? I can't run the bash command as a separate cron job (the database update has to integrate seamlessly with the app).
Check The uWSGI cron-like interface
uWSGI’s master has an internal cron-like facility that can generate
events at predefined times. You can use it
You can set the options for example to:
[uwsgi]
; every two hours
cron = 0 -2 -1 -1 -1 /usr/bin/backup_my_home --recursive
Is that sufficient?
I would like to start two different services in my Docker container and exit the container as soon as one of them exits. I looked at supervisor, but I can't find how to get it to quit as soon as one of the managed applications exits. It tries to restart them up to three times, as is the standard setting and then just sits there doing nothing. Is supervisor able to do this or is there any other tool for this? A bonus would be if there also was a way to let both managed programs write to stdout, tagged with their application name, e.g.:
[Program 1] Some output
[Program 2] Some other output
[Program 1] Output again
Since you asked if there was another tool... we designed and wrote a powerful replacement for supervisord that is designed specifically for Docker. It automatically terminates when all applications quit, as well as has special service settings to control this behavior, plus will redirect stdout with tagged syslog-compatible output lines as well. It's open source, and being used in production.
Here is a quick start for Docker: http://garywiz.github.io/chaperone/guide/chap-docker-simple.html
There is also a complete set of tested base-images which are a good example at: https://github.com/garywiz/chaperone-docker, but these might be overkill and the earlier quickstart may do the trick.
I found solutions to both of my requirements by reading through the docs some more.
Exit supervisord on application exit
This can be achieved by using a custom eventlistener. I had to add the following segment into my supervisord configuration file:
[eventlistener:shutdownevent]
command=/shutdownhandler.sh
events=PROCESS_STATE_EXITED
supervisord will start the referenced script and upon the given event being triggered (PROCESS_STATE_EXITED is triggered after the exit of one of the managed programs and it not restarting automatically) will send a line containing data about the event on the scripts stdin.
The referenced shutdownhandler-script contains:
#!/bin/bash
while :
do
echo -en "READY\n"
read line
kill $(cat /supervisord.pid)
echo -en "RESULT 2\nOK"
done
The script has to indicate being ready by sending "READY\n" on its stdout, after which it may receive an event data line on its stdin. For my use case upon receival of a line (meaning one of the managed programs has exited), a SIGTERM is sent to the supervisord process being found by the pid it leaves in its pid file (situated in the root directory by default). For technical completeness, I also included a positive answer for the eventlistener, though that one should never matter.
Tagged output on stdout
I did this by simply starting a tail process in the background before starting supervisord, tailing the programs output log and piping the lines through ts (from the moreutils package) to prepend a tag to it. This way it shows up via docker logs with an easy way to see which program actually wrote the line.
tail -fn0 /var/log/supervisor/program1.log | ts '[Program 1]' &
In a rails application (or sinatra), if I make a call to a shell command, under what context does this command run?
I'm not sure if I am asking my question correctly, but does it run in the same thread as the rails process?
When you shell out, is it possible to make this a asychronous call? If yes, does this mean at the operating system level it will start a new thread? Can it start in a pool of threads instead of a new thread?
If you are using system('cmd') or simply backticks:
`cmd`
Then the command will be executed in the context of a subshell.
If you wish to run multiple of these at a time, you can use Rubys fork functionality:
fork { system('cmd') }
fork { system('cmd') }
This will create multiple subprocessess which run the individual commands in their respective subshells.
Read up on forking here: http://www.ruby-doc.org/core-2.0/Process.html#method-c-fork
It's more than just a new thread, it's a completely separate process. It will be synchronous and control will not return to Ruby until the command has completed. If you want a fire-and-forget solution, you can simply background the task:
$ irb
irb(main):001:0> system("sleep 30 &")
=> true
irb(main):002:0>
$ ps ax | grep sleep
3409 pts/4 S 0:00 sleep 30
You can start as many processes as you want via system("foo &") or`foo &`.
If you want more control over launching background processes from Ruby, including properly detaching ttys and a host of other things, check out the daemons gem. That's more suitable for long-running processes that you want to manage, with PID files, etc., but it's also possible to just launch tasks with it.
There are alternative solutions for managing background processes depending on your needs. The resque gem is popular for queuing and managing background jobs. It requires Redis and some setup, but it's good if you need that level of control.
I need a shell script to send a HUP to the parent and child processes.
I am using freeBSD with tcsh? #/bin/sh
Somehow, I need to pipe the PID output from pgrep to kill -HUP in a loop in a shell script.
Ultimately I want to run this script as a cron job.
I just don't have the skills yet.
Thanks - Brad
(This isn't a complete answer, but I can't make comments without at least 50 reputation apparently).
First of all, /bin/sh on FreeBSD is a Boune-compatible shell, not tcsh (which is /bin/tcsh). A start would be something like the following:
#!/bin/sh
for pid in $(pgrep <process name>); do kill -HUP $pid; done
Without more details, I can't really say much more.