Freeing unused allocated nodes on a SLURM cluster - gnu-parallel

I'm running some batches of serial programs on a (very) inhomogeneous SLURM cluster (version 2.6.6-2), using GNU 'parallel' to do the distribution. The problem that I'm having is that some of the nodes finish their tasks a lot faster than the others, and I end up with situations like, for example, a job that's allocating 4 nodes but is only using 1 during half of the simulation.
Is there any way, without administrator privileges, to free one of these unused nodes? I can mitigate the problem by running 4 jobs on individual nodes, or with files containing lists of homogeneous nodes, but it's still far from ideal.
For reference, here are the script files that I'm using (adapted from here)
job.sh
#!/bin/sh
#SBATCH --job-name=test
#SBATCH --time=96:00:00
#SBATCH --ntasks=16
#SBATCH --mem-per-cpu=1024
#SBATCH --ntasks-per-node=4
#SBATCH --partition=normal
# --delay .2 prevents overloading the controlling node
# -j is the number of tasks parallel runs so we set it to $SLURM_NTASKS
# --joblog makes parallel create a log of tasks that it has already run
# --resume makes parallel use the joblog to resume from where it has left off
# the combination of --joblog and --resume allow jobs to be resubmitted if
# necessary and continue from where they left off
parallel="parallel --delay .2 -j $SLURM_NTASKS"
$parallel < command_list.sh
command_list.sh
srun --exclusive -N1 -n1 nice -19 ./a.out config0.dat
srun --exclusive -N1 -n1 nice -19 ./a.out config1.dat
srun --exclusive -N1 -n1 nice -19 ./a.out config2.dat
...
srun --exclusive -N1 -n1 nice -19 ./a.out config31.dat

You can use the scontrol command to downsize your job:
scontrol update JobId=# NumNodes=#
I am not sure however how Slurm chooses the nodes to dismiss. You might need to choose them by hand and write
scontrol update JobId=# NodeList=<names>
See Question 24 in the Slurm FAQ.

Related

How to monitor and log networking latency between group of docker containers?

I have a setup of 10 docker containers from different immages in a swarm on 3 machines. I need to monitor and log network latency / packet delays between each container. Is there a right tool for it?
I can implement something like
while true; for host in "${my_hosts[*]}"; do ping -c 1 "$host" > latency.log; done done
and launch it on each machine, tailing latency.log to monitor like Prometheus. But it feels like reinvensing a square wheel.
I hope i understand what you need , Im implementing something like this myself .
I tested netdata with prometheus and graphana and metricbeat\filebeat with elastic and kibana
we choose to use elastic (ELK stack) because in the same DB you can handle metrics and textual data .
hope i gave you some directions .
What I have at the end is a setup that:
Shares hosts between containers by volume,
Measures latency feeding hosts to fping,
Writes fping output to log file,
Serves this log file to Prometheus by mtail.
I've implemented wrapper around fping to let it work with mtail:
#!/usr/bin/env bash
# It wraps `fping -lDe` to give output for multiple hosts one line at time (for `mtail` parser)
# Default `fping -lDe` behavior produce X lines at time where X = number of hosts to ping
# This waits for hosts file with `# docker` section as described in usage guide
echo "Measuing time delays to docker hosts from '$1'"
# take hostnames after '# docker' comment line
hosts=$(cat $1 | sed -n '/# docker/,$p' | sed 1,1d)
trap "exit" INT # exit loop by SIGINT
# start `fping` and write it's output to stdout line by line
stdbuf -oL fping -lDe $hosts |
while IFS= read -r line
do
echo $line
done
And there is mtail parser for the log file:
gauge delay by host
gauge loss by host
# [<epoch_time>] <host> : [2], 84 bytes, <delay> ms (0.06 avg, 0% loss)
/\[(?P<epoch_time>\d+\.\d+)\] (?P<host>[\w\d\.]*) : \[\d+\], \d+ \w+, (?P<delay>\d+\.\d+) ms \(\d+\.\d+ avg, (?P<loss>\d+)% loss\)/ {
delay[$host]=$delay
loss[$host]=$loss
}
Now you can add fping and mtail to your images to let it serve delays and losses metrics for Prometheus.
References:
mtail: https://github.com/google/mtail
fping: https://fping.org/

Does `strace -f` work differently when run inside a docker container?

Assume the following:
I have a program myprogram inside a docker container
I'm running the docker container with
docker run --privileged=true my-label/my-container
Inside the container - the program is being run with:
strace -f -e trace=desc ./myprogram
What I see is that the strace (despite having the -f on) doesn't follow all the child processes.
I see the following output from strace
[pid 10] 07:36:46.668931 write(2, "..\n"..., 454 <unfinished ...>
<stdout of ..>
<stdout other output - but I don't see the write commands - so probably from a child process>
[pid 10] 07:36:46.669684 write(2, "My final output\n", 24 <unfinished ...>
<stdout of My final output>
What I want to see is the other write commands.
Now I should see the the other write commands - because I'm using -f.
What I think is happening is that running inside docker makes the process handling and security different.
My question is: Does strace -f work differently when run inside a docker container?
Note that this application starts and stops in 2 seconds - so the tracing tool has to follow the application lifecycle - like strace does. Connecting to a server background process won't work.
It turns out strace truncates string output - you have to explicitly tell it that you want more than the first n (10?) string chars. You do this with -s 800.
strace -s 800 -ff ./myprogram
You can also get all the write commands by asking strace explicitly with -e write.
strace -s 800 -ff -e write ./myprogram

Set up a sharded solr collection using solrcloud

I would like to set up a 6 shards solr collection on 3 windows machines.
Tried the bin\solr -e cloud and set up 2 machines 6 shards and 1 replica. When stopping and starting 2 cores on one machine (each using another hard disk) I get 6 shards; 3 for each core.
When I start another core on another machine nothing happens, the 3rd one doesn't do anything.
When I start another core on the same machine using the same config in another directory nothing happens, the core starts but has no collections and the 2 cores first started still have 3 shards each.
For example: I start the 3rd one with:
bin\solr start -c -p 7576 -z localhost:9983 -s server/solr/collection/node3/solr
Or start on another machine:
bin\solr start -c -p 7576 -z zookeeper:9983 -s server/solr/collection/node3/solr
Is there some documentation out there that doesn't use the "convenient" bin\solr that I'm trying to reverse engineer the entire day to figure out how to set up zookeeper/solr to add the nth solr core as a shard until 6 shards are reached?
I think I found the answer: bin\solr -e cloud starts up the cores and assignes data to them.
After running the standard bin\solr -e cloud with 2 cores, a collection with 6 shards and 1 replica I stop all bin\solr stop -all
Then copy solr-5.2.1\example\cloud\node1 as solr-5.2.1\example\cloud\node3 delete the files in solr-5.2.1\example\cloud\node3\logs and let solr-5.2.1\example\cloud\node3 have gettingstarted_shard6_replica1 (leave that file in solr-5.2.1\example\cloud\node3\solr and remove it from solr-5.2.1\example\cloud\node1\solr).
Start up 3 cores:
bin\solr start -c -p 8983 -s example\cloud\node1\solr
bin\solr start -cloud -p 7574 -z localhost:9983 -s example\cloud\node2\solr
bin\solr start -cloud -p 7575 -z localhost:9983 -s example\cloud\node3\solr
And now I can see the 3rd solr instance has gettingstarted_shard6_replica1

How to stop parallel from reporting "No more processes" with "-X" option?

Working off this example: http://www.gnu.org/software/parallel/man.html#EXAMPLE:-Speeding-up-fast-jobs
When I run:
seq -w 0 9999 | parallel touch pict{}.jpg
seq -w 0 9999 | parallel -X touch pict{}.jpg
Success! However, add another 9 and BOOM:
$ seq -w 0 99999 | parallel --eta -X touch pict{}.jpg
parallel: Warning: No more processes: Decreasing number of running jobs to 3. Raising ulimit -u or /etc/security/limits.conf may help.
Computers / CPU cores / Max jobs to run
1:local / 4 / 3
parallel: Warning: No more processes: Decreasing number of running jobs to 2. Raising ulimit -u or /etc/security/limits.conf may help.
parallel: Warning: No more processes: Decreasing number of running jobs to 1. Raising ulimit -u or /etc/security/limits.conf may help.
parallel: Error: No more processes: cannot run a single job. Something is wrong.
I would expect parallel -X to run no more jobs than I have cpu cores, and to cram as many parameters onto each job as the max command line length permits. How am I running out of processes?
My environment:
OSX Yosemite
ulimit -u == 709
GNU parallel 20141122
GNU bash, version 3.2.53(1)-release (x86_64-apple-darwin14)
Your expectation is 100% correct. What you are seeing is clearly a bug - probably due to GNU Parallel not being well tested on OSX. Please follow http://www.gnu.org/software/parallel/man.html#REPORTING-BUGS and file a bug report.

Convert check_load in Nagios to Zabbix

Hi I have just built my Zabbix server and in the process of configuring some checks currently setup in Nagios.
One these checks is check_load. Can anyone explain what this check means in Nagios and how I can replicate it in Zabbix.
In Nagios check_load monitors server load. Server load is a good indication of what your overall utilisation looks like : http://en.wikipedia.org/wiki/Load_(computing)
You can view server load easily on most *nix servers using the top command. The 3 numbers at the top right show your 1, 5 and 15 minute load averages. As a brief guide the load should be less than your number of processors. So for instance if you have a 4 cpu server then I would expect your load average to sit below 4.00.
I recently did a quick load monitor in nagios script format for http://www.dataloop.io
It was done quickly and needs a fair bit of work to work across other systems. But it gives a feel for how to scrape the output of top:
#!/bin/bash
onemin=$(top -b -n1 | sed -n '1p' | cut -d ' ' -f 13 | sed 's/%//')
fivemin=$(top -b -n1 | sed -n '1p' | cut -d ' ' -f 14 | sed 's/%//')
fifteenmin=$(top -b -n1 | sed -n '1p' | cut -d ' ' -f 15 | sed 's/%//')
int_fifteenmin=$( printf "%.0f" $fifteenmin )
echo "OK | 1min=$onemin;;;; 5min=$fivemin;;;; 15min=$fifteenmin;;;;"
alert=10
if [ "$int_fifteenmin" -gt "$alert" ]
then
exit 2
fi
exit 0
Hope this explains enough for you to create a Zabbix equivalent.
In zabbix, it is a zabbix agent built-in check. Search for system.cpu.load here.
As for what it measures, the already posted link to wikipedia article is a great read.

Resources