Group specific MPI ranks to a single processing unit - binding

I have an MPI application ( lets say -np 6 ) in which I know ahead of time that MPI ranks 0, 2,and 3 are very light computationally compared to ranks 1, 4, and 5 and I want to conserve resource by pinning ranks 0, 2, and 3 to the same physical processing unit. Then pin ranks 1, 4, and 5 each to individual physical processing units of their own.
I know there are many flavors of MPI out there and syntax varies, however I cannot find anything out there that actually dictates the location of individual ranks, instead of just specifying a uniform 2 ppn or something to that effect. But I have to image this is possible, I am just not sure of where it falls, pinning? binding? mapping? etc.
Thanks for the help!

Open MPI supports what it calls rankfiles that specify the mapping of each rank to host and processing element on that host. You can see more in the man page for mpiexec (link is to documentation for v2.1 that comes with, e.g., Ubuntu 18.04 LTS, but is essentially the same in newer versions too), but assuming you run everything on a single host with at least 4 CPU cores, the rankfile will look something like:
rank 0=hostname slot=0
rank 1=hostname slot=1
rank 2=hostname slot=0
rank 3=hostname slot=0
rank 4=hostname slot=2
rank 5=hostname slot=3
where hostname is the host name, possibly localhost.
Here is an example:
First, a small utility script show_affinity that displays the CPU affinity of the current MPI rank:
#!/bin/bash
echo "$OMPI_COMM_WORLD_RANK: $(grep Cpus_allowed_list /proc/self/status)"
Second, a sample rankfile:
rank 0=localhost slot=0
rank 1=localhost slot=1
rank 2=localhost slot=0
rank 3=localhost slot=0
rank 4=localhost slot=2
rank 5=localhost slot=3
MPI launch of show_affinity using that rankfile:
$ mpiexec -H localhost -rf rankfile ./show_affinity
0: Cpus_allowed_list: 0-1
1: Cpus_allowed_list: 2-3
2: Cpus_allowed_list: 0-1
3: Cpus_allowed_list: 0-1
4: Cpus_allowed_list: 4-5
5: Cpus_allowed_list: 6-7
The CPU has hyperthreading enabled, so each rank gets bound to both hardware threads.

Related

GNU Parallel -- How to understand "block-size" setting, and guess what to set it to?

How do I set the block-size parameter when running grep using GNU parallel on a single machine with multiple cores, based on the "large_file" filesize, "small_file" filesize and machine I'm using to get the fastest performance possible (or please correct me if there is something else that im missing here)? What are the performance issues/speed bottlenecks I'll run into when setting it too high or too low? I understand what block-size does, in that it blocks out the large_file in chunks, and sends those chunks to each job, but I'm still missing the potential for how and why that would impact speed of execution.
The command in question:
parallel --pipepart --block 100M --jobs 10 -a large_file.csv grep -f small_file.csv
where large_file.csv has in it:
123456 1
234567 2
345667 22
and
where small_file.csv has in it:
1$
2$
and so on...
Thank you!
parallel --pipepart --block -1 --jobs 10 -a large_file.csv grep -f small_file.csv
--block -1 will split large_file.csv into one block per jobslot (here 10 chunks). The splitting will be done on the fly, so it will not be read into RAM to do the splitting.
Splitting into n evenly sized blocks (where n = number of jobs to run in parallel) often makes sense if the time spent per line is roughly the same. If it varies a lot (say, some lines take 100 times longer to process than others), then it may make more sense to chop into more bits. E.g. --block -10 will split into 10 times as many blocks as --block -1.
The optimal value can seldom be guessed in advance, because it may also depend on how fast your disk is. So try different values and identify where the bottleneck is. It is typically one of disk I/O, CPU, RAM, command startup time.

Lua and Torch issues with GPu

I am trying to run the Lua based program from the OpenNMT. I have followed the procedure from here : http://forum.opennmt.net/t/text-summarization-on-gigaword-and-rouge-scoring/85
I have used the command:
th train.lua -data textsum-train.t7 -save_model textsum1 -gpuid 0 1 2 3 4 5 6 7
I am using 8 GPUs but still the process is damn slow as if the process is working on the CPU. kindly, let me know what might be the solution for the optimizing the GPU usage.
Here is the stats of the GP usage:
Kindly, let me know how I can make the process run faster using the complete GPUs. I am available with 11GBs, but the process only consumes 2 GB or less. Hence the process is damn slow.
As per OpenNMT documentation, you need to remove 0 from right after the gpuid option since 0 stands for the CPU, and you are effectively reduce the training speed to that of a CPU-powered one.
To use data parallelism, assign a list of GPU identifiers to the -gpuid option. For example:
th train.lua -data data/demo-train.t7 -save_model demo -gpuid 1 2 4
will use the first, the second and the fourth GPU of the machine as returned by the CUDA API.

How to vectorize/group together many signals generated from Qsys to Altera Quartus

In the Altera Qsys, I am using ten input parallel ports (lets name them pio1 to pio10), each port is 12 bits. These parallel ports obtain values from the vhdl block in Quartus schematic. In the schematic bdf, I can see pio1 to pio10 from the nios ii system symbol so I can connect these pios to other blocks in my bdf.
My question is, how to vectorize these pio1 to pio10? Instead of seeing all ten pios one line by one line coming out from the Nios system symbol, what should I do in order to group all these ten pios so that I only see one instead of ten? From the one pio that I see, I can name it pio[1..10][1..12], the first bracket means pio1 to pio10, the second bracket means bit1 to bit 12 because each parallel port has 12 bits.
Could you please let me know how could I do that?

In SPMD using GNU parallel, is processing the smallest files first the most efficient way?

This is pretty straight forward:
Say I have many files in the folder data/ to process via some executable ./proc. What is the simplest way to maximize efficiency? I have been doing this to gain some efficiency:
ls --sort=size data/* | tac | parallel ./proc
which lists the data according to size, then tac (reverse of cat) flips the order of that output so the smallest files are processed first. Is this the most efficient solution? If not, how can the efficiency be improved (simple solutions preferred)?
I remember that sorting like this leads to better efficiency since larger jobs don't block up the pipeline, but aside from examples I can't find or remember any theory behind this, so any references would be greatly appreciated!
If you need to run all jobs and want to optimize for time to complete them all, you want them to finish the same time. In that case you should run the small jobs last. Otherwise you may have the situation where all cpus are done except one that just started on the last big job. Here you will waste CPU time for all CPUs except the one.
Here are 8 jobs: 7 take 1 second, one takes 5:
1 2 3 4 55555 6 7 8
On a dual core small jobs first:
1368
24755555
On a dual core big jobs first:
555557
123468

Is there a monitoring tool like xentop that will track historical data?

I'd like to view historical data for guest cpu/memory/IO usage, rather than just current usage.
There is a perl program i have written that does this. See link text
It also supports logging to a URL.
Features:
perl xenstat.pl -- generate cpu stats every 5 secs
perl xenstat.pl 10 -- generate cpu stats every 10 secs
perl xenstat.pl 5 2 -- generate cpu stats every 5 secs, 2 samples
perl xenstat.pl d 3 -- generate disk stats every 3 secs
perl xenstat.pl n 3 -- generate network stats every 3 secs
perl xenstat.pl a 5 -- generate cpu avail (e.g. cpu idle) stats every 5 secs
perl xenstat.pl 3 1 http://server/log.php -- gather 3 secs cpu stats and send to URL
perl xenstat.pl d 4 1 http://server/log.php -- gather 4 secs disk stats and send to URL
perl xenstat.pl n 5 1 http://server/log.php -- gather 5 secs network stats and send to URL
Sample output:
[server~]# xenstat 5
cpus=2
40_falcon 2.67% 2.51 cpu hrs in 1.96 days ( 2 vcpu, 2048 M)
52_python 0.24% 747.57 cpu secs in 1.79 days ( 2 vcpu, 1500 M)
54_garuda_0 0.44% 2252.32 cpu secs in 2.96 days ( 2 vcpu, 750 M)
Dom-0 2.24% 9.24 cpu hrs in 8.59 days ( 2 vcpu, 564 M)
40_falc 52_pyth 54_garu Dom-0 Idle
2009-10-02 19:31:20 0.1 0.1 82.5 17.3 0.0 *****
2009-10-02 19:31:25 0.1 0.1 64.0 9.3 26.5 ****
2009-10-02 19:31:30 0.1 0.0 50.0 49.9 0.0 *****
Try Nagios, or Munin.
Xentop is a tool to monitor the domains (VMs) running under Xen. VMware's ESX has a similar tool (I believe its called esxtop).
The problem is that you'd like to see the historical CPU/Mem usage for domains on your Xen system, correct?
As with all Virtualization layers, there are two views of this information relevant to admins: the burden imposed by the domain on the host and the what the domain thinks is its process load. If the domain thinks it is running low on resources but the host is not, it is easy to allocate more resources to the domain from the host. If the host runs out of resources, you'll need to optimize or turn off some of the domains.
Unfortunately, I don't know of any free tools to do this. XenSource provides a rich XML-RPC API to control and monitor their systems. You could easily build something from that.
If you only care about the domain-view of its own resources, I'm sure there are plenty of monitoring tools already available that fit your need.
As a disclaimer, I should mention that the company I work for, Leostream, builds virtualization management software. Unfortunately, it does not really do utilization monitoring.
Hope this helps.
Both Nagios and Munin seem to have plugins/support for Xen data collection.
A Xen Virtual Machine Monitor Plugin for Nagios
munin plugins

Resources