How to discover the high-performance network interface on a linux HPC cluster? - network-programming

I have a distributed program which communicates with ZeroMQ that runs on HPC clusters.
ZeroMQ uses TCP sockets, so by default on HPC clusters the communications will use the admin network, so I have introduced an environment variable read by my code to force communication on a particular network interface.
With Infiniband (IB), usually it is ib0. But there are cases where another IB interface is used for the parallel file system, or on Cray systems the interface is ipogif, on some non-HPC systems it can be eth1, eno1, p4p2, em2, enp96s0f0, or whatever...
The problem is that I need to ask the administrator of the cluster the name of the network interface to use, while codes using MPI don't need to because MPI "knows" which network to use.
What is the most portable way to discover the name of the high-performance network interface on a linux HPC cluster? (I don't mind writing a small MPI program for this if there is no simple way)

There is no simple way and I doubt a complete solution exists. For example, Open MPI comes with an extensive set of ranked network communication modules and tries to instantiate all of them, selecting in the end the one that has the highest rank. The idea is that ranks somehow reflect the speed of the underlying network and that if a given network type is not present, its module will fail to instantiate, so faced with a system that has both Ethernet and InfiniBand, it will pick InfiniBand as its module has higher precedence. This is why larger Open MPI jobs start relatively slowly and is definitely not fool proof - in some cases one has to intervene and manually select the right modules, especially if the node has several network interfaces of InfiniBand HCAs and not all of them provide node-to-node connectivity. This is usually configured system-wide by the system administrator or the vendor and is why MPI "just works" (pro tip: in not-so-small number of cases it actually doesn't).
You may copy the approach taken by Open MPI and develop a set of detection modules for your program. For TCP, spawn two or more copies on different nodes, list their active network interfaces and the corresponding IP addresses, match the network addresses and bind on all interfaces on one node, then try to connect to it from the other node(s). Upon successful connection, run something like the TCP version of NetPIPE to measure the network speed and latency and pick the fastest network. Once you've gotten this information from the initial small set of nodes, it is very likely that the same interface is used on all other nodes too, since most HPC systems are as homogeneous as possible when it comes to their nodes' network configuration.
If there is a working MPI implementation installed, you can use it to launch the test program. You may also enable debug logging in the MPI library and parse the output, but this will require that the target system has an MPI implementation supported by your log parser. Also, most MPI libraries use native InfiniBand or whatever high-speed network API there is and will not tell you which is the IP-over-whatever interface, because they won't use it at all (unless configured otherwise by the system administrator).

Q : What is the most portable way to discover the name of the high-performance network interface on a linux HPC cluster?
This seems to be in a gray-zone - trying to solve a multi-faceted problem among site-specific hardware (technical) interface naming and theirs non-technical, weakly administratively maintained, preferred ways of use.
As-is State :
ZeroMQ can (as per RFC 37/ZMTP v3.0+) specify <hardware(interface)>:<port>/<service> details :
zmq_bind (server_socket, "tcp://eth0:6000/system/name-service/test");
And:
zmq_connect (client_socket, "tcp://192.168.55.212:6000/system/name-service/test");
yet has no means, to my knowledge, to reverse-engineer the primary use of such an interface, in the holistic context of the HPC-site and it's hardware configuration.
Seems to me, your idea of pre-testing the administrative mappings via MPI-tool first and letting ZeroMQ deployment use these externally detected (if indeed auto-detectable, as you assumed above) configuration details for a proper (preferred) interface usage.
The Safe Way to Go :
Asking the HPC-infrastructure Support Team ( who is responsible for knowing all of the above and trained to help Scientific Teams to use the HPC in the most productive manner ) would be my preferred way to go.
Disclaimer :
Sorry in case this did not help your will to read & auto-detect all the needed configuration details ( a universal BlackBox-HPC-ecosystem detection and auto-configuration strategy would hardly be a trivial one-liner, I guess, wouldn't it? )

Related

Guidance on when to chose virtual machines or physical machines over containers

There are many articles and videos comparing containers, virtual machines, physical machines. However almost all information is theoretical: containers are fast, VMs are secure, etc. But I could not find description of specific use cases or guidance on when to choose virtual machines, physical machines, but not containers. So, currently I cannot imagine situation when somebody gives recommendation to not use containers.
Question:
Could you please list specific applications or solutions when you would recommend using VMs, but not container?
Could you please list specific applications or solutions when you would recommend using OS over bare metal, but not containers or VMs?
Here is example of answer I would appreciate to get (note, that I am not sure if this information is correct):
Use case 1: Edge Router
Edge router is a router which connects organizational network to the Internet. Also, in this case it is assumed, that vendor of the router provides it not as device but as a software package (virtualized router).
Edge router most probably will be one of target of hacker's attacks. Thus security requirements come to the first place.
Containers are not recommended in this case. By default containers provide mediocre level of security. Strong security can be achieved with complex configuration (what configuration?) but this is more difficult than in case of VM or bare metal. In addition, high security level may require special hardened Linux kernel, however containers technology does not allow adjusting kernel configuration.
Virtual Machines would be a good choice if vendor of the router provides software as VM image or when organization has many edge routers (for example, many offices with internet access points), and has (or is ready to create) well-established process of preparation of VM images. In this case using VMs will simplify rollout, update and healing the virtualized edge router. VM also provides high security level; nevertheless is it still recommended to place such a VM in a separate server and to not share same server with other applications/VMs to avoid cross-VM attacks.
Physical machine would be a good choice if router vendor provides router's software as an application package (not as a VM) such as .rpm, and rollout, update and healing processes are not expected to take much efforts; this might be the case when when company has few routers (so updates can be performed manually or automated with tools like Ansible), and couple of hour of planned and unplanned downtime is acceptable.
Use case 2: ...
Thank you in advance.
The question is a bit vague so I'll try my best:
you'd usually allocate work to containers when you have a few separate applications with limited physical resources and you'd like to run them each with their own different environment (different runtime version, architecture and dependencies) which managing on a machine (physical or virtual) would be cumbersome.
you'd use a VM when you want specifically a feature that containers couldn't satisfy or it would just be a headache to set them up with it and a simple quick and easy VM could solve (and again you have limited resources you'd like to share between use cases)
and finally, a physical machine when performance is of the essence like I/O requests and latency around that.
you can also mix and match to match each tier needs:
we need to run many applications that VM would be too much of an overhead for them and containers would make their handling more automated and streamline so containers with k8s, but on the other hand, we want local storage offered to those containers to be very fast so we run the k8s cluster on physical machines.
if recoverability would be of the essence we would have used VM due to the options of snapshotting VM states over time.
It's all a big LEGO set you can mix and match depending on your use case and needs

When to write a Custom Kernel Module

Problem Statement:
I have a very high bandwidth data link that is UDP based. The source of this data is not configurable, and sends on UDP a stream of datagrams. We have code that uses the standard methods for receiving data on the UDP socket that works adequately. I wanted to know if
Does there exist a command interface to extract multiple UDP datagrams at a time? to improve efficiency?
If one doesn't exist, does it make sense to create a kernel module to provide the capability?
I am a novice, and i wanted to understand what thought process has to happen when writing your own kernel module seems appropriate. I know that such a surgical procedure isn't meant to done lightly, but there must be a set of criteria where that action is prudent. Maybe not in my case, but in general.
HW / Kernel Module Perspective
A typical network adapter these days would be capable of distributing received packets across multiple hardware Rx queues thus letting the host run multiple software Rx queues bound to different CPU cores reading out packets in parallel. From a single HW/SW queue perspective, the host may poll it for new packets (see Linux NAPI), with each poll ideally yielding a batch of packets, and, alternatively, the host may still use interrupt-driven approach for Rx signalling with interrupt coalescing turned on for improved efficiency.
Existing NIC drivers in Linux kernel strive to stick with the most performant techniques, and the kernel itself should be able to leverage all of that properly.
Userland / Application Perspective
There's PACKET_MMAP interface provided by Linux kernel for improved Rx/Tx efficiency on the application side. Long story short, an application can set up a memory buffer shared between the kernel- and userspace and read out incoming packets from it, ideally in batches, or blocks, thus avoiding costly kernel-to-userspace copies and context switches so customary when using regular methods.
For added efficiency, the application may have multiple sockets bound to the NIC in separate threads / processes and demand that packet reception be load balanced across these sockets (see AF_PACKET fanout mode description).
DPDK Perspective
Kernel bypass framework that allows an application to seize full control of a network adapter by means of a vendor-specific poll-mode driver, or PMD, effectively running in userspace as part of the application and by its very nature not needing any kernel-to-userspace copies, context switches and, most likely, locking. Multi-queue receive operation, load balancing (round robin, RSS, you name it) and more cutting edge offloads are likely to be available, too (it's vendor specific).
Summary
The short of it, given the fact that multiple network acceleration techniques already exist, one need never write their own kernel module to solve the problem in question. By the looks of it, your application, which, as you say, uses standard methods, is not aware of PACKET_MMAP technique. So I'd be tempted to suggest looking at this one closely. DPDK approach might require that the application be effectively re-implemented from scratch, so I would first go for PACKET_MMAP approach as a low-hanging fruit.

Can I write a file to a specific cluster location?

You know, when an application opens a file and write to it, the system chooses in which cluster will be stored. I want to choose myself ! Let me tell you what I really want to do... In fact, I don't necessarily want to write anything. I have a HDD with a BAD range of clusters in the middle and I want to mark that space as it is occupied by a file, and eventually set it as a hidden-unmoveable-system one (like page file in windows) so that it won't be accessed anymore. Any ideas on how to do that ?
Later Edit:
I think THIS is my last hope. I just found it, but I need to investigate... Maybe a file could be created anywhere and then relocated to the desired cluster. But that requires writing, and the function may fail if that cluster is bad.
I believe the answer to your specific question: "Can I write a file to a specific cluster location" is, in general, "No".
The reason for that is that the architecture of modern operating systems is layered so that the underlying disk store is accessed at a lower level than you can access, and of course disks can be formatted in different ways so there will be different kernel mode drivers that support different formats. Even so, an intelligent disk controller can remap the addresses used by the kernel mode driver anyway. In short there are too many levels of possible redirection for you to be sure that your intervention is happening at the correct level.
If you are talking about Windows - which you haven't stated but which appears to assumed - then you need to be looking at storage drivers in the kernel (see https://learn.microsoft.com/en-us/windows-hardware/drivers/storage/). I think the closest you could reasonably come would be to write your own Installable File System driver (see https://learn.microsoft.com/en-us/windows-hardware/drivers/ddi/_ifsk/). This is really a 'filter' as it sits in the IO request chain and can intercept and change IO Request Packets (IRPs). Of course this would run in the kernel, not in userspace, and normally this would be written in C and I note your question is tagged for Delphi.
Your IFS Driver can sit at differnt levels in the request chain. I have used this technique to intercept calls to specific file system locations (paths / file names) and alter the IRP so as to virtualise the request - even calling back to user space from the kernel to resolve how the request should be handled. Using the provided examples implementing basic functionality with an IFS driver is not too involved because it's a filter and not a complete storgae system.
However the very nature of this approach means that another filter can also alter what you are doing in your driver.
You could look at replacing the file system driver that interfaces to the hardware, but I think that's likely to be an excessive task under the circumstances ... and as pointed out already by #fpiette the disk controller hardware can remap your request anyway.
In the days of MSDOS the access to the hardware was simpler and provided by the BIOS which could be hooked to allow the requests to be intercepted. Modern environments aren't that simple anymore. The IFS approach does allow IO to be hooked, but it does not provide the level of control you need.
EDIT regarding suggestion by the OP of using FSCTL_MOVE_FILE
For simple environment this may well do what you want, it is designed to support a defragmentation process.
However I still think there's no guarantee that this actually will do what you want.
You will note from the page you have linked to it states that it is moving one or more virtual clusters of a file from one logical cluster to another within the same volume
This is a code that's passed to the underlying storage drivers which I have referred to above. What the storage layer does is up to the storage layer and will depend on the underlying technology. With more advanced storage there's no guarantee this actually addresses the physical locations which I believe your question is asking about.
However that's entirely dependent on the underlying storage system. For some types of storage relocation by the OS may not be honoured in the same way. As an example consider an enterprise storage array that has a built in data-tiering function. Without the awareness of the OS data will be relocated within the storage based on the tiering algorithms. Also consider that there are technologies which allow data to be directly accessed (like NVMe) and that you are working with 'virtual' and 'logical' clusters, not physical locations.
However, you may well find that in a simple case, with support in the underlying drivers and no remapping done outside the OS and kernel, this does what you need.
Since you problem is to mark bad cluster, you don't need to write any program. Use the command line utility CHKDSK that Windows provides.
I an elevated command prompt (Run as administrator), run the command:
chkdsk /r c:
The check will be done on the next reboot.
Don't forget to read the documentation.

What is the fastest way of communicating from PC to a Controller using LabVIEW?

I am working on a project wherein I need to communicate 8 boolean outputs to the controller based on the result generated by a program built using LabVIEW on a PC.
I have discussed this with a few colleagues who suggest using a parallel port data-bus and use TTL signals to communicate to a micro-controller which will give maximum transfer speed.
I understand it being a cost effective solution but will it be fastest way to communicate with a micro-controller? Also, considering it being a legacy technology which limits its availability on standard PC's I have to buy an additional PCI-E card with parallel port interface.

erlang general question on socket

I have a question about a project I should implement for my Distributed System course.
The project consist in designing and implementing a library that provides a reliable multicast service to user processes. All processes belong to a group, and a message is sent by a member process to all members of the group. The sender is excluded from the recipient list.
This seems to me quite easy to implement in erlang, due to its message passing structure...more points are given if you use rpc call instead of normal sockets based implementation..
Now my question is this: one of the mandatory points of this projects requires that sockets aren't kept open when there is no communication going on between processes...
Our course is held in C, but we are free to use any language we like...can I satisfy this constraint using erlang nodes and rpc calls?
thanks in advance
Yes. The rpc module even has multicall, which takes a list of nodes and will do exactly what you described. It won't hold your sockets open when it's not using them either.
Despite what the other answers say, Erlang's default behavior does not satisfy your constraints.
A typical network of Erlang nodes using Erlang distribution will remain densely connected (every node connected to every other node) with TCP sockets open even when you're not using them. You will either have to use -connect_all false and manage opening/closing the connections to other nodes yourself, or you will have to develop your own distribution protocol. I would recommend the latter, especially since you are learning. The trick to make it easy is to use term_to_binary and binary_to_term.

Resources