Get the number of cores in Erlang with Linux - erlang

i am writing a concurrent program and i need to know the number of cores of the system so then the program will know how many processes to open.
Is there command to get this inside Erlang code?
Thnx.

You can use
erlang:system_info(logical_processors_available)
to get the number of cores that can be used by the erlang runtime system.

There is also:
erlang:system_info(schedulers_online)
which tells you how many scheduler threads are actually running.

To get the number of available cores, use the logical_processors flag to erlang:system_info/1:
1> erlang:system_info(logical_processors).
8
There are two companion flags to this one: logical_processors_online shows how many are in use, and logical_processors_available show how many are available (it will return unknown when all logical processors available are online).
To know how to parallelize your code, you should rely on schedulers_online which will return the number of actual Erlang schedulers that are available in your current VM instance:
1> erlang:system_info(schedulers_online).
8
Note however that parallelizing on this value alone might not be enough. Sometimes you have other processes running that need some CPU time and sometimes your algorithm would benefit from even more parallelism (waiting on IO for example). A rule of thumb is to use the value obtained from schedulers_online as a multiplier for parallelism, but always test with different multiples to see what works best for your application.

How this information is exposed will be very operating system specific (unless you happen to be writing an operating system of course).
You didn't say what operating system you're working on. In the case of Linux, you can get the data from /proc/cpuinfo, however there are subtleties with the meaning of hyperthreading and the issue of multiple cores on the same die using a shared L2 cache (effectively you've got a NUMA architecture).

Related

Does an Operating System check every Instruction?

Not sure if anyone here can answer this.
I've learned that an Operating System checks if an instruction of a program changes something outside of its allocated memory, and if it does then the OS won't allow the program to do this.
But, if the OS has to check this for every instruction, won't this take up at least 5/6 of the CPU? I tried to replicate this, and this is how many clock cycles I've come up with to check this for every instruction.
If I've understood something wrong, please correct me, because I can't imagine that an OS takes up that much of the CPU.
There are several safe-guards in place to ensure a non-privileged process behaves. I will discuss two of them in the context of the x86_64 architecture, but these concepts (mostly) extend to other major platforms.
Privilege Levels
There is a bit in a particular CPU register that indicates the current privilege level. These privileges are often called rings, where ring 0 corresponds to the kernel (ie. highest privilege), and ring 3 corresponds to a userspace process (ie. lowest privilege). There are other rings, but they're not relevant to this introduction.
Certain instructions in x86_64 may only be executed by privileged processes. The current ring must be 0 to execute a privileged instruction. If you try to execute this instruction without the correct privileges, the processor raises a general protection fault. The kernel synchronously processes this interrupt, and will almost certainly kill the userspace process.
The ring level can only be changed while in ring 0, so the userspace process can't simply change from ring 3 to ring 0 by itself.
Execute Permission in Page Tables
All instructions to be executed are stored in memory. Many architectures (including x86_64) use page tables to store mappings from virtual addresses to physical addresses. These page tables have several bookkeeping entries as well, one of which is an execute permission bit. If this bit is not set for a page that corresponds to the instruction trying to be executed, then the processor will produce a general protection fault. As before, the kernel will synchronously process this interrupt, and likely kill the offending process.
When are these execute bits set? They can be dynamically set via mmap(2), but in most cases the compiler emits special CODE sections in the binaries it generates, and when the OS loads the binary into memory it sets the execute bit in the page table entries for the pages that correspond to the CODE sections.
Who's checking these bits?
You're right to ask about the performance penalty of an OS checking these bits for every single instruction. If the OS were doing this, it would be prohibitively expensive. Instead, the processor supports privilege levels and page tables (with the execute bit). The OS can set these bits, and rely on the processor to generate interrupts when a process acts outside its privileges.
These hardware checks are very fast.

MPI/Pthread program does not scale

I have a MPI/Pthread program in which each MPI process will be running on a separate computing node. Within each MPI process, certain number of Pthreads (1-8) are launched. However, no matter how many Pthreads are launched within a MPI process, the overall performance is pretty much the same. I suspect all the Pthreads are running on the same CPU core. How can I assign threads to different CPU cores?
Each computing node has 8 cores.(two Quad core Nehalem processors)
Open MPI 1.4
Linux x86_64
Questions like this are often dependent on the problem at hand. Most likely, you are running into a resource lock issue (where the threads are competing for a lock) -- this would look like only one core was doing any work, because only one thread can (effectively) do any work at any given time.
Setting CPU affinity for a certain thread is not a good solution. You should allow for the OS scheduler to optimally determine the physical core assignment for a given pthread.
Look at your code and try to figure out where you are locking where you shouldn't be, or if you've come up with a correct parallel solution to the problem at hand. You should also test a version of the program using only pthreads (not MPI) and see if scaling is achieved.

Is it better to start multiple erlang nodes per machine, or just one per machine?

Preface: When I say "machine" below, I mean either a physical dedicated server, or a virtual private server. When I say "node" I mean, an instance of the erlang virtual machine, of which there could be multiple running as separate processes under a single unix kernel.
I've got a project that involves multiple erlang/OTP applications. The applications will be running together and talking to each other on the same machine. They will all be hitting the disk, using memory and spawning erlang processes. They will also be using network resources because they will be talking to similar machines with the same set of applications running on them in a cluster.
Almost all of this communication is via HTTP. Thus I could separate each erlang OTP application into a separate instance of the erlang VM on the same machine and they could still talk to each other.
My question is: Is it better to have them running all under one erlang VM so that this erlang VM process can allocate access to resources among them, and schedule the execution of the various erlang processes.
Or is it better to have separate erlang nodes on a given server?
If one is better than the other, why?
I'm assuming running all of these apps in a single erlang vm which is given, essentially, full run of the server, will result in better performance. The OS is just managing the disk and ram at the low level, and only has one significant process (the erlang VM) to switch with... and the erlang VM is probably smarter about allocating resources when it has the holistic view of all the erlang processes.
This may be something that I need to test, but I'm not in a position to do so effectively in the near term.
The answer is: it depends.
Advantages of using a single node:
Memory is controlled by a single Erlang VM. It is way easier.
Inter-application communication (if using erlang-messaging) is faster.
Less operating system context switches happens
Advantages of using multiple nodes:
If the system is linking in C code to the VM, death of one node due to a bug in C will not kill the others.
Agree with #I GIVE CRAP ANSWERS
I would go with one VM. Here is why:
dynamic handling of run time queues belonging to schedulers (with varied origin of CPU load its important)
fewer VMs to monitor
better understanding of memory allocation and easier to spot malicious process (can compare all of them at once)
much easier inter app supervision
I wouldn't care about VM crash - you need to be prepared any way. Heart works especially well in the cluster of equal units.
We've always used one VM per application because it's easier to manage.
The scheduler and SMP support in Erlang have come a long way in the past few years, so there isn't as much reason as there used to be to run multiple VMs on the same node.
I Agree with previous answers but there is a case scenario where having multiple nodes per cpu is the answer: When a heavy task hits the node. A task may take multiple minutes to complete and in such case a gen server will hold the node until completion of the task.

How does Erlang pass messages between processes on the same node?

Between nodes, message are (must be) passed over TCP/IP. However, by what mechanism are they passed between processes running on the same node? Is TCP/IP used in this case as well? Unix domain sockets? What is the difference in performance between "within node" and "between node" message passing?
by what mechanism are they passed between processes running on the same node?
Because Erlang processes on the same node are all running within a single native process — the BEAM emulator — message structures are simply copied into the receiver's message queue. The message structure is copied, rather than simply referenced, for all the standard no-side-effects functional programming reasons.
See erts_send_message() in erts/emulator/beam/erl_message.c in the Erlang sources for more detail. In R15B01, the bits most relevant to your question start at line 980 or so, with the call to erts_queue_message().
If you did choose to run multiple BEAM emulators on a single physical machine, I would guess messages get sent between them the same way as between different physical machines. There's probably no good reason to do that now that BEAM has good SMP support, though.
What is the difference in performance between "within node" and "between node" message passing?
A simple benchmark on your actual hardware would be more useful to you than anecdotal evidence from others.
If you want generalities, however, observe that memory bandwidths are around 20 GByte/sec these days, and that you're unlikely to have a network link faster than 10 Gbit/sec between nodes. That means that while there may be many differences between your actual application and any simple benchmark you perform or find, these differences probably cannot swamp an order of magnitude difference in transfer rate.
If you "only" have a 1 Gbit/sec end-to-end network link between nodes, intranode transfers will probably be over two orders of magnitude faster than internode transfers.
"All data in messages between Erlang processes is copied, with the exception of refc binaries on the same Erlang node.":
http://erlang.org/doc/efficiency_guide/processes.html#id2265332

Erlang Documentation/SMP: single-node and multi-node per machine or per application, and the confusion that may follow

I'm studying Erlang's process model at the moment. I have hit a snag in a tech report (section 3, paragraph 2) on Erlang:
This explains why it in some cases can be more efficient to run several SMP VM's
with one scheduler each instead on one SMP VM with several schedulers. Of course
the running of several VM's require that the application can run in many parallel tasks
which has no or very little communication with each other.
Now this paragraph is confusing me; I can see the uni-process multiple scheduler scenario, but I am failing to see multiple processes with a single scheduler; Presumably each process would have a different node name, and this would mean a certain application, without modification, cannot be used with this model; the virtue of not requiring modification has been mentioned as a key feature of SMP in the report. If the multiple processes have the same node names, than performance would be disastrous due to inter-Erlang-process messaging storms -- this assume the use of in-memory amnesia. Is there some process model that is not introduced in the article and that I am missing here ?
What is the author trying say here ? is he trying to suggest that an application would have to be rewritten (to take multiple unique node-names into account) for the multi-process single-scheduler case ?
-- edit 1: Clarification of Source of Problem --
The question has been answered through discussion; the following is an outline of the trouble I had.
The issue for this question has been that the documentation, as I recall, does not touch on a scenario of running multiple Erlang emulators per physical machine -- it has always been shown that the emulator represents your physical machine (in industrial usage); also, the scenario of having to explicitly partition a program for computational efficiency has never been considered. This sudden introduction has been the source of my woe.
The convention is still biased towards creating LOTS of processes and that the future holds many improvements for the SMP emulator for Erlang, and this means that single node per machine is still a very viable option assuming favourable application design.
Rewrite after reading article:
This explains why it in some cases can
be more efficient to run several SMP
VM's with one scheduler each instead
on one SMP VM with several schedulers.
Non-SMP VM has no-lock so runs fast.
Single scheduler SMP VM 10% slower, due to cost of checking locks
Multiple scheduler SMP VM slower again due to using/waiting for locks
Of course the running of several VM's
require that the application can run
in many parallel tasks which has no or
very little communication with each
other.
I think: Nodes on the same server have to have different names.
Inter process messaging while by slower due to the inter-process nature verse intra process messaging of a VM node.
If you have multiple schedulers in a single VM, they will inevitably contend over various resources (e.g. ets meta table, atom-table, scheduler run-queue during migration, etc.) because of the inner architecture. If you have a single scheduler, contention will obviously not occur. Lock checking and acquiring will still be done though, so running a non SMP VM instead shall yield even better performance (but requires a rebuilding of the VM from source).
Take a four-core machine for example. Option one means that you run four instances of the Erlang VM, each with a single scheduler, affinity set to different processor cores. Option two means running a single Erlang VM with four schedulers, each scheduler's affinity set to different processor cores.
If you have a whole lot of independent processes to run, option two will result in better performance, because the four cores will be fully utilized (theoretically). In contrast, in option one, this won't be possible, because the lock contention will make execution on cores wait for each other every now and then.
On the other hand if your processes need to chatter a lot, option one is the way to go because the inter-process communication is way cheaper than communication between different VMs. You gain more with this than you lose with lock contention.
I believe the answer is in the preceding paragraph:
The SMP VM with only one scheduler is slightly slower (10%) than the non
SMP VM.
This is because the SMP VM need to use locks for all shared
datastructures. But as
long as there are no lock-conflicts the overhead caused by
locking is not that high (it
is the lock conflicts that takes time).
Scheduler's reliance on locks for shared data structures can impose an overhead on a given system. It seems to follow that having multiple schedulers on one SMP VM imposes a collectively greater overhead.
There are some advatanges with several nodes on one physical machine.
1) Resource locking overhead as mentioned.
2) Fail-over. In telecom products you really don't want to have the beam come crashing down on you. If you have NIFs or linked-in drivers in your system this might occur.
3) Memory locality. Few nodes gives you a poor-mans way to force processes to a few cores. This could be a big boost for NUMA archs typically but also for SMP. The scheduler don't take NUMA into account (yet). You can spawn a process to a specific scheduler and lock it to it, it won't migrate but that is an undocumented feature ... or it was removed all together. I forget.
With several nodes you will need a load balancer between the nodes of course but that is the usual way to do it anyways. Some logic that supervises the nodes.
However, the numbers from the EUC papers are over a year old [#] and I wouldn't recommend a multi-node approach if you don't really need it. The runtime system is much better at handling these types of problems today. A lot of lock overhead has been removed and the mrq-scheduler has been improved.
# 2009's numbers look like this.
Edit:
Regarding 3) the spawn feature i mentioned is,
spawn_opt(fun() -> ... end, [{scheduler, Id}]) -> pid(),
where Id is an integer and refers to a specific scheduler.
I wouldn't recommend using it since it undocumented.

Resources