The answer to this question says that Erlang PIDs are actually 28-bit integers, the first 10 of which are the node number (always 0 for the local node), and the next 18 of which are an index into the global process table. So, if my understanding is correct, assuming we're only working on a single node, the maximum number of unique PIDs is 2^18, or about 262,000. Is this then the maximum number of processes that I can spawn on a single Erlang node, over time? If I have a very long-running Erlang node, will the VM immediately crash after I allocate my 2^18+1'th node, or do old, unused PIDs get reused? If so how is that process implemented at the VM level?
The answer to the other question seems to refer to an older version of the Erlang runtime, it changed after R9 (R17 is the latest at the moment). According to the implementation the process id uses 28 bits for internal identifiers.
Pids are recycled when the process dies and any monitors has been notified, so 2^28 is the upper limit of the number of simultaneous processes on the node.
The default process limit is 2^18 and can be increased with the +P option to erl, see the erl options documentation.
Note: the documentation says the upper limit is 2^27 processes, that's not consistent with the code.
Related
I am trying to use some of the uncore hardware counters, such as: skx_unc_imc0-5::UNC_M_WPQ_INSERTS. It's supposed to count the number of allocations into the Write Pending Queue. The machine has 2 Intel Xeon Gold 5218 CPUs with cascade lake architecture, with 2 memory controllers per CPU. linux version is 5.4.0-3-amd64. I have the following simple loop and I am reading this counter for it. Array elements are 64 byte in size, equal to cache line.
for(int i=0; i < 1000000; i++){
array[i].value=2;
}
For this loop, when I map memory to DRAM NUMA node, the counter gives around 150,000 as a result, which maybe makes sense: There are 6 channels in total for 2 memory controllers in front of this NUMA node, which use DRAM DIMMs in interleaving mode. Then for each channel there is one separate WPQ I believe, so skx_unc_imc0 gets 1/6 from the entire stores. There are skx_unc_imc0-5 counters that I got with papi_native_avail, supposedly each for different channels.
The unexpected result is when instead of mapping to DRAM NUMA node, I map the program to Non-Volatile Memory, which is presented as a separate NUMA node to the same socket. There are 6 NVM DIMMs per-socket, that create one Interleaved Region. So when writing to NVM, there should be similarly 6 different channels used and in front of each, there is same one WPQ, that should get again 1/6 write inserts.
But UNC_M_WPQ_INSERTS returns only around up 1000 as a result on NV memory. I don't understand why; I expected it to give similarly around 150,000 writes in WPQ.
Am I interpreting/understanding something wrong? Or is there two different WPQs per channel depending wether write goes to DRAM or NVM? Or what else could be the explanation?
It turns out that UNC_M_WPQ_INSERTS counts the number of allocations into the Write Pending Queue, only for writes to DRAM.
Intel has added corresponding hardware counter for Persistent Memory: UNC_M_PMM_WPQ_INSERTS which counts write requests allocated in the PMM Write Pending Queue for IntelĀ® Optaneā¢ DC persistent memory.
However there is no such native event showing up in papi_native_avail which means it can't be monitored with PAPI yet. In linux version 5.4, some of the PMM counters can be directly found in perf list uncore such as unc_m_pmm_bandwidth.write - Intel Optane DC persistent memory bandwidth write (MB/sec), derived from unc_m_pmm_wpq_inserts, unit: uncore_imc. This implies that even though UNC_M_PMM_WPQ_INSERTS is not directly listed in perf list as an event, it should exist on the machine.
As described here the EventCode for this counter is: 0xE7, therefore it can be used with perf as a raw hardware event descriptor as following: perf stat -e uncore_imc/event=0xe7/. However, it seems that it does not support event modifiers to specify user-space counting with perf. Then after pinning the thread in the same socket as the NVM NUMA node, for the program that basically only does the loop described in the question, the result of perf kind of makes sense:
Performance counter stats for 'system wide': 1,035,380 uncore_imc/event=0xe7/
So far this seems to be the the best guess.
I have four Erlang nodes working together on multi-process application.
In my order one process is monitoring which draws the location of the processes on the area and the three other nodes handle the processes location and movement. On the monitor I use an ETS database to store the locations when the key is the process PID. I have noticed that the nodes creates processes which have the same PIDs which obviously interrupts with the management of the entire system.
I have tried to connect the processes with:
net_adm:ping(...).
net_kernel:connect(...).
I was hoping that when the nodes will be aware of each other they will give different PIDs but that did not work.
The PIDs may be printed the same, e.g. <0.42.0>, but that's just an output convention: PIDs on the local node are printed with the first number being 0. If you'd send this PID to another node and print it there, it would be printed as <2265.42.0> or similar. PIDs are always associated with the name of the node where the process is running, and you can extract it with node(Pid). Therefore, PIDs from different nodes will never compare equal.
This answer goes into more details about the structure of a PID.
I'm looking for an automatic way to do my load balance and this module attracted me.
As the manual says,
pool can be used to run a set of Erlang nodes as a pool of computational processors.
It is organized as a master and a set of slave nodes and includes the following features:
The slave nodes send regular reports to the master about their current load.
Queries can be sent to the master to determine which node will have the least load.
The BIF statistics(run_queue) is used for estimating future loads.
It returns the length of the queue of ready to run processes in the Erlang runtime system.
What's the frequency and load for the slave nodes to send regular reports?
Is it a proper way to make load balance?
Reports are sent every 2 seconds and use information gathered from statistics(run_queue) to determine the node with the least load. run_queue returns the queue size of the current node's scheduler.
When you call pool:get_node/0 you are getting the node with the lowest number of tasks waiting to be executed on it's scheduler. Keep in mind that nodes are kept in sorted order so calls to pool:get_node/0 do not directly query nodes, but rather rely on information that could be up to 2 seconds old.
If you need a load balanced pool of nodes, pool works great.
Here is some more info from the pool.erl source:
%% Supplies a computational pool of processors.
%% The chief user interface function here is get_node()
%% Which returns the name of the nodes in the pool
%% with the least load !!!!
%% This function is callable from any node including the master
%% That is part of the pool
%% nodes are scheduled on a per usgae basis and per load basis,
%% Whenever we use a node, we put at the end of the queue, and whenever
%% a node report a change in load, we insert it accordingly
Is there a limit to the number of processes that can be register globally? Or is this only limited by the memory/ max number of atoms ?
Ubuntu 12.04 and Erlang R15B01.
Good question! I'd bet on the number of atoms, if you take into account the following. The Efficiency Guide has a section on system limits:
Processes
The maximum number of simultaneously alive Erlang processes is by default 32768. This limit can be raised up to at most 268435456 processes at startup (see documentation of the system flag +P in the erl(1) documentation). The maximum limit of 268435456 processes will at least on a 32-bit architecture be impossible to reach due to memory shortage.
Distributed nodes
Known nodes
A remote node Y has to be known to node X if there exist any pids, ports, references, or funs (Erlang data types) from Y on X, or if X and Y are connected. The maximum number of remote nodes simultaneously/ever known to a node is limited by the maximum number of atoms available for node names. All data concerning remote nodes, except for the node name atom, are garbage-collected.
Also, the erl manual section describes the flag you can use to alter the number of processes in your node:
+P Number
Sets the maximum number of concurrent processes for this system. Number must be in the range 16..134217727. Default is 32768.
Since you can alter the number of concurrent processes per node, but you cant alter the number of allowed atoms, and the process names are atoms which are copied in replica per node, that should be the total allowed number of globally registered processes.
Hope it helps :)
EDIT: Actually, turns out you can change the number of allowed atoms :)
Atoms
By default, the maximum number of atoms is 1048576. This limit can be raised or lowered using the +t option.
+t size
Set the maximum number of atoms the VM can handle. Default is 1048576.
When I run my WebSocket test, I found the following interesting memory usage results:
Server stated, no connection
[{total,573263528},
{processes,17375688},
{processes_used,17360240},
{system,555887840},
{atom,472297},
{atom_used,451576},
{binary,28944},
{code,3774097},
{ets,271016}]
44 processes,
System:705M,
Erlang Residence:519M
100K Connections
[{total,762564512},
{processes,130105104},
{processes_used,130089656},
{system,632459408},
{atom,476337},
{atom_used,456484},
{binary,50160},
{code,3925064},
{ets,7589160}]
100044 processes,
System: 1814M,
Erlang Residence: 950M
200K Connections
( restart server and create from 0 connection, not continue from case 2)
[{total,952040232},
{processes,243161192},
{processes_used,243139984},
{system,708879040},
{atom,476337},
{atom_used,456484},
{binary,70856},
{code,3925064},
{ets,14904760}]
200044 processes,
System:3383M,
Erlang: 1837M
The figures with "System:" and "Erlang:" are provided htop, others are output of memory() call from erlang shell. Please look at the total and erlang residence memory. When there is no connection, these two are roughly same, with 100K connections, residence memory is a little larger than total, with 200K connections, residence memory is almost double the total.
Can anybody explain?
The most probable answer for your quersion is memory fragmentation.
Allocating OS memory is expensive, so Erlang tries to manage memory for you.
When Erlang allocates memory, it creates an entity called "carrier", which consists of many "blocks". Erlang memory(total) reports the sum of all the block sizes (memory actually used). OS reports the sum of all carriers sizes (sum of memory used and preallocated). Both sum of blocks sizes and carrier sizes can be read from Erlang VM. If (block sizes)/(carrier sizes) << 1, than VM has hard time with freeing the carriers. There might be many big carriers with only couple of blocks used. You can read it with: erlang:system_info({allocator,Type}). but there is an easier way. You can check it using Recon library:
http://ferd.github.io/recon/recon_alloc.html
Firstly check:
recon_alloc:fragmentation(current).
and next:
recon_alloc:fragmentation(max).
This should explain the difference between total memory reported by Erlang VM and OS. If you are sending many small messages over websockets, you can decrease the fragmentation by running Erlang with 2 options:
erl +MBas aobf +MBlmbcs 512
First option will change the block allocation strategy from best fit to address order best fit, which could help squeeze more blocks into first carriers and second one decreases maximum multiblock carrier size, which makes carriers smaller (this should make freeing them easier).