How do Erlang atoms work? - erlang

Trying to find documentation on details, I did not find a lot beyond:
There is a (erlang runtime instance-) atom table.
Atom string literal is only stored once.
Atoms take 1 word.
To me, this leaves a lot of things in the unclear.
Is the atom word value always the same, independent of the sequence modules are loaded into a runtime instance? If modules A and B both define/reference some atoms, will the value of the atom change from session to session, depending on whether A or B was loaded first?
When matching for an atom inside a module, is there some "atom literal to atom value" resolution taking place? Do modules have some own module-local atom-value-lookup table, which gets filled in at load-time of a module?
In a distributed scenario where 2 erlang runtime instances communicate with each other. Is there some "sync-atom-tables" action going on? Or do atoms get serialized as string literals, instead of as words?

Atom is simply an ID maintained by the VM. The representation of the ID is a machine integer of the underlying architecture, e.g. 4 bytes on 32-bit systems and 8 bytes on 64-bit systems. See the usage in the LYSE book.
The same atom in the same running VM is always mapped to the same ID (integer). For example the following tuple:
{apple, pear, cherry, apple}
could be stored as the following tuple in the actual Erlang memory:
{1, 2, 3, 1}
All atoms are stored in one big table which is never garbage-collected, i.e. once an atom is created in a running VM it stays in the table until the VM is shut down.
Answering your questions:
1 . No. The ID of the atom will change between VM runs. If you shut down the VM and reload the tuple above the system might end up with the following IDs:
{50, 51, 52, 50}
depending on what other atoms have been created before it was loaded. Atoms only live as long as the VM.
2 . No. There is only one table of atoms per VM. All literal atoms in the module are mapped to their IDs when the module is loaded. If a particular atom doesn't yet exist in that table then it's inserted and stays there until the VM restarts.
3 . No. Tables with atoms are per VM and they are separate. Consider a situation when two VMs are started at the same time but they don't know of each other. Atoms created in each VM may have different IDs in the table. If at some point in time one node gets to know about the other node different atoms will have different IDs. They can't be easily synchronized or merged. But atoms aren't simply send as text representations to the other node either. They are "compressed" to a form of cache and send all together in the header. See the distribution header in the description of the communication protocol. Basically, the header contains atoms used in later terms with their IDs and textual representation. Then each term references the atom by the ID specified in the header rather than passing the same text each time.

To get really basic without going into implementation, an atom is a literal "thing" with a name. Its value is always itself and it knows its own name. You generally use it when you want the tag, like the atoms ok and error. Atoms are unique in the sense that there is only one atom foo in the system, and each time I refer to foo, I am referring to this same unique foo irrespective of whether they are in the same module, or whether they come from the same process. There is always only one foo.
A bit of implementation. Atoms are stored in a global atom table, and when you create a new atom, it is inserted into the table if it is not already there. This makes comparing atoms for equality very fast as you just check if the two atoms refer to the same slot in the atom table.
While separate instances of the VM, nodes, have separate atom tables, the communication between the nodes in distributed erlang is optimised for this, so very often you don't need to send the actual atom name between nodes.

Related

Lua Virtual Machine Register size

In the register based lua virtual machine are the registers fixed size?
Or is it a dynamic structure?
I found an bytecode example here at page 17 where the constant string "hello" is loaded into a register, so it must be dynamic? Isn't this uncommon for registers?
http://luaforge.net/docman/83/98/ANoFrillsIntroToLua51VMInstructions.pdf
Each register contains a Lua value. Lua values are implemented in C as tagged unions. See also: The Implementation Of Lua 5.0. This tagged union stores small types (booleans, numbers) by value and everything else (strings, tables, functions, etc.) as a pointer. So the size of a register is constant, though larger than one native machine word.

Listing available records available to a process in Erlang

Records are compile time structures. The record_info and is_record recognise the compiled records and their structures. Is there a way to ask the VM what records have been defined that are available to the process? I am interested in getting the internal tuple representation for every record definition.
What I want to do is something like:
-record(car,{make=honda}).
get_record(Car) ->
%% Some magic here to end up having sth like
{car,{make,honda}} or even better #car{} indeed. %% when Car = 'car'
As you said records are only a compile time construct, so once compiled records are only tuples, this would suggest no available information is left during runtime, but since you mentioned those two functions I was curious and I checked how they worked.
According to this record_info/2 is a pseudo function made available only during compilation, so it doesn't need any run time information on records.
On the other hand the description of is_record(Term, RecordTag) states that this BIF (built-in function) only "returns true if Term is a tuple and its first element is RecordTag, false otherwise", so it is actually only checking the structure and first element of the tuple.
Based on this, I would guess that there is no record information made available during runtime. This thread confirms the unavailability of record_info/2 during runtime.
I have used Dynarec (https://github.com/dieswaytoofast/dynarec.git) successfully in a data mapping module for one of the apps I am currently working on. It is a parse transformer, though, not a run-time VM tool. It compiles information on each defined record, as well as information about the fields for each record. In my case, I use it to dynamically map incoming data to record data. This module may get you what you need. YMMV. Good luck.
As others have said records are purely compile time and there is no runtime information about records. Erlang just sees tuples. For example the record_info/2 pseudo functions are expanded to data at compile time, a list of atoms for fields argument and an integer for size.

How to create a string outside of Erlang that represents a DICT Term?

I want to construct a string in Java that represents a DICT term and that will be passed to an Erlang process for being reflected back as an erlang term ( string-to-term ).
I can achieve this easily for ORDDICT 's, since they are structured as a simple sorted key / value pair in a list of tuples such as : [ {field1 , "value1"} , {field2 , "value2} ]
But, for DICTS, they are compiled into a specific term that I want to find how to reverse-engineer it. I am aware this structure can change over new releases, but the benefits for performance and ease of integration to Java would overcome this. Unfortunately Erlang's JInterface is based on simple data structures. An efficient DICT type would be of great use.
A simple dict gets defined as follows:
D1 = dict:store("field1","AAA",dict:new()).
{dict,1,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
{{[],[],[],[],[],[],[],[],
[["field1",65,65,65]],
[],[],[],[],[],[],[]}}}
As it can be seen above, there are some coordinates which I do not understand what they mean ( the numbers 1,16,16,8,80,48 and a set of empty lists, which likely represent something as well.
Adding two other rows (key-value pairs) causes the data to look like:
D3 = dict:store("field3","CCC",D2).
{dict,3,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
{{[],[],
[["field3",67,67,67]],
[],[],[],[],[],
[["field1",65,65,65]],
[],[],[],[],
[["field2",66,66,66]],
[],[]}}}
From the above I can notice that:
the first number (3) reppresets the number of items in the DICT.
the second number (16) shows the number of list slots in the first tuple of lists
the third number (16) shows the number of list slots in the second typle of lists, of which the values ended up being placed on ( in the middle ).
the fourth number (8) appears to be the number of slots in the second row of tuples from where the values are placed ( a sort of index-pointer )
the remaining numbers (80 and 48)... no idea...
adding a key "field0" gets placed not in the end but just after "field1"'s data. This indicates the indexing approach.
So the question, is there a way (algorithm) to reliably directly create a DICT string from outside of Erlang ?
The comprehensive specification how dict is implemented can be found simply in the dict.erl sourcecode.
But I'm not sure replicating dict.erl's implementation in Java is worthwhile. This would only make sense if you want a fast dict like data structure that you need to pass often between Java and Erlang code. It might make more sense to use a Key-Value store both from Erlang and Java without passing it directly around. Depending on your application this could be e.g. riak or maybe even connect your different language worlds with RabbitMQ. Both examples are implemented in Erlang and are easily accessible from both worlds.

Erlang: binary_to_atom filling up atom table space security issue

I heard that an atom table can fill up in Erlang, leaving the system open for DDoS unless you increase the number of atoms that can be created. It looks like binary_to_existing_atom/2 is the solution to this.
Can anyone explain exactly how binary_to_atom/2 is a security implication and how binary_to_existing_atom/2 solves this problem?
When an atom is first used it is given an internal number and put in an array in the VM. This array is allocated statically and can fill up if enough different atoms are used. binary_to_existing_atom will only convert a binary string to an atom which already exists in the array, if it does not exist the call will fail.
If you are converting input data directly to atoms without doing any sanity checks it would be possible for an external client to send <<"a">> and <<"b">> until the array is full at which point the vm will crash.
Another way to avoid this is to simply not use binary_to_atom and instead pattern match on different binaries and return the desired atom.
list_to_atom/1 and binary_to_atom/1 are very serious bugs in erlang code.
Always create a major function like this:
to_atom(X) when is_list(X) ->
try list_to_existing_atom(X) of
Atom -> Atom
catch
_Error:_ErrorReason -> list_to_atom(X)
end.
In this way, if the atom already exists in the Atom table, the try body avoids creating the atom again. Its only created the first time this function is called.

id values of different variables in python 3

I am able to understand immutability with python (surprisingly simple too). Let's say I assign a number to
x = 42
print(id(x))
print(id(42))
On both counts, the value I get is
505494448
My question is, does python interpreter allot ids to all the numbers, alphabets, True/False in the memory before the environment loads? If it doesn't, how are the ids kept track of? Or am I looking at this in the wrong way? Can someone explain it please?
What you're seeing is an implementation detail (an internal optimization) calling interning. This is a technique (used by implementations of a number of languages including Java and Lua) which aliases names or variables to be references to single object instances where that's possible or feasible.
You should not depend on this behavior. It's not part of the language's formal specification and there are no guarantees that separate literal references to a string or integer will be interned nor that a given set of operations (string or numeric) yielding a given object will be interned against otherwise identical objects.
I've heard that the C Python implementation does include a set of the first hundred or so integers as statically instantiated immutable objects. I suspect that other very high level language run-time libraries are likely to include similar optimizations: the first hundred integers are used very frequently by most non-trivial fragments of code.
In terms of how such things are implemented ... for strings and larger integers it would make sense for Python to maintain these as dictionaries. Thus any expression yielding an integer (and perhaps even floats) and strings (at least sufficiently short strings) would be hashed, looked up in the appropriate (internal) object dictionary, added if necessary and then returned as references to the resulting object.
You can do your own similar interning of any sorts of custom object you like by wrapping the instantiation in your own calls to your own class static dictionary.

Resources