Erlang: binary_to_atom filling up atom table space security issue - erlang

I heard that an atom table can fill up in Erlang, leaving the system open for DDoS unless you increase the number of atoms that can be created. It looks like binary_to_existing_atom/2 is the solution to this.
Can anyone explain exactly how binary_to_atom/2 is a security implication and how binary_to_existing_atom/2 solves this problem?

When an atom is first used it is given an internal number and put in an array in the VM. This array is allocated statically and can fill up if enough different atoms are used. binary_to_existing_atom will only convert a binary string to an atom which already exists in the array, if it does not exist the call will fail.
If you are converting input data directly to atoms without doing any sanity checks it would be possible for an external client to send <<"a">> and <<"b">> until the array is full at which point the vm will crash.
Another way to avoid this is to simply not use binary_to_atom and instead pattern match on different binaries and return the desired atom.

list_to_atom/1 and binary_to_atom/1 are very serious bugs in erlang code.
Always create a major function like this:
to_atom(X) when is_list(X) ->
try list_to_existing_atom(X) of
Atom -> Atom
catch
_Error:_ErrorReason -> list_to_atom(X)
end.
In this way, if the atom already exists in the Atom table, the try body avoids creating the atom again. Its only created the first time this function is called.

Related

What are the steps in doing incrementation in erlang?

increment([]) -> [];
increment([H|T]) -> [H+1|increment(T)].
decrement([]) -> [];
decrement([H|T]) -> [H-1|decrement(T)].
So I have this code but I don't know how they properly work like in java.
Java and Erlang are different beasts. I don't recommend trying to make comparisons to Java when learning Erlang, especially if Java is the only language you know so far. The code you've posted is a good example of the paradigm known as "functional programming". I'd suggest doing some reading on that subject to help you understand what's going on. To try to break this down as far as Erlang goes, you need to understand that an Erlang function is completely different from a Java method.
In Java, your method signature is composed of the method name and the types of its arguments. The return type can also be significant. A Java increment method like the function you wrote might be written like List<Integer> increment(List<Integer> input). The body of the Java method would probably iterate through the list an element at a time and set each element to itself plus one:
List<Integer> increment(List<Integer> input) {
for (int i = 0; i < input.size; i++) {
input.set(i, input.get(i) + 1);
}
}
Erlang has almost nothing in common with this. To begin with, an erlang function's "signature" is the name and arity of the function. Arity means how many arguments the function accepts. So your increment function is known as increment/1, and that's its unique signature. The way you write the argument list inside the parentheses after the function name has less to do with argument types than with the pattern of the data passed to it. A function like increment([]) -> ... can only successfully be called by passing it [], the empty list. Likewise, the function increment([Item]) -> ... can only be successfully called by passing it a list with one item in it, and increment([Item1, Item2]) -> ... must be passed a list with two items in it. This concept of matching data to patterns is quite aptly known as "pattern matching", and you'll find it in many functional languages. In Erlang functions, it's used to select which head of the function to execute. This bears a rough similarity to Java's method overloading, where you can have many methods with the same name but different argument types; however a pattern in an Erlang function head can bind variables to different pieces of the arguments that match the pattern.
In your code example, the function increment/1 has two heads. The first head is executed only if you pass an empty list to the function. The second head is executed only if you pass a non-empty list to the function. When that happens, two variables, H and T, are bound. H is bound to the first item of the list, and T is bound to the rest of the list, meaning all but the first item. That's because the pattern [H|T] matches a non-empty list, including a list with one element, in which case T would be bound to the empty list. The variables thus bound can be used in the body of the function.
The bodies of your functions are a very typical form of iterating a list in Erlang to produce a new list. It's typical because of another important difference from Java, which is that Erlang data is immutable. That means there's no such concept as "setting an element of a list" like I did in the Java code above. If you want to change a list, you have to build a new one, which is what your code does. It effectively says:
The result of incrementing the empty list is the empty list.
The result of incrementing a non-empty list is:
Take the first element of the list: H.
Increment the rest of the list: increment(T).
Prepend H+1 to the result of incrementing the rest of the list.
Note that you want to be careful about how you build lists in Erlang, or you can end up wasting a lot of resources. The List Handling User's Guide is a good place to learn about that. Also note that this code uses a concept known as "recursion", meaning that the function calls itself. In many popular languages, including Java, recursion is of limited usefulness because each new function call adds a stack frame, and your available memory space for stack frames is relatively limited. Erlang and many functional languages support a thing known as "tail call elimination", which is a feature that allows properly written code to recurse indefinitely without exhausting any resources.
Hopefully this helps explain things. If you can ask a more specific question, you might get a better answer.

How do Erlang atoms work?

Trying to find documentation on details, I did not find a lot beyond:
There is a (erlang runtime instance-) atom table.
Atom string literal is only stored once.
Atoms take 1 word.
To me, this leaves a lot of things in the unclear.
Is the atom word value always the same, independent of the sequence modules are loaded into a runtime instance? If modules A and B both define/reference some atoms, will the value of the atom change from session to session, depending on whether A or B was loaded first?
When matching for an atom inside a module, is there some "atom literal to atom value" resolution taking place? Do modules have some own module-local atom-value-lookup table, which gets filled in at load-time of a module?
In a distributed scenario where 2 erlang runtime instances communicate with each other. Is there some "sync-atom-tables" action going on? Or do atoms get serialized as string literals, instead of as words?
Atom is simply an ID maintained by the VM. The representation of the ID is a machine integer of the underlying architecture, e.g. 4 bytes on 32-bit systems and 8 bytes on 64-bit systems. See the usage in the LYSE book.
The same atom in the same running VM is always mapped to the same ID (integer). For example the following tuple:
{apple, pear, cherry, apple}
could be stored as the following tuple in the actual Erlang memory:
{1, 2, 3, 1}
All atoms are stored in one big table which is never garbage-collected, i.e. once an atom is created in a running VM it stays in the table until the VM is shut down.
Answering your questions:
1 . No. The ID of the atom will change between VM runs. If you shut down the VM and reload the tuple above the system might end up with the following IDs:
{50, 51, 52, 50}
depending on what other atoms have been created before it was loaded. Atoms only live as long as the VM.
2 . No. There is only one table of atoms per VM. All literal atoms in the module are mapped to their IDs when the module is loaded. If a particular atom doesn't yet exist in that table then it's inserted and stays there until the VM restarts.
3 . No. Tables with atoms are per VM and they are separate. Consider a situation when two VMs are started at the same time but they don't know of each other. Atoms created in each VM may have different IDs in the table. If at some point in time one node gets to know about the other node different atoms will have different IDs. They can't be easily synchronized or merged. But atoms aren't simply send as text representations to the other node either. They are "compressed" to a form of cache and send all together in the header. See the distribution header in the description of the communication protocol. Basically, the header contains atoms used in later terms with their IDs and textual representation. Then each term references the atom by the ID specified in the header rather than passing the same text each time.
To get really basic without going into implementation, an atom is a literal "thing" with a name. Its value is always itself and it knows its own name. You generally use it when you want the tag, like the atoms ok and error. Atoms are unique in the sense that there is only one atom foo in the system, and each time I refer to foo, I am referring to this same unique foo irrespective of whether they are in the same module, or whether they come from the same process. There is always only one foo.
A bit of implementation. Atoms are stored in a global atom table, and when you create a new atom, it is inserted into the table if it is not already there. This makes comparing atoms for equality very fast as you just check if the two atoms refer to the same slot in the atom table.
While separate instances of the VM, nodes, have separate atom tables, the communication between the nodes in distributed erlang is optimised for this, so very often you don't need to send the actual atom name between nodes.

Listing available records available to a process in Erlang

Records are compile time structures. The record_info and is_record recognise the compiled records and their structures. Is there a way to ask the VM what records have been defined that are available to the process? I am interested in getting the internal tuple representation for every record definition.
What I want to do is something like:
-record(car,{make=honda}).
get_record(Car) ->
%% Some magic here to end up having sth like
{car,{make,honda}} or even better #car{} indeed. %% when Car = 'car'
As you said records are only a compile time construct, so once compiled records are only tuples, this would suggest no available information is left during runtime, but since you mentioned those two functions I was curious and I checked how they worked.
According to this record_info/2 is a pseudo function made available only during compilation, so it doesn't need any run time information on records.
On the other hand the description of is_record(Term, RecordTag) states that this BIF (built-in function) only "returns true if Term is a tuple and its first element is RecordTag, false otherwise", so it is actually only checking the structure and first element of the tuple.
Based on this, I would guess that there is no record information made available during runtime. This thread confirms the unavailability of record_info/2 during runtime.
I have used Dynarec (https://github.com/dieswaytoofast/dynarec.git) successfully in a data mapping module for one of the apps I am currently working on. It is a parse transformer, though, not a run-time VM tool. It compiles information on each defined record, as well as information about the fields for each record. In my case, I use it to dynamically map incoming data to record data. This module may get you what you need. YMMV. Good luck.
As others have said records are purely compile time and there is no runtime information about records. Erlang just sees tuples. For example the record_info/2 pseudo functions are expanded to data at compile time, a list of atoms for fields argument and an integer for size.

What is the process for saving erlang values to a file and loading them back?

For example I have a list I want to save as a file that has a lot of other erlang types. Then I want to load it back into a process What would I use? io_lib:format("~P", [Term]) with io:write and then file:consult?
Yes. Note that you need a trailing dot for each term, and that file:consult returns a list of all dot-terminated terms in the file. So if you only have one term, the code would look like:
ok = file:write_file("myfile", io_lib:format("~p.~n", [Term])),
{ok, [Term]} = file:consult("myfile").
As an alternative to legoscia's solution, you can also write the result of erlang:term_to_binary/1 to a file and read it back with erlang:binary_to_term/1. There's a few caveats with this approach, though:
The file will not be human-readable (at least not easily)
You can't store multiple terms easily because erlang:term_to_binary/1 can produce null-characters and newlines, which can create problems with parsing. There are a few ways to get around this, though:
base64 encode the terms and separate by newline
store your terms inside of another term. For instance, if you have three terms you want to store, use erlang:term_to_binary({T1, T2, T3})
There's no handy file:consult equivalent for term_to_binary, so you have to explicitly read (as a binary) and then run binary_to_term
So why would you bother with erlang:term_to_binary/1 at all? Two reasons:
Space efficiency (in most cases)
Parsing-speed (faster to parse term_to_binary than a human-readable term)

F# interactive development

Coming from a Matlab and R background where the development process is very interactive (select, run selection, fix, select, run selection, fix, etc), I'm trying to figure out how F# handles this style of development, which seems pretty important in scientific applications. Here are few things that just immediately come to mind to somebody new to F#:
Selecting multiple lines gives different results than one line at a time.
let add x y = x + y
add 4.1 2.3
Selecting both lines results in float -> float -> float whereas selecting the first line results in int -> int -> int. More generally, matlab/R users are used to results printing out after each statement, not at the end.
Shadow copying can become burdensome.
let file = open2GBfile('file.txt')
process file
If you run this interactively over and over again, the 2GB file is shadow copied and you will quickly run out of memory. Making file mutable doesn't seem like the appropriate solution, since the final run of the program will never change it.
Given these issues, is it impossible for a fsi.exe based system to support matlab/R style interactive development?
[Edit: I am guessing about 2. Do objects get marked for deletion as soon as they are shadowed?]
I wouldn't expect F# to be a drop-in replacement for Matlab/R, because unlike them, F# is a general purpose programming language. Not everything you need for a specific type of work will be in the standard libraries. But that doesn't mean that the "interactive development" you describe is impossible, it may just require some effort up-front to build the library functions you depend on.
For #1, as was mentioned earlier, adding type annotations is unfortunately necessary in some cases, but also the inline keyword and "hat-types" can give you duck-typing.
For #2, I'm not clear on what your open and process functions do, versus what you want them to do. For example, the open function could:
Read the entire file at once, return the data as an array/list/etc, and then close the file
Return a FileStream object, which you're calling process on but forget to close.
Return a sequence expression so you can lazily iterate over the file contents
Memoize the result of one of the above, so that subsequent calls just return the cached result
One of the gazillion other ways to create an abstraction over file access.
Some of these are better suited for your task than others. Compared to Matlab & R, a general purpose language like F# gives you more ways to shoot yourself in the foot. But that's because it gives you more ways to do everything.
To #1
In FSI, you'll have to type ;; at the end of each statement and get the results directly:
> 1 + 2;;
val it : int = 3
Generally an F#-Codefile should be seen as a collection of individual functions that you have to call and evaluate interactively and not as a series of steps that produce values to be shown.
To #2:
This seems to be a problem of the code itself: Make file a function, so the reading/copying is only done when and where really needed (otherwise the let binding would be evaluated in the beginning).

Resources