Are atoms copied from one process to another when i send an atom as a message ? My thinking is that since this atom is already existing in the VM is doesnot have to make a copy. I understand that binaries are more efficient when sending from one process to another.
If i am sending a trigger message, a constant message from one process to another, which is better to use: atom or binary ?
Use what is the most correct semantically. In general, don't worry about performance unless you have benchmarked and you are certain your code could benefit from optimizations. If you use what is the most correct semantically, it will likely be the fastest anyway.
That said, what is the most correct semantically?
Atoms are useful for tagging or identifying terms that are static, that won't change. So if you want to tell a process to do some work, you could write: send(pid, :do_some_work). Then the other process can easily match on the atom and perform the required work (comparison with atoms are super fast).
However, if you are passing dynamic content, then surely you want to use binaries. It would actually be unsafe to convert binaries to atoms and atoms also have a size limit. You can't have an atom that is 1 kilobyte long.
Finally, it is worth pointing out that atoms are by far the fastest to copy. Atoms are represented as integers, they take 1 word. So you are copying just 1 word around.
Binaries, even though they are shared after a certain size, take at least 3 words.
More information: Why is useful to have a atom type (like in elixir, erlang)?
The Erlang VM efficiency guide: http://www.erlang.org/doc/efficiency_guide/advanced.html
Related
The question is broad, here is my specific context:
I only use term_to_binary to dump binary to postgresql, and read back with binary_to_term
I don't use term_to_binary to produce any identifier or to compare data.
My data types are only (from Elixir), map, list, string, number, nil, boolean. (i.e. no function, no atom, no struct)
Why not jsonb? It's ridiculously slow. Erlang term <-> binary is much much faster. (more than 10x)
Will binary_to_term be always able to read binary produced by any previous version of term_to_binary?
Thanks!
One can enforce it with options, but no guarantee is provided and the format has changed during time but erlang always provided an option to read legacy formats for backward compatibility.
More info in erlang docs.
On the Support, Compatibility, Deprecations, and Removal page, the external term format is not specifically mentioned. The distribution protocol is, though:
Erlang Distribution
Erlang nodes can communicate across at least two preceding and two subsequent releases.
And since the distribution protocol relies on the external term format, it's probably safe to assume that binary_to_term will be able to read data from at least two major releases back.
Well, the title says it all: I'm wondering what is the reason why the BEAM doesn't garbage collect atoms. I'm aware of question How Erlang atoms can be garbage collected but, while related, it doesn't reply to why.
Because that is not possible (or at least very hard) to do in the current design. Atoms are important part of:
modules, as module names are atoms
function names, which also are atoms
distributed Erlang also extensively use atoms
Especially last point makes it hard. Imagine for second that we would have a GC for atoms. What would happen if there would be a GC cleanup in between the distributed call where we send some atoms over the wire? All of that makes atoms quite essential to how VM works and making them GCed would not only make implementation of VM much more complex, it would also make code much slower as atoms do not need to be copied between processes and as these aren't GCed, these can be completely omitted in GC mark step.
and also I want to compare the result of concurrent parsing with serial parsing .
How to do this? Which one is the simplest approach?
As others have said, YACC (based on LR parsing) by itself isn't "concurrent".
(It is my understanding that YACC isn't even re-entrant, which will make it pretty hard to use it in multithreaded context no matter what you do. The non-reentrancy can presumably be overcome with mere sweat, so this is an annoyance, not a show stopper.)
One idea is to construct a pipeline, allowing a lexer to generate lexemes into a steam as fast as it can, and letting the actual parser read from the stream. This can get you at best a factor of 2. You might be able to do this with YACC relatively easily, modulo setting up communicating threads.
McKeeman et al have implemented parallel LR parsing by dividing the file into N chunks for N threads and claim to have gotten good results. The approach isn't simple, because dividing parsing of a single file into parallel chunks of about the same size, and stitching those chunks together, isn't easy. I doubt that one could easily hack up YACC to do this.
A screwy idea is parse a file from both ends toward the middle.
Its easy enough to define the backward parser grammar from the "natural" forward one: just reverse all the grammar rule content. Nothing is easy; this might introduce ambiguities not present in the forward parser. This paper combines McKeeman's idea of breaking the file into chunks with bidirectional parsing of each chunk, enabling one to find a lot of parallelism on a big file.
Easier to do is to parallelize the parsing of individual files, using whatever parsing technology you have. This does parallelize relatively well, although the parsing time for individual files may not be even, so this is best done with some kind of worklist and a team parser threads taking work from this worklist. You could probably organize to do this with YACC.
Our DMS Software Reengineering Toolkit (DMS) uses a generalization of this when reading large systems of Java sources and classes. Parsing by itself isnt very useful; you almost always need to build symbol tables, too. DMS reading Java thus parallelizes both parsing, AST building and symbol table construction. Starting from a base set of filenames, it launches parses in parallel grains (multiplexed by DMS on top of OS threads); when a parser completes, the parsed tree is handed to name resolver, which splits into a parallel grain per parallel scope encountered. Nested scopes cause a tree of grains to be generated. Further work is gated by treating resolution of a scope as a future (event); while the scope is being resolved, more Java files parse/name resolution activities may be launched; when a scope is resolved, an event is signalled and grains waiting for scope completion can then inspect the scope content to resolve their own names. The tangle of (potential) parallelism in the middle of this is almost frightening :-} but is managed by DMS's underlying parallel programming language, PARLANSE, which uses work-stealing to balance the load across the threads.
Experience shows that doing this in production with 8 cores leads to a 5x speedup over sequential for a few thousand (typical/Java implies small) source files. We think we know where the bottleneck is; there are parts of the name resolver that are more expensive than it should be in terms of excess synchronization in an attribute grammar. I suspect we can get this closer to 8. So, parallel parsing and name resolution can be pretty effective.
We don't do quite as well with C++14, because of all the dependencies of individual files on #includes that it reads, often in different preprocessor configurations.
If the input allows splitting into chunks, like, e.g. log file lines, which can be parsed in parallel, then you can parse a certain number such chunks in parallel using a producer/consumer queue and then join the parse results.
We have a few non-erlang-connected clusters in our infrastructure and currently use term_to_binary to encode erlang terms for messages between the clusters. On the receiving side we use binary_to_term(Bin, [safe]) to only convert to existing atoms (should there be any in the message).
Occasionally (especially after starting a new cluster/stack), we run into the problem that there are partially known atoms encoded in the message, i.e. the sending cluster knows this atom, but the receiving does not. This can be for various reasons, most common is that the receiving node simply has not loaded a module containing some record definition. We currently employ some nasty work-arounds which basically amount to maintaining a short-ish list of potentially used atoms, but we're not quite happy with this error-prone approach.
Is there a smart way to share atoms between these clusters? Or is it recommended to not use the binary format for such purposes?
Looking forward to your insights.
I would think hard about why non-Erlang nodes are sending atom values in the first place. Most likely there is some adjustment that can be made to the protocol being used to communicate -- or most often there is simply not a real protocol defined and the actual protocol in use evolved organically over time.
Not knowing any details of the situation, there are two solutions to this:
Go deep and use an abstract serialization technique like ASN.1 or JSON or whatever, using binary strings instead of atoms. This makes the most sense when you have a largish set of well understood, structured data to send (which may wrap unstructured or opaque data).
Remain shallow and instead write a functional API interface for the processes/modules you will be sending to/calling first, to make sure you fully understand what your protocol actually is, and then back that up by making each interface call correspond to a matching network message which, when received, dispatches the same procedures an API function call would have.
The basic problem is the idea of non-Erlang nodes being able to generate atoms that the cluster may not be aware of. This is a somewhat sticky problem. In many cases the places where you are using atoms you can instead use binaries to similar effect and retain the same semantics without confusing the runtime. Its the difference between {<<"new_message">>, Data} and {new_message, Data}; matching within a function head works the same way, just slightly more noisy syntactically.
The Erlang external term format has changed at least once (but this change appears to predate the history stored in the Erlang/OTP github repository); clearly, it could change in the future.
However, as a practical matter, is it generally considered safe to assume that this format is stable now? By "stable," I mean specifically that, for any term T, term_to_binary will return the same binary in any current or future version of Erlang (not merely whether it will return a binary that binary_to_term will convert back to a term identical to T). I'm interested in this property because I'd like to store hashes of arbitrary Erlang terms on disk and I want identical terms to have the same hash value now and in the future.
If it isn't safe to assume that the term format is stable, what are people using for efficient and stable term serialization?
it's been stated that erlang will provide compatibility for at least 2 major releases. that would mean that BEAM files, the distribution protocol, external term format, etc from R14 will at least work up to R16.
"We have as a strategy to at least support backwards compatibility 2 major releases back in time."
"In general, we only break backward compatibility in major releases
and only for a very good reason and usually after first deprecating
the feature one or two releases beforehand."
erlang:phash2 is guaranteed to be a stable hash of an Erlang term.
I don't think OTP makes the guarantee made that term_to_binary(T) in vX =:= term_to_binary(T) in vY. Lots of things could change if they introduce new term codes for optimized representations of things. Or if we need to add unicode strings to the ETF or something. Or in the vanishingly unlikely future in which we introduce a new fundamental datatype. For an example of change that has happened in external representation only (stored terms compare equal, but are not byte equal) see float_ext vs. new_float_ext.
In practical terms, if you stick to atoms, lists, tuples, integers, floats and binaries, then you're probably safe with term_to_binary for quite some time. If the time comes that their ETF representation changes, then you can always write your own version of term_to_binary that doesn't change with the ETF.
For data serialization, I usually choose between Google Protocol Buffers and JSON. Both of them are very stable. For working with these formats from Erlang I use Piqi, Erlson and mochijson2.
Big advantage of Protobuf and JSON is that they can be used from other programming languages by design, whereas Erlang external term format is more or less specific to Erlang.
Note that JSON string representation is implementation-dependent (escaped characters, floating point precision, whitespace, etc.) and for this reason it may not be suitable for your use-case.
Protobuf is less straightforward to work with compared to schemaless formats but it is a very well-designed and powerful tool.
Here are a couple of other schemaless binary serialization formats to consider. I don't know how stable they are. It may turn out that Erlang external term format is more stable.
https://github.com/uwiger/sext
https://github.com/TonyGen/bson-erlang