Sharing atoms between http-connected erlang clusters - erlang

We have a few non-erlang-connected clusters in our infrastructure and currently use term_to_binary to encode erlang terms for messages between the clusters. On the receiving side we use binary_to_term(Bin, [safe]) to only convert to existing atoms (should there be any in the message).
Occasionally (especially after starting a new cluster/stack), we run into the problem that there are partially known atoms encoded in the message, i.e. the sending cluster knows this atom, but the receiving does not. This can be for various reasons, most common is that the receiving node simply has not loaded a module containing some record definition. We currently employ some nasty work-arounds which basically amount to maintaining a short-ish list of potentially used atoms, but we're not quite happy with this error-prone approach.
Is there a smart way to share atoms between these clusters? Or is it recommended to not use the binary format for such purposes?
Looking forward to your insights.

I would think hard about why non-Erlang nodes are sending atom values in the first place. Most likely there is some adjustment that can be made to the protocol being used to communicate -- or most often there is simply not a real protocol defined and the actual protocol in use evolved organically over time.
Not knowing any details of the situation, there are two solutions to this:
Go deep and use an abstract serialization technique like ASN.1 or JSON or whatever, using binary strings instead of atoms. This makes the most sense when you have a largish set of well understood, structured data to send (which may wrap unstructured or opaque data).
Remain shallow and instead write a functional API interface for the processes/modules you will be sending to/calling first, to make sure you fully understand what your protocol actually is, and then back that up by making each interface call correspond to a matching network message which, when received, dispatches the same procedures an API function call would have.
The basic problem is the idea of non-Erlang nodes being able to generate atoms that the cluster may not be aware of. This is a somewhat sticky problem. In many cases the places where you are using atoms you can instead use binaries to similar effect and retain the same semantics without confusing the runtime. Its the difference between {<<"new_message">>, Data} and {new_message, Data}; matching within a function head works the same way, just slightly more noisy syntactically.

Related

How is Iota on Tangle Quantum proof?

I do understand Tangle has a graph based data structure i.e. forming a direct acyclic graph. It is not a merkle tree like a typical blockchain. But I could not figure out this relation makes it quantum proof or not. Is no-mining, and peer verification enough to make a distributed ledger quantum proof?
I asked a very similar thing here https://bitcoin.stackexchange.com/questions/55202/iota-quantum-resistance
The way the ledger is organized: linked-list (as in blockchain) or DAG (Tangle) has no impact for sure. There is still some sort of PoW (when you submit a new transaction) but that is also irrelevant.
Basically with a quantum computer cryptographic one-way hash functions (like SHA-2, SHA-3, BLAKE2) are still ok with a few caveats, the same goes for block ciphers (like AES). Traditional public key cryptography (RSA, DSA, Diffie-Hellman and the eliptic versions) are however NOT secure anymore. So you can't have signatures (which is a quite necessary thing for cryptocurrencies). There are some complicated workaround constructions but the simplest is one based on hash functions (Lamport OTS). More references are in my question. Note that I still don't know how exactly IOTA does this. Basically I got stuck at reading about their Curl hash function.

Custom PCollectionView

There have been a number of times when I wanted to create a custom PCollectionView. Is this possible? For now, the only workaround I have is to create a PTransform, return a PCollection, and then apply a PCollectionView.asSingleton() transform, but I've noticed (at least several months ago) that this is much slower than running a native PCollectionView transform, such as View.AsList(). And since I'll be calling this PCollectionView method millions of times, it makes a difference if it takes a few milliseconds vs say a second.
How do you want to view the contents of your PCollection? The answer to this question will determine how you should approach things.
Cloud Dataflow (more generally, any Apache Beam backend) has a few ways that it will materialize your PCollection to allow you to efficiently access it as a side input. So list, singleton, map, and multimap are each pretty efficient for their usual access patterns (iteration, key lookup, etc). The architecture of Dataflow (now Beam) is such that you can define custom views, but if it requires a new access pattern then it will require backend support to be efficient.
Also you might care to know that after the first access to a singleton sided input, the value will usually be cached.

Sending messages among erlang processes: Atoms vs Binaries

Are atoms copied from one process to another when i send an atom as a message ? My thinking is that since this atom is already existing in the VM is doesnot have to make a copy. I understand that binaries are more efficient when sending from one process to another.
If i am sending a trigger message, a constant message from one process to another, which is better to use: atom or binary ?
Use what is the most correct semantically. In general, don't worry about performance unless you have benchmarked and you are certain your code could benefit from optimizations. If you use what is the most correct semantically, it will likely be the fastest anyway.
That said, what is the most correct semantically?
Atoms are useful for tagging or identifying terms that are static, that won't change. So if you want to tell a process to do some work, you could write: send(pid, :do_some_work). Then the other process can easily match on the atom and perform the required work (comparison with atoms are super fast).
However, if you are passing dynamic content, then surely you want to use binaries. It would actually be unsafe to convert binaries to atoms and atoms also have a size limit. You can't have an atom that is 1 kilobyte long.
Finally, it is worth pointing out that atoms are by far the fastest to copy. Atoms are represented as integers, they take 1 word. So you are copying just 1 word around.
Binaries, even though they are shared after a certain size, take at least 3 words.
More information: Why is useful to have a atom type (like in elixir, erlang)?
The Erlang VM efficiency guide: http://www.erlang.org/doc/efficiency_guide/advanced.html

Erlang OTP I/O - A few questions

I have read one of erlang's biggest adopters is the telecom industry. I'm assuming that they use it to send binary data between their nodes and provide for easy redundancy, efficiency, and parallelism.
Does erlang actually send just the binary to a central node?
Is it directly responsible for parsing the binary data into actual voice? Or is it fed to another language/program via ports?
Is responsible for the speed in a telephone call, speed as in the delay between me saying something and you hearing it.
Is it possible that erlang is solely used for the ease in parallel behavior and c++ or similar for processing speed in sequential functions?
I can only guess at how things are implemented in actual telecom switches, but I can recommend an approach to take:
First, you implement everything in Erlang, including much of the low-level stuff. This probably won't scale that much since signal processing is very costly. As a prototype however, it works and you can make calls and whatnot.
Second, you decide on what to do with the performance bottlenecks. You can push them to C(++) and get a factor of roughly 10 or you can push them to an FPGA and get a factor of roughly 100. Finally you can do CMOS work and get a factor of 1000. The price of the latter approach is also much steeper, so you decide what you need and go buy that.
Erlang remains in control of the control backplane in the sense of what happens when you push buttons the call setup and so on. But once a call has been allocated, we hand over the channel to the lower layer. ATM switching is easier here because once the connection is set you don't need to change it (ATM is connection-oriented, IP is packet-oriented).
Erlangs distribution features are there primarily for providing redundancy in the control backplane. That is, we synchronize tables of call setups and so on between multiple nodes to facilitate node takeover in case of hardware failure.
The trick is to use ports and NIFs post prototype to speed up the slower parts of the program.

serialization/deserialization vs raw data buffer

Greetings,
I am working on distributed pub-sub system expected to have minimum latency. I am now having to choose between using serialization/deserialization and raw data buffer. I prefer raw data approach as there's almost no overhead, which keeps the latency low. But my colleague argues that I should use marshaling as the parser will be less complex and buggier. I stated my concern about latency but he said it's gonna be worth it in the long run and there's even FPGA devices to accelerate.
What's your opinions on this?
TIA!
Using a 'raw data' approach, if hardcoded in one language, for one platform, has a few problems when you try to write code on another platform in another language (or sometimes even the same language/platform for a different compiler, if your fields don't have natural alignment).
I recommend using an IDL to describe your message formats. If you pick one that is reducible to 'raw data' in your language of choice, then the generated code to access message fields for that language won't do anything but access variables in a structure that is memory overlayed on your data buffer, but it still represents the metadata in a platform and language neutral way, so parsers can make more complicated code for other platforms.
The downside of picking something that is reducible to a C structure overlay is that it doesn't handle optional fields, and doesn't handle variable-length arrays, and may not handle backwards compatible extensions in the future (unless you just tack them on the end of the structure). I'd suggest you read about google's protocol buffers if you haven't yet.

Resources