Well, the title says it all: I'm wondering what is the reason why the BEAM doesn't garbage collect atoms. I'm aware of question How Erlang atoms can be garbage collected but, while related, it doesn't reply to why.
Because that is not possible (or at least very hard) to do in the current design. Atoms are important part of:
modules, as module names are atoms
function names, which also are atoms
distributed Erlang also extensively use atoms
Especially last point makes it hard. Imagine for second that we would have a GC for atoms. What would happen if there would be a GC cleanup in between the distributed call where we send some atoms over the wire? All of that makes atoms quite essential to how VM works and making them GCed would not only make implementation of VM much more complex, it would also make code much slower as atoms do not need to be copied between processes and as these aren't GCed, these can be completely omitted in GC mark step.
Related
As I understand it, when a managed language (like Haxe) can and wants to compiles to a non-managed language (like C++), it includes some form of garbage collector in the runtime.
I was wondering if it would be possible to completely abstract away memory management in the intermediate representation / abstract syntax tree, so that a garbage collector would not be needed and the default behavior (stack allocations live until end of scope and heap allocations live until freed) could be used?
Thank you!
If I understood you correctly, you're asking whether it's possible to take a garbage collected language and compile it to an equivalent program in a non-garbage collected language without introducing memory errors or leaks, just by adding frees in the right places (i.e. no reference counting or otherwise keeping track of references or implementing a garbage collection algorithm in anyway or doing anything else at run time that could be considered garbage collection).
No, that is not possible. To do something like this, you'd have to be able to statically answer the question "What's the point in the program, after which a given object is no longer referenced", which is a non-trivial semantic property and thus undecidable per Rice's theorem.
You could define a sufficiently restricted subset of the language (something like "only one live variable may hold a strong reference to an object at a time and anything else must use weak references"), but programming in that subset would be so different from programming in the original language¹ that there wouldn't be much of a point in doing that.
¹ And perhaps more importantly: it would be highly unlikely that existing code would conform to that subset. So if there's a compiler that can compile my code to efficient GC-free native code, but only if I completely re-write my code to fit an awkward subset of the language, why wouldn't I just re-write the project in Rust instead? Especially since interop with libraries that aren't written in the subset would probably be infeasible as well.
I'm attempting to build my first C-like programming language, likely an interpreter and I've just made the first step aka the lexer.
I've thought about taking the lazy route by simply lexing the entire source code stream all at one then then have the parser process the data.
I've noticed that many other compilers and interpreters only lex during parsing when the parser module asks for another token.
Is it quicker in terms of code performance for a program to lex source code all at once then parse the resulting tokens or lex and parse tokens individually?
"faster" is a bit of a fuzzy word. There are different kinds of speed (latency, absolute start-to-finish duration, compile speed, execution speed), and depending on how you implement your language's front-end and backend, either approach can be faster.
Also, faster is not always better. If your parser is technically faster, but uses too much memory, it could crash or at the least end up swapping, which would slow it down again. If your parser is lightning-fast but produces inefficient code, your users will pay for your faster development speed. You'll have to write actual code and run it in a profiler to be able to tell what is really better, and come up with which criteria are important to you.
Tokenizing/Lexing everything at once at the start means you might be able to optimize memory allocation and thus take less time resizing your token list etc., but it also means the entire file has to be lexed before it can even be partially parsed.
OTOH if you parse as-needed, you may have to append to your arrays in small steps more often, so you'll pay a memory penalty, but in case of e.g. an interpreted language like JavaScript, you may only have to parse the parts that are actually used for this run-through.
So a lot of it depends on the details of your language, and the hardware you expect to be running on. In embedded systems with little memory and no swap, you may have no choice but to progressively lex, as the whole program source code might not fit in memory. If your language's syntax needs a lot of lookahead, you may not see any benefit of progressively lexing because you're reading it all anyway...
Are atoms copied from one process to another when i send an atom as a message ? My thinking is that since this atom is already existing in the VM is doesnot have to make a copy. I understand that binaries are more efficient when sending from one process to another.
If i am sending a trigger message, a constant message from one process to another, which is better to use: atom or binary ?
Use what is the most correct semantically. In general, don't worry about performance unless you have benchmarked and you are certain your code could benefit from optimizations. If you use what is the most correct semantically, it will likely be the fastest anyway.
That said, what is the most correct semantically?
Atoms are useful for tagging or identifying terms that are static, that won't change. So if you want to tell a process to do some work, you could write: send(pid, :do_some_work). Then the other process can easily match on the atom and perform the required work (comparison with atoms are super fast).
However, if you are passing dynamic content, then surely you want to use binaries. It would actually be unsafe to convert binaries to atoms and atoms also have a size limit. You can't have an atom that is 1 kilobyte long.
Finally, it is worth pointing out that atoms are by far the fastest to copy. Atoms are represented as integers, they take 1 word. So you are copying just 1 word around.
Binaries, even though they are shared after a certain size, take at least 3 words.
More information: Why is useful to have a atom type (like in elixir, erlang)?
The Erlang VM efficiency guide: http://www.erlang.org/doc/efficiency_guide/advanced.html
This post on Erlang scalability says there's an overhead for every call, cast or message to a gen_server. How much overhead is it, and what is it for?
The cost that is being referenced is the cost of a (relatively blind) function call to an external module. This happens because everything in the gen_* abstractions are callbacks to externally defined functions (the functions you write in your callback module), and not function calls that can be optimized by the compiler within a single module. A part of that cost is the resolution of the call (finding the right code to execute -- the same reason each "dot" in.a.long.function.or.method.call in Python or Java raise the cost of resolution) and another part of the cost is the actual call itself.
BUT
This is not something you can calculate as a simple quantity and then multiply by to get a meaningful answer regarding the cost of operations across your system.
There are too many variables, points of constraint, and unexpectedly cheap elements in a massively concurrent system like Erlang where the hardest parts of concurrency are abstracted away (scheduling related issues) and the most expensive elements of parallel processing are made extremely cheap (context switching, process spawn/kill and process:memory ratio).
The only way to really know anything about a massively concurrent system, which by its very nature will exhibit emergent behavior, is to write one and measure it in actual operation. You could write exactly the same program in pure Erlang once and then again as an OTP application using gen_* abstractions and measure the difference in performance that way -- but the benchmark numbers would only mean anything to that particular program and probably only mean anything under that particular load profile.
All this taken in mind... the numbers that really matter when we start splitting hairs in the Erlang world are the reduction budget costs the scheduler takes into account. Lukas Larsson at Erlang Solutions put out a video a while back about the scheduler and details the way these costs impact the system, what they are, and how to tweak the values under certain circumstances (Understanding the Erlang Scheduler). Aside from external resources (iops delay, network problems, NIF madness, etc.) that have nothing to do with Erlang/OTP the overwhelming factor is the behavior of the scheduler, not the "cost of a function call".
In all cases, though, the only way to really know is to write a prototype that represents the basic behavior you expect in your actual system and test it.
and also I want to compare the result of concurrent parsing with serial parsing .
How to do this? Which one is the simplest approach?
As others have said, YACC (based on LR parsing) by itself isn't "concurrent".
(It is my understanding that YACC isn't even re-entrant, which will make it pretty hard to use it in multithreaded context no matter what you do. The non-reentrancy can presumably be overcome with mere sweat, so this is an annoyance, not a show stopper.)
One idea is to construct a pipeline, allowing a lexer to generate lexemes into a steam as fast as it can, and letting the actual parser read from the stream. This can get you at best a factor of 2. You might be able to do this with YACC relatively easily, modulo setting up communicating threads.
McKeeman et al have implemented parallel LR parsing by dividing the file into N chunks for N threads and claim to have gotten good results. The approach isn't simple, because dividing parsing of a single file into parallel chunks of about the same size, and stitching those chunks together, isn't easy. I doubt that one could easily hack up YACC to do this.
A screwy idea is parse a file from both ends toward the middle.
Its easy enough to define the backward parser grammar from the "natural" forward one: just reverse all the grammar rule content. Nothing is easy; this might introduce ambiguities not present in the forward parser. This paper combines McKeeman's idea of breaking the file into chunks with bidirectional parsing of each chunk, enabling one to find a lot of parallelism on a big file.
Easier to do is to parallelize the parsing of individual files, using whatever parsing technology you have. This does parallelize relatively well, although the parsing time for individual files may not be even, so this is best done with some kind of worklist and a team parser threads taking work from this worklist. You could probably organize to do this with YACC.
Our DMS Software Reengineering Toolkit (DMS) uses a generalization of this when reading large systems of Java sources and classes. Parsing by itself isnt very useful; you almost always need to build symbol tables, too. DMS reading Java thus parallelizes both parsing, AST building and symbol table construction. Starting from a base set of filenames, it launches parses in parallel grains (multiplexed by DMS on top of OS threads); when a parser completes, the parsed tree is handed to name resolver, which splits into a parallel grain per parallel scope encountered. Nested scopes cause a tree of grains to be generated. Further work is gated by treating resolution of a scope as a future (event); while the scope is being resolved, more Java files parse/name resolution activities may be launched; when a scope is resolved, an event is signalled and grains waiting for scope completion can then inspect the scope content to resolve their own names. The tangle of (potential) parallelism in the middle of this is almost frightening :-} but is managed by DMS's underlying parallel programming language, PARLANSE, which uses work-stealing to balance the load across the threads.
Experience shows that doing this in production with 8 cores leads to a 5x speedup over sequential for a few thousand (typical/Java implies small) source files. We think we know where the bottleneck is; there are parts of the name resolver that are more expensive than it should be in terms of excess synchronization in an attribute grammar. I suspect we can get this closer to 8. So, parallel parsing and name resolution can be pretty effective.
We don't do quite as well with C++14, because of all the dependencies of individual files on #includes that it reads, often in different preprocessor configurations.
If the input allows splitting into chunks, like, e.g. log file lines, which can be parsed in parallel, then you can parse a certain number such chunks in parallel using a producer/consumer queue and then join the parse results.