Is Erlang the C of the clustered computing world? - erlang

Erlang seems to be very low level and performant on networks, but does not have a very rich type system or many of the things that other functional languages offer, so it seems to me that it will become the lowest level development language for clustered programming, until something else comes along and offers a decent clustered VM AND high level constructs. Any thoughts on this?

C is the C of clustered computing.
At least, every HPC cluster I've seen had lots of C and Fortran running MPI, and never Erlang.
If anything, trends seem to be towards grid standards which are language agnostic rather than Erlang's specific messaging protocol. Interpreted languages are getting an edge in for gluing the heavy lifting together, a role which Erlang might be a good match for, but if you're spending hundreds of thousands of pounds a year running a cluster, you don't want the CPU time to be taken up running interpreted bytecode for anything which could be converted to a faster language.

Related

Why spark.ml don't implement any of spark.mllib algorithms?

Following the Spark MLlib Guide we can read that Spark has two machine learning libraries:
spark.mllib, built on top of RDDs.
spark.ml, built on top of Dataframes.
According to this and this question on StackOverflow, Dataframes are better (and newer) than RDDs and should be used whenever possible.
The problem is that I want to use common machine learning algorithms (e.g: Frequent Pattern Mining,Naive Bayes, etc.) and spark.ml (for dataframes) don't provide such methods, only spark.mllib(for RDDs) provides this algorithms.
If Dataframes are better than RDDs and the referred guide recommends the use of spark.ml, why aren't common machine learning methods implemented in that lib?
What's the missing point here?
Spark 2.0.0
Currently Spark moves strongly towards DataFrame API with ongoing deprecation of RDD API. While number of native "ML" algorithms is growing the main points highlighted below are still valid and internally many stages are implemented directly using RDDs.
See also: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0
Spark < 2.0.0
I guess that the main missing point is that spark.ml algorithms in general don't operate on DataFrames. So in practice it is more a matter of having a ml wrapper than anything else. Even native ML implementation (like ml.recommendation.ALS use RDDs internally).
Why not implement everything from scratch on top of DataFrames? Most likely because only a very small subset of machine learning algorithms can actually benefit from the optimizations which are currently implemented in Catalyst not to mention be efficiently and naturally implemented using DataFrame API / SQL.
Majority of the ML algorithms require efficient linear algebra library not a tabular processing. Using cost based optimizer for linear algebra could be an interesting addition (I think that flink already has one) but it looks like for now there is nothing to gain here.
DataFrames API gives you very little control over the data. You cannot use partitioner*, you cannot access multiple records at the time (I mean a whole partition for example), you're limited to a relatively small set of types and operations, you cannot use mutable data structures and so on.
Catalyst applies local optimizations. If you pass a SQL query / DSL expression it can analyze it, reorder, apply early projections. All of that is that great but typical scalable algorithms require iterative processing. So what you really want to optimize is a whole workflow and DataFrames alone are not faster than plain RDDs and depending on an operation can be actually slower.
Iterative processing in Spark, especially with joins, requires a fine graded control over the number of partitions, otherwise weird things happen. DataFrames give you no control over partitioning. Also, DataFrame / Dataset don't provide native checkpoint capabilities (fixed in Spark 2.1) which makes iterative processing almost impossible without ugly hacks
Ignoring low level implementation details some groups of algorithms, like FPM, don't fit very well into a model defined by ML pipelines.
Many optimizations are limited to native types, not UDT extensions like VectorUDT.
There is one more problem with DataFrames, which is not really related to machine learning. When you decide to use a DataFrame in your code you give away almost all benefits of static typing and type inference. It is highly subjective if you consider it to be a problem or not but one thing for sure, it doesn't feel natural in Scala world.
Regarding better, newer and faster I would take a look at Deep Dive into Spark SQL’s Catalyst Optimizer, in particular the part related to quasiquotes:
The following figure shows that quasiquotes let us generate code with performance similar to hand-tuned programs.
* This has been changed in Spark 1.6 but it is still limited to default HashPartitioning

Machine Learning & Big Data [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
In the beginning, I would like to describe my current position and the goal that I would like to achieve.
I am a researcher dealing with machine learning. So far have gone through several theoretical courses covering machine learning algorithms and social network analysis and therefore have gained some theoretical concepts useful for implementing machine learning algorithms and feed in the real data.
On simple examples, the algorithms work well and the running time is acceptable whereas the big data represent a problem if trying to run algorithms on my PC. Regarding the software I have enough experience to implement whatever algorithm from articles or design my own using whatever language or IDE (so far have used Matlab, Java with Eclipse, .NET...) but so far haven't got much experience with setting-up infrastructure. I have started to learn about Hadoop, NoSQL databases, etc, but I am not sure what strategy would be the best taking into consideration the learning time constraints.
The final goal is to be able to set-up a working platform for analyzing big data with focusing on implementing my own machine learning algorithms and put all together into production, ready for solving useful question by processing big data.
As the main focus is on implementing machine learning algorithms I would like to ask whether there is any existing running platform, offering enough CPU resources to feed in large data, upload own algorithms and simply process the data without thinking about distributed processing.
Nevertheless, such a platform exists or not, I would like to gain a picture big enough to be able to work in a team that could put into production the whole system tailored upon the specific customer demands. For example, a retailer would like to analyze daily purchases so all the daily records have to be uploaded to some infrastructure, capable enough to process the data by using custom machine learning algorithms.
To put all the above into simple question: How to design a custom data mining solution for real-life problems with main focus on machine learning algorithms and put it into production, if possible, by using the existing infrastructure and if not, design distributed system (by using Hadoop or whatever framework).
I would be very thankful for any advice or suggestions about books or other helpful resources.
First of all, your question needs to define more clearly what you intend by Big Data.
Indeed, Big Data is a buzzword that may refer to various size of problems. I tend to define Big Data as the category of problems where the Data size or the Computation time is big enough for "the hardware abstractions to become broken", which means that a single commodity machine cannot perform the computations without intensive care of computations and memory.
The scale threshold beyond which data become Big Data is therefore unclear and is sensitive to your implementation. Is your algorithm bounded by Hard-Drive bandwidth ? Does it have to feet into memory ? Did you try to avoid unnecessary quadratic costs ? Did you make any effort to improve cache efficiency, etc.
From several years of experience in running medium large-scale machine learning challenge (on up to 250 hundreds commodity machine), I strongly believe that many problems that seem to require distributed infrastructure can actually be run on a single commodity machine if the problem is expressed correctly. For example, you are mentioning large scale data for retailers. I have been working on this exact subject for several years, and I often managed to make all the computations run on a single machine, provided a bit of optimisation. My company has been working on simple custom data format that allows one year of all the data from a very large retailer to be stored within 50GB, which means a single commodity hard-drive could hold 20 years of history. You can have a look for example at : https://github.com/Lokad/lokad-receiptstream
From my experience, it is worth spending time in trying to optimize algorithm and memory so that you could avoid to resort to distributed architecture. Indeed, distributed architectures come with a triple cost. First of all, the strong knowledge requirements. Secondly, it comes with a large complexity overhead in the code. Finally, distributed architectures come with a significant latency overhead (with the exception of local multi-threaded distribution).
From a practitioner point of view, being able to perform a given data mining or machine learning algorithm in 30 seconds is one the key factor to efficiency. I have noticed than when some computations, whether sequential or distributed, take 10 minutes, my focus and efficiency tend to drop quickly as it becomes much more complicated to iterate quickly and quickly test new ideas. The latency overhead introduced by many of the distributed frameworks is such that you will inevitably be in this low-efficiency scenario.
If the scale of the problem is such that even with strong effort you cannot perform it on a single machine, then I strongly suggest to resort to on-shelf distributed frameworks instead of building your own. One of the most well known framework is the MapReduce abstraction, available through Apache Hadoop. Hadoop can be run on 10 thousands nodes cluster, probably much more than you will ever need. If you do not own the hardware, you can "rent" the use of a Hadoop cluster, for example through Amazon MapReduce.
Unfortunately, the MapReduce abstraction is not suited to all Machine Learning computations.
As far as Machine Learning is concerned, MapReduce is a rigid framework and numerous cases have proved to be difficult or inefficient to adapt to this framework:
– The MapReduce framework is in itself related to functional programming. The
Map procedure is applied to each data chunk independently. Therefore, the
MapReduce framework is not suited to algorithms where the application of the
Map procedure to some data chunks need the results of the same procedure to
other data chunks as a prerequisite. In other words, the MapReduce framework
is not suited when the computations between the different pieces of data are
not independent and impose a specific chronology.
– MapReduce is designed to provide a single execution of the map and of the
reduce steps and does not directly provide iterative calls. It is therefore not
directly suited for the numerous machine-learning problems implying iterative
processing (Expectation-Maximisation (EM), Belief Propagation, etc.). The
implementation of these algorithms in a MapReduce framework means the
user has to engineer a solution that organizes results retrieval and scheduling
of the multiple iterations so that each map iteration is launched after the reduce
phase of the previous iteration is completed and so each map iteration is fed
with results provided by the reduce phase of the previous iteration.
– Most MapReduce implementations have been designed to address production needs and
robustness. As a result, the primary concern of the framework is to handle
hardware failures and to guarantee the computation results. The MapReduce efficiency
is therefore partly lowered by these reliability constraints. For example, the
serialization on hard-disks of computation results turns out to be rather costly
in some cases.
– MapReduce is not suited to asynchronous algorithms.
The questioning of the MapReduce framework has led to richer distributed frameworks where more control and freedom are left to the framework user, at the price of more complexity for this user. Among these frameworks, GraphLab and Dryad (both based on Direct Acyclic Graphs of computations) are well-known.
As a consequence, there is no "One size fits all" framework, such as there is no "One size fits all" data storage solution.
To start with Hadoop, you can have a look at the book Hadoop: The Definitive Guide by Tom White
If you are interested in how large-scale frameworks fit into Machine Learning requirements, you may be interested by the second chapter (in English) of my PhD, available here: http://tel.archives-ouvertes.fr/docs/00/74/47/68/ANNEX/texfiles/PhD%20Main/PhD.pdf
If you provide more insight about the specific challenge you want to deal with (type of algorithm, size of the data, time and money constraints, etc.), we probably could provide you a more specific answer.
edit : another reference that could prove to be of interest : Scaling-up Machine Learning
I had to implement a couple of Data Mining algorithms to work with BigData too, and I ended up using Hadoop.
I don't know if you are familiar to Mahout (http://mahout.apache.org/), which already has several algorithms ready to use with Hadoop.
Nevertheless, if you want to implement your own Algorithm, you can still adapt it to Hadoop's MapReduce paradigm and get good results. This is an excellent book on how to adapt Artificial Intelligence algorithms to MapReduce:
Mining of Massive Datasets - http://infolab.stanford.edu/~ullman/mmds.html
This seems to be an old question. However given your usecase, the main frameworks focusing on Machine Learning in Big Data domain are Mahout, Spark (MLlib), H2O etc. However to run Machine Learning algorithms on Big Data you have to convert them to parallel programs based on Map Reduce paradigm. This is a nice article giving a brief introduction to major (not all) big Data frameworks:
http://www.codophile.com/big-data-frameworks-every-programmer-should-know/
I hope this will help.

What is the best programming language to implement neural networks?

I'm not looking for a Neural Networks library, since I'm creating new kinds of networks. For that I need a good "dataflow" language.
Of course you can do this in C, C++, Java and co. but dealing from scratch with the multithreading etc. would be a nightmare.
At the other extremity, languages like Oz or Erlang seem more adapted, but they don't have many libraries, and they are harder to master (it's easy to play with them, but is it OK to create complete software ?).
What would you suggest ?
I watched an interesting conference presentation about using Erlang for Neural Networks. You might want to check it out:
From Telecom Networks to Neural Networks; Erlang, as the unintentional Neural Network Programming Language
I also know that the presented system is going to be open-sourced any day now according the authors tweet.
Erlang is very well suited for NN.
Neurons can be modeled by processes (no problem with having millions of them)
Connections/synapses can be represented by PIDs of target neuron. It is very easy to initialize such a network as part of standard init procedure in OTP. Communication would be realized by message passing.
Maybe it would be good to have global address space in ETS/mnesia (build in datastores) in order to do dynamic reconfiguration of network structure.
Pattern matching in receive block can determine what kind of signal neuron receives and modify it on the fly.
It would be very easy to monitor such a network.
Also consider that Erlang NN would be 'live' all the time. You would be able to query neurons, layers, routers etc any time.
In C/C++ you just read current state of arrays/data structure.
Regarding performance, we all know that C/C++ is orders of magnitude faster than Erlang,
however NN topic is tricky.
If the network would hold very few neurons, in very wide address space, in regular array,
iterating over it again and again could be costly (in C). Equivalent situation in Erlang would be solved by single query to root/roots (input layer) neurons, which would propagate query directly to well addressed neighborhs.
DXNN1, and DXNN2 which was built and introduced in the textbook: Handbook of Neuroevolution Through Erlang: http://www.amazon.com/Handbook-Neuroevolution-Through-Erlang-Gene/dp/1461444624/ref=zg_bs_760204_22
Are open source and available at: https://github.com/CorticalComputer
If you are interested in data flow programming and multi-threading then I would suggest National Instruments LabVIEW. In this case you don't need to bother about multi-threading since its already there and you can also use OOP since now OOP is also native with LabVIEW. LabVIEW OOP is also purely based on data flow programming paradigm.
If you have any Java experience, then use Scala which is a JVM language that is based on the same concept of "actors" as Erlang. But it is less strict than Erlang and can easily use any existing Java libraries.
Then, when you find a computationally expensive task that would work better in Erlang, you can use Erlang's jinterface library to communicate between your Scala code and your distributed Erlang nodes.
Using Java does not mean dealing from scratch with multithreading - just use one of numerous Java Actor Libraries.
It's not a language in and of itself, but Emergent is very powerful and can be highly customized (it has a full scripting language).
It's open source, too, which could be helpful as a guide if you need to make your own version for your novel architectures.
Why reinvent the wheel? Try PyBrain. It's free and very comprehensive:
Quickstart
Another big plus for Erlang is full integration with Drakon
http://drakon-editor.sourceforge.net/drakon-erlang/intro.html
It all depends on your application. C++, Python are some good programming languages for machine learning

Robotic Applications In Erlang

I want to use Erlang for implementing a Robotic Application. Most current real world applications implemented in Erlang are web based. Robot implemented by Prof. corrado didn't utilize concurrency of Erlang which is the heart of Erlang and concentrated on Artificial intelligence more(according to my understanding of that project).
Some ideas come to my mind are like Soccor Robots, Multiple robots cleaning a room but in such systems Robots can be programmed in C (or any other programming language) which can be controlled using MATLAB. MATLAB helps in image processing(vision system) and solving complex array calculations, so what is the point of using Erlang?.(please correct me if I am wrong)
Can somebody suggest me some Robotic Application which can utilize Erlang's feature especially concurrency and one can argue that Erlang is best suited for such application over other languages.
Little bit detailed answer would help me a lot.
But Erlang might not be the best approach to any robot application at all. But you can go for the less ambitious thesis of demonstrating that Erlang supports a computing model that is important in many robotics applications. The points of using Erlang for robotics include
concurrency to model and monitor a concurrent world;
distribution of sensor, actuator and computing resources;
state machines to tie the behaviors together; and
supervisors for fault tolerance.
Anything can be done in any language, but Erlang makes some things convenient, particularly on the architectural level.
Chapter 14 in Concurrent Programming in Erlang for example models a lift control system by one process for each lift and one for each floor, and later discusses the process structure for a satellite control system. Lifts or satellites maybe aren't very robot-like, but the principles are the same.
The Erlang & Robotics work by Corrado Santoro et al. makes plenty use of concurrency. Their 2007 mobile robot project has bunch of different (concurrent) OTP behaviors that range from low level I/O to high level planning. Teaching Erlang using robotics and player/stage is another recent work.
Your Robot Soccer or Cleaning Robot ideas are fine and have plenty of scope for concurrency and inter-robot communication. But you don't just do an arbitrary robot application of that size. Either you have a team and some specific robots to work on, or you get yourself a simulator (get a simulator either way).
Try simulating a number of robots that steer towards each other until they all collide, each robot running a process of its own. When that is working, replace the task and add processes that (pretend to) control motors, feel walls, see the environment, understand user commands, break down, etc., and exchange messages with other robots and planning processes.
Read up on robotic systems architectures to understand that such designs are common and why. Did Erlang facilitate this type of programming?
the only way to understand where erlang is best and when it's best, is to program in erlang for awhile. It's not about what erlang does best, it's more about understanding how a functional design pattern using erlang's small spawns and otp compares to trying to solve the same problems in a imperative language. A short bulleted pros and cons list will not do justice. Google's go routines and channels, haskell and D can act similarly without any problem. Erlang is particularly good at being distributed. In newer concurrent languages you don't really have to work too hard to make things concurrent. In erlang in particular, if you're making your spawns evenly do the cpu intensive tasks you are using concurrency. You won't find much information on erlang for robotics, but you'll find 3 books with lots of examples on making things distributed and concurrent which tackle many of the same problems. The other concurrent languages don't usually specialize in such a way. Many useful primitives are built into erlang.
A lot of posts have been made about doing similar things in other languages but sticking with erlang saves a lot of work.
http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=%22erlang+style%22+%22(concurrency|message+passing)%22
http://groups.google.com/group/golang-dev/browse_thread/thread/e120a586441b9b24/806eab93bd5281a0?#806eab93bd5281a0

Why is MPI considered harder than shared memory and Erlang considered easier, when they are both message-passing?

There's a lot of interest these days in Erlang as a language for writing parallel programs on multicore. I've heard people argue that Erlang's message-passing model is easier to program than the dominant shared-memory models such as threads.
Conversely, in the high-performance computing community the dominant parallel programming model has been MPI, which also implements a message-passing model. But in the HPC world, this message-passing model is generally considered very difficult to program in, and people argue that shared memory models such as OpenMP or UPC are easier to program in.
Does anybody know why there is such a difference in the perception of message-passing vs. shared memory in the IT and HPC worlds? Is it due to some fundamental difference in how Erlang and MPI implement message passing that makes Erlang-style message-passing much easier than MPI? Or is there some other reason?
I agree with all previous answers, but I think a key point that is not made totally clear is that one reason that MPI might be considered hard and Erlang easy is the match of model to the domain.
Erlang is based on a concept of local memory, asynchronous message passing, and shared state solved by using some form of global database that all threads can get to. It is designed for applications that do not move a whole lot of data around, and that is not supposed to explode out to a 100k separate nodes that need coordination.
MPI is based on local memory and message passing, and is intended for problems where moving data around is a key part of the domain. High-performance computing is very much about taking the dataset for a problem, and splitting it up among a host of compute resources. And that is pretty hard work in a message-passing system as data has to be explicitly distributed with balancing in mind. Essentially, MPI can be viewed as a grudging admittance that shared memory does not scale. And it is targeting high-performance computation spread across 100k processors or more.
Erlang is not trying to achieve the highest possible performance, rather to decompose a naturally parallel problem into its natural threads. It was designed with a totally different type of programming tasks in mind compared to MPI.
So Erlang is best compared to pthreads and other rather local heterogeneous thread solutions, rather than MPI which is really aimed at a very different (and to some extent inherently harder) problem set.
Parallelism in Erlang is still pretty hard to implement. By that I mean that you still have to figure out how to split up your problem, but there's a few minor things that ease this difficulty when compared to some MPI library in C or C++.
First, since Erlang's message-passing is a first-class language feature, the syntactic sugar makes it feel easier.
Also, Erlang libraries are all built around Erlang's message passing. This support structure helps give you a boost into parallel-processling land. Take a look at the components of OTP like gen_server, gen_fsm, gen_event. These are very easy to use structures that can help your program become parallel.
I think it's more the robustness of the available standard library that differentiates erlang's message passing from other MPI implementations, not really any specific feature of the language itself.
Usually concurrency in HPC means working on large amounts of data. This kind of parallelism is called data parallelism and is indeed easier to implement using a shared memory approach like OpenMP, because the operating system takes care of things like scheduling and placement of tasks, which one would have to implement oneself if using a message passing paradigm.
In contrast, Erlang was designed to cope with task parallelism encountered in telephone systems, where different pieces of code have to be executed concurrently with only a limited amount of communication and strong requirements for fault tolerance and recovery.
This model is similar to what most people use PThreads for. It fits applications like web servers, where each request can be handled by a different thread, while HPC applications do pretty much the same thing on huge amounts of data which also have to be exchanged between workers.
I think it has something to do with the mind-set when you're programming with MPI and when you're programming with Erlang. For instance, MPI is not built-into the language whereas Erlang has built-in support for message passing. Another possible reason is the disconnect between merely sending/receiving messages and partitioning solutions into concurrent units of execution.
With Erlang you are forced to think in a functional programming frame where data actually zips by from function call to function call -- and receiving is an active act which looks like a normal construct in the language. This gives you a closer connection between the computation you're actually performing and the act of sending/receiving messages.
With MPI on the other hand you are forced to think merely about the actual message passing but not really the decomposition of work. This frame of thinking requires somewhat of a context switch between writing the solution and the messaging infrastructure in your code.
The discussion can go on but the common view is that if the construct for message passing is actually built into the programming language and paradigm that you're using, usually that's a better means of expressing the solution compared to something else that is "tacked on" or exists as an add-on to a language (in the form of a library or extension).
Does anybody know why there is such a difference in the perception of message-passing vs. shared memory in the IT and HPC worlds? Is it due to some fundamental difference in how Erlang and MPI implement message passing that makes Erlang-style message-passing much easier than MPI? Or is there some other reason?
The reason is simply parallelism vs concurrency. Erlang is bred for concurrent programming. HPC is all about parallel programming. These are related but different objectives.
Concurrent programming is greatly complicated by heavily non-deterministic control flow and latency is often an important objective. Erlang's use of immutable data structures greatly simplifies concurrent programming.
Parallel programming has much simpler control flow and the objective is all about maximal total throughput and not latency. Efficient cache usage is much more important here, which renders both Erlang and immutable data structures largely unsuitable. Mutating shared memory is both tractable and substantially better in this context. In effect, cache coherence is providing hardware-accelerated message passing for you.
Finally, in addition to these technical differences there is also a political issue. The Erlang guys are trying to ride the multicore hype by pretending that Erlang is relevant to multicore when it isn't. In particular, they are touting great scalability so it is essential to consider absolute performance as well. Erlang scales effortlessly from poor absolute performance on one core to poor absolute performance on any number of cores. As you can imagine, that does not impress the HPC community (but it is adequate for a lot of heavily concurrent code).
Regarding MPI vs OpenMP/UPC: MPI forces you to slice the problem in small pieces and take responsibility for moving data around. With OpenMP/UPC, "all the data is there", you just have to dereference a pointer. The MPI advantage is that 32-512 CPU clusters are much cheaper than 32-512 CPU single machines. Also, with MPI the expense is upfront, when you design the algorithm. OpenMP/UPC can hide the latencies that you'll get at runtime, if your system uses NUMA (and all big systems do) - your program won't scale and it will take a while to figure out why.
This article actually explaines it well, Erlang is best when we are sending small pieces of data arround and MPI does much better on more complex things. Also The Erlang model is easy to understand :-)
Erlang Versus MPI - Final Results and Source Code

Resources