What is the differ between spread toolkit & DDS or - middleware

In distributed systems what is the difference between the DDS as a middleware and spread toolkit as a middleware, or the difference between the framewroks based on DDS and that based on spread toolkit

Related

Using scikit-learn on Databricks

Scikit-Learn algorithms are single node implementations. Does this mean, that they are not an appropriate choice for building machine learning models on Databricks cluster for the reason that they cannot take advantage of the cluster computing resources ?
They are not appropriate, in the sense that, as you say, they cannot take advantage of the cluster computing resources, which Databricks is arguably all about. The raison d'être of Databricks is Apache Spark, and specifically for ML tasks, its ML library Spark MLlib.
This does not mean that you cannot use scikit-learn in Databricks (you'll find that a Databricks cluster comes by scikit-learn installed by default), only that it is usable for problems that do not actually require a cluster. If you want to exploit the cluster resource capabilities for ML, you need to revert to Spark MLlib.
I think desertnaut hit the nail on the head here. I believe Scikit Learn algos are designed only for non-parallel processing jobs, and all the MLlib stuff is designed to leverage cluster compute resources and parallel processing resources. Take a look at the link below for sample code for standard regression and classification tasks.
https://spark.apache.org/docs/latest/ml-classification-regression.html
In addition, here are some code samples for different clustering tasks.
https://spark.apache.org/docs/latest/ml-clustering.html
That should probably cover most of the things you will be doing.
I believe that it depends on the task at hand. I see two general scenarios:
Your data is big and does not fit into memory. Go with the Spark MLlib and their distributed algos.
Your data is not that big and you want to utilize sheer computing power. The typical use case is hyperparameter search.
Databricks allow for distributing such workloads from the driver node to the executors with hyperopt and its SparkTrials (random + Bayesian search).
Some docs here>
http://hyperopt.github.io/hyperopt/scaleout/spark/
However, there are much more attempts to make the sklearn on spark work. You can supposedly distribute the workloads through UDF, using joblib, or others. I am investigating the issue myself, and will update the answer later.

What tools do you know for storage, version control and deploy as API service of ML models?

I found https://dataversioncontrol.com and https://hydrosphere.io/ml-lambda/. What can be more?
Convert your ML pipeline to a standardized text-based representation, and use regular version control tools (such as Git). For example, the PMML standard can represent most popular R, Scikit-Learn and Apache Spark ML transformation and model types. Better yet, after conversion to the standardized representation, all these models become directly comparable with one another (eg. measuring the "complexity" of random forest model objects between different ML frameworks).
You can build whatever APIs you like on top of this versioned base layer.
To get started with the PMML standard, please check out the Java PMML API backend project, and its Openscoring REST API frontend project.

Why spark.ml don't implement any of spark.mllib algorithms?

Following the Spark MLlib Guide we can read that Spark has two machine learning libraries:
spark.mllib, built on top of RDDs.
spark.ml, built on top of Dataframes.
According to this and this question on StackOverflow, Dataframes are better (and newer) than RDDs and should be used whenever possible.
The problem is that I want to use common machine learning algorithms (e.g: Frequent Pattern Mining,Naive Bayes, etc.) and spark.ml (for dataframes) don't provide such methods, only spark.mllib(for RDDs) provides this algorithms.
If Dataframes are better than RDDs and the referred guide recommends the use of spark.ml, why aren't common machine learning methods implemented in that lib?
What's the missing point here?
Spark 2.0.0
Currently Spark moves strongly towards DataFrame API with ongoing deprecation of RDD API. While number of native "ML" algorithms is growing the main points highlighted below are still valid and internally many stages are implemented directly using RDDs.
See also: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0
Spark < 2.0.0
I guess that the main missing point is that spark.ml algorithms in general don't operate on DataFrames. So in practice it is more a matter of having a ml wrapper than anything else. Even native ML implementation (like ml.recommendation.ALS use RDDs internally).
Why not implement everything from scratch on top of DataFrames? Most likely because only a very small subset of machine learning algorithms can actually benefit from the optimizations which are currently implemented in Catalyst not to mention be efficiently and naturally implemented using DataFrame API / SQL.
Majority of the ML algorithms require efficient linear algebra library not a tabular processing. Using cost based optimizer for linear algebra could be an interesting addition (I think that flink already has one) but it looks like for now there is nothing to gain here.
DataFrames API gives you very little control over the data. You cannot use partitioner*, you cannot access multiple records at the time (I mean a whole partition for example), you're limited to a relatively small set of types and operations, you cannot use mutable data structures and so on.
Catalyst applies local optimizations. If you pass a SQL query / DSL expression it can analyze it, reorder, apply early projections. All of that is that great but typical scalable algorithms require iterative processing. So what you really want to optimize is a whole workflow and DataFrames alone are not faster than plain RDDs and depending on an operation can be actually slower.
Iterative processing in Spark, especially with joins, requires a fine graded control over the number of partitions, otherwise weird things happen. DataFrames give you no control over partitioning. Also, DataFrame / Dataset don't provide native checkpoint capabilities (fixed in Spark 2.1) which makes iterative processing almost impossible without ugly hacks
Ignoring low level implementation details some groups of algorithms, like FPM, don't fit very well into a model defined by ML pipelines.
Many optimizations are limited to native types, not UDT extensions like VectorUDT.
There is one more problem with DataFrames, which is not really related to machine learning. When you decide to use a DataFrame in your code you give away almost all benefits of static typing and type inference. It is highly subjective if you consider it to be a problem or not but one thing for sure, it doesn't feel natural in Scala world.
Regarding better, newer and faster I would take a look at Deep Dive into Spark SQL’s Catalyst Optimizer, in particular the part related to quasiquotes:
The following figure shows that quasiquotes let us generate code with performance similar to hand-tuned programs.
* This has been changed in Spark 1.6 but it is still limited to default HashPartitioning

PageRank tool for large graphs

I need to compute the PageRank scores for a large graph which cannot be loaded into memory. I need a simple toolkit that can be easily modified, since I need to change its code in my research. Are you aware of any useful and simple toolkit that computes PageRank for large graphs (the size of graph is around 40 GB).
Thanks
Two packages you might want to evaluate are
Apache TinkerPop
http://tinkerpop.incubator.apache.org/docs/3.0.1-incubating/#pagerankvertexprogram
Apache Spark - GraphX
http://spark.apache.org/docs/latest/graphx-programming-guide.html#pagerank
Both are open source with Apache license, so the source code is available for you to modify or extend.

Splitting OpenCV operations between frontend and backend processors

Is it possible to split an OpenCV application into a frontend and
backend modules, such that frontend runs on thin-clients that have
very limited processing power (running Intel Atom dual-core
processors, with 1-2GB RAM), and backend does most the computational
heavy-lifting s.a. using Google Compute Engine ?
Is this possible
with an additional constraint of the network communication between
frontend and backend being not fast, s.a. being limited to say
128-256kbps ?
Are there any precedents of this kind ? Is there any such opensource
project ?
Are there some common architectural patters that could help
in such design ?
Additional clarification:
The front-end node, need NOT be purely a front-end, as in running the user-interface. I would imagine that certain OpenCV algorithms could be run on the front-end node, that is especially useful in reducing the amount of data that needs to be sent to the back-end for processing (s.a. colour-space transformation, conversion to grayscale, histogram etc.). I've successfully tested real-time face-detection (Haar cascade) on this low-end machine, in realtime, so the frontend node can pull some workload. In fact, I'd prefer to do most of the work in the frontend, and only push those computation heavy aspects to the backend, that are clearly and definitely well beyond the computational power of the frontend computer.
What I am looking for are suggestions/ideas on nature of algorithms that are best run on Google Compute Engine, and some architectural patterns that are tried & tested, for use with OpenCV to achieve such a split.

Resources