Is it possible to split an OpenCV application into a frontend and
backend modules, such that frontend runs on thin-clients that have
very limited processing power (running Intel Atom dual-core
processors, with 1-2GB RAM), and backend does most the computational
heavy-lifting s.a. using Google Compute Engine ?
Is this possible
with an additional constraint of the network communication between
frontend and backend being not fast, s.a. being limited to say
128-256kbps ?
Are there any precedents of this kind ? Is there any such opensource
project ?
Are there some common architectural patters that could help
in such design ?
Additional clarification:
The front-end node, need NOT be purely a front-end, as in running the user-interface. I would imagine that certain OpenCV algorithms could be run on the front-end node, that is especially useful in reducing the amount of data that needs to be sent to the back-end for processing (s.a. colour-space transformation, conversion to grayscale, histogram etc.). I've successfully tested real-time face-detection (Haar cascade) on this low-end machine, in realtime, so the frontend node can pull some workload. In fact, I'd prefer to do most of the work in the frontend, and only push those computation heavy aspects to the backend, that are clearly and definitely well beyond the computational power of the frontend computer.
What I am looking for are suggestions/ideas on nature of algorithms that are best run on Google Compute Engine, and some architectural patterns that are tried & tested, for use with OpenCV to achieve such a split.
Related
Scikit-Learn algorithms are single node implementations. Does this mean, that they are not an appropriate choice for building machine learning models on Databricks cluster for the reason that they cannot take advantage of the cluster computing resources ?
They are not appropriate, in the sense that, as you say, they cannot take advantage of the cluster computing resources, which Databricks is arguably all about. The raison d'être of Databricks is Apache Spark, and specifically for ML tasks, its ML library Spark MLlib.
This does not mean that you cannot use scikit-learn in Databricks (you'll find that a Databricks cluster comes by scikit-learn installed by default), only that it is usable for problems that do not actually require a cluster. If you want to exploit the cluster resource capabilities for ML, you need to revert to Spark MLlib.
I think desertnaut hit the nail on the head here. I believe Scikit Learn algos are designed only for non-parallel processing jobs, and all the MLlib stuff is designed to leverage cluster compute resources and parallel processing resources. Take a look at the link below for sample code for standard regression and classification tasks.
https://spark.apache.org/docs/latest/ml-classification-regression.html
In addition, here are some code samples for different clustering tasks.
https://spark.apache.org/docs/latest/ml-clustering.html
That should probably cover most of the things you will be doing.
I believe that it depends on the task at hand. I see two general scenarios:
Your data is big and does not fit into memory. Go with the Spark MLlib and their distributed algos.
Your data is not that big and you want to utilize sheer computing power. The typical use case is hyperparameter search.
Databricks allow for distributing such workloads from the driver node to the executors with hyperopt and its SparkTrials (random + Bayesian search).
Some docs here>
http://hyperopt.github.io/hyperopt/scaleout/spark/
However, there are much more attempts to make the sklearn on spark work. You can supposedly distribute the workloads through UDF, using joblib, or others. I am investigating the issue myself, and will update the answer later.
Golang is much faster than Python.
However, in the case of Google Cloud Dataflow where Apache Beam is used as a programming model,
I want to understand whether the processing speed difference between Python and Golang is nearly the same or Golang is much faster than Python.
So I'm looking for Golang and Python benchmark material with big data in Dataflow.
Furthermore, it is even better to indicate the cause of the speed difference.
While Go as a language has advantages over Python, python benefits from tight interoperation with C, which gives it some speed benefits, since many of the more popular libraries have C implementations. Python is highly favoured for Machine Learning as a result, including for Beam.
Do you have any particular usecases that you would find valuable as comparisons?
The current expectation is that Go workers will have improved startup time compared to Python workers, but beyond that, it's harder to say without a concrete scenario.
I work on the Beam Go SDK. Currently the SDK isn't supported on Dataflow, and there are no comparative benchmarks between the Go and Python SDKs at present.
The Go SDK is still considered experimental. See the roadmap for the blockers in resolving that, which are currently in progress.
I developed a series of microbenchmarks using some shared-memory libraries (e.g. openmp, tbb, etc) to check how they scale varying the number of threads.
Currently I'm running them on a 4-core processor, the results are pretty reasonable, but I only got 3 points on a speedup plot.
To get more data and a more wide analysis of them I'm planning to run them on a 32-core machine.
One of the possibilities is to buy a 32-core processor, like the AMD Epyc or Intel Xeon, they are kinda expensive, but I know what I'll get with them.
My second and less expensive alternative is to run them on a cloud, like the Amazon AWS or Microsoft Azure.
Then, before making my choice I need some clarification:
As far as I understand AWS can make a machine with as many cores as I want, but all of them are virtualized.
When I run an application there how reliable are the time measure of its execution?
Will I get the same scalability that I get when I run the application on the real 32-core processor?
From decades of experience with virtualization performance, this is an area to be cautious. A lot will depend on the level of contention involved between your virtual machine and others, which in many cloud environments, is difficult to know without tooling.
Also, it isn't clear whether you are discussing elapsed time and/or processor time. Both can be influenced by virtualization, though my experience is that elapsed time is more variable.
I can't speak to the listed environments, but in IBM Z virtualization solutions, we provide metrics that cover processor time consumed by the virtual machine and that consumed by the hypervisor. For your purposes, you'd want just that consumed by the virtual machine. Sorry, I don't know if either of the platforms you mentioned provide that information.
In these type of experiments, we often find it useful to do more measurement iterations to see run time variability.
After playing with the current distributed training implementation for a while, I think it views each GPU as a separate worker.However, It is common now to have 2~4 GPUs in one box. Isn't it better to adopt the single box multi-GPU methodology to compute average gradients in single box first and then sync up across multiple nodes? This way it ease the I/O traffic a lot, which is always the bottleneck in data parallelism.
I was told it's possible with the current implementation by having all GPUs in single box as a worker, but I am not able to figure out how to tie the average gradients with SyncReplicasOptimizer, since SyncReplicasOptimizer directly takes the optimizer as input.
Any ideas from anyone?
Distributed TensorFlow supports multiple GPUs in the same worker task. One common way to perform distributed training for image models is to perform synchronous training across multiple GPUs in the same worker, and asynchronous training across workers (though other configurations are possible). This way you only pull the model parameters to the worker once, and they are distributed among the local GPUs, easing the network bandwidth utilization.
To do this kind of training, many users perform "in-graph replication" across the GPUs in a single worker. This can use an explicit loop across the local GPU devices, like in the CIFAR-10 example model; or higher-level library support, like in the model_deploy() utility from TF-Slim.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
In the beginning, I would like to describe my current position and the goal that I would like to achieve.
I am a researcher dealing with machine learning. So far have gone through several theoretical courses covering machine learning algorithms and social network analysis and therefore have gained some theoretical concepts useful for implementing machine learning algorithms and feed in the real data.
On simple examples, the algorithms work well and the running time is acceptable whereas the big data represent a problem if trying to run algorithms on my PC. Regarding the software I have enough experience to implement whatever algorithm from articles or design my own using whatever language or IDE (so far have used Matlab, Java with Eclipse, .NET...) but so far haven't got much experience with setting-up infrastructure. I have started to learn about Hadoop, NoSQL databases, etc, but I am not sure what strategy would be the best taking into consideration the learning time constraints.
The final goal is to be able to set-up a working platform for analyzing big data with focusing on implementing my own machine learning algorithms and put all together into production, ready for solving useful question by processing big data.
As the main focus is on implementing machine learning algorithms I would like to ask whether there is any existing running platform, offering enough CPU resources to feed in large data, upload own algorithms and simply process the data without thinking about distributed processing.
Nevertheless, such a platform exists or not, I would like to gain a picture big enough to be able to work in a team that could put into production the whole system tailored upon the specific customer demands. For example, a retailer would like to analyze daily purchases so all the daily records have to be uploaded to some infrastructure, capable enough to process the data by using custom machine learning algorithms.
To put all the above into simple question: How to design a custom data mining solution for real-life problems with main focus on machine learning algorithms and put it into production, if possible, by using the existing infrastructure and if not, design distributed system (by using Hadoop or whatever framework).
I would be very thankful for any advice or suggestions about books or other helpful resources.
First of all, your question needs to define more clearly what you intend by Big Data.
Indeed, Big Data is a buzzword that may refer to various size of problems. I tend to define Big Data as the category of problems where the Data size or the Computation time is big enough for "the hardware abstractions to become broken", which means that a single commodity machine cannot perform the computations without intensive care of computations and memory.
The scale threshold beyond which data become Big Data is therefore unclear and is sensitive to your implementation. Is your algorithm bounded by Hard-Drive bandwidth ? Does it have to feet into memory ? Did you try to avoid unnecessary quadratic costs ? Did you make any effort to improve cache efficiency, etc.
From several years of experience in running medium large-scale machine learning challenge (on up to 250 hundreds commodity machine), I strongly believe that many problems that seem to require distributed infrastructure can actually be run on a single commodity machine if the problem is expressed correctly. For example, you are mentioning large scale data for retailers. I have been working on this exact subject for several years, and I often managed to make all the computations run on a single machine, provided a bit of optimisation. My company has been working on simple custom data format that allows one year of all the data from a very large retailer to be stored within 50GB, which means a single commodity hard-drive could hold 20 years of history. You can have a look for example at : https://github.com/Lokad/lokad-receiptstream
From my experience, it is worth spending time in trying to optimize algorithm and memory so that you could avoid to resort to distributed architecture. Indeed, distributed architectures come with a triple cost. First of all, the strong knowledge requirements. Secondly, it comes with a large complexity overhead in the code. Finally, distributed architectures come with a significant latency overhead (with the exception of local multi-threaded distribution).
From a practitioner point of view, being able to perform a given data mining or machine learning algorithm in 30 seconds is one the key factor to efficiency. I have noticed than when some computations, whether sequential or distributed, take 10 minutes, my focus and efficiency tend to drop quickly as it becomes much more complicated to iterate quickly and quickly test new ideas. The latency overhead introduced by many of the distributed frameworks is such that you will inevitably be in this low-efficiency scenario.
If the scale of the problem is such that even with strong effort you cannot perform it on a single machine, then I strongly suggest to resort to on-shelf distributed frameworks instead of building your own. One of the most well known framework is the MapReduce abstraction, available through Apache Hadoop. Hadoop can be run on 10 thousands nodes cluster, probably much more than you will ever need. If you do not own the hardware, you can "rent" the use of a Hadoop cluster, for example through Amazon MapReduce.
Unfortunately, the MapReduce abstraction is not suited to all Machine Learning computations.
As far as Machine Learning is concerned, MapReduce is a rigid framework and numerous cases have proved to be difficult or inefficient to adapt to this framework:
– The MapReduce framework is in itself related to functional programming. The
Map procedure is applied to each data chunk independently. Therefore, the
MapReduce framework is not suited to algorithms where the application of the
Map procedure to some data chunks need the results of the same procedure to
other data chunks as a prerequisite. In other words, the MapReduce framework
is not suited when the computations between the different pieces of data are
not independent and impose a specific chronology.
– MapReduce is designed to provide a single execution of the map and of the
reduce steps and does not directly provide iterative calls. It is therefore not
directly suited for the numerous machine-learning problems implying iterative
processing (Expectation-Maximisation (EM), Belief Propagation, etc.). The
implementation of these algorithms in a MapReduce framework means the
user has to engineer a solution that organizes results retrieval and scheduling
of the multiple iterations so that each map iteration is launched after the reduce
phase of the previous iteration is completed and so each map iteration is fed
with results provided by the reduce phase of the previous iteration.
– Most MapReduce implementations have been designed to address production needs and
robustness. As a result, the primary concern of the framework is to handle
hardware failures and to guarantee the computation results. The MapReduce efficiency
is therefore partly lowered by these reliability constraints. For example, the
serialization on hard-disks of computation results turns out to be rather costly
in some cases.
– MapReduce is not suited to asynchronous algorithms.
The questioning of the MapReduce framework has led to richer distributed frameworks where more control and freedom are left to the framework user, at the price of more complexity for this user. Among these frameworks, GraphLab and Dryad (both based on Direct Acyclic Graphs of computations) are well-known.
As a consequence, there is no "One size fits all" framework, such as there is no "One size fits all" data storage solution.
To start with Hadoop, you can have a look at the book Hadoop: The Definitive Guide by Tom White
If you are interested in how large-scale frameworks fit into Machine Learning requirements, you may be interested by the second chapter (in English) of my PhD, available here: http://tel.archives-ouvertes.fr/docs/00/74/47/68/ANNEX/texfiles/PhD%20Main/PhD.pdf
If you provide more insight about the specific challenge you want to deal with (type of algorithm, size of the data, time and money constraints, etc.), we probably could provide you a more specific answer.
edit : another reference that could prove to be of interest : Scaling-up Machine Learning
I had to implement a couple of Data Mining algorithms to work with BigData too, and I ended up using Hadoop.
I don't know if you are familiar to Mahout (http://mahout.apache.org/), which already has several algorithms ready to use with Hadoop.
Nevertheless, if you want to implement your own Algorithm, you can still adapt it to Hadoop's MapReduce paradigm and get good results. This is an excellent book on how to adapt Artificial Intelligence algorithms to MapReduce:
Mining of Massive Datasets - http://infolab.stanford.edu/~ullman/mmds.html
This seems to be an old question. However given your usecase, the main frameworks focusing on Machine Learning in Big Data domain are Mahout, Spark (MLlib), H2O etc. However to run Machine Learning algorithms on Big Data you have to convert them to parallel programs based on Map Reduce paradigm. This is a nice article giving a brief introduction to major (not all) big Data frameworks:
http://www.codophile.com/big-data-frameworks-every-programmer-should-know/
I hope this will help.