How a machine learning algorithm(Present in Spark MLlib) can be applied to Data collected from sensors in KAA. I haven't found any such use case built on KAA. My requirement is to collect the live streams of data, processing and cleaning the same and applying a machine leaning algorithm in KAA.
I have done this by collecting the data using Apache nifi and through Kafka passing the data to Spark Streaming Application on which I am applying the machine learning algorithm.
I want to perform the same in KAA as an IoT platform.
Related
I'm very new to AI and ML, but very much not new to web dev.
I have trained a Tensor Flow implementation of pix2pix on my M1 GPU. I've wrapped it up in a Flask server and I want to deploy it. I've got it running locally in a Docker container but when I deploy it Google Cloud Run there seem to be issues related to me training it on ARM and then deploying it on something different (I assume x86 but I can't find docs to confirm).
I notice that the image I get back from the local Docker instance is very different from running on just localhost - much lower quality output in Docker (from an AI perspective, not image quality) as well.
I'm wondering if there are specific things that need to be done when training on Apple Silicon and then deploying on more traditional cloud hardware?
Should I just train and develop in the cloud? Seems like a waste of a great local GPU.
I appreciate this is vague, my understanding of this area is low.
I currently have a simple Machine Learning infrastructure running locally and I want to migrate this all onto Google Cloud. I simply fetch the data I need from a database, build my model and then test the model on test data. This is all done in PyCharm locally.
I want to simply migrate this and have the possibility for all this to be done on Google Cloud, while having the flexibility to make local changes that can apply when run on the cloud as well. There are many Google Cloud resources relating to this and so I am looking for best practices people follow on running such a procedure.
Thanks and please let me know if there are any clarifications needed.
I highly suggest you to take a look at this machine learning workflow in the cloud which consists of:
Data Ingestion and Collection
Storing the data.
Processing data.
ML training.
ML deployment.
Data Ingestion and Collection
There are multiple resources you can use if you would like to ingest data with Google Cloud Platform. The simplest solution I can recommend to you are both Google Compute Engine or an App Engine App (for example for a forum where a user fill some data up).
Nonetheless, if you would like to ingest data in real-time, you can also use Cloud Pub/Sub.
Storing the data
As you mentioned, you are retrieving all the information from a database. If you are used to work with SQL or NoSQL I highy suggest you to go after Cloud SQL. Not only provides a good interface when building your instance, but also lets you access it securely and very rapidly.
If it not the case, you can also use Google Cloud Storage or BigQuery, but over those two, I will pick BigQuery since it has also the possibility to work with stream data.
Processing data
For processing data before feeding it to the model you can use either:
Cloud DataFlow: Cloud Dataflow is a fully-managed service for transforming and enriching data in stream (real time) and batch (historical) modes with equal reliability and expressiveness -- no more complex workarounds or compromises needed.
Cloud Dataproc: Dataproc is a fast, easy-to-use, fully managed cloud service for running Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way.
Cloud Dataprep: Cloud Dataprep by Trifacta is an intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis, reporting, and machine learning.
ML training & ML deployment
For training/deploying your ML model I would suggest to use AI platform.
AI Platform makes it easy for machine learning developers, data scientists, and data engineers to take their ML projects from ideation to production and deployment, quickly and cost-effectively.
If you have to work with huge datasets, the best practices are run the model as a Tensorflow job with AI Platform so you can have a training cluster.
Finally for deploying your models using AI Platform, you can take a look here.
I'm currently working on a machine learning problem and created a model in Dev environment where the data set is low in the order of few hundred thousands. How do I transport the model to Production environment where data set is very large in the order of billions.
Is there any general recommended way to transport machine learning models?
Depends on which Development Platform your using. I know that DL4J uses Hadoop Hyper Parameter server. I write my ML progs in C++ and use my own generated data, TensorFlow and others use Data that is compressed and unpacked using Python. For Realtime data I would suggest using one of the Boost librarys as I have found it useful in dealing with large amounts of RT data for example Image Processing with OpenCV. But I imagine there must be an equivalent set of librarys suited to your data. CSV data is easy to process using C++ or Python. Realtime (Boost), Images (OpenCV), csv (Python) or you can just write a program that pipes the data into your program using Bash (Tricky). You could have it buffer the data somehow and then routinely serve the data to your ML program and then retrieve the data and store it in a Mysql Database. Sounds like you need a Data server or a Data management program so the ML algo just works away on its chunk of data. Hope that helps.
Is it possible to split an OpenCV application into a frontend and
backend modules, such that frontend runs on thin-clients that have
very limited processing power (running Intel Atom dual-core
processors, with 1-2GB RAM), and backend does most the computational
heavy-lifting s.a. using Google Compute Engine ?
Is this possible
with an additional constraint of the network communication between
frontend and backend being not fast, s.a. being limited to say
128-256kbps ?
Are there any precedents of this kind ? Is there any such opensource
project ?
Are there some common architectural patters that could help
in such design ?
Additional clarification:
The front-end node, need NOT be purely a front-end, as in running the user-interface. I would imagine that certain OpenCV algorithms could be run on the front-end node, that is especially useful in reducing the amount of data that needs to be sent to the back-end for processing (s.a. colour-space transformation, conversion to grayscale, histogram etc.). I've successfully tested real-time face-detection (Haar cascade) on this low-end machine, in realtime, so the frontend node can pull some workload. In fact, I'd prefer to do most of the work in the frontend, and only push those computation heavy aspects to the backend, that are clearly and definitely well beyond the computational power of the frontend computer.
What I am looking for are suggestions/ideas on nature of algorithms that are best run on Google Compute Engine, and some architectural patterns that are tried & tested, for use with OpenCV to achieve such a split.
I have a few questions related with the use of Apache Spark for real-time analytics using Java. When the Spark application is submitted, the data that are stored in Cassandra database are loaded and processed via a machine learning algorithm (Support Vector Machine). Throughout Spark's streaming extension when new data arrive, they are persisted in the database, the existing dataset is re-trained and the SVM algorithm is executed. The output of this process is also stored back in the database.
Apache Spark's MLLib provides implementation of linear support vector machine. In case that I would like a non-linear SVM implementation, should I implement my own algorithm or may I use existing libraries such as libsvm or jkernelmachines? These implementations are not based on Spark's RDDs, is there a way to do this without implementing the algorithm from scratch using RDD collections? If not, that would be a huge effort if I would like to test several algorithms.
Is MLLib providing out of the box utilities for data scaling before executing the SVM algorithm? http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf as defined in section 2.2
While new dataset is streamed, do I need to re-train the hole dataset? Is there any way that I could just add the new data to the already trained data?
To answer your questions piecewise,
Spark provides the MLUtils class that allows you to load data from the LIBSVM format into RDDs - so just the data load portion won't stop you from utilizing that library. You could also implement your own algorithms if you know what you're doing, although my recommendation would be to take an existing one and tweak the objective function and see how it runs. Spark basically provides you the functionality of a distributed Stochastic Gradient Descent process - you can do anything with it.
Not that I know of. Hopefully someone else knows the answer.
What do you mean by re-training when the whole data is streamed?
From the docs,
.. except fitting occurs on each batch of data, so that the model continually updates to reflect the data from the stream.