Storing training dataset in a platform like mlflow

Storing training dataset in a platform like mlflow - machine-learning

Are there experiment management platforms that also allow
storing and managing training datasets (images, in my case)? I am familiar with the ML-Flow, but AFAIK it doesn't support such an option, am I right? If there are no platforms like this, how would you suggest managing training datasets in combination with existing platforms?

You can use a cloud service's object storage to store and manage your training dataset. Any cloud provider has such a solution. Then, your code should allow ingesting data from storage buckets to train and do experiment tracking.

Related

Migrate from running ML training and testing locally to Google Cloud

I currently have a simple Machine Learning infrastructure running locally and I want to migrate this all onto Google Cloud. I simply fetch the data I need from a database, build my model and then test the model on test data. This is all done in PyCharm locally.
I want to simply migrate this and have the possibility for all this to be done on Google Cloud, while having the flexibility to make local changes that can apply when run on the cloud as well. There are many Google Cloud resources relating to this and so I am looking for best practices people follow on running such a procedure.
Thanks and please let me know if there are any clarifications needed.

I highly suggest you to take a look at this machine learning workflow in the cloud which consists of:
Data Ingestion and Collection
Storing the data.
Processing data.
ML training.
ML deployment.
Data Ingestion and Collection
There are multiple resources you can use if you would like to ingest data with Google Cloud Platform. The simplest solution I can recommend to you are both Google Compute Engine or an App Engine App (for example for a forum where a user fill some data up).
Nonetheless, if you would like to ingest data in real-time, you can also use Cloud Pub/Sub.
Storing the data
As you mentioned, you are retrieving all the information from a database. If you are used to work with SQL or NoSQL I highy suggest you to go after Cloud SQL. Not only provides a good interface when building your instance, but also lets you access it securely and very rapidly.
If it not the case, you can also use Google Cloud Storage or BigQuery, but over those two, I will pick BigQuery since it has also the possibility to work with stream data.
Processing data
For processing data before feeding it to the model you can use either:
Cloud DataFlow: Cloud Dataflow is a fully-managed service for transforming and enriching data in stream (real time) and batch (historical) modes with equal reliability and expressiveness -- no more complex workarounds or compromises needed.
Cloud Dataproc: Dataproc is a fast, easy-to-use, fully managed cloud service for running Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way.
Cloud Dataprep: Cloud Dataprep by Trifacta is an intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis, reporting, and machine learning.
ML training & ML deployment
For training/deploying your ML model I would suggest to use AI platform.
AI Platform makes it easy for machine learning developers, data scientists, and data engineers to take their ML projects from ideation to production and deployment, quickly and cost-effectively.
If you have to work with huge datasets, the best practices are run the model as a Tensorflow job with AI Platform so you can have a training cluster.
Finally for deploying your models using AI Platform, you can take a look here.

Data processing while using tensorflow serving (Docker/Kubernetes)

I am looking to host 5 deep learning models where data preprocessing/postprocessing is required.
It seems straightforward to host each model using TF serving (and Kubernetes to manage the containers), but if that is the case, where should the data pre and post-processing take place?

I'm not sure there's a single definitive answer to this question, but I've had good luck deploying models at scale bundling the data pre- and post-processing code into fairly vanilla Go or Python (e.g., Flask) applications that are connected to my persistent storage for other operations.
For instance, to take the movie recommendation example, on the predict route it's pretty performant to pull the 100 films a user has watched from the database, dump them into a NumPy array of the appropriate size and encoding, dispatch to the TensorFlow serving container, and then do the minimal post-processing (like pulling the movie name, description, cast from a different part of the persistent storage layer) before returning.

Additional options to josephkibe's answer, you can:
Implementing processing into model itself (see signatures for keras models and input receivers for estimators in SavedModel guide).
Install Seldon-core. It is a whole framework for serving that handles building images and networking. It builds service as a graph of pods with different API's, one of them are transformers that pre/post-process data.

Using Google ML Engine with BigQuery?

I'm currently designing a data warehouse in BigQuery. I'm planning to store user data like past purchases or abandoned carts.
This seems to be perfect to manually analyze trends and to get insights. But what if I want to leverage Machine Learning, e.g. to suggest products to a group of users?
I have looked into Google ML Engine and TensorFlow, and it seems like the TensorFlow model would need to query BigQuery first. In some scenarios, this could mean that TensorFlow would need to query all or most of the data that is stored in BigQuery.
This feels a bit off, so I'm wondering if this is really how things are supposed to happen. Otherwise, I assume that my ML model would have to work with stale data?

So I would agree with you, using BigQuery as a data warehouse for your ML is expensive. It would be cheaper and much more efficient to use Google Cloud Storage to store all the data you wish to process. Once everything is processed and generated, you may then wish to push that data to BigQuery push that data to another source like Spanner or even Cloud Storage.
That being said Google has now created a beta product BigQuery ML. This now allows users to create and execute machine learning models in BigQuery via the use of SQL queries. I believe it uses python and tensorflow under the hood, but I believe it would be the best solution given that you have a light weight ML load.
Since it is still in beta as of now, I don't know well it's performance compares to Google ML engine and tensorflow.

Depending on what kind of model you want to train and how you want to server the model you can do one the following options:
You can export your data to Google Cloud Storage as CSV and then read the files in Cloud ML Engine. This will let you use the power of Tensorflow and you can then use Cloud ML Engine's serving system to send traffic to your model.
On the downside, this means that you have to export all of your BigQuery data to GCS and every time you decide to make any change to the data you need to go back to BigQuery and export again. Also if the data you want to prediction on is in BigQuery you have to export that as well and send it to Cloud ML Engine using a separate system.
If you want to explore and interactively train Logistic or Linear regression models on your data, you can use BigQuery Machine learning. This will allow you to slice and dice your data in BigQuery and experiment with different parts of your data and various preprocessing options. You can also use all the power of SQL. BigQuery ML also allows you to use the model after training within BigQuery (you can use SQL to feed data in to the model).
For many cases using full power of Tensorflow (i.e. using DNNs) is not necessary. This is especially true for structured data. On the other hand, most of your time will be spent on preprocessing and cleaning the data which would be much easier in SQL in BigQuery.
So you have two options here. Choose based on your needs.
P.S.: You can also try using BigQuery Reader in Tensorflow. I don't recommend it as it is very slow. But if your data is not huge it may work for you.

What is the recommended method to transport machine learning models?

I'm currently working on a machine learning problem and created a model in Dev environment where the data set is low in the order of few hundred thousands. How do I transport the model to Production environment where data set is very large in the order of billions.
Is there any general recommended way to transport machine learning models?

Depends on which Development Platform your using. I know that DL4J uses Hadoop Hyper Parameter server. I write my ML progs in C++ and use my own generated data, TensorFlow and others use Data that is compressed and unpacked using Python. For Realtime data I would suggest using one of the Boost librarys as I have found it useful in dealing with large amounts of RT data for example Image Processing with OpenCV. But I imagine there must be an equivalent set of librarys suited to your data. CSV data is easy to process using C++ or Python. Realtime (Boost), Images (OpenCV), csv (Python) or you can just write a program that pipes the data into your program using Bash (Tricky). You could have it buffer the data somehow and then routinely serve the data to your ML program and then retrieve the data and store it in a Mysql Database. Sounds like you need a Data server or a Data management program so the ML algo just works away on its chunk of data. Hope that helps.

Can Google Cloud Vision API be trained using your image data?

IBM Watson has a capability where you can train the classifiers on Watson using your images but I am unable to find a similar capability on Google Cloud Vision API? What I want is that I upload 10-15 classes of images and on the bases of upload images classify any images loaded after that. IBM Bluemix (Watson) has this capability but their pricing is significantly higher than Google. I am open to other services as well, if prices ares below Google's

As far as I know Google Cloud Vision API cannot be trained with your custom data. However, there is a service called vize.ai, where you can define your custom classes and upload the images, the training is for free and the prices for API usage are below Google's and IBM's.
Disclaimer: I'm vize.it co-founder
Edit: Link changed

You can train your own models using Cloud AutoML Vision. There are 2 different ways to do this:
Cloud-hosted models.
Edge exportable models.

With some work you can train a model for free using TensorFlow - see the model training section.
However, they have released an already trained model, so if you're lucky and what you want to classify already overlaps with their model, then no training is needed.

Azure has started this now, google for "azure custom vision" this is still a preview service but with good accuracy at least for our workload which is preschool children images.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart