Does tensorflow-federated support placing training data on client side? - tensorflow-federated

It's very awesome to see that tensorflow-federated could support distributed training now. I referred to the example here.
However, it seems the training data are sent from server to client at each epoch, and the client(remote_executor_service) doesn't hold any dataset. It is different from typical federated learning scenario. So I was wondering could I place training data separately on each client?

Two things to note:
The example linked (High-performance Simulation with Kubernetes) is running a federated learning simulation. At this time TensorFlow Federated is primarily used for executing research simulations. There has yet to be a means for deploying to real smart phones. In the simulation, each client dataset is logically separate, but may be physically present on the same machine.
The creation of a tf.data.Dataset (e.g. the definition of train_data in the tutorial) could be thought of like creating a "recipe to read data", than actually reading the data itself. For example, adding .batch() or .map() calls to a dataset returns a new recipes, but doesn't actually materialize any dataset examples. The dataset isn't actually read until a .reduce() call, or the dataset is iterated over in a for loop. In the tutorial, the dataset "recipe" is sent to remote workers; the data is read/materialized remotely when the dataset is iterated over during local training (the data itself is not sent to the remote workers).

Related

How to build federated learning model of unbalanced and small dataset

I am working to build a federated learning model using TFF and I have some questions:
I am preparing the dataset, I have separate files of data, with same features and different samples. I would consider each of these files as a single client. How can I maintain this in TFF?
The data is not balanced, meaning, the size of data varies in each file. Is this affecting the modeling process?
The size of the data is a bit small, one file (client) is having 300 records and another is 1500 records, is it suitable to build a federated learning model?
Thanks in advance
You can create a ClientData for your dataset, see Working with tff's ClientData.
The dataset doesn't have to balanced to build a federated learning model. In https://arxiv.org/abs/1602.05629, the server takes weighted federated averaging of client's model updates, where the weights are the number of samples each client has.
A few hundred records per client is no less than the EMNIST dataset, so that would be fine. About the total number of clients: this tutorial shows FL with 10 clients, you can run the colab with smaller NUM_CLIENTS to see how it works on the example dataset.

Using Google ML Engine with BigQuery?

I'm currently designing a data warehouse in BigQuery. I'm planning to store user data like past purchases or abandoned carts.
This seems to be perfect to manually analyze trends and to get insights. But what if I want to leverage Machine Learning, e.g. to suggest products to a group of users?
I have looked into Google ML Engine and TensorFlow, and it seems like the TensorFlow model would need to query BigQuery first. In some scenarios, this could mean that TensorFlow would need to query all or most of the data that is stored in BigQuery.
This feels a bit off, so I'm wondering if this is really how things are supposed to happen. Otherwise, I assume that my ML model would have to work with stale data?
So I would agree with you, using BigQuery as a data warehouse for your ML is expensive. It would be cheaper and much more efficient to use Google Cloud Storage to store all the data you wish to process. Once everything is processed and generated, you may then wish to push that data to BigQuery push that data to another source like Spanner or even Cloud Storage.
That being said Google has now created a beta product BigQuery ML. This now allows users to create and execute machine learning models in BigQuery via the use of SQL queries. I believe it uses python and tensorflow under the hood, but I believe it would be the best solution given that you have a light weight ML load.
Since it is still in beta as of now, I don't know well it's performance compares to Google ML engine and tensorflow.
Depending on what kind of model you want to train and how you want to server the model you can do one the following options:
You can export your data to Google Cloud Storage as CSV and then read the files in Cloud ML Engine. This will let you use the power of Tensorflow and you can then use Cloud ML Engine's serving system to send traffic to your model.
On the downside, this means that you have to export all of your BigQuery data to GCS and every time you decide to make any change to the data you need to go back to BigQuery and export again. Also if the data you want to prediction on is in BigQuery you have to export that as well and send it to Cloud ML Engine using a separate system.
If you want to explore and interactively train Logistic or Linear regression models on your data, you can use BigQuery Machine learning. This will allow you to slice and dice your data in BigQuery and experiment with different parts of your data and various preprocessing options. You can also use all the power of SQL. BigQuery ML also allows you to use the model after training within BigQuery (you can use SQL to feed data in to the model).
For many cases using full power of Tensorflow (i.e. using DNNs) is not necessary. This is especially true for structured data. On the other hand, most of your time will be spent on preprocessing and cleaning the data which would be much easier in SQL in BigQuery.
So you have two options here. Choose based on your needs.
P.S.: You can also try using BigQuery Reader in Tensorflow. I don't recommend it as it is very slow. But if your data is not huge it may work for you.

How to deploy machine learning algorithm in production environment?

I'm new to machine learning algorithm. I'm learning basic algorithms like regression, classification, clustering, sequence modelling, on-line algorithms. All the article that are available on internet shows how to use these algorithm with specific data. There is no article regarding deployment of those algorithm in production environment. So my questions are
1) How to deploy machine learning algorithm in production environment?
2) The typical approach follows in machine learning tutorial is to build the model using some training data, use it for testing data. But is it advisable to use that kind of model in production environment? Incoming data may keep changing so the model will be ineffective. What should be duration for the model refresh cycle to accommodate such changes?
I am not sure if this is a good question (since it is too general and not formulated good), but I suggest you to read about bias - variance tradeoff. Long story short, you could have low bias\high variance machine-learning model and get 100% accurate results on your test data (the data you used to implement a model), but you could cause your model to overfit the training data. As result, when you will try to use it on data which you haven't used during training it will lead to poor performance. On the other hand, you may have high bias\low variance model, which will be poorly fit to your training data and will also perform just as bad on new production data. Keeping this in mind general guideline will be:
1) Obtain some good amount of data which you could use to build a prototype of machine-learning system
2) Split your data into train set, cross-validation set and test set
3) Create a model which will have relatively low bias (good accuracy, actually - good F1 score) on your test data. Then try this model on cross-validation set to see the results. If the results are bad - you have a high variance problem, you used a model which overfit the data and can't generalize well. Re-write your model, play with model parameters or use different algorithm. Repeat until you get a good result on CV set
4) Since we played with the model in order to get a good result on CV set, you want to test your final model on test set. If it is good - that's it, you have a final version of model and could use it on prod environment.
Second question has no answer, it is based on your data and your application. But 2 general approaches might be used:
1) Do everything I mentioned earlier to build a model with a good performance on test set. Re-train your model on new data once in some period (try different periods, but you could try to re-train your model once you see that performance of model dropped down).
2) Use online-learning approach. This is not applicable for many algorithms, but for some cases it could be used. Generally, if you see that you could use stochastic gradient descent learning method - you could use online-learning and just keep your model up-to-date with the newest production data.
Keep in mind that even if you use #2 (online-learning approach) you can't be sure that your model will be good forever. Sooner or later the data you get may change significantly and you may want to use whole different model (for example switch to ANN instead of SWM or logistic regression).
DISCLAIMER: I work for this company, Datmo building a better workflow for ML. We’re always looking to help fellow developers working on ML so feel free to reach out to me at anand#datmo.com if you have any questions.
1) In order to deploy, you should first split up your code into preprocessing, training and test. This way you can easily encapsulate the required components for deployment. Usually, you will then want to take your preprocessing, test, as well as your weights file (the output of your training process) and put them in one folder. Next, you will want to host this on a server and wrap an API server around this. I would suggest a Flask Restful API so that you can use query parameters as your inputs and output your response in standard JSON blobs.
To host it on a server, you can use this article which talks about how you can deploy a Flask API on EC2.
You can load and model and serve it as API as given in this code.
2) Hard for me to answer without more details. It's highly dependent on the type of data and the type of model. For example, for deep learning, there is no such thing as online learning.
I am sorry that my comments does not include too much detail* since I am also a newbie in "deployment" of ML. But since the author is also new in ML, I hope these basic guidance could be helpful as well.
For "deployment", you should
Have ML algorithms: You may use free-tools, or develop your own tool using libraries in Python, R, Java, .Net, .. or use a system on cloud..)
Train those ML models using training datasets
Save those trained models (You should search this topic based on your development environment. There are some file formats that Tensorflow/Keras provide, or formats like pickle, ONNX,.. I would like to write a whole list here, with their supporting language & environment, advantage&disadvantage and loadability but I am also trying to investigate this topic, as a newbie)
And THEN, you can deploy these saved-models on production. On production you should either have your own-developed application to run the saved model (For example: an application that you developed with Python that takes trained&saved .pickle file and TestData as input; and simply gives "prediction for the test data" as output) or you should have an environment/framework that runs the saved models (search for ML environments/frameworks on cloud). At first, you should clarify your need: Do you need a stand-alone program on production, or will you serve a internal web-service, or via-cloud, etc.
For the second question; as above answers indicate the issue is "online training ability" of the models. Please additionally note that; for "online learning", your production environment has to feed your production tool/system with the real-correct label of the test data as well. Will you have that capability?
Note: All above are just small "comments" instead of a clear answer, but technically I am not able to write comments yet. Thanks for not de-voting :)
Regarding the first question, my service mlrequest makes deploying models to production simple. You can get started with a free API key that provides 50k model transactions a month.
This code will train and deploy, or update your model across 5 global data centers.
from mlrequest import Classifier
classifier = Classifier('my-api-key')
features = {'feature1': 'val1', 'feature2': 45}
training_data = {'features': features, 'label': 2}
r = classifier.learn(training_data=training_data, model_name='my-model', class_count=2)
This is how you make predictions, latency-routed to the nearest data center to get the quickest response.
features = {'feature1': 'val1', 'feature2': 77}
r = classifier.predict(features=features, model_name='my-model', class_count=2)
r.predict_result
Regarding your second question, it completely depends on the problem you are solving. Some models need to be frequently updated, while others almost never need to be updated.

Apache Spark (MLLib) for real time analytics

I have a few questions related with the use of Apache Spark for real-time analytics using Java. When the Spark application is submitted, the data that are stored in Cassandra database are loaded and processed via a machine learning algorithm (Support Vector Machine). Throughout Spark's streaming extension when new data arrive, they are persisted in the database, the existing dataset is re-trained and the SVM algorithm is executed. The output of this process is also stored back in the database.
Apache Spark's MLLib provides implementation of linear support vector machine. In case that I would like a non-linear SVM implementation, should I implement my own algorithm or may I use existing libraries such as libsvm or jkernelmachines? These implementations are not based on Spark's RDDs, is there a way to do this without implementing the algorithm from scratch using RDD collections? If not, that would be a huge effort if I would like to test several algorithms.
Is MLLib providing out of the box utilities for data scaling before executing the SVM algorithm? http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf as defined in section 2.2
While new dataset is streamed, do I need to re-train the hole dataset? Is there any way that I could just add the new data to the already trained data?
To answer your questions piecewise,
Spark provides the MLUtils class that allows you to load data from the LIBSVM format into RDDs - so just the data load portion won't stop you from utilizing that library. You could also implement your own algorithms if you know what you're doing, although my recommendation would be to take an existing one and tweak the objective function and see how it runs. Spark basically provides you the functionality of a distributed Stochastic Gradient Descent process - you can do anything with it.
Not that I know of. Hopefully someone else knows the answer.
What do you mean by re-training when the whole data is streamed?
From the docs,
.. except fitting occurs on each batch of data, so that the model continually updates to reflect the data from the stream.

how to train a classifier using video datasets

If I have a video dataset of a specific action , how could I use them to train a classifier that could be used later to classify this action.
The question is very generic. In general, there is no foul proof way of training a classifier that will work for everything. It highly depends on the data you are working with.
Here is the 'generic' pipeline:
extract features from the video
label your features (positive for the action you are looking for; negative otherwise)
split your data into 2 (or 3) sets. One for training, one for testing and the other optionally for validation
train a classifier on the labeled examples (e.g. SVM, Neural Network, Nearest Neighbor ...)
validate the results on the validation data, if that is appropriate for the algorithm
test on data you haven't used for training.
You can start with some machine learning tools here http://www.cs.waikato.ac.nz/ml/weka/
Make sure you never touch the test data for any other purposes than testing
Good luck
Almost 10 years later, here's an updated answer.
Set up a camera and collect raw video data
Save it somewhere in form of single frames. Do this yourself locally or using a cloud bucket or use a service like Sieve API. Helpful repo linked here.
Export from Sieve or cloud bucket to get data labeled. Do this yourself or using some service like Scale Rapid.
Split your dataset into train, test, and validation.
Train a classifier on the labeled samples. Use transfer learning over some existing model and fine-tune just the last few layers.
Run your model over the test set after each training epoch and save the one with the best test set performance.
Evaluate your model at the end using the validation set.
There are many repos that can help you get started: https://github.com/weiaicunzai/awesome-image-classification
The two things that can help you ensure best results include 1. high quality labeled data and 2. a diverse, curated dataset. That's what Sieve can help with!

Resources