I am trying to study federated machine learning on time series data. The data is collected from multiple clients. How to convert this data into federated data ?
In Tensorflow Federated we generally consider federated data as a dataset pivoted on the clients. It sounds like here it might be useful to pivot on clients, but retain the time series ordering of that data.
jpgard gives a great answer in
How to create federated dataset from a CSV file? that can be used as an example for other file formats.
Related
For extracting causal DAG from a time series data, I have read some papers that utilize MLP/LSTM as well as other algorithms. But due to ease of use, I want to use the TETRAD software. But I am not understanding I how to input a time series data like a stock exchange or hospital emergency data in the software. For example, here is a sample of the data I am using. I am facing problem to make the software understand that the data has temporal aspects and it needs to model the data as a time series data to extract the causal relationship DAG. I am not finding proper instructions on which algorithm to use in TETRAD for time series causal relationship extraction as well as how to model the data. As causal inference is not my primary field of study, any guidance will be helpful.
I tried using the FGES algorithm to extract the relationship but I am not sure if that is the correct way.
I have many csv files and I want to create clients and give one csv file to each. Is there any method to do so?
I need any tutorial on how to covert dataset into federated dataset, or useful links or examples on the same.
I am working to build a federated learning model using TFF and I have some questions:
I am preparing the dataset, I have separate files of data, with same features and different samples. I would consider each of these files as a single client. How can I maintain this in TFF?
The data is not balanced, meaning, the size of data varies in each file. Is this affecting the modeling process?
The size of the data is a bit small, one file (client) is having 300 records and another is 1500 records, is it suitable to build a federated learning model?
Thanks in advance
You can create a ClientData for your dataset, see Working with tff's ClientData.
The dataset doesn't have to balanced to build a federated learning model. In https://arxiv.org/abs/1602.05629, the server takes weighted federated averaging of client's model updates, where the weights are the number of samples each client has.
A few hundred records per client is no less than the EMNIST dataset, so that would be fine. About the total number of clients: this tutorial shows FL with 10 clients, you can run the colab with smaller NUM_CLIENTS to see how it works on the example dataset.
It's very awesome to see that tensorflow-federated could support distributed training now. I referred to the example here.
However, it seems the training data are sent from server to client at each epoch, and the client(remote_executor_service) doesn't hold any dataset. It is different from typical federated learning scenario. So I was wondering could I place training data separately on each client?
Two things to note:
The example linked (High-performance Simulation with Kubernetes) is running a federated learning simulation. At this time TensorFlow Federated is primarily used for executing research simulations. There has yet to be a means for deploying to real smart phones. In the simulation, each client dataset is logically separate, but may be physically present on the same machine.
The creation of a tf.data.Dataset (e.g. the definition of train_data in the tutorial) could be thought of like creating a "recipe to read data", than actually reading the data itself. For example, adding .batch() or .map() calls to a dataset returns a new recipes, but doesn't actually materialize any dataset examples. The dataset isn't actually read until a .reduce() call, or the dataset is iterated over in a for loop. In the tutorial, the dataset "recipe" is sent to remote workers; the data is read/materialized remotely when the dataset is iterated over during local training (the data itself is not sent to the remote workers).
I'm currently designing a data warehouse in BigQuery. I'm planning to store user data like past purchases or abandoned carts.
This seems to be perfect to manually analyze trends and to get insights. But what if I want to leverage Machine Learning, e.g. to suggest products to a group of users?
I have looked into Google ML Engine and TensorFlow, and it seems like the TensorFlow model would need to query BigQuery first. In some scenarios, this could mean that TensorFlow would need to query all or most of the data that is stored in BigQuery.
This feels a bit off, so I'm wondering if this is really how things are supposed to happen. Otherwise, I assume that my ML model would have to work with stale data?
So I would agree with you, using BigQuery as a data warehouse for your ML is expensive. It would be cheaper and much more efficient to use Google Cloud Storage to store all the data you wish to process. Once everything is processed and generated, you may then wish to push that data to BigQuery push that data to another source like Spanner or even Cloud Storage.
That being said Google has now created a beta product BigQuery ML. This now allows users to create and execute machine learning models in BigQuery via the use of SQL queries. I believe it uses python and tensorflow under the hood, but I believe it would be the best solution given that you have a light weight ML load.
Since it is still in beta as of now, I don't know well it's performance compares to Google ML engine and tensorflow.
Depending on what kind of model you want to train and how you want to server the model you can do one the following options:
You can export your data to Google Cloud Storage as CSV and then read the files in Cloud ML Engine. This will let you use the power of Tensorflow and you can then use Cloud ML Engine's serving system to send traffic to your model.
On the downside, this means that you have to export all of your BigQuery data to GCS and every time you decide to make any change to the data you need to go back to BigQuery and export again. Also if the data you want to prediction on is in BigQuery you have to export that as well and send it to Cloud ML Engine using a separate system.
If you want to explore and interactively train Logistic or Linear regression models on your data, you can use BigQuery Machine learning. This will allow you to slice and dice your data in BigQuery and experiment with different parts of your data and various preprocessing options. You can also use all the power of SQL. BigQuery ML also allows you to use the model after training within BigQuery (you can use SQL to feed data in to the model).
For many cases using full power of Tensorflow (i.e. using DNNs) is not necessary. This is especially true for structured data. On the other hand, most of your time will be spent on preprocessing and cleaning the data which would be much easier in SQL in BigQuery.
So you have two options here. Choose based on your needs.
P.S.: You can also try using BigQuery Reader in Tensorflow. I don't recommend it as it is very slow. But if your data is not huge it may work for you.