I am working to build a federated learning model using TFF and I have some questions:
I am preparing the dataset, I have separate files of data, with same features and different samples. I would consider each of these files as a single client. How can I maintain this in TFF?
The data is not balanced, meaning, the size of data varies in each file. Is this affecting the modeling process?
The size of the data is a bit small, one file (client) is having 300 records and another is 1500 records, is it suitable to build a federated learning model?
Thanks in advance
You can create a ClientData for your dataset, see Working with tff's ClientData.
The dataset doesn't have to balanced to build a federated learning model. In https://arxiv.org/abs/1602.05629, the server takes weighted federated averaging of client's model updates, where the weights are the number of samples each client has.
A few hundred records per client is no less than the EMNIST dataset, so that would be fine. About the total number of clients: this tutorial shows FL with 10 clients, you can run the colab with smaller NUM_CLIENTS to see how it works on the example dataset.
Related
It's very awesome to see that tensorflow-federated could support distributed training now. I referred to the example here.
However, it seems the training data are sent from server to client at each epoch, and the client(remote_executor_service) doesn't hold any dataset. It is different from typical federated learning scenario. So I was wondering could I place training data separately on each client?
Two things to note:
The example linked (High-performance Simulation with Kubernetes) is running a federated learning simulation. At this time TensorFlow Federated is primarily used for executing research simulations. There has yet to be a means for deploying to real smart phones. In the simulation, each client dataset is logically separate, but may be physically present on the same machine.
The creation of a tf.data.Dataset (e.g. the definition of train_data in the tutorial) could be thought of like creating a "recipe to read data", than actually reading the data itself. For example, adding .batch() or .map() calls to a dataset returns a new recipes, but doesn't actually materialize any dataset examples. The dataset isn't actually read until a .reduce() call, or the dataset is iterated over in a for loop. In the tutorial, the dataset "recipe" is sent to remote workers; the data is read/materialized remotely when the dataset is iterated over during local training (the data itself is not sent to the remote workers).
I have a lot of data, each data has 3 inputs, 6 outputs, how can I analyze the relationship between these data through big data analysis or machine learning? When the new 3 inputs are provided, the new 6 output data is automatically given.
input:1,2,3 output:4,5,6,7,8,9
input:4,5,6 output:???
Your question is about the capacity of a learning machine. In supervised learning, the machine learns from a lot of examples. In your case, if you have a lot of data that are labeled, you may want to build a multiple layer perceptron to study the samples. In this architecture, you would want 3 input neurons and 6 output neurons and multiple layers between them. On the other hand, if you believe there is a generating pattern in your training data, you may want to use statistical model. In either case you need a lot of data to train a machine. In your example, it would confuse a machine just like confusing a human since it has too few samples and has too much possibilities.
I'm trying to utilize a pre-trained model like Inception v3 (trained on the 2012 ImageNet data set) and expand it in several missing categories.
I have TensorFlow built from source with CUDA on Ubuntu 14.04, and the examples like transfer learning on flowers are working great. However, the flowers example strips away the final layer and removes all 1,000 existing categories, which means it can now identify 5 species of flowers, but can no longer identify pandas, for example. https://www.tensorflow.org/versions/r0.8/how_tos/image_retraining/index.html
How can I add the 5 flower categories to the existing 1,000 categories from ImageNet (and add training for those 5 new flower categories) so that I have 1,005 categories that a test image can be classified as? In other words, be able to identify both those pandas and sunflowers?
I understand one option would be to download the entire ImageNet training set and the flowers example set and to train from scratch, but given my current computing power, it would take a very long time, and wouldn't allow me to add, say, 100 more categories down the line.
One idea I had was to set the parameter fine_tune to false when retraining with the 5 flower categories so that the final layer is not stripped: https://github.com/tensorflow/models/blob/master/inception/README.md#how-to-retrain-a-trained-model-on-the-flowers-data , but I'm not sure how to proceed, and not sure if that would even result in a valid model with 1,005 categories. Thanks for your thoughts.
After much learning and working in deep learning professionally for a few years now, here is a more complete answer:
The best way to add categories to an existing models (e.g. Inception trained on the Imagenet LSVRC 1000-class dataset) would be to perform transfer learning on a pre-trained model.
If you are just trying to adapt the model to your own data set (e.g. 100 different kinds of automobiles), simply perform retraining/fine tuning by following the myriad online tutorials for transfer learning, including the official one for Tensorflow.
While the resulting model can potentially have good performance, please keep in mind that the tutorial classifier code is highly un-optimized (perhaps intentionally) and you can increase performance by several times by deploying to production or just improving their code.
However, if you're trying to build a general purpose classifier that includes the default LSVRC data set (1000 categories of everyday images) and expand that to include your own additional categories, you'll need to have access to the existing 1000 LSVRC images and append your own data set to that set. You can download the Imagenet dataset online, but access is getting spotier as time rolls on. In many cases, the images are also highly outdated (check out the images for computers or phones for a trip down memory lane).
Once you have that LSVRC dataset, perform transfer learning as above but including the 1000 default categories along with your own images. For your own images, a minimum of 100 appropriate images per category is generally recommended (the more the better), and you can get better results if you enable distortions (but this will dramatically increase retraining time, especially if you don't have a GPU enabled as the bottleneck files cannot be reused for each distortion; personally I think this is pretty lame and there's no reason why distortions couldn't also be cached as a bottleneck file, but that's a different discussion and can be added to your code manually).
Using these methods and incorporating error analysis, we've trained general purpose classifiers on 4000+ categories to state-of-the-art accuracy and deployed them on tens of millions of images. We've since moved on to proprietary model design to overcome existing model limitations, but transfer learning is a highly legitimate way to get good results and has even made its way to natural language processing via BERT and other designs.
Hopefully, this helps.
Unfortunately, you cannot add categories to an existing graph; you'll basically have to save a checkpoint and train that graph from that checkpoint onward.
I am trying to predict from my past data which has around 20 attribute columns and a label. Out of those 20, only 4 are significant for prediction. But i also want to know that if a row falls into one of the classified categories, what other important correlated columns apart from those 4 and what are their weight. I want to get that result from my deployed web service on Azure.
You can use permutation feature importance module but that will give importance of the features across the sample set. Retrieving the weights on per call basis is not available in Azure ML.
If I have a video dataset of a specific action , how could I use them to train a classifier that could be used later to classify this action.
The question is very generic. In general, there is no foul proof way of training a classifier that will work for everything. It highly depends on the data you are working with.
Here is the 'generic' pipeline:
extract features from the video
label your features (positive for the action you are looking for; negative otherwise)
split your data into 2 (or 3) sets. One for training, one for testing and the other optionally for validation
train a classifier on the labeled examples (e.g. SVM, Neural Network, Nearest Neighbor ...)
validate the results on the validation data, if that is appropriate for the algorithm
test on data you haven't used for training.
You can start with some machine learning tools here http://www.cs.waikato.ac.nz/ml/weka/
Make sure you never touch the test data for any other purposes than testing
Good luck
Almost 10 years later, here's an updated answer.
Set up a camera and collect raw video data
Save it somewhere in form of single frames. Do this yourself locally or using a cloud bucket or use a service like Sieve API. Helpful repo linked here.
Export from Sieve or cloud bucket to get data labeled. Do this yourself or using some service like Scale Rapid.
Split your dataset into train, test, and validation.
Train a classifier on the labeled samples. Use transfer learning over some existing model and fine-tune just the last few layers.
Run your model over the test set after each training epoch and save the one with the best test set performance.
Evaluate your model at the end using the validation set.
There are many repos that can help you get started: https://github.com/weiaicunzai/awesome-image-classification
The two things that can help you ensure best results include 1. high quality labeled data and 2. a diverse, curated dataset. That's what Sieve can help with!