Platform for benchmarking of classifiers - machine-learning

I need a platform (java) using for testing of different text classifiers by single training/benchmarking data. Of cause, different classifiers may come from different vendors and have different APIs. Obviously, I will have to write adapters. The propose of the platform is to manage training data and invocation of training/classification/benchmarking. Are you familiar with such open source?

Related

In FL, can clients train different model architectures?

I practice on this tutorial, I would like that each client train a different architecture and different model, Is this possible?
TFF does support different clients having different model architectures.
However, the Federated Learning for Image Classification tutorial uses tff.learning.build_federated_averaging_process which implements the Federated Averaging (McMahan et. al 2017) algorithm, defined as each client receiving the same architecture. This is accomplished in TFF by "mapping" (in the functional programming sense) the model to each client dataset to produce a new model, and then aggregating the result.
To achieve different clients having different architectures, a different federated learning algorithm would need to be implemented. There are couple (non-exhaustive) ways this could be expressed:
Implement an alternative to ClientFedAvg. This method applies a fixed model to the clients dataset. An alternate implementation could potentially create a different architecture per client.
Create a replacement for tff.learning.build_federated_averaging_process
that uses a different function signature, splitting out groups of clients
that would receive different architectures. For example, currently FedAvg
looks like:
(<state#SERVER, data#CLIENTS> → <state#SERVER, metrics#SERVER>
this could be replaced with a method with signature:
(<state#SERVER, data1#CLIENTS, data2#CLIENTS, ...> → <state#SERVER, metrics#SERVER>
This would allow the function to internally tff.federated_map() different model architectures to different client datasets. This would likely only be useful in FL simulations or experimentation and research.
However, in federated learning there will be difficult questions around how to aggregate the models back on the server into a single global model. This probably needs to be designed out first.

Other compression methods for Federated Learning

I noticed that the Gradient Quantization compression method is already implemented in TFF framework. How about non-traditional compression methods where we select a sub-model by dropping some parts of the global model? I come across the "Federated Dropout" compression method in the paper "Expanding the Reach of Federated Learning by Reducing Client Resource Requirements" (https://arxiv.org/abs/1812.07210). Any idea if Federated Dropout method is already supported in Tensorflow Federated. If not, any insights how to implement it (the main idea of the method is dropping a fixed percentage of the activations and filters in the global model to exchange and train a smaller sub-model)?
Currently, there is no implementation of this idea available in the TFF code base.
But here is an outline of how you could do it, I recommend to start from examples/simple_fedavg
Modify top-level build_federated_averaging_process to accept two model_fns -- one server_model_fn for the global model, one client_model_fn for the smaller sub-model structure actually trained on clients.
Modify build_server_broadcast_message to extract only the relevant sub-model from the server_state.model_weights. This would be the mapping from server model to client model.
The client_update may actually not need to be changed (I am not 100% sure), as long as only the client_model_fn is provided from client_update_fn.
Modify server_update - the weights_delta will be the update to the client sub-model, so you will need to map it back to the larger global model.
In general, the steps 2. and 4. are tricky, as they depend not only what layers are in a model, but also the how they are connected. So it will be hard to create a easy to use general solution, but it should be ok to write these for a specific model structure you know in advance.
We have several compression schemas implemented in our simulator:
"FL_PyTorch: Optimization Research Simulator for Federated Learning."
https://burlachenkok.github.io/FL_PyTorch-Available-As-Open-Source/
https://github.com/burlachenkok/flpytorch
FL_PyTorch is a suite of open-source software written in python that builds on top of one of the most popular research Deep Learning (DL) frameworks PyTorch. We built FL_PyTorch as a research simulator for FL to enable fast development, prototyping, and experimenting with new and existing FL optimization algorithms. Our system supports abstractions that provide researchers with sufficient flexibility to experiment with existing and novel approaches to advance the state-of-the-art. The work is in proceedings of the 2nd International Workshop on Distributed Machine Learning DistributedML 2021. The paper, presentation, and appendix are available in DistributedML’21 Proceedings (https://dl.acm.org/doi/abs/10.1145/3488659.3493775).

Is there a way to use external, compiled packages for data processing in Google's AI Platform?

I would like to set up a prediction task, but the data preprocessing step requires using tools outside of Python's data science ecosystem, though Python has APIs to work with those tools (e.g. a compiled java NLP tool set). I first thought about creating a Docker container to have an environment with those tools available, but a commentator has said that that is not currently supported. Is there perhaps some other way to make such tools available to the Python prediction class needed for AI Platform? I don't really have a clear sense of what's happening on the backend with AI platform, and how much ability a user has to modify or set that up.
Not possible today. Is there any specific use case you are targeting not satisfied today?
Cloud AI platform offers multiple prediction frameworks (TensorFlow, scikit-learn, XGboost, Pytorch, Custom predictions) in multiple versions.
After looking into the requirements you can use the new AI Platform feature custom prediction, https://cloud.google.com/ml-engine/docs/tensorflow/custom-prediction-routine-keras
To deploy a custom prediction routine to serve predictions from your trained model, do the following:
Create a custom predictor to handle requests
Package your predictor and your preprocessing module. Here you can install your custom libraries.
Upload your model artifacts and your custom code to Cloud Storage
Deploy your custom prediction routine to AI Platform

A light and accurate classifier which is doable on a device with limited sources

I have a project which I should classify the data coming from several sensors(time series based data) like gyroscope to several classes. I have used several classifiers including SVM, decision tree, neural networks, KNN,... in a batch scenario. My ultimate goal is to find a real-time classifier which is accurate, light and also has the ability to improve itself to implement it on my device which has limited sources(CPU, RAM,..). I was thinking a semi-supervised classifier since I can save a few labeled data on my device and use the future data points to improve my classifier. Does anyone have any recommendation or experience in this regard?
Online learning is very challenging. I recommend you steer away from now and use batch learning. You can always update the model as you update the mobile app or just make the app look for a new updated model on your server every x days.
Now, how to run a machine learning algorithm efficiently on a phone with limited resources. First, you have to identify which platform you are using. I assume you want to get a platform agnostic answer. Most ML algorithms (except lazy learning ones) can run efficiently on smartphone, have a look at this benchmarking experiment.
You have several options here:
iOS: Here's a list of all machine learning libraries available publicly.
Android: Weka for Android, this lib has a huge number of ML algorithms.
Platform agnostic deep learning: Tensorflow, you can export your models to TensorFlow lite (tutorial) and deploy them on any mobile OS and Caffe2 to train deep learning models and export them to any smartphone OS.

Why spark.ml don't implement any of spark.mllib algorithms?

Following the Spark MLlib Guide we can read that Spark has two machine learning libraries:
spark.mllib, built on top of RDDs.
spark.ml, built on top of Dataframes.
According to this and this question on StackOverflow, Dataframes are better (and newer) than RDDs and should be used whenever possible.
The problem is that I want to use common machine learning algorithms (e.g: Frequent Pattern Mining,Naive Bayes, etc.) and spark.ml (for dataframes) don't provide such methods, only spark.mllib(for RDDs) provides this algorithms.
If Dataframes are better than RDDs and the referred guide recommends the use of spark.ml, why aren't common machine learning methods implemented in that lib?
What's the missing point here?
Spark 2.0.0
Currently Spark moves strongly towards DataFrame API with ongoing deprecation of RDD API. While number of native "ML" algorithms is growing the main points highlighted below are still valid and internally many stages are implemented directly using RDDs.
See also: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0
Spark < 2.0.0
I guess that the main missing point is that spark.ml algorithms in general don't operate on DataFrames. So in practice it is more a matter of having a ml wrapper than anything else. Even native ML implementation (like ml.recommendation.ALS use RDDs internally).
Why not implement everything from scratch on top of DataFrames? Most likely because only a very small subset of machine learning algorithms can actually benefit from the optimizations which are currently implemented in Catalyst not to mention be efficiently and naturally implemented using DataFrame API / SQL.
Majority of the ML algorithms require efficient linear algebra library not a tabular processing. Using cost based optimizer for linear algebra could be an interesting addition (I think that flink already has one) but it looks like for now there is nothing to gain here.
DataFrames API gives you very little control over the data. You cannot use partitioner*, you cannot access multiple records at the time (I mean a whole partition for example), you're limited to a relatively small set of types and operations, you cannot use mutable data structures and so on.
Catalyst applies local optimizations. If you pass a SQL query / DSL expression it can analyze it, reorder, apply early projections. All of that is that great but typical scalable algorithms require iterative processing. So what you really want to optimize is a whole workflow and DataFrames alone are not faster than plain RDDs and depending on an operation can be actually slower.
Iterative processing in Spark, especially with joins, requires a fine graded control over the number of partitions, otherwise weird things happen. DataFrames give you no control over partitioning. Also, DataFrame / Dataset don't provide native checkpoint capabilities (fixed in Spark 2.1) which makes iterative processing almost impossible without ugly hacks
Ignoring low level implementation details some groups of algorithms, like FPM, don't fit very well into a model defined by ML pipelines.
Many optimizations are limited to native types, not UDT extensions like VectorUDT.
There is one more problem with DataFrames, which is not really related to machine learning. When you decide to use a DataFrame in your code you give away almost all benefits of static typing and type inference. It is highly subjective if you consider it to be a problem or not but one thing for sure, it doesn't feel natural in Scala world.
Regarding better, newer and faster I would take a look at Deep Dive into Spark SQL’s Catalyst Optimizer, in particular the part related to quasiquotes:
The following figure shows that quasiquotes let us generate code with performance similar to hand-tuned programs.
* This has been changed in Spark 1.6 but it is still limited to default HashPartitioning

Resources