I am trying to implement federated learning with Tensorflow Federated. I am not able to run the tensorflow model on the dataset existing on the client machine. The process followed is as below.
I have one server machine which host the dataset to be used for federated learning. I have created the model and TFF learning average process in the server.
The remote executor service is running on a client machine(GCP VM). The server broadcast is working fine and the model training is executing on the client machine.
But the data for the model training is passed as a parameter to client machine with the broadcast process. Is there a way to train the model with the data hosted on the client machine?
Related
Recently my Raspberry Pi 3b+ didn't boot correctly. I have decided to buy a new rpi 4 and wanted to transfer a Docker container (created with docker-compose) running Teslamate (self-hosted data logger for Tesla cars).
I have copied all the files of /var/lib/docker from the old SD, but actually I don't know how I can create again a container using all the previous data of Teslamate.
Additional informations:
Teslamate is written in Elixir
Data is stored in a Postgres database
Visualization and data analysis with Grafana
Vehicle data is published to a local MQTT Broker
Any clue?
When ML model gets trained, it should be moved automatically from on-premise to Azure storage.
How can I automate On-Premise ML trained model to be stored in Azure storage account, goal here is when the model gets trained it should automatically be stored inside a storage account containers.
There are several solutions can help copying the trained model files from on-premise to Azure Storage.
To use azcopy sync command to replicate the source location to the destination location. Evan if your on-premise OS is Linux, you can try to run it via crontab with an interval.
To use Azure/azure-storage-fuse to mount a container of Azure Blob Storage to Linux filesystem, then directly save the trained model files to the mounted path, if the on-premise trainning machine is Linux.
To use an Azure File Share with Windows or Linux or macOS via Samba 3.0 as a directory in your on-premise filesystem, then you can save the trained model files into it.
At the end of Python trainning script, to add some code using Azure Storage SDK for Python to directly upload the trained module files to Azure Storage.
Hope it helps.
I have built a custom model in Sagemaker and serialized the model through pickle. I want to deploy my model through Sagemaker hosting services and read through this
https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-hosting.html
But I am lost on how to build my own Docker container for a custom model with an algorithm that is currently not implemented as part of the Amazon Estimator.
How do I build my own docker image to load into ECR to then build the container that allows me to create an endpoint?
Have the look at this guide: https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/scikit_bring_your_own/container
It shows how to create a Container for both Training Jobs and Endpoint deployment.
If you only need to deploy the Endpoint, you can skip the training part.
As mentioned in the documentation, for a SageMaker Endpoint, you need a Docker container with a web server implemented that listens to HTTP requests at route "/ping" and "/invocations".
In the guide, they have implemented a flask web server using NGINX and Gunicorn.
For your use case:
https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/scikit_bring_your_own/container/decision_trees
In this directory, you can skip the "train" file and keep rest of the files intact except for the "predictor.py" file. It's the file that you will modify to implement your own inference algorithm.
Hello while I had set up google cloud machine learning to train a neural network , suddenly I am unable to submit jobs to google cloud.
There is no error but the command hangs there without doing anything , Also my instance is running .Here is the command:
gcloud ml-engine jobs submit training job9123 --runtime-version 1.0 --job-dir gs://dataset1_giorgaros2 --package-path trainmodule --module-name trainmodule.nncloud --region europe-west1 --config cloudml-gpu.yaml -- --train-file gs://dataset1_giorgaros2/nnn.p
Thank You !
ML engine job logs could help to obtain more details about the failed job execution, in most of the cases the log file contains the cause for the failure.
Finding the job logs on ML engine
If you are trying the same command each time over the training job execution, you might be obtaining an error regarding to the job name, this due to the name must be unique for each job on ML engine as it is described over the naming convention rules on ML engine jobs.
ML Engine name convention
Try checking network connectivity to google compute engine.
Check logs from the run - https://console.cloud.google.com/
And of course, read the docs:
https://cloud.google.com/sdk/gcloud/reference/ml-engine/jobs/submit/training
I am training a Tensorflow model with, unfortunately, many CPU operations. On my local machine, I compiled Tensorflow from source with support for SSE4.2/AVX/FMA to make training run faster. When I train on gcloud via their ML engine service, I am getting 10x slow down compared to local. I suspect that Tensorflow on gcloud ML engine wasn't compile with CPU optimizations. I was wondering if what are ways around this.