Tensorflow Session problems (multi-session 1 gpu, async sess.run ?) - docker

sorry for the title i know it's a bit vague but i'm having a hard time with our design and I need help !
So we have a trained model, which we wanna use on images for car detection. We have a lot a images coming from multiple camera in our nodejs backend. What we are looking to do is to create multiple workers (child_process) and then send an image path via stdin to every single one of them so they can process it and get the results (1 image per worker per run).
Workers are python3 scripts, so they all run the same code. This mean we have multiple tensorflow session. That created a problem, it looks like i can't find a way to run multiple session on the same gpu... Is there a way to do this ?
If not, how can i achieve my goal to run those images in a parallel way with only 1 gpu ? Maybe i can create 1 session and attache to it in my workers ? I'm very new to this as you can see !
Btw i'm running all of this in a docker container with a gtx 960M (yes i know.. better than nothing i guess).

By default, a tensorflow session will hog all GPU memory. You can override the defaults when creating the session. From this answer:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
That said, graph building/session creation is much more expensive than just running inference on a session, so you don't want to have to do that for each individual query image. You may be better off running a server that builds the graph, starts the session, loads variables etc. then responds to queries as they come in. If you want it more asynchronous than this, you can still have multiple servers with a session in each on the same GPU using the above method.
Check out tensorflow serving for a lot more on this.

Related

Job-based cloud processing solution

I would like to do some cloud processing on a very small cluster of machines (<5).
This processing should be based on 'jobs', where jobs are parameterized scripts that run in a certain docker environment.
As an example for what a job could be:
Run in docker image "my_machine_learning_docker"
Download some machine learning dataset from an internal server
Train some neural network on the dataset
Produce a result and upload it to a server again.
My use cases are not limited to machine learning however.
A job could also be:
Run in docker image "my_image_processing_docker"
Download a certain amount of images from some folder on a machine.
Run some image optimization algorithm on each of the images.
Upload the processed images to another server.
Now what I am looking for is some framework/tool, that keeps track of the compute servers, that receives my jobs and dispatches them to an available server. Advanced priorization, load management or something is not really required.
It should be possible to query the status of jobs and of the servers via an API (I want to do this from NodeJS).
Potentially, I could imagine this framework/tool to dynamically spin up these compute servers in in AWS, Azure or something. That would not be a hard requirement though.
I would also like to host this solution myself. So I am not looking for a commercial solution for this.
Now I have done some research, and what I am trying to do has similarities with many, many existing projects, but I have not "quite" found what I am looking for.
Similar things I have found were (selection):
CI/CD solutions such as Jenkins/Gitlab CI. Very similar, but it seems to be tailored very much towards the CI/CD case, and I am not sure whether it is such a good idea to abuse a CI/CD solution for what I am trying to do.
Kubernetes: Appears to be able to do this somehow, but is said to be very complex. It also looks like overkill for what I am trying to do.
Nomad: Appears to be the best fit so far, but it has some proprietary vibes that I am not very much a fan of. Also it still feels a bit complex...
In general, there are many many different projects and frameworks, and it is difficult to find out what the simplest solution is for what I am trying to do.
Can anyone suggest anything or point me in a direction?
Thank you
I would use Jenkins for this use case even if it appears to you as a “simple” one. You can start with the simplest pipeline which can also deal with increasing complexity of your job. Jenkins has API, lots of plugins, it can be run as container for a spin up in a cloud environment.
Its possible you're looking for something like AWS Batch flows: https://aws.amazon.com/batch/ or google datalflow https://cloud.google.com/dataflow. Out of the box they do scaling, distribution monitoring etc.
But if you want to roll your own ....
Option A: Queues
For your job distribution you are really just looking for a simple message queue that all of the workers listen on. In most messaging platforms, a Queue supports deliver once semantics. For example
Active MQ: https://activemq.apache.org/how-does-a-queue-compare-to-a-topic
NATS: https://docs.nats.io/using-nats/developer/receiving/queues
Using queues for load distribution is a common pattern.
A queue based solution can use both with manual or atuomated load balancing as the more workers you spin up, the more instances of your workers you have consuming off the queue. The same messaging solution can be used to gather the results if you need to, using message reply semantics or a dedicated reply channel. You could use the resut channel to post progress reports back and then your main application would know the status of each worker. Alternatively they could drop status in database. It probably depends on your preference for collecting results and how large the result sets would be. If large enough, you might even just drop results in an S3 bucket or some kind of filesystem.
You could use something quote simple to mange the workers - Jenkins was already suggested is in defintely a solution I have seen used for running multiple instances accross many servers as you just need to install the jenkins agent on each of the workers. This can work quote easily if you own or manage the physical servers its running on. You could use TeamCity as well.
If you want something cloud hosted, it may depend on the technology you use. Kubernetties is probably overkill here, but certiabnly could be used to spin up N nodes and increase/decrease those number of workers. To auto scale you could publish out a single metric - the queue depth - and trigger an increase in the number of workers based on how deep the queue is and a metric you work out based on cost of spinning up new nodes vs. the rate at which they are processed.
You could also look at some of the lightweight managed container solutions like fly.io or Heroku which are both much easier to setup than K8s and would let you scale up easily.
Option 2: Web workers
Can you design your solution so that it can be run as a cloud function/web worker?
If so you could set them up so that scaling is fully automated. You would hit the cloud function end point to request each job. The hosting engine would take care of the distribution and scaling of the workers. The results would be passed back in the body of the HTTP response ... a json blob.
Your workload may be too large for these solutions, but if its actually fairly light weight quick it could be a simple option.
I don't think these solutions would let you query the status of tasks easily.
If this option seems appealing there are quite a few choices:
https://workers.cloudflare.com/
https://cloud.google.com/functions
https://aws.amazon.com/lambda/
Option 3: Google Cloud Tasks
This is a bit of a hybrid option. Essentially GCP has a queue distribution workflow where the end point is a cloud function or some other supported worker, including cloud run which uses docker images. I've not actually used it myself but maybe it fits the bill.
https://cloud.google.com/tasks
When I look at a problem like this, I think through the entirity of the data paths. The map between source image and target image and any metadata or status information that needs to be collected. Additionally, failure conditions need to be handled, especially if a production service is going to be built.
I prefer running Python, Pyspark with Pandas UDFs to perform the orchestration and image processing.
S3FS lets me access s3. If using Azure or Google, Databricks' DBFS lets me seamlessly read and write to cloud storage without 2 extra copy file steps.
Pyspark's binaryFile data source lets me list all of the input files to be processed. Spark lets me run this in batch or an incremental/streaming configuration. This design optimizes for end to end data flow and data reliability.
For a cluster manager I use Databricks, which lets me easily provision an auto-scaling cluster. The Databricks cluster manager lets users deploy docker containers or use cluster libraries or notebook scoped libraries.
The example below assumes the image is > 32MB and processes it out of band. If the image is in the KB range then dropping the content is not necessary and in-line processing can be faster (and simpler).
Pseudo code:
df = (spark.read
.format("binaryFile")
.option("pathGlobFilter", "*.png")
.load("/path/to/data")
.drop("content")
)
from typing import Iterator
def do_image_xform(path:str):
# Do image transformation, read from dbfs path, write to dbfs path
...
# return xform status
return "success"
#pandas_udf("string")
def do_image_xform_udf(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
for path in iterator:
yield do_image_xform(path)
df_status = df.withColumn('status',do_image_xform_udf(col(path)))
df_status.saveAsTable("status_table") # triggers execution, saves status.

Is it possible to run multiples of the same model in parallel on the Coral dev board?

I'm running mobilnet SSD and getting around 14ms per input image. Is it possible for me to run two of these models at the same time on the same dev board tpu? For example I have a backlog of 100 images I want to get through and the only thing that is important to me is how long it takes to get through all 100. So if I could run 2 or 4 at a time that would be amazing. I tried to read through the docs and I looked at pipelining but the edge compiler tells me "~$ Warning: For the given model, you're creating more segments than is necessary". Everything else I've read about running in parallel is about using two physical edge TPUs. If it's not possible that's fine I just want to know for sure :)
Thank you
You can run multiple models, but the TPU has limited memory and will swap your models in and out so you may not see a performance improvement by delegating your task to multiple models. However, you could co-compile your models. This process 'compiles' each model with the same identifier (a caching token) which enables them both to run on the TPU without getting swapped in and out.
Compiling models is done with the edgetpu_compiler; the process works like this:
edgetpu_compiler someModel.tflite someOtherModel.tflite
Or with the same model:
edgetpu_compiler someModelA.tflite someModelA_duplicate.tflite
There are some nuances to the process, such as the order in which you feed the models to the edgetpu_compiler process can impact performance as does the scenario where your combined models are too big to fit into the TPU RAM. I suggest starting with this documentation about multiple models.

Watching over SageMaker while it is training

I am using Amazon SageMaker to train a model with a lot of data.
This takes a lot of time - hours or even days. During this time, I would like be able to query the trainer and see its current status, particularly:
How many iterations it already did, and how many iterations it still needs to do? (the training algorithm is deep learning - it is based on iterations).
How much time does it need to complete the training?
Ideally, I would like to classify a test-sample using the model of the current iteration, to see its current performance.
One way to do this is to explicitly tell the trainer to print debug messages after each iteration. However, these messages will be availble only at the console from which I run the trainer. Since training takes so much time, I would like to be able to query the trainer status remotely, from different computers.
Is there a way to remotely query the status of a running trainer?
All logs are available in Amazon Cloudwatch. You can query CloudWatch programmatically or via an API to parse the logs.
Are you using built-in algorithms or a Framework like MXNet or TensorFlow? For TensorFlow you can monitor your job with TensorBoard.
Additionally, you can see high level job status using the describe training job API call:
import sagemaker
sm_client = sagemaker.Session().sagemaker_client
print(sm_client.describe_training_job(TrainingJobName='You job name here'))

How to spin up 'n' instances of an app / container with pre-loaded memory

Background:
I have a language processing java app that requires about 16MB memory and takes about 40 seconds to initialise resources into that memory before exposing a webservice. I am new to containers and related technologies so apologies if my question is obvious...
Objective:
I want to make available several hundred instances of my app on-demand and in a pre-loaded/ pre-configured state. (eg I could make a call to AWS to stand-up 'n' instances of my app and they would be ready in <10seconds.)
Question:
I'm anticipating that I 'may' be able to create a docker image of the app, initialise it and pause hence be able to clone that on demand and 'un-pause' ? Could you advise whether what I am looking to do is possible and if so, how you would approach it.
AWS is my platform of choice so any AWS flavoured specifics would be super helpful.
I'd split your question in two, if you don't mind:
1. Spinning up N containers (or, more likely, scale on demand)
2. Preloading memory.
#1 is Kubernetes's bread and butter and you can find a ton of resources about it online, so allow me to focus on #2.
The real problem is that you're too focused on a possible solution to see the bigger picture:
You want to "preload memory" in order to speed up launch time (well, what do you think Java is doing in those 40s that the magick memory preloader wouldn't?).
A different approach would be to launch the container, let Java eat up resources for 40s, but not make that container available to the world during that time.
Kubernetes provides tools to achieve exactly that, see here:
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/
Hope this helps!

Dataflow job takes too long to start

I'm running a job which reads about ~70GB of (compressed data).
In order to speed up processing, I tried to start a job with a large number of instances (500), but after 20 minutes of waiting, it doesn't seem to start processing the data (I have a counter for the number of records read). The reason for having a large number of instances is that as one of the steps, I need to produce an output similar to an inner join, which results in much bigger intermediate dataset for later steps.
What should be an average delay before the job is submitted and when it starts executing? Does it depend on the number of machines?
While I might have a bug that causes that behavior, I still wonder what that number/logic is.
Thanks,
G
The time necessary to start VMs on GCE grows with the number of VMs you start, and in general VM startup/shutdown performance can have high variance. 20 minutes would definitely be much higher than normal, but it is somewhere in the tail of the distribution we have been observing for similar sizes. This is a known pain point :(
To verify whether VM startup is actually at fault this time, you can look at Cloud Logs for your job ID, and see if there's any logging going on: if there is, then some VMs definitely started up. Additionally you can enable finer-grained logging by adding an argument to your main program:
--workerLogLevelOverrides=com.google.cloud.dataflow#DEBUG
This will cause workers to log detailed information, such as receiving and processing work items.
Meanwhile I suggest to enable autoscaling instead of specifying a large number of instances manually - it should gradually scale to the appropriate number of VMs at the appropriate moment in the job's lifetime.
Another possible (and probably more likely) explanation is that you are reading a compressed file that needs to be decompressed before it is processed. It is impossible to seek in the compressed file (since gzip doesn't support it directly), so even though you specify a large number of instances, only one instance is being used to read from the file.
The best way to approach the solution of this problem would be to split a single compressed file into many files that are compressed separately.
The best way to debug this problem would be to try it with a smaller compressed input and take a look at the logs.

Resources