Watching over SageMaker while it is training

Watching over SageMaker while it is training - machine-learning

I am using Amazon SageMaker to train a model with a lot of data.
This takes a lot of time - hours or even days. During this time, I would like be able to query the trainer and see its current status, particularly:
How many iterations it already did, and how many iterations it still needs to do? (the training algorithm is deep learning - it is based on iterations).
How much time does it need to complete the training?
Ideally, I would like to classify a test-sample using the model of the current iteration, to see its current performance.
One way to do this is to explicitly tell the trainer to print debug messages after each iteration. However, these messages will be availble only at the console from which I run the trainer. Since training takes so much time, I would like to be able to query the trainer status remotely, from different computers.
Is there a way to remotely query the status of a running trainer?

All logs are available in Amazon Cloudwatch. You can query CloudWatch programmatically or via an API to parse the logs.
Are you using built-in algorithms or a Framework like MXNet or TensorFlow? For TensorFlow you can monitor your job with TensorBoard.
Additionally, you can see high level job status using the describe training job API call:
import sagemaker
sm_client = sagemaker.Session().sagemaker_client
print(sm_client.describe_training_job(TrainingJobName='You job name here'))

Related

Question on on updating machine learning models live

I have a model (loaded into memory) live in production that consumes messages/data from message queue for make a prediction. I have a separate process that retrains the model every few hours (necessary). What is the best way to trigger model to reload newly trained version into memory every-time retraining occurs? Currently I just have the production model reload on an interval or every 1000 messages.
I figured this would be easier if instead of a message queue I have a webserver. So I can just have an endpoint that can trigger reload.
It's hard to find best practices on this topic.

I've found a similar question here. Google App Engine: Automatically re-deploy once a day to update machine learning model?
The answers seem to suggest the best way would be to redeploy when training is complete. But I will likely have more models in this pipeline. redeploying on every retrain is not really feasible.

Tensorflow Session problems (multi-session 1 gpu, async sess.run ?)

sorry for the title i know it's a bit vague but i'm having a hard time with our design and I need help !
So we have a trained model, which we wanna use on images for car detection. We have a lot a images coming from multiple camera in our nodejs backend. What we are looking to do is to create multiple workers (child_process) and then send an image path via stdin to every single one of them so they can process it and get the results (1 image per worker per run).
Workers are python3 scripts, so they all run the same code. This mean we have multiple tensorflow session. That created a problem, it looks like i can't find a way to run multiple session on the same gpu... Is there a way to do this ?
If not, how can i achieve my goal to run those images in a parallel way with only 1 gpu ? Maybe i can create 1 session and attache to it in my workers ? I'm very new to this as you can see !
Btw i'm running all of this in a docker container with a gtx 960M (yes i know.. better than nothing i guess).

By default, a tensorflow session will hog all GPU memory. You can override the defaults when creating the session. From this answer:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
That said, graph building/session creation is much more expensive than just running inference on a session, so you don't want to have to do that for each individual query image. You may be better off running a server that builds the graph, starts the session, loads variables etc. then responds to queries as they come in. If you want it more asynchronous than this, you can still have multiple servers with a session in each on the same GPU using the above method.
Check out tensorflow serving for a lot more on this.

Debugging slow reads from BigQuery on Google Cloud Dataflow

Background:
We have a really simple pipeline which reads some data from BigQuery (usually ~300MB) filters/transforms it and puts it back to BigQuery. in 99% of cases this pipeline finishes in 7-10minutes and is then restarted again to process a new batch.
Problem:
Recently, the job has started to take >3h once in a while, maybe 2 times in a month out of 2000 runs. When I look at the logs, I can't see any errors and in fact it's only the first step (read from BigQuery) that is taking so long.
Does anyone have a suggestion on how to approach debugging of such cases? Especially since it's really the read from BQ and not any of our transformation code. We are using Apache Beam SDK for Python 0.6.0 (maybe that's the reason!?)
Is it maybe possible to define a timeout for the job?

This is an issue on either Dataflow side or BigQuery side depending on how one looks at it. When splitting the data for parallel processing, Dataflow relies on an estimate of the data size. The long runtime happens when BigQuery sporadically gives a severe under-estimate of the query result size, and Dataflow, as a consequence, severely over-splits the data and the runtime becomes bottlenecked by the overhead of reading lots and lots of tiny file chunks exported by BigQuery.
On one hand, this is the first time I've seen BigQuery produce such dramatically incorrect query result size estimates. However, as size estimates are inherently best-effort and can in general be arbitrarily off, Dataflow should control for that and prevent such oversplitting. We'll investigate and fix this.
The only workaround that comes to mind meanwhile is to use the Java SDK: it uses quite different code for reading from BigQuery that, as far as I recall, does not rely on query size estimates.

Should I store a global counter or an aggregated value in a TSDB

This question is really about the data schema. I have a program which has a bunch of discrete events, and I want to get beautiful graphs out.
From my knowledge, I understand that I should really keep a counter of the number of events that have occurred, and on a regular interval, transfer that cumulative counter to the TSDB (as part of a cron job or similar).
What I currently have though is a system where the monitor, on a regular interval, tells the TSB how many events occurred during that interval (a fixed hard coded value!).
Which of these two design patterns is better? What are the factors that affect that decision? Do I have a counter value here or is it just a measurement?
I have various concerns, including but not limited to the efficiency of the monitoring tool.

You tagged the question with InfluxDB but it seems like what you are really asking about is the collection agent. For that I would look at Telegraf.
StatsD is also a really great lightweight API that is available for most major languages now, from which you can efficiently emit different types of stats (counters, timings, etc); either for every event or at a sample rate you define.
I implemented a solution that gather metrics emitted from my app using StatsD, metrics that were pulled (JMX queries), and basic host level stats you get for free with Telegraf. Every host (30+) runs a single telegraf instance which delivers its stats to a centralized InfluxDB server on some interval (i.e. 30 seconds).
So with an approach like that you get a good balance of performance and data precision.

predictionio not producing any predictions

I am trying to test out prediction-io for the first time. I followed the installation instructions for linux and developed several test engines. After repeatedly getting the following error on my own datasets I decided to follow the movie 100k tutorial (https://github.com/PredictionIO/PredictionIO-Docs/blob/cbca03b1c2bad949db951a3a798f0080c48b3674/source/tutorials/movie-recommendation.rst). The same error seems to persist even though it seems as if my Hadoop is running correctly (and not in safe mode) and the engine says that it is running and training is complete. The error that I am getting is:
predictionio.ItemRecNotFoundError: request: GET
/engines/itemrec/movie-rec/topn.json {'pio_n': 10, 'pio_uid': '28',
'pio_appkey':
'UsZmneFir39GXO9hID3wDhDQqYNje4S9Ea3jiQjrpHFzHwMEqCqwJKhtAziveC9D'}
/engines/itemrec/movie-rec/topn.json?pio_n=10&pio_uid=28&pio_appkey=UsZmneFir39GXO9hID3wDhDQqYNje4S9Ea3jiQjrpHFzHwMEqCqwJKhtAziveC9D
status: 404 body: {"message":"Cannot find recommendation for user."}
The rest of the tutorial runs as expected, just no predictions ever seem to appear. Can someone please point me in the right direction on how to solve this issue?
Thanks!

Several suggestions:
Check if there is data in PredictioIO's database. I saw jobs failing because there was some items in database but no users and no user-to-item actions. Look into Mongo database appdata - there should be collections named users, items and u2iActions. These collections are only created when you add first user-item-u2iaction there via API. That's bad that it is not clear whether job completed successfully or not via the web interface.
Check logs - PredictionIO logs, and Hadoop logs if you use Hadoop jobs. See if model training jobs did complete (BTW, did you invoke "Train prediction model now" via web interface?)
Verify if there is some data in predictionio_modeldata for your algorithm.
Well, even if model is trained OK, there can still be not enough data to produce recommendations for some user. Try "Random" to get the simplest recommendations available for all, to check if system as a whole works.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Watching over SageMaker while it is training - machine-learning

Related

Question on on updating machine learning models live

Tensorflow Session problems (multi-session 1 gpu, async sess.run ?)

Debugging slow reads from BigQuery on Google Cloud Dataflow

Should I store a global counter or an aggregated value in a TSDB

predictionio not producing any predictions

Categories

Resources