I'm using the optimizer SHERPA to train a machine learning network. When the training begins, it lists a url that takes you to a built-in SHERPA dashboard to track the progress as it runs. The url is connected to a local host forwarding port specified within the code.
The training runs just fine within the docker, but I can't figure out why my visualization dashboard isn't working. Attached is a picture of what it's supposed to look like (from the SHERPA GitHub page) versus what I'm getting, which is blank. I let the training finish completely, which took multiple hours, but the dashboard still stayed blank the entire time. Any ideas?
comparison image
Related
I have filed an issue in the official doccano repo. Here. However I am posting here also in the hope of getting some idea on what I am doing wrong.
I have two EC2 instances both running Ubuntu 20
In one of them I have set up doccano and uploaded some data
I annotated a bit of that data and then trained a custom model using Hgging Face.
In the second EC2 instance I have uploaded the trained model and created a FastAPI based API to serve the result.
I want to set up auto labeling (it is a Sequence Labeling project).
I follow the steps in the official document and also take help from here.
Everything goes right, including at the second step when I am testing the api connectivity doccano could successfully connect and fetch the data.
Once all is done I go to one of the documents and try to do the auto labeling. And Surprise
NOTHING HAPPENS.
There is no log in the model server showing that no request has ever reached there!
Both the doccano and the model server and running via Docker inside the EC2 instances.
What am I doing wrong?
Please help.
Warm regards
Ok I found the reason. Thanks to a reply in the Github issue. Here I am going to quote the reply as it is.
The website is not behaving intuitively from the UX perspective. When
you turn on Auto Labeling, just try going to the next example (arrow
on the top right of the page) and your Auto Labeling API should be
called. Then go back and it will be called on the first example. Also,
it's called only on examples that are not marked as "labeled".
So anyone having difficulty with the same problem hopefully you will get some help from here.
I use a time series database to report some network metrics, such as the download time or DNS lookup time for some endpoints. However, sometimes the measure fails like if the endpoint is down, or if there is a network issue. In theses cases, what should be done according to the best practices? Should I report an impossible value, like -1, or just not write anything at all in the database?
The problem I see when not writing anything, is that I cannot know if my test is not running anymore, or if it is a problem with the endpoint/network.
The best practice is to capture the failures in their own time series for separate analysis.
Failures or bad readings will skew the series, so they should be filtered out or replaced with a projected value for 'normal' events. The beauty of a time series is that one measure (time) is globally common, so it is easy to project between two known points when one is missing.
The failure information is also important, as it is an early indicator to issues or outages on your target. You can record the network error and other diagnostic information to find trends and ensure it is the client and not your server having the issue. Further, there can be several instances deployed to monitor the same target so that they cancel each other's noise.
You can also monitor a known endpoint like google's 204 page to ensure network connectivity. If all the monitors report an error connecting to your site but not to the known endpoint, your server is indeed down.
I am using Amazon SageMaker to train a model with a lot of data.
This takes a lot of time - hours or even days. During this time, I would like be able to query the trainer and see its current status, particularly:
How many iterations it already did, and how many iterations it still needs to do? (the training algorithm is deep learning - it is based on iterations).
How much time does it need to complete the training?
Ideally, I would like to classify a test-sample using the model of the current iteration, to see its current performance.
One way to do this is to explicitly tell the trainer to print debug messages after each iteration. However, these messages will be availble only at the console from which I run the trainer. Since training takes so much time, I would like to be able to query the trainer status remotely, from different computers.
Is there a way to remotely query the status of a running trainer?
All logs are available in Amazon Cloudwatch. You can query CloudWatch programmatically or via an API to parse the logs.
Are you using built-in algorithms or a Framework like MXNet or TensorFlow? For TensorFlow you can monitor your job with TensorBoard.
Additionally, you can see high level job status using the describe training job API call:
import sagemaker
sm_client = sagemaker.Session().sagemaker_client
print(sm_client.describe_training_job(TrainingJobName='You job name here'))
sorry for the title i know it's a bit vague but i'm having a hard time with our design and I need help !
So we have a trained model, which we wanna use on images for car detection. We have a lot a images coming from multiple camera in our nodejs backend. What we are looking to do is to create multiple workers (child_process) and then send an image path via stdin to every single one of them so they can process it and get the results (1 image per worker per run).
Workers are python3 scripts, so they all run the same code. This mean we have multiple tensorflow session. That created a problem, it looks like i can't find a way to run multiple session on the same gpu... Is there a way to do this ?
If not, how can i achieve my goal to run those images in a parallel way with only 1 gpu ? Maybe i can create 1 session and attache to it in my workers ? I'm very new to this as you can see !
Btw i'm running all of this in a docker container with a gtx 960M (yes i know.. better than nothing i guess).
By default, a tensorflow session will hog all GPU memory. You can override the defaults when creating the session. From this answer:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
That said, graph building/session creation is much more expensive than just running inference on a session, so you don't want to have to do that for each individual query image. You may be better off running a server that builds the graph, starts the session, loads variables etc. then responds to queries as they come in. If you want it more asynchronous than this, you can still have multiple servers with a session in each on the same GPU using the above method.
Check out tensorflow serving for a lot more on this.
I am trying to test out prediction-io for the first time. I followed the installation instructions for linux and developed several test engines. After repeatedly getting the following error on my own datasets I decided to follow the movie 100k tutorial (https://github.com/PredictionIO/PredictionIO-Docs/blob/cbca03b1c2bad949db951a3a798f0080c48b3674/source/tutorials/movie-recommendation.rst). The same error seems to persist even though it seems as if my Hadoop is running correctly (and not in safe mode) and the engine says that it is running and training is complete. The error that I am getting is:
predictionio.ItemRecNotFoundError: request: GET
/engines/itemrec/movie-rec/topn.json {'pio_n': 10, 'pio_uid': '28',
'pio_appkey':
'UsZmneFir39GXO9hID3wDhDQqYNje4S9Ea3jiQjrpHFzHwMEqCqwJKhtAziveC9D'}
/engines/itemrec/movie-rec/topn.json?pio_n=10&pio_uid=28&pio_appkey=UsZmneFir39GXO9hID3wDhDQqYNje4S9Ea3jiQjrpHFzHwMEqCqwJKhtAziveC9D
status: 404 body: {"message":"Cannot find recommendation for user."}
The rest of the tutorial runs as expected, just no predictions ever seem to appear. Can someone please point me in the right direction on how to solve this issue?
Thanks!
Several suggestions:
Check if there is data in PredictioIO's database. I saw jobs failing because there was some items in database but no users and no user-to-item actions. Look into Mongo database appdata - there should be collections named users, items and u2iActions. These collections are only created when you add first user-item-u2iaction there via API. That's bad that it is not clear whether job completed successfully or not via the web interface.
Check logs - PredictionIO logs, and Hadoop logs if you use Hadoop jobs. See if model training jobs did complete (BTW, did you invoke "Train prediction model now" via web interface?)
Verify if there is some data in predictionio_modeldata for your algorithm.
Well, even if model is trained OK, there can still be not enough data to produce recommendations for some user. Try "Random" to get the simplest recommendations available for all, to check if system as a whole works.