I am attempting to run a Training Job on Google's Cloud ML. The signs that I have of my job running are:
Messages such as these indicating the package was built and installed:
INFO 2017-06-07 15:14:01 -0700 master-replica-0 Successfully built
training-job-foo
INFO 2017-06-07 15:14:01
-0700 master-replica-0 Installing collected packages: training-job-foo
INFO 2017-06-07 15:14:01
-0700 master-replica-0 Successfully installed training-job-foo-0.1.dev0
INFO 2017-06-07 15:14:01
-0700 master-replica-0 Running command: pip install --user training-job-foo-0.1.dev0.tar.gz
INFO 2017-06-07 15:14:02
-0700 master-replica-0 Processing ./training-job-foo-0.1.dev0.tar.gz
Messages like this indicating that my job is starting:
INFO 2017-06-07 15:14:03 -0700 master-replica-0 Running command:
python -m training-job-foo.training_routine_bar --job-dir
gs://regional-bucket-similar-to-training-job/output/
A message like this indicating that my scalar summaries are being processed:
INFO 2017-06-07 15:14:21 -0700 master-replica-0 Summary name Total
Accuracy is illegal; using Total_Accuracy instead.
Finally, I also see CPU, Memory usage increase and my consumedMLUnits increase
I should add, I also see the summary Filewriters create the summary files before the jobs are created but I dont see those files increase in size. I also see an initial checkpoint file written to gs://regional-bucket-similar-to-training-job/output/
Other than that I see no further logs or outputs. I should be seeing logs since I print accuracy, loss every so often. I also write summaries and checkpoint files.
What am I missing?
Also what other debugging tools are available in such scenarios? All I am doing currently is streaming logs, watching the job status, CPU Usage, Memory Usage on the Cloud ML console and watching my Cloud Storage bucket for any changes
Sorry that you are experiencing issues. Currently, the available debugging tools are Job logs, metrics and TensorBoard, but seems like all of these can't be used in your case.
If possible, can you please send us your project number and job id to cloudml-feedback#google.com,so that we can take a close look?
Related
I started the launch process using docker-compose up command, still processing from last 3 hours. Is it usually taking the time.?
This is processing in visual studio terminal.
I am having some performance problems when I am starting Jenkins inside Kubernetes cluster.
One of the points that sometimes occurs and it takes so much time is next operation:
INFO: Finished Download metadata. 1,397 ms
In this case, it is just 1 second but sometimes it takes like 40 seconds. I have tried to find this log message in Jenkins core but I have not found it, so I suspect it is some plugin. My question where is this happening, what is doing and why it is required.
Thanks.
Feb 10, 2018 2:04:22 PM hudson.model.AsyncPeriodicWork$1 run
INFO: Started Download metadata
Feb 10, 2018 2:04:22 PM hudson.model.AsyncPeriodicWork$1 run
INFO: Finished Download metadata. 4 ms
Believe you are referring to the logs like the one above. If yes, these are the log rotation strategy logs thats gets executed through AsyncPeriodicWork class and it is configured in Jenkins specifically for discarding Old Builds.
Following image gives you the configuration in Jenkins UI
You can appropriately configure this based on your project requirements, if you feel this is impacting your startup time.
I've been running batch jobs for over a week now with DataflowRunner without a problem but all of a sudden starting from today the jobs started failing with the error message below. The workers don't seem to start and there's no log in stackdriver at all.
Anything I'm missing here?
Dataflow SDK version: 2.0.0
Submitted job: 2017-08-29_09_43_20-9537473353894635176
2017-08-29 16:44:24 ERROR MonitoringUtil$LoggingHandler:101 - 2017-08-29T16:44:22.277Z: (54a5da9d57fd266d): Workflow failed.
EDIT:
If I remove --zone=europe-west2-b from the batch run it works which indicates that there might be something wrong with this zone.
I took a look at your job. It failed because it couldn't get quota to bring up the workers. Likely you do not have quota in that zone. This error is not handled back correctly, but it should be fixed in the next release.
I followed the instructions given here to set-up my machine to run SyntaxNet. I have installed all the required software and ensured the versions are the same as the instructions. But when I run the bazel tests using command bazel test --linkopt=-headerpad_max_install_names syntaxnet/... util/utf8/... on my Mac OS, it fails every time. I'm getting the following error message
Sending SIGTERM to previous Bazel server (pid=42104)... Sending SIGKILL to previous Bazel server process group (pid=42104)... Error: SIGKILL unsuccessful after 10s: Operation not permitted
Not sure what's going wrong. Kindly advice
I have a web application on RoR 2.1 and backend MySQL up and running with around 8k users and now i want to do a Load Test on my web app and server to figure out the load on the server and the average and peak number of concurrent users.
What are the ways of implementing this load test to analyse the load on the server and performance of the web application with a way to figure out average and peak number of concurrent users?
I'm using ab (apache benchmarks http://httpd.apache.org/docs/2.0/programs/ab.html) for load tests. Example of testing on google.com:
ab -n 10000 -c 100 http://google.com/
It allows me to investigate how much requests per second my setup(application) can do as well as concurrency level.
The ab tool is a part of the Apache httpd package in CentOS and Red Hat distributions. So it is probably already installed there. For Ubuntu/Debian install apache2-utils package.
ab --help for full options list
The most important are :
-n requests Number of requests to perform
-c concurrency Number of multiple requests to make
Also i'm monitoring peaks of activity with munin(http://httpd.apache.org/docs/2.0/programs/ab.html) and plugins for nginx/passenger/unicorn/CPU/Memory depending on configuration, as well as plugin for MySQL which shows total amount of queries per second and many more data.
You can install munin using appropriate tutorial for your RH linux from that page http://munin-monitoring.org/wiki/LinuxInstallation.
Here also quite nice article about munin and mongrel monitoring:
http://onrails.org/2007/08/31/monitoring-rails-performance-with-munin-and-a-mongrel
You could pick up plugins for apache(and not only) monitoring from the http://exchange.munin-monitoring.org.
Good thing about that all that it doesnt require to change application. So you can just install it and use without any changes from your production setup.