Running Google Cloud ML training job but getting no stdout output in logs - google-cloud-ml-engine

I've built a trainer and when I submit the job, the job starts and logs get populated. But none of my output to stdout ever appears in the log. I do get messages like "The TensorFlow library wasns't compiled to use AVX2 instructions..."
The entire job takes about 5 to 10 minutes on my laptop; I let it run for over an hour on the cloud server and still never saw any output (and the first line of output occurs almost immediately when I run it locally.)
I can run my job locally by invoking it directly, but I haven't been able to get it to run using the "gcloud local" command... when I do this, I get an error "No module named tensorflow"

The log message "The TensorFlow library wasn't compiled to use AVX2 instructions" indicates that log messages are flowing from TensorFlow to Cloud Logging. So most likely there is a problem with the way you have configured logging and as a result log messages aren't being correctly written to stderr/stdout.
This easiest way to debug this would be to create a simple example to try to reproduce this error.
I'd suggest creating a simply python program that does nothing but log a message and then submitting that to the service to see if a log message is printed.
Something like the following
import logging
import time
if __name__ == "__main__":
logging.getLogger().setLevel(logging.INFO)
# Output logs for 5 minutes. We do this for 5 minutes just to ensure
# the job doesn't terminate before logs can be flushed.
for i in range(30):
logging.info("This is an info message.")
logging.error("This is an error message.")
time.sleep(10)
For the issue importing TensorFlow when running locally please take a look at this SO Question which has some suggestions on how to check the Python path used by gcloud and verifying that it includes TensorFlow.

Related

Metaplex Candy machine mainnet NFT depoly issue

I have made Solana NFT using Metaplex Candy Machine.
I have uploaded 1000 NFT. But In candy machine UI, shows available count is 985.
I lost 15 NFTs.
Also, if I click Mint Button, the count was reduced to 3 at once. and can't see NFT on my phantom wallet.
It worked on devnet perfectly, but after deploying mainnet, it occurred above error.
Please help me with this issue. how to fix this?.
I can't retrieve the lost NFTs?
I have not seen this exact error but it could be because you did not run the verify_upload cmd after you uploaded.
Source:
https://docs.metaplex.com/candy-machine-v2/verify-upload
This is always recommended as network issues can cause some transactions to fail in large uploads and the CLI won't retry if they fail. The only way to confirm they are all uploaded is
ts-node ~/metaplex/js/packages/cli/src/candy-machine-v2-cli.ts verify_upload -e devnet -k ~/.config/solana/devnet.json -c example
If this fails with:
Error: not all NFTs checked out. check out logs above for details
then just rerun the upload cmd and verify again until it outputs
Ready to deploy!
You can launch the deployment of NFTs multiple times until it is okay.

pytest: capture stdout/stderr at setup/teardown

In my tests, I use fixture to run a web server Docker container via docker-py and detach=True.
When test is executed, I want to output container logs in case of failure. This is in principle achieved with
print(container.logs().decode(), file=sys.stderr)
on fixture teardown; but I get logs even for successful tests, not only for failed ones as if I printed the logs in the test body.
What's the best way to output the logs so the behavior is similar to printing them in the test bodies as they come?

Query on custom metrics exposed via prometheus node exporter textfile collector fails

I am new to prometheus/alertmanager.
I have created a cron job which executes shell script every minute. This shell script generates "test.prom" file (with a gauge metric in it) in the same directory which is assigned to --textfile.collector.directory argument (to node-exporter). I verified (using curl http://localhost:9100/metrics) that the node-exporter exposes that custom metric correctly.
When I tried to run a query against that custom metric in prometheus dashboard, it does not show up any results (it says no data found).
I could not figure out why the query against the metric exposed via node-exporter textfile collector fails. Any clues what I missed ? Also please let me know how to check and ensure that prometheus scraped my custom metric 'test_metric` ?
My query in prometheus dashboard is test_metric != 0 (in prometheus dashboard) which did not give any results. But I exposed test_metric via node-exporter textfile.
Any help is appreciated !!
BTW, the node-exporter is running as docker container in Kubernetes environment.
I had a similar situation, but it was not a configuration problem.
Instead, my data included timestamps:
# HELP network_connectivity_rtt Round Trip Time to each node
# TYPE network_connectivity_rtt gauge
network_connectivity_rtt{host="home"} 53.87 1541426242
network_connectivity_rtt{host="hop_1"} 58.8 1541426242
network_connectivity_rtt{host="hop_2"} 21.93 1541426242
network_connectivity_rtt{host="hop_3"} 71.69 1541426242
PNE was picking them up without any problem once I reloaded it. As prometheus is running under systemd, I had to check the logs like this:
journalctl --system -u prometheus.service --follow
There I read this line:
msg="Error on ingesting samples that are too old or are too far into the future"
Once I removed the timestamps, values started appearing. This lead me to read more in detail about the timestamps, and I found out they have to be in miliseconds. So this format now is ok:
# HELP network_connectivity_rtt Round Trip Time to each node
# TYPE network_connectivity_rtt gauge
network_connectivity_rtt{host="home"} 50.47 1541429581376
network_connectivity_rtt{host="hop_1"} 3.38 1541429581376
network_connectivity_rtt{host="hop_2"} 11.2 1541429581376
network_connectivity_rtt{host="hop_3"} 20.72 1541429581376
I hope it helps someone else.
Its my bad. I did not included scrape instructions for node-exporter in prometheus.yaml file. It worked after including them.
This issue is happening because of stale metrics.
Lets say you have written you metric in file at 13.00
by default after 5min prometheus will consider you metric stale and it might disappear from there at the time you are making query.

Error while running model training in google cloud ml

I want to run model training in the cloud. I am following this link which runs a sample code to train a model based on flower dataset. The tutorial consists of 4 stages:
Set up your Cloud Storage bucket
Preprocessing training and evaluation data in the cloud
Run model training in the cloud
Deploying and using the model for prediction
I was able to complete step 1 and 2, however in step 3, job is successfully submitted but somehow error occurs and task exits with non exit status 1. Here is the log of the task
Screenshot of expanded log is:
I used following command:
gcloud ml-engine jobs submit training test${JOB_ID} \
--stream-logs \
--module-name trainer.task \
--package-path trainer\
--staging-bucket ${BUCKET_NAME} \
--region us-central1 \
--runtime-version=1.2 \
-- \
--output_path "${GCS_PATH}/training" \
--eval_data_paths "${GCS_PATH}/preproc/eval*" \
--train_data_paths "${GCS_PATH}/preproc/train*"
Thanks in advance!
Can you please confirm that the input files (eval_data_paths and train_data_paths) are not empty? Additionally if you are still having issues can you please file an issue https://github.com/GoogleCloudPlatform/cloudml-samples since its easier to handle the issue on Github.
I met the same issue and couldn't figure out, then I followed this, do it again from git clone and there was no error after running on gcs.
It is clear from your error message
The replica worker 1 exited with a non-zero status of 1. Termination reason: Error
that you have some programming error (syntax, undefined etc).
For more information, Check the return code and meaning
Return code -------------Meaning-------------- Cloud ML Engine response
0 Successful completion Shuts down and releases job resources.
1-128 Unrecoverable error Ends the job and logs the error.
Your need to find your bug first and fix it, then try again.
I recommend run your task locally (if your configuration supports) before you submit in cloud. If you find any bug, you can fix easily in your local machine.

VBS printer script executing error

I have some trouble executing/using vbs scripts linked to printers. They are located in %windir%/System32/Printing_Admin_Scripts
The objective is to plan a weekly print task to preserve ink cartridge
Looking at the scripts, everything was available for me to create this task
The main script to use is prnqctl.vbs
Before to create my task, I have tried to test the script and this is what I got (sorry for the french version, I will try to update the screenshot in english later):
There is obviously something wrong.
I have tried to google the error code, nothing conclusive.
I have tried to run my script in admin mode and also under admin session, same problem
I have made some research on CIMWin32, it seems to be a dll and I can find it in some locations of my filesystem
My OS is W8.1.
If anybody has a suggestion or solution, I'm interested in
==>cscript C:\Windows\System32\Printing_Admin_Scripts\en-US\prnqctl.vbs -e
Unable to get printer instance. Error 0x80041002 Not found
Operation GetObject
Provider CIMWin32
Description
Win32 error code
The error culprit is clear: you should provide a valid -p argument. It's a mandatory parameter in case of -e operation:
==>cscript C:\Windows\System32\Printing_Admin_Scripts\en-US\prnqctl.vbs -e -p "Fax"
Success Print Test Page Printer Fax
==>

Resources