Google ML Engine - Internal Server Error before run of second trial - google-cloud-ml-engine

I am attempting to run a hyper-parameter tuning job on the Google ML Engine, but I seem to have an error whenever I do more than 1 trail within the same job. I get the following error message: Internal error occurred. Please retry in a few minutes. If you still experience errors, contact Cloud ML. with the job log showing the following:
Job log
Internal Error JSON log
I've been trying to run the same job since Friday but to no avail.

All of your hyperparameters have exactly one possible value, so the first Hyperparameter trial exhausted the parameter space and there wasn't anything new to try for a second trial.
Of course, this should not be communicated as an Internal Error, so I'll make sure that gets fixed.

Related

Dataflow jobs failing and showing no logs

I created pipelines in Dataflow using the standard template JDBC to BigQuery and there are a few jobs that are unexpectedly failing and not showing any logs.
The thing is, when a job fails because of the resources, the job needed more vCPUs than was avaliable in the region or the memory was not enough for example, these kind of errors are displayed in the logs, as you can see below.
But some jobs just fail with no logs and the resources are sufficient.
Does anyone know how to find the logs in this case?
Change the severity of the logs. If you choose Default, you should see more logs. For how the job page looks like for that failed job, I would say you are probably going to need to have a look at the worker logs as well.
Depending on the error, the Diagnostics tab may have some summarized info of what kind error has made the job fail.

Dataflow Job fails with "Unable to bring up enough workers"

My Dataflow job is failing with the following message, how should I debug?
Workflow failed. Causes: (65a939e801f185b6): Unable to bring up enough
workers: minimum 1, actual 0.
The service will output this message when it is unable to allocate a virtual machine from Compute Engine to execute the job. Please check your quota in the console.
I had problems with the same thing. However, switching zone solved the problem for me. I believe it gives the same error message sometimes when there are no free resources.

Error creating the GCE VMs or starting Dataflow

I'm getting the following error in the recent jobs I'm trying to submit:
2015-01-07T15:51:56.404Z: (893c24e7fd2fd6de): Workflow failed.
Causes: (893c24e7fd2fd601):
There was a problem creating the GCE VMs or starting Dataflow on the VMs so no data was processed. Possible causes:
1. A failure in user code on in the worker.
2. A failure in the Dataflow code.
Next Steps:
1. Check the GCE serial console for possible errors in the logs.
2. Look for similar issues on http://stackoverflow.com/questions/tagged/google-cloud-dataflow.
There are no other errors.
What does this error mean?
Sorry for the trouble.
The Dataflow starts up VM instances and then launches an agent on those VMs. Those agents then do the heavy lifting of executing your code (e.g. ParDo's, reading and writing) your Data.
The error indicates the job failed because no agents were requesting work. As a result, the service marked the job as a failure because it wasn't making any progress and never would since there weren't any agents to process your data.
So we need to figure out where in the agent startup process things failed.
The first thing to check is whether the VMs actually started. When you run your job do you see any VMs created in your project? It might take a minute or two for the VMs to startup but they should appear shortly after the runner prints out the message "Starting worker pool setup". The VMs should be named something like
<PREFIX-OF-JOB-NAME>-<TIMESTAMP>-<random hexadecimal number>-<instance number>
Only a prefix of the job name is used to ensure we don't exceed GCE name limits.
If the VMs startup the next thing to do is to inspect the worker logs to look for errors indicating problems in launching the agent.
The easiest way to access the logs is using the UI. Go to the Google Cloud Console and then select the Dataflow option in the left hand frame. You should see a list of your jobs. You can click on the job in question. This should show you a graph of your job. On the right side you should see a button "view logs". Please click that. You should then see a UI for navigating the logs and you can look for errors.
The second option is to look for the logs on GCS. The location to look for is:
gs://PATH TO YOUR STAGING DIRECTORY/logs/JOB-ID/VM-ID/LOG-FILE
You might see multiple log files. The one we are most interested in is the one that starts with "start_java_worker". If that log file doesn't exist then the worker didn't make enough progress to actually upload the file; or else there might have been a permission problem uploading the log file.
In that case the best thing to do is to try to ssh into one of the VMs before it gets torn down. You should have about 15 minutes before the job fails and the VMs are deleted.
Once you login to the VM you can find all the logs in
/var/log/dataflow/...
The log we care most about at this point is:
/var/log/dataflow/taskrunner/harness/start_java_worker-SOME ID.log
If there is a problem starting the code that runs on the VM that log should tell us. That log and the other logs should also tell us if there is a permission problem that prevents the code running on the worker from being able to access Dataflow.
Please take a look and let us know if you find anything.
Apart from Jeremy Lewi's great answer, I would like to add that I've seen this error appear when you don't enable the proper Google APIs in the Developers Console, as mentioned here, which leads to a permission issue, like Jeremy said.

predictionio not producing any predictions

I am trying to test out prediction-io for the first time. I followed the installation instructions for linux and developed several test engines. After repeatedly getting the following error on my own datasets I decided to follow the movie 100k tutorial (https://github.com/PredictionIO/PredictionIO-Docs/blob/cbca03b1c2bad949db951a3a798f0080c48b3674/source/tutorials/movie-recommendation.rst). The same error seems to persist even though it seems as if my Hadoop is running correctly (and not in safe mode) and the engine says that it is running and training is complete. The error that I am getting is:
predictionio.ItemRecNotFoundError: request: GET
/engines/itemrec/movie-rec/topn.json {'pio_n': 10, 'pio_uid': '28',
'pio_appkey':
'UsZmneFir39GXO9hID3wDhDQqYNje4S9Ea3jiQjrpHFzHwMEqCqwJKhtAziveC9D'}
/engines/itemrec/movie-rec/topn.json?pio_n=10&pio_uid=28&pio_appkey=UsZmneFir39GXO9hID3wDhDQqYNje4S9Ea3jiQjrpHFzHwMEqCqwJKhtAziveC9D
status: 404 body: {"message":"Cannot find recommendation for user."}
The rest of the tutorial runs as expected, just no predictions ever seem to appear. Can someone please point me in the right direction on how to solve this issue?
Thanks!
Several suggestions:
Check if there is data in PredictioIO's database. I saw jobs failing because there was some items in database but no users and no user-to-item actions. Look into Mongo database appdata - there should be collections named users, items and u2iActions. These collections are only created when you add first user-item-u2iaction there via API. That's bad that it is not clear whether job completed successfully or not via the web interface.
Check logs - PredictionIO logs, and Hadoop logs if you use Hadoop jobs. See if model training jobs did complete (BTW, did you invoke "Train prediction model now" via web interface?)
Verify if there is some data in predictionio_modeldata for your algorithm.
Well, even if model is trained OK, there can still be not enough data to produce recommendations for some user. Try "Random" to get the simplest recommendations available for all, to check if system as a whole works.

SSRS2005 timeout error

I've been running around circles the last 2 days, trying to figure a problem in our customers live environment. I figured I might as well post it here, since google gave me very limited information on the error message (5 results to be exact).
The error boils down to a timeout when requesting a certain report in SSRS2005, when a certain parameter is used.
The deployment scenario is:
Machine #1 Running reporting services (SQL2005, W2K3, IIS6)
Machine #2 Running datawarehouse database (SQL2005, W2K3) which is the data source for #1
Both machines are running on the same vm cluster and LAN.
The report requests a fairly simple SP - lets called it sp(param $a, param $b).
When requested with param $a filled, it executes correctly. When using param $b, it times out after the global timeout periode has passed.
If I run the stored procedure with param $b directly from sql management studio on #2, it returns the results perfectly fine (within 3-4s).
I've profiled the datawarehouse database on #2 and when param $b is used, the query from the reporting service to the database, never reaches #2.
The error message that I get upon timeout, when using param $b, when invoking the report directly from SSRS web interface is:
"An error has occurred during report processing.
Cannot read the next data row for the data set DataSet.
A severe error occurred on the current command. The results, if any, should be discarded. Operation cancelled by user."
The ExecutionLog for the SSRS does give me much information besides the error message rsProcessingAborted
I'm running out of ideas of how to nail this problem. So I would greatly appreciate any comments, suggestions or ideas.
Thanks in advance!
The first thing you need to do is to ensure your statistics are up to date.
(It sounds like a case of an incorrect query plan being used due to parameter sniffing, as described in this SO answer: Parameter Sniffing (or Spoofing) in SQL Server).
One way to fix this in SQL Server 2005, is using the OPTIMIZE FOR query hint. See also OPTIMIZE FOR query hint in SQL Server 2005
Also, do you have a regular scheduled index rebuild job for some or all of your indexes?

Resources