Dataflow Job failing with ZONE_RESOURCE_POOL_EXHAUSTED error in us-central1 and northamerica-northeast1

Dataflow Job failing with ZONE_RESOURCE_POOL_EXHAUSTED error in us-central1 and northamerica-northeast1 - google-cloud-dataflow

I'm trying to follow this GCP guide for importing CSV files from a Google Cloud Bucket into Cloud Spanner with GCP Dataflow.
The first time I tried the job it failed because there were some problems with the format of my CSV files and the manifest JSON. However after fixing those issues I keep running into this error:
Startup of the worker pool in zone us-central1-b failed to bring up any of the desired 2 workers. ZONE_RESOURCE_POOL_EXHAUSTED: Instance 'import-XXXXX' creation failed: The zone 'projects/mycoolproject/zones/us-central1-b' does not have enough resources available to fulfill the request. Try a different zone, or try again later.
After looking through the GCP docs, the only reference I found to this was this page here which suggest simply waiting (doesn't say how long one should expect to wait) or moving the job location. So I tried running the job in northamerica-northeast1 and I'm getting the exact same error.
I'm following the GCP Dataflow/Spanner CSV import guide step by step and I can't figure out what I'm doing wrong. I've never used Dataflow before so maybe there's something obvious I'm missing?
I should also note that my team doesn't use any Compute Engine resources but the docs don't say anything about having to manually enable such resources, only the Dataflow API.
What am I doing wrong?

Related

Fargate logging issue using log4j2

We have a fargate service running. On CloudWatch we can see the metrics for ECS/ContainerInsights->StorageWriteBytes growing every hour, and at some point it will not increase anymore probably because out of disk space. We will start to see log errors if we do not force a new deployment of the ECS. The error looks like:
error: org.apache.logging.log4j.core.appender.AppenderLoggingException: Error
writing to RandomAccessFile /apollo/env/ReaverFeatureGating/var/output/logs/application.log.%d{yyyy-MM-dd-HH}
Questions:
Is this normal to all the fargate services? Do we setup something
wrong?
Can we remove all the AmazonRollingRandomAccessFile and just use STDOUT in log4j2-container.xml? Will that still post our events to
CloudWatch, but just not writing to the disk?

After some research this is what I got:
Because the default template includes AmazonRollingRandomAccessFile, the log will be generated locally but never be cleaned up. There are some suggestions about adding a cron job to delete the logs, but for our case we don't need the local logs.
Yes, CloudWatch just need STDOUT.
Also, StorageWriteBytes only represent how many bytes are read/write to the storage. It is not equal to the used disk space. To monitor the disk space, we can build CloudWatch Agent into the container image and then use disk_used metric.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/metrics-collected-by-CloudWatch-agent.html

Tasks will not run in Spring Cloud Data Flow (Docker/K8S)

Last week I installed the Docker/Kubernetes based version of Spring Cloud Data Flow
Although there were not overt errors, things are not working correctly.
I am able to create streams and tasks in the web UI and Spring Cloud Data Flow Shell but nothing runs.
I am most interested in Tasks.
When I create them, they all show with a Task Status of UNKNOWN.
Unfortunately, no matter how many times I launch them, the status always remains UNKNOWN.
I'm able to delete them but what magic must I use to make them run?

There's nothing apparent from the description as to what has failed. Perhaps if you can update it with more details, it'd be useful.
From a troubleshooting standpoint, when deploying streams or if the launch of Tasks' fails for any reason, they will be logged in the SCDF-server/Skipper-server logs. You'd have to tail the logs of the respective pod to learn more about the failures.
Also, it'd be useful to verify the output of kubectl describe pod/<POD_NAME> to see what's causing the stream/task pods not to start successfully. They're usually listed towards the end of this command-output.
The usual suspects are due to pods' health-check failures and/or the stream/task application docker images aren't resolvable at runtime. You'll see the reasons in the logs, of course.

This was a misconfiguration on my end.
I'm able to run as expected now.

"This DAG seems to be existing only locally. The master scheduler doesn't seem to be aware of its existence."

I start experimenting with Google Cloud Composer where I deploy few DAGs:
One of my DAG with an info statement indicating This DAG seems to be existing only locally. The master scheduler doesn't seem to be aware of its existence. cannot run, even manually. When I start it manually it stays on state "running" forever and never start to run the first task.
As explained in detail below the only difference between the two DAGs is that the broken one is using a custom operator.
Do you have any idea what's wrong here and how I can fix it ?
Thanks
hello2_gcp_plugins_v2 is calling the only bash and email operator is working as expected (I received the email). If I configure a scheduler_interval it's starting as expecting. Even if I set up the scheduler interval to None, it's working well when I start it manually
hello2_gcp_plugins_v5 is calling a custom operator that I already deploy in the expecting bucket. The custom operator just calls an API via the HttpHook to get data and upload it to gcs bucket via the GoogleCloudStorageHook. Whatever the scheduler interval is set up or keep to None, I always see the info statement in the UI and the DAG never start automatically. When started manually it stays in running state forever and the first task is never triggered.

I answer myself to my question as I fix it and may be useful if someone else is getting into the same trouble.
Even if it's not obvious the following information This DAG seems to be existing only locally. The master scheduler doesn't seem to be aware of its existence. was due to a buggy operator use in my DAG. In my case, one of my custom operator.
To debug it, I click on the DAG -> Graph View -> Click on my custom operator -> Task Instance Details and the stacktrace of the error in my operator was display.
I fix my operator, upload the new version in the GCS bucket and after few refresh the Web UI didn't mention the information message anymore and my DAG was running.

this can also happen if you add a new dag without stopping the scheduler and it hasn't run the refresh on the dags folder to find the new dags yet. You can change the scheduler refresh time in the airflow.cfg to make it refresh quicker.

Dataflow Workers unable to connect to Dataflow Service

I am using Google Dataprep to start Dataflow jobs and am facing some difficulties.
For background, we used Dataprep for some weeks and it worked without problem before we started to have authorization issues with the service account. When we finally solved this, we restarted the jobs we used to launch but they failed with "The Dataflow appears to be stuck.".
We tried with another very simple job but we met the same error. Here are the full error messages, the job fails after one hour being stuck:
Dataflow -
(1ff58651b9d6bab2): Workflow failed. Causes: (1ff58651b9d6b915): The Dataflow appears to be stuck.
Dataprep -
The Dataflow job (ID: 2017-11-15_00_23_23-9997011066491247322) failed. Please
contact Support and provide the Dataprep Job ID 20825 and the Dataflow Job ID.
It seems this kind of error has various origins and I have no clue about where to start.
Thanks in advance

Please check if there have been any changes to your project's default network. This is the common reason for workers not being able to contact the service, causing 1 hour timeouts.
Update:
After looking into further, <project-number>-compute#developer.gserviceaccount.com service account for Compute Engine is missing under 'Editor' role. This is usually automatically created. Probably this was removed later by mistake. See 'Compute Engine Service Account' section in https://cloud.google.com/dataflow/security-and-permissions.
We are working on fixes to improve early detection of such missing permissions so that the failure points the root cause better.
This implies your other Dataflow jobs fail similarly as well.

the best route would be to contact Google Support.
The issue is related to the Dataflow side and would require some more research on the Dataflow backend by Google

Error creating the GCE VMs or starting Dataflow

I'm getting the following error in the recent jobs I'm trying to submit:
2015-01-07T15:51:56.404Z: (893c24e7fd2fd6de): Workflow failed.
Causes: (893c24e7fd2fd601):
There was a problem creating the GCE VMs or starting Dataflow on the VMs so no data was processed. Possible causes:
1. A failure in user code on in the worker.
2. A failure in the Dataflow code.
Next Steps:
1. Check the GCE serial console for possible errors in the logs.
2. Look for similar issues on http://stackoverflow.com/questions/tagged/google-cloud-dataflow.
There are no other errors.
What does this error mean?

Sorry for the trouble.
The Dataflow starts up VM instances and then launches an agent on those VMs. Those agents then do the heavy lifting of executing your code (e.g. ParDo's, reading and writing) your Data.
The error indicates the job failed because no agents were requesting work. As a result, the service marked the job as a failure because it wasn't making any progress and never would since there weren't any agents to process your data.
So we need to figure out where in the agent startup process things failed.
The first thing to check is whether the VMs actually started. When you run your job do you see any VMs created in your project? It might take a minute or two for the VMs to startup but they should appear shortly after the runner prints out the message "Starting worker pool setup". The VMs should be named something like
<PREFIX-OF-JOB-NAME>-<TIMESTAMP>-<random hexadecimal number>-<instance number>
Only a prefix of the job name is used to ensure we don't exceed GCE name limits.
If the VMs startup the next thing to do is to inspect the worker logs to look for errors indicating problems in launching the agent.
The easiest way to access the logs is using the UI. Go to the Google Cloud Console and then select the Dataflow option in the left hand frame. You should see a list of your jobs. You can click on the job in question. This should show you a graph of your job. On the right side you should see a button "view logs". Please click that. You should then see a UI for navigating the logs and you can look for errors.
The second option is to look for the logs on GCS. The location to look for is:
gs://PATH TO YOUR STAGING DIRECTORY/logs/JOB-ID/VM-ID/LOG-FILE
You might see multiple log files. The one we are most interested in is the one that starts with "start_java_worker". If that log file doesn't exist then the worker didn't make enough progress to actually upload the file; or else there might have been a permission problem uploading the log file.
In that case the best thing to do is to try to ssh into one of the VMs before it gets torn down. You should have about 15 minutes before the job fails and the VMs are deleted.
Once you login to the VM you can find all the logs in
/var/log/dataflow/...
The log we care most about at this point is:
/var/log/dataflow/taskrunner/harness/start_java_worker-SOME ID.log
If there is a problem starting the code that runs on the VM that log should tell us. That log and the other logs should also tell us if there is a permission problem that prevents the code running on the worker from being able to access Dataflow.
Please take a look and let us know if you find anything.

Apart from Jeremy Lewi's great answer, I would like to add that I've seen this error appear when you don't enable the proper Google APIs in the Developers Console, as mentioned here, which leads to a permission issue, like Jeremy said.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart