Tasks will not run in Spring Cloud Data Flow (Docker/K8S) - spring-cloud-dataflow

Last week I installed the Docker/Kubernetes based version of Spring Cloud Data Flow
Although there were not overt errors, things are not working correctly.
I am able to create streams and tasks in the web UI and Spring Cloud Data Flow Shell but nothing runs.
I am most interested in Tasks.
When I create them, they all show with a Task Status of UNKNOWN.
Unfortunately, no matter how many times I launch them, the status always remains UNKNOWN.
I'm able to delete them but what magic must I use to make them run?

There's nothing apparent from the description as to what has failed. Perhaps if you can update it with more details, it'd be useful.
From a troubleshooting standpoint, when deploying streams or if the launch of Tasks' fails for any reason, they will be logged in the SCDF-server/Skipper-server logs. You'd have to tail the logs of the respective pod to learn more about the failures.
Also, it'd be useful to verify the output of kubectl describe pod/<POD_NAME> to see what's causing the stream/task pods not to start successfully. They're usually listed towards the end of this command-output.
The usual suspects are due to pods' health-check failures and/or the stream/task application docker images aren't resolvable at runtime. You'll see the reasons in the logs, of course.

This was a misconfiguration on my end.
I'm able to run as expected now.

Related

How to get Cloud Run to handle multiple simultaneous deployments?

I've got a project with 4 components, and every component has hosting set up on Google Cloud Run, separate deployments for testing and for production. I'm also using Google Cloud Build to handle the build & deployment of the components.
Due to lack of good webhook events from source system, I'm currently forced to trigger a rebuild of all components in a project every time there is a new change. In the project this means 8 different images to build and deploy, as testing and production use different build-time settings as well.
I've managed to optimize Cloud Build to handle the 8 concurrent builds pretty nicely, but they all finish around the same time, and then all 8 are pushed to Cloud Run. It often seems like Cloud Run does not like this at all and starts throwing some errors to me that I've been unable to resolve.
First and more serious is that often about 4-6 of the 8 deployments go through as expected, and the remaining ones either are significantly delayed or just fail, often so that the first few go through fine, then a few with significant delays, and the final 1-2 just fail. This seems to be caused by some "reconciliation request quota" being exhausted in the region (in this case europe-north1), as this is the error I can see at the top of the Cloud Run service -view:
Additionally and mostly annoyingly, the Cloud Run dashboard itself does not seem to handle having 8 services deployed, as just sitting on the dashboard view listing the services regularly throws me another error related to some read quotas:
I've tried contacting Google via their recommended "Send feedback" button but have received no reply in ~1wk+ (who knows when I sent it, because they don't seem to confirm receipt).
One option I can do to try and improve the situation is to deploy the "testing" and "production" variants in different regions, however that would be less than optimal, and seems like this is some simple configuration somewhere about the limits. Are there other options for me to consider? Or should I just try to set up some synchronization on these that not all deployments are fired at once?
Optimizing the need to build and deploy all components at once is not really an option in this case, since they have some shared code as well, and when that changes it would still be necessary to support this.
This is an issue with Cloud Run. Developers are expected to be able to deploy many services in parallel.
The bug should be fixed within a few days or couple of weeks.
[update] Bug should now be fixed.
Make sure to use the --async flag if you want to deploy in parrallel: gcloud run deploy $SERVICE --image --async

How to use a scheduler(cron) container to execute commands in other containers

I've spent a fair amount of time researching and I've not found a solution to my problem that I'm comfortable with. My app is working in a dockerized environment:
one container for the database;
one or more containers for the APP itself. Each container holds a specific version of the APP.
It's a multi-tenant application, so each client (or tenant) may be related to only one version at a time (migration should be handle per client, but that's not relevant).
The problem is I would like to have another container to handle scheduling jobs, like sending e-mails, processing some data, etc. The scheduler would then execute commands in app's containers. Projects like Ofelia offer a great promise but I would have to know the container to execute the command ahead of time. That's not possible because I need to go to the database container to discover which version the client is in, to figure it out what container the command should be executed in.
Is there a tool to help me here? Should I change the structure somehow? Any tips would be welcome.
Thanks.
So your question is you want to get the APP's version info in the database container before scheduling jobs,right?
I think this is relate to the business, not the dockerized environment,you may have ways to slove the problem:
Check the network ,make sure the network of the container can connect to each other
I think the database should support RPC function,you can use it to get the version data
You can use some RPC supported tools,like SSH

"This DAG seems to be existing only locally. The master scheduler doesn't seem to be aware of its existence."

I start experimenting with Google Cloud Composer where I deploy few DAGs:
One of my DAG with an info statement indicating This DAG seems to be existing only locally. The master scheduler doesn't seem to be aware of its existence. cannot run, even manually. When I start it manually it stays on state "running" forever and never start to run the first task.
As explained in detail below the only difference between the two DAGs is that the broken one is using a custom operator.
Do you have any idea what's wrong here and how I can fix it ?
Thanks
hello2_gcp_plugins_v2 is calling the only bash and email operator is working as expected (I received the email). If I configure a scheduler_interval it's starting as expecting. Even if I set up the scheduler interval to None, it's working well when I start it manually
hello2_gcp_plugins_v5 is calling a custom operator that I already deploy in the expecting bucket. The custom operator just calls an API via the HttpHook to get data and upload it to gcs bucket via the GoogleCloudStorageHook. Whatever the scheduler interval is set up or keep to None, I always see the info statement in the UI and the DAG never start automatically. When started manually it stays in running state forever and the first task is never triggered.
I answer myself to my question as I fix it and may be useful if someone else is getting into the same trouble.
Even if it's not obvious the following information This DAG seems to be existing only locally. The master scheduler doesn't seem to be aware of its existence. was due to a buggy operator use in my DAG. In my case, one of my custom operator.
To debug it, I click on the DAG -> Graph View -> Click on my custom operator -> Task Instance Details and the stacktrace of the error in my operator was display.
I fix my operator, upload the new version in the GCS bucket and after few refresh the Web UI didn't mention the information message anymore and my DAG was running.
this can also happen if you add a new dag without stopping the scheduler and it hasn't run the refresh on the dags folder to find the new dags yet. You can change the scheduler refresh time in the airflow.cfg to make it refresh quicker.

Scheduled job does not appear to run and no kernel files are created

I have a scheduled notebook job that has been running without issue for a number of days, however, last night it stopped running. Note that I am able to run the job manually without issue.
I raised a previous question on this topic: How to troubleshoot a DSX scheduled notebook?
Following the above instructions, I noticed that there were no log files created at the times when the job should have run. Because I'm able to run the job manually and there are no kernel logs created at the times the schedule job should have run, I'm presuming there is an issue with the scheduler service.
Are there any other steps I can perform to investigate this issue?
This sounds like a problem with the Scheduling service. I recommend to take it up with DSX support. Currently there is no management UX telling you why a specific job failed or letting you restart a particular execution (that would be a good fit for an enhancement request to provide via https://datascix.uservoice.com/).

Error creating the GCE VMs or starting Dataflow

I'm getting the following error in the recent jobs I'm trying to submit:
2015-01-07T15:51:56.404Z: (893c24e7fd2fd6de): Workflow failed.
Causes: (893c24e7fd2fd601):
There was a problem creating the GCE VMs or starting Dataflow on the VMs so no data was processed. Possible causes:
1. A failure in user code on in the worker.
2. A failure in the Dataflow code.
Next Steps:
1. Check the GCE serial console for possible errors in the logs.
2. Look for similar issues on http://stackoverflow.com/questions/tagged/google-cloud-dataflow.
There are no other errors.
What does this error mean?
Sorry for the trouble.
The Dataflow starts up VM instances and then launches an agent on those VMs. Those agents then do the heavy lifting of executing your code (e.g. ParDo's, reading and writing) your Data.
The error indicates the job failed because no agents were requesting work. As a result, the service marked the job as a failure because it wasn't making any progress and never would since there weren't any agents to process your data.
So we need to figure out where in the agent startup process things failed.
The first thing to check is whether the VMs actually started. When you run your job do you see any VMs created in your project? It might take a minute or two for the VMs to startup but they should appear shortly after the runner prints out the message "Starting worker pool setup". The VMs should be named something like
<PREFIX-OF-JOB-NAME>-<TIMESTAMP>-<random hexadecimal number>-<instance number>
Only a prefix of the job name is used to ensure we don't exceed GCE name limits.
If the VMs startup the next thing to do is to inspect the worker logs to look for errors indicating problems in launching the agent.
The easiest way to access the logs is using the UI. Go to the Google Cloud Console and then select the Dataflow option in the left hand frame. You should see a list of your jobs. You can click on the job in question. This should show you a graph of your job. On the right side you should see a button "view logs". Please click that. You should then see a UI for navigating the logs and you can look for errors.
The second option is to look for the logs on GCS. The location to look for is:
gs://PATH TO YOUR STAGING DIRECTORY/logs/JOB-ID/VM-ID/LOG-FILE
You might see multiple log files. The one we are most interested in is the one that starts with "start_java_worker". If that log file doesn't exist then the worker didn't make enough progress to actually upload the file; or else there might have been a permission problem uploading the log file.
In that case the best thing to do is to try to ssh into one of the VMs before it gets torn down. You should have about 15 minutes before the job fails and the VMs are deleted.
Once you login to the VM you can find all the logs in
/var/log/dataflow/...
The log we care most about at this point is:
/var/log/dataflow/taskrunner/harness/start_java_worker-SOME ID.log
If there is a problem starting the code that runs on the VM that log should tell us. That log and the other logs should also tell us if there is a permission problem that prevents the code running on the worker from being able to access Dataflow.
Please take a look and let us know if you find anything.
Apart from Jeremy Lewi's great answer, I would like to add that I've seen this error appear when you don't enable the proper Google APIs in the Developers Console, as mentioned here, which leads to a permission issue, like Jeremy said.

Resources