I have a Spark Streaming job problem, one task of the job always in a running status and the other tasks are all completed. I found some exception occur in non-stoped task.
Why an exception occur in the task causes a job does not stop in a "running" status.
Did someone met and solved this problem? Please Help! Thanks in advance.
enter image description here
Related
From ECS console I started to see this issue
I think I understand pretty well the cause of this. ECS pulls these images as an anonymous user, since I haven't provided any credentials, and since the task was set to run every 5 minutes it was hitting the anonymous quota. No big deal, I set the task to run every 10 minutes and for now the problem is solved.
What drives me nuts is:
From CloudWatch console you can see that the task was executed. If I graph the number of executions I will see a chart with data points every 5 minutes. This is wrong to me, because in order to execute the task ECS first needs to pull something, and it can't, therefore there is no execution.
I can't find the error (CannotPullContainerError: ref pull has been retried 1 time(s): failed to copy: httpReadSeeker: failed open: unexpected status code https://registry-1.docker.io/v2/amazon/aws-for-fluent-bit/manifests/sha256:f134723f733ba4f3489c742dd8cbd0ade7eba46c75d574...) in any CloudWatch log stream.
I need to be able to monitor for this error. I spot checked it because I found it by chance since I was looking into the console. How can I monitor for this?
FYI, I use Data Dog too for log ingestion. Maybe that can help?
Often times my Sidekiq jobs will be running for greater than 1 minute. I've tried to debug by sending the Sidekiq process a TTIN signal but I don't see anything being logged. My intuition is that it's a network request that's making it hang but I'm making use of timeouts on all network requests to address this already.
Any suggestions? Thanks!
You are asking how to profile and tune slow Ruby code.
RubyProf.profile do
MyJob.new.perform(...)
end
Output the report and review it to find where the code is slow.
https://github.com/ruby-prof/ruby-prof#usage
You can use the benchmark module https://ruby-doc.org/stdlib-1.9.3/libdoc/benchmark/rdoc/Benchmark.html
You can then save the report to a log file.
My Dataflow job is failing with the following message, how should I debug?
Workflow failed. Causes: (65a939e801f185b6): Unable to bring up enough
workers: minimum 1, actual 0.
The service will output this message when it is unable to allocate a virtual machine from Compute Engine to execute the job. Please check your quota in the console.
I had problems with the same thing. However, switching zone solved the problem for me. I believe it gives the same error message sometimes when there are no free resources.
If I forget to start delayed_job workers on server, the delay jobs will always pending and seems I can't get any errors from Delayed::Job API. is there any easy way to debug with this mistake? I have a dashboard to list the failed background jobs for admins, it could be great to have an alert if there is no worker running. Thanks!
Hmmm, I can find a empty job with Delayed::Job in this case, and it will continue if the worker working again, so I think it should be fine.
At least once a day my Delayed::Job workers will randomly stop working jobs off the queue, yet the processes are still alive.
Pictured: "Zombies"
When I inspect the remaining jobs in the queue, none will show that they are locked/being-worked by the zombified workers in question. Even when looking at failed jobs its hard to make a definite question connection between a failure and the workers going into zombie mode.
I have a theory that a job has an error that causes workers to segfault, but not completely die. Is there any way to inspect a worker process and see what it's doing? How would one go about debugging this issue when there's not even a stacktrace or failed job to inspect?