This is the only error exception in the logs and all Dataflow workers shut down after 3.5 days of processing. It gets through more than half of the load. What does this error mean? Not sure if it is a memory issue that might get solved after increasing the resources. There can be no exception caused by user code because everything is inside a blanket try...except block.
Workflow failed. Causes: S04:Reshuffle/ReshufflePerKey/GroupByKey/Read+Reshuffle/ReshufflePerKey/GroupByKey/GroupByWindow+Reshuffle/ReshufflePerKey/FlatMap(restore_timestamps)+Reshuffle/RemoveRandomKeys+ParDo(EnrichCompanies)+ParDo(LogCompanyPipelineRun) failed., The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers:
company-batch-enrichment--02161750-u9wk-harness-3nj5
Root cause: The worker lost contact with the service.,
company-batch-enrichment--02161750-u9wk-harness-3nj5
Root cause: The worker lost contact with the service.,
company-batch-enrichment--02161750-u9wk-harness-3nj5
Root cause: The worker lost contact with the service.,
company-batch-enrichment--02161750-u9wk-harness-3nj5
Root cause: The worker lost contact with the service.
Below is the resource metrics for the Job.
Related
We do a lot of testing on Pentaho jobs, but each time we have to wait for 1 minute 40 seconds before anything happens. The cause must be Karaf because of this output message of Kitchen:
11:55:58,432 ERROR [KarafLifecycleListener] The Kettle Karaf Lifecycle Listener
failed to execute properly after waiting for 100 seconds.
Releasing lifecycle hold, but some services may be unavailable.
Some people suggested to add a line to kettle.properties in your home dir, or in folder .pentaho or .kettle like below, but this had no effect:
KITCHEN_KARAF_TIMEOUT_SECONDS=20
The other approach would be to give Karaf something to listen to so that it will stop waiting any further, but I could not find info on this.
How do we avoid that 100 seconds of waiting?
If you don't need Karaf, you can remove these files and folder:
\classes\kettle-lifecycle-listeners.xml
\classes\kettle-registry-extensions.xml
\lib\org.apache.karaf.*.jar
\system\karaf
Sometimes, when one of our longer builds is running (around 2 hours), Jenkins will start displaying the "Jenkins is going to shutdown ..." message. And no, this has not been done by an admin (me).
When I last saw this, I checked the console output of the running job, and it was still churning through it's tests and was running normally. It was not hung.
Then later, I checked again, and the console had the "BUILD SUCCESSFUL" message, followed by "Pausing (Preparing for shutdown)" - and it just sat there.
So I clicked on the kill job button, and killed it. and got the "Aborted by ..." message.
Then 15 seconds later it displayed "Click here to forcibly terminate running steps". I did that. It then displayed "Terminating withAnt".
Then 15 seconds later, it displayed "Click here to forcibly kill entire build". Which I did - and Jenkins return to normal operation and cleared the "going to shutdown" message.
WHAT IS GOING ON!
One related note: Due to getting too much "state" bleedthrough on our JUNits, we recently added the forkmode="perTest" setting to the Ant JUnit task. This has resulted in random tests failing with a "vm exited unexpectedly" message. It happens randomly for different tests. (which is a PITA since we can no longer count on Test Failed status in jenkins meaning anything.) And no, I'm not sure if that has always happened when the Jenkins job has the termination problem.
Well, I think I figured this out.
The system was running low on disk space. So SOME jobs that used more were triggering this problem - and others would run without a problem.
When I finally received a low-disk space error in one of the logs, I did some cleaning (found a bunch of files that were supposed to have been deleted.). Since then, this error has stopped occuring.
NOPE! Still happening
I'm getting a System event log entry:
An ISAPI reported an unhealthy condition to its worker process. Therefore, the worker process with process id of '<processID>' serving application pool '<myApp>' has requested a recycle.
but I am not getting anymore information other than that. What could I do to get a more detailed error message as to what "unhealthy" means?
Resolve
Check an unhealthy ISAPI component:
Possible resolutions for an unhealthy ISAPI component include the following:
Contact the ISAPI vendor regarding the error. The vendor should have more specific knowledge about the features and behavior of the component.
Check the event log message for more detailed information about the error.
For more information about "Troubleshooting Unexpected Issues", you can refer to this link.
For the verification project of this error, you can refer to this link.
For one of our gerrit projects, while navigating the file differences we get this error:
Application Error
Intraline difference not available due to server error
[Continue]
It doesn't happen for all projects, currently we've detected the error on only one project.
I looked on Google and on the gerrit documentation. Found a reference on their source code, but don't know what causes it and how it can be resolved.
The web page with the error contains a "Continue" button. Once clicked it will take you to the file you selected, but the error is annoying.
Do you know how to fix this?
That is caused while cache the intraline difference of one file, when compared between two commits. The default timeout value is 5 seconds. If the file is huge, and computation takes longer than the timeout, the worker thread is terminated, an error message is shown, and no intraline difference is displayed for the file pair.
A solution could fix this.
Add config in gerrit.conf.
[cache "diff_intraline"]
timeout = 15000 ms # Or other time length as you want.
restart Gerrit service
run SSH command "gerrit flush-caches", using a user with ViewCaches global capability.
ssh -p port userxxx#host gerrit flush-caches
Then it would work.
Cause of the error:
It is a result of Gerrit taking too long to diff the file, and marking the diff in one of its caches as non-available.
The relevant error log is here:
[2012-06-08 11:14:08,547] WARN com.google.gerrit.server.patch.IntraLineLoader : 5000 ms timeout reached for IntraLineDiff in project xxxxxxx on commit 354dd67ad54578cf801d8cda64a4ae8484ebb0b7 for path xxxxxxx.java comparing bf9fbc21520af7bfd0841c8b9f955ca6e215b059..f6b9c7992c12cfdca253acd033966f98f70f3543. Killing IntraLineDiff-6
We have a situation where our builds have stopped executing in a stable manner.
At a rate of about one every three we receive either TF215096 or TF215097 errors & the Build fails.
If we then restart the Build controller, it works again - until next time.
The errors we get are:
TF215096: An error occurred while connecting to controller vstfs:///Build/Controller/1: There was no endpoint listening at ht*p://XXXX that could accept the message. This is often caused by an incorrect address or SOAP action. See InnerException, if present, for more details.
TF215096: An error occurred while connecting to controller XXX - Controller: Could not connect to ht*p://XXX. TCP error code 10061: No connection could be made because the target machine actively refused it 192.168.XXX.XXX:XXX.
TF215097: An error occurred while initializing a build for build definition \XXX: Team Foundation services are not available from server ht*p://XXX. Technical information (for administrator): The underlying connection was closed: A connection that was expected to be kept alive was closed by the server.
TF215097: An error occurred while initializing a build for build definition \YYY: An error occurred while receiving the HTTP response to ht*p://XXX. This could be due to the service endpoint binding not using the HTTP protocol. This could also be due to an HTTP request context being aborted by the server (possibly due to the service shutting down). See server logs for more details.
Server logs provide with little info, at least we 've found nothing that helps us resolve the situation. Various searches in the Net were also not productive.
Does anybody had these/similar issues? Any ideas on how/where to look for a resolution?
Thank you very much in advance for any input!
Yeah it does sound like you have some connectivity issues. You can try enabling SOAP tracing on both the build machine and the server (if possible) to see if there is any error. If it still does not give you any new information, contact Microsoft by filing a Connect Bug to get help.
I am not sure if it will help you but I have ran into similar issues with build agents and ended up just deleting and re-creating the agent. You may try deleting your controller/agent and adding it back in. A brute-force solution but a good starting point. If that doesn't resolve the issue at least you can eliminate the controller/agent as the issue and take a look at network/server related issues.
Today is a happy day, since we managed to get to the bottom of the matter. Sorry #Duat that I'm taking away the 'answer' checkmark - but it turned out that the problem was quite different from what you (and anybody else) has predicted.
In my last update I was about to forward the matter to MS, when we realized that our Firewall was misbehaving in the name resolution. So we assumed this was the culprit & awaited for this to resolve. After this was resolved, we STILL had the same issues and we went again re-examining the situation.
We isolated the problem within our Build Process, more specific with a custom code activity included in our build solution.
I had implemented a code activity that would kick in at the final steps of every build. This activity was about gathering BuildDetails about the running build & add them as a new line in a 'BuildLog.xls'. Implementation made use of Microsoft.Office.Interop.Excel.This excel sheet resides in another server (NOT on the Servers where the controller/agents reside).
During development of this activity I was faced with issues like this, but after I was done no instances of EXCEL were left hanging. So I thought this was done & dealt with.
With try & error, we observed that when this activity wouldn't ran, no problems would occur.
With this activity running, the very first build after a build-controller reset would succeed, any next build had a certain chance to fail. Once any build failed, no other would succeed until another build-controller reset.
I have only a general understanding of what the problem was (Excel-call is DCOM, TFS services are WCF : How on earth would they interfere?! Why would this sometimes succeed and sometimes fail?! ).
The provided diagnostics were no help either, in fact they mislead us into a loop that continued for months.
If I ever find the time, I 'd like to cleanly reproduce the error & make a Server Fault question out of it...
After removal of this activity it works! I now searched in SO & found this, where J.Saunders comments: "In general, you should never use Office Interop from a server environment". It's ironic that once you get to the bottom of any difficult issue, the whole universe seems to have known about it except you...