The Dataflow appears to be stuck - google-cloud-dataflow

Got the following message:
The Dataflow appears to be stuck. Please reach out to the Dataflow team at http://stackoverflow.com/questions/tagged/google-cloud-dataflow.
I realized there were other questions regarding the same error message, but the context seemed different for each and the message rather generic, so I'm posting again.
Job ID: 2017-09-25_09_27_25-5047889078463721675
Please assist. Thanks.
EDIT: Problem seems to have disappeared (at least for now) after updating to Apache Beam SDK for Python 2.1.1 from 2.0.0.

A common cause of stuckness in Dataflow pipelines is an inability to start the workers. If you look at the Stackdriver Logs (view Logs in the UI, and click the link to go to Stackdriver) you should be able to view the worker_startup logs. Any problems here can indicate failures to start workers, which would cause the job to be stuck.

Related

Dataflow with Go SDK failing with 'InvalidProtocolBufferException'

We were previously running Beam 2.11.0 which started failing due to an apparent change in URN format. When I attempted to update and use the latest release (2.13.0) the pipeline started timing out, and the only seemingly relevant error that I could identify from the logs during testing was:
org.apache.beam.vendor.grpc.v1p13p1.com.google.protobuf.InvalidProtocolBufferException: While parsing a protocol message, the input ended unexpectedly in the middle of a field. This could mean either that the input has been truncated or that an embedded message misreported its own length.
To further test this, I attempted to use the wordcount example as provided in the Beam docs/repo here - https://beam.apache.org/get-started/wordcount-example/ - But I got the same result. I'm not sure if this is a generic error or something to do with Dataflow. The pipelines seem to work when running via direct runner.
Note: I have rebuilt our worker_harness_container_image with the latest version.
I understand the Go SDK is not officially supported by Dataflow, but could anybody tell me if the error is something related to Dataflow or some other issue?
PS: I've asked the question in Dataflow and Beam Slack channels but haven't had any response.

Error creating Queues in RabbitMQ

I have an environment in our company which hosts RabbitMQ 3.6.1 and Erlang 19.3. When i tried to create a queue by using RabbitMQ Management UI, I am getting the below error. I can create Exchanges and VHosts ok. It is only when I am trying to create Queues that I am getting the error. I tried to write a utility to create queues using the HTTP api but even that fails.
Upon some more researching I stumbled upon this article https://groups.google.com/d/msg/rabbitmq-users/pa1UtLbbvOE/3OlgKgMBAgAJ which says Erlang 19 is not compatible with RabbitMQ 3.6.3 and lower. Can someone confirm my findings please?
The error I am getting is
Got response code 500 with body {"error":"Internal Server Error","reason":"{error,\n {exit,\n {{function_clause,\n [{rabbit_queue_location_validator,module,\n [\"random\"],\n [{file,\"src/rabbit_queue_location_validator.erl\"},\n {line,50}]},\n {rabbit_queue_location_validator,validate_strategy,1,\n [{file,\"src/rabbit_queue_location_validator.erl\"},\n {line,38}]},\n {rabbit_queue_master_location_misc,get_location_mod_by_config,\n 1,\n [{file,\"src/rabbit_queue_master_location_misc.erl\"},\n {line,88}]},\n {rabbit_queue_master_location_misc,get_location,1,\n [{file,\"src/rabbit_queue_master_location_misc.erl\"},\n {line,51}]},\n {rabbit_amqqueue,declare,6,\n [{file,\"src/rabbit_amqqueue.erl\"},{line,300}]},\n {rabbit_channel,handle_method,3,\n [{file,\"src/rabbit_channel.erl\"},{line,1331}]},\n {rabbit_channel,handle_cast,2,\n [{file,\"src/rabbit_channel.erl\"},{line,455}]},\n {gen_server2,handle_msg,2,\n [{file,\"src/gen_server2.erl\"},{line,1049}]}]},\n {gen_server,call,\n [<0.27627.105>,\n {call,\n {'queue.declare',0,<<\"Test\">>,false,true,false,false,false,\n []},\n none,<0.15368.105>},\n infinity]}},\n [{gen_server,call,3,[{file,\"gen_server.erl\"},{line,212}]},\n {rabbit_mgmt_util,'-amqp_request/5-fun-0-',4,\n [{file,\"src/rabbit_mgmt_util.erl\"},{line,579}]},\n {rabbit_mgmt_util,with_channel,5,\n [{file,\"src/rabbit_mgmt_util.erl\"},{line,598}]},\n {rabbit_mgmt_util,http_to_amqp,5,\n [{file,\"src/rabbit_mgmt_util.erl\"},{line,526}]},\n {webmachine_resource,resource_call,3,\n [{file,\"src/webmachine_resource.erl\"},{line,186}]},\n {webmachine_resource,do,3,\n [{file,\"src/webmachine_resource.erl\"},{line,142}]},\n {webmachine_decision_core,resource_call,1,\n [{file,\"src/webmachine_decision_core.erl\"},{line,48}]},\n {webmachine_decision_core,accept_helper,1,\n [{file,\"src/webmachine_decision_core.erl\"},{line,612}]}]}}\n"}
The RabbitMQ team monitors this mailing list and only sometimes answers questions on stackoverflow.
In your case, the error is happening here. Did you create a queue-master-locator policy with the value of random? If so, I recommend clearing the policy to see if that resolves the issue.
I also recommend upgrading to the latest version (3.6.12). The version you are using is very old.
Thanks to #Luke Bakken for pointing me to the RabbitMQ Mailing list.
I managed to fix the problem by changing the configuration of queue master location strategy to <<"random">>
Please see this link for more info
https://groups.google.com/d/msg/rabbitmq-users/XUbtu4UxbHQ/3y-PvO0oBAAJ

Jenkins UI is not working but backend is fine

We are running into issues were the Jenkins UI stops working but backend is fine and is able to process all traffic.
Is there a way we can restart just the front end of jenkins ?
Thanks
I did some research about Jenkins UI freeze and found that in most of the times it happens due to the java garbage collector. Luckily it is something that you can configure, so this is what I suggest:
read about java garbage collector if you are not familiar with it.
enable GC logging for your Jenkins instance
analyze GC logs with tools such as http://gceasy.io/
add relevant GC flags to your Jenkins Java Opts
repeat 2-4 until satisfied
For further reading see:
my slides from a talk I did about this subject:
https://www.slideshare.net/TidharKleinOrbach/why-does-my-jenkins-freeze-sometimes-and-what-can-i-do-about-it
a post I made about the subject:
http://engineering.taboola.com/5-simple-tips-boosting-jenkins-performance/
more resources:
https://www.cloudbees.com/blog/joining-big-leagues-tuning-jenkins-gc-responsiveness-and-stability
https://jenkins.io/blog/2016/11/21/gc-tuning/
I ran into the same problem this week. These are the things I did:
Looked at the debugging steps given on Jenkins wiki, but it didn't work for me:https://wiki.jenkins-ci.org/display/JENKINS/Jenkins+is+hanging
Later, I realised I might have memory issues on the server where I was running Jenkins, and the UI had crashed. Now, the UI won't load even after freeing up memory space.
Finally, I restarted my Jenkins server - something I was avoiding doing from the beginning - and it solved the issue and Jenkins UI came back up.
Thanks

Why is my Dataflow pipeline not showing steps?

When I run the examples I get a pretty picture showing the flow and I can monitor as it executes. With my application it doesn't show the diagram and if I click on "Step" it displays nothing.
Adding screenshot of Job log. No warnings or errors. BTW, I assumed the icon on the log entry with an "i" stands for Info level, but when I change the level from BASIC to ALL many more entries are added and they all have the same icon. That is confusing. Icons should be more clear and should have hover tips, IMO.
I'm on the Dataflow team. I'm sorry that you are encountering this issue.
I believe this is occurring because of the custom step names you're code is using.
From your screenshot of the job logs, it appears that some of these steps have been given names that represent a GCS storage path location.
I noticed this from this message in the logs:
Executing operation "gs://datalake/landing/...."
This fails to render in the monitoring UI and likely hits an assertion because slashes are disallowed characters.
In order to work around this issue would you please try removing the custom step names used in your code. Which seems to be set to gs:// style paths. You could also try specifying names for each step, without using special characters.
Please try running the job again after that change and see if the graph renders properly in the dataflow UI.
I have created a github issue to track this bug and prevent these slash characters from sent in the future in the dataflow SDK code.
Please let me know if you encounter any more issues.

TFS 2010 Build: Sporadic failure in the process

We have a situation where our builds have stopped executing in a stable manner.
At a rate of about one every three we receive either TF215096 or TF215097 errors & the Build fails.
If we then restart the Build controller, it works again - until next time.
The errors we get are:
TF215096: An error occurred while connecting to controller vstfs:///Build/Controller/1: There was no endpoint listening at ht*p://XXXX that could accept the message. This is often caused by an incorrect address or SOAP action. See InnerException, if present, for more details.
TF215096: An error occurred while connecting to controller XXX - Controller: Could not connect to ht*p://XXX. TCP error code 10061: No connection could be made because the target machine actively refused it 192.168.XXX.XXX:XXX.
TF215097: An error occurred while initializing a build for build definition \XXX: Team Foundation services are not available from server ht*p://XXX. Technical information (for administrator): The underlying connection was closed: A connection that was expected to be kept alive was closed by the server.
TF215097: An error occurred while initializing a build for build definition \YYY: An error occurred while receiving the HTTP response to ht*p://XXX. This could be due to the service endpoint binding not using the HTTP protocol. This could also be due to an HTTP request context being aborted by the server (possibly due to the service shutting down). See server logs for more details.
Server logs provide with little info, at least we 've found nothing that helps us resolve the situation. Various searches in the Net were also not productive.
Does anybody had these/similar issues? Any ideas on how/where to look for a resolution?
Thank you very much in advance for any input!
Yeah it does sound like you have some connectivity issues. You can try enabling SOAP tracing on both the build machine and the server (if possible) to see if there is any error. If it still does not give you any new information, contact Microsoft by filing a Connect Bug to get help.
I am not sure if it will help you but I have ran into similar issues with build agents and ended up just deleting and re-creating the agent. You may try deleting your controller/agent and adding it back in. A brute-force solution but a good starting point. If that doesn't resolve the issue at least you can eliminate the controller/agent as the issue and take a look at network/server related issues.
Today is a happy day, since we managed to get to the bottom of the matter. Sorry #Duat that I'm taking away the 'answer' checkmark - but it turned out that the problem was quite different from what you (and anybody else) has predicted.
In my last update I was about to forward the matter to MS, when we realized that our Firewall was misbehaving in the name resolution. So we assumed this was the culprit & awaited for this to resolve. After this was resolved, we STILL had the same issues and we went again re-examining the situation.
We isolated the problem within our Build Process, more specific with a custom code activity included in our build solution.
I had implemented a code activity that would kick in at the final steps of every build. This activity was about gathering BuildDetails about the running build & add them as a new line in a 'BuildLog.xls'. Implementation made use of Microsoft.Office.Interop.Excel.This excel sheet resides in another server (NOT on the Servers where the controller/agents reside).
During development of this activity I was faced with issues like this, but after I was done no instances of EXCEL were left hanging. So I thought this was done & dealt with.
With try & error, we observed that when this activity wouldn't ran, no problems would occur.
With this activity running, the very first build after a build-controller reset would succeed, any next build had a certain chance to fail. Once any build failed, no other would succeed until another build-controller reset.
I have only a general understanding of what the problem was (Excel-call is DCOM, TFS services are WCF : How on earth would they interfere?! Why would this sometimes succeed and sometimes fail?! ).
The provided diagnostics were no help either, in fact they mislead us into a loop that continued for months.
If I ever find the time, I 'd like to cleanly reproduce the error & make a Server Fault question out of it...
After removal of this activity it works! I now searched in SO & found this, where J.Saunders comments: "In general, you should never use Office Interop from a server environment". It's ironic that once you get to the bottom of any difficult issue, the whole universe seems to have known about it except you...

Resources