Dataflow job not completed/failed after workers are being started - google-cloud-dataflow

I have created a dataflow pipeline which read a file from Storage Bucket and just do a simple transform to the data (e.g: trim the spaces).
When I execute the dataflow job, the job started and log shows that the workers are started in a zone, but after that nothing happens. Job never get completed or failed. I had to manually stop the job.
Dataflow job has been executed by a service account having dataflow.worker role, dataflow.developer role and dataflow.objectAdmin role.
Please can someone suggest why the dataflow job is not being completed or why the job not executed after the worker started.
2021-02-09 11:01:29.753 GMTWorker configuration: n1-standard-1 in europe-west2-b.
Warning
2021-02-09 11:01:30.015 GMTThe network sdas-global-dev doesn't have rules that open TCP ports 12345-12346 for internal connection with other VMs. Only rules with a target tag 'dataflow' or empty target tags set apply. If you don't specify such a rule, any pipeline with more than one worker that shuffles data will hang. Causes: No firewall rules associated with your network.
Info
2021-02-09 11:01:31.067 GMTExecuting operation Read files/Read+ManageData/ParDo(ManageData)
Info
2021-02-09 11:01:31.115 GMTStarting 1 workers in europe-west2-b...
Warning
2021-02-09 11:07:33.341 GMTThe network sdas-global-dev doesn't have rules that open TCP ports 12345-12346 for internal connection with other VMs. Only rules with a target tag 'dataflow' or empty target tags set apply. If you don't specify such a rule, any pipeline with more than one worker that shuffles data will hang. Causes: No firewall rules associated with your network.

I found the problem. I was running the job in a region as where the VPC was in different region. Thus the worker did not able to spin up. Make the region same as of the VPC and then everything went well.

Related

pyspark client no result from spark server in docker but is connecting

I have a spark cluster running in a docker container. I have a pyspark simple example program to test my configuration which is running on my desktop outside the docker container. The spark console gets and executes the job and completes the job. However the pyspark client never gets the results.
image of spark console
The pyspark program's console shows:
" Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties Setting default log level
to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For
SparkR, use setLogLevel(newLevel). 22/03/05 11:42:23 WARN
ProcfsMetricsGetter: Exception when trying to compute pagesize, as a
result reporting of ProcessTree metrics is stopped 22/03/05 11:42:28
WARN TaskSchedulerImpl: Initial job has not accepted any resources;
check your cluster UI to ensure that workers are registered and have
sufficient resources 22/03/05 11:42:43 WARN TaskSchedulerImpl: Initial
job has not accepted any resources; check your cluster UI to ensure
that workers are registered and have sufficient resources 22/03/05
11:42:58 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered
and have sufficient resources 22/03/05 11:43:13 WARN
TaskSchedulerImpl: Initial job has not accepted any resources; check
your cluster UI to ensure that workers are registered and have
sufficient resources 22/03/05 11:43:28 WARN TaskSchedulerImpl: Initial
job has not accepted any resources; check your cluster UI to ensure
that workers are registered and have sufficient resources 22/03/05
11:43:43 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered
and have sufficient resources "
I know this is false since the job executed on the server.
If I click the kill link on the server the pyspark program immediately gets:
22/03/05 11:46:22 ERROR Utils: Uncaught exception in thread
stop-spark-context org.apache.spark.SparkException: Exception thrown
in awaitResult: at
org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at
org.apache.spark.deploy.client.StandaloneAppClient.stop(StandaloneAppClient.scala:287)
at
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.org$apache$spark$scheduler$cluster$StandaloneSchedulerBackend$$stop(StandaloneSchedulerBackend.scala:259)
at
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.stop(StandaloneSchedulerBackend.scala:131)
at
org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:927)
at
org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2567)
at
org.apache.spark.SparkContext.$anonfun$stop$12(SparkContext.scala:2086)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1442)
at org.apache.spark.SparkContext.stop(SparkContext.scala:2086) at
org.apache.spark.SparkContext$$anon$3.run(SparkContext.scala:2035)
Caused by: org.apache.spark.SparkException: Could not find AppClient.
Thoughts on how to fix this?
There can be multiple reasons for it, as you are running spark client in docker container there is possibility that your container is not reachable from spark nodes while the reverse is possible, that's why your spark session gets created but gets killed in few seconds after it.
You should make your container accessible from spark nodes to make network connection complete. If in error message you are seeing some DNS name which might be container name in most cases, map it to docker container's host ip in /etc/hosts file on all nodes of spark cluster.
Hope it helps.

Understanding Docker Container Internals in Hyperledger Fabric

I think to understood how fabric mainly works and how consens is reached. What I am still missing in the documentation is the part of what happens inside of a docker container of fabric to take part in communication process.
So, communication starting from a client (e.g. an app) takes place in using gRPC messages between peers and orderer.
But what happens inside of the containers?
I imagine it for myself as a process that is only receiving gRPC message and answering them in using functions in the background of a peer/orderer, to hands out its response for further processing in another unit like the client to collect the responses of multiple peers for a smart contract.
But what happens really inside a container? I mean, a container spawns, when the docker image file is loaded and launched by the yaml config file. But what is started there inside of it (is there only a single peer binary started, e.g. like the command "peer node start") - I mean the compiled go binary file "peer" only?? What is listening? What is responding there? I discovered only one port for every container that is exposed out. This seems for me to be the gate for gRPC (cause it is often used as Port ID: **51).
The same questions goes for the orderer, the chaincode and the cli. How are they talking to each other or is gRPC the only way of communication and processing (excluded of the discovery service and gossip, how is this started inside of the containers (in using the yaml files for lauchun only or is there further internal configuration or a startupscript in the image files (cause I cannot look inside the images, only login on running containers while runtime).
When your client sends request to one of the peers, peer instance checks if requested chaincode (CC) installed on it. If CC not installed: Obviously you'll get an error.
If CC is installed: Peer checks if a dedicated container is already started for the given CC and corresponding version. If container is started, peer sends transaction request to that CC instance and returns back the response to your client after signing the transaction. Signing guarantees that response is really sent by that peer.
If container not started:
It builds a docker image and starts that instance (docker container). New image would be based on one of the hyperledger images. i.e. if your CC is GO, then hyperledger/baseos, which is very basic linux os, will be used. This new image contains CC binary and META-DATA as well.
That peer instance is using underlying (your) machine's docker server to do all of those. That's the reason why we need to pass /var/run:/host/var/run into volume mapping and CORE_VM_ENDPOINT=unix:///host/var/run/docker.sock into environment variables.
Once the CC container starts, it connects to its parent peer node which is defined with
CORE_PEER_CHAINCODEADDRESS attribute. Peer dictates to child (probably during image creation) to use this address, so they obey. Peer node defines its own listen URL with CORE_PEER_CHAINCODELISTENADDRESS attribute.
About your last question; communication is with gRPC in between nodes also with clients. If TLS is enabled, then it's for sure secure communication. Entry point for orderers to know about peers and peers know about other organizations' peers is the definition of anchor peers defined during channel creation. Discovery service is running in peer nodes, so they can hold a close to real-time network layout. Discovery service also provides peers' identity, that's how clients can detect other organizations' peers when endorsement policy requires multiple organizations' endorsement policy (i.e. if policy look like AND(Org1MSP.member, Org2MSP.member)).

Jenkins build log shows aborted by user

Jenkins job ( in Network A ) runs on a slave machine ( say, server A in Network A). Jenkins job has instructions as part of the build to SSH to a server ( say server B in Network B) and execute further steps.
The job runs for about 2.5 hours. Very randomly it fails with the error message stating
18:24:14 Aborted by <USERNAME>
18:24:14 Finished: ABORTED
On server B where the build is executed, TCP keep alive is set to yes and to probe a signal every 80secs. On the kernel level, the tcpkeepalive parameter is set to 2.5hours.
I'm sure that the problem is not the timeout from this machine as i have seen a run with a duration of 157 minutes pass successfully.
The build log does neither have any further lines nor it is descriptive.
How can i effectively debug this problem? We are unable to track anything from the network traffic as there is only one session when the slave is established with SSH.
If incase, this is due to any error within the build, how can i make Jenkins throw descriptive message so that we can narrow down to the root cause?
What specifically can be tracked in network to check if this is due to network glitch?

Cloud dataflow job using Internal IP?

How do I configure to run my Cloud dataflow job using Internal IP?
Our policy doesn't allow to use external IP to spawn the workers. So, looking for options that would disallow external IP. I ran and got the below error.
Startup of the worker pool in zone XXX failed to bring up any of the desired 1 workers. Please check for errors in your job parameters, check quota, and retry later, or please try in a different zone/region.
Add instance projects to use external IP with it.
You can use the --usePublicIps=false flag. Here you can look at some examples.
looks like they updated flags
now it's
--no_use_public_ips or --use_public_ips
PS: Python

Jenkins service won't start unless it has access to 178.255.83.1

We recently went through some network policy updates and I've discovered that my Jenkins server's jenkins service will no longer restart as expected (this worked fine prior to the policy updates).
There doesn't seem to be any logging information written on the service startup (no log files get updates).
Is there a list of external IPs that Jenkins needs to access in order to start up properly?
By looking at the logs, it seems as though part of the service start-up process is to contact one of the OCSP Servers. This seems to be related to certificate verification so it's probably legitimate traffic.
Once an exception was added for the target address (http://178.255.83.1:80), the Jenkins service started up without issues.

Resources