pyspark client no result from spark server in docker but is connecting - docker

I have a spark cluster running in a docker container. I have a pyspark simple example program to test my configuration which is running on my desktop outside the docker container. The spark console gets and executes the job and completes the job. However the pyspark client never gets the results.
image of spark console
The pyspark program's console shows:
" Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties Setting default log level
to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For
SparkR, use setLogLevel(newLevel). 22/03/05 11:42:23 WARN
ProcfsMetricsGetter: Exception when trying to compute pagesize, as a
result reporting of ProcessTree metrics is stopped 22/03/05 11:42:28
WARN TaskSchedulerImpl: Initial job has not accepted any resources;
check your cluster UI to ensure that workers are registered and have
sufficient resources 22/03/05 11:42:43 WARN TaskSchedulerImpl: Initial
job has not accepted any resources; check your cluster UI to ensure
that workers are registered and have sufficient resources 22/03/05
11:42:58 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered
and have sufficient resources 22/03/05 11:43:13 WARN
TaskSchedulerImpl: Initial job has not accepted any resources; check
your cluster UI to ensure that workers are registered and have
sufficient resources 22/03/05 11:43:28 WARN TaskSchedulerImpl: Initial
job has not accepted any resources; check your cluster UI to ensure
that workers are registered and have sufficient resources 22/03/05
11:43:43 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered
and have sufficient resources "
I know this is false since the job executed on the server.
If I click the kill link on the server the pyspark program immediately gets:
22/03/05 11:46:22 ERROR Utils: Uncaught exception in thread
stop-spark-context org.apache.spark.SparkException: Exception thrown
in awaitResult: at
org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at
org.apache.spark.deploy.client.StandaloneAppClient.stop(StandaloneAppClient.scala:287)
at
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.org$apache$spark$scheduler$cluster$StandaloneSchedulerBackend$$stop(StandaloneSchedulerBackend.scala:259)
at
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.stop(StandaloneSchedulerBackend.scala:131)
at
org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:927)
at
org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2567)
at
org.apache.spark.SparkContext.$anonfun$stop$12(SparkContext.scala:2086)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1442)
at org.apache.spark.SparkContext.stop(SparkContext.scala:2086) at
org.apache.spark.SparkContext$$anon$3.run(SparkContext.scala:2035)
Caused by: org.apache.spark.SparkException: Could not find AppClient.
Thoughts on how to fix this?

There can be multiple reasons for it, as you are running spark client in docker container there is possibility that your container is not reachable from spark nodes while the reverse is possible, that's why your spark session gets created but gets killed in few seconds after it.
You should make your container accessible from spark nodes to make network connection complete. If in error message you are seeing some DNS name which might be container name in most cases, map it to docker container's host ip in /etc/hosts file on all nodes of spark cluster.
Hope it helps.

Related

Dataflow job not completed/failed after workers are being started

I have created a dataflow pipeline which read a file from Storage Bucket and just do a simple transform to the data (e.g: trim the spaces).
When I execute the dataflow job, the job started and log shows that the workers are started in a zone, but after that nothing happens. Job never get completed or failed. I had to manually stop the job.
Dataflow job has been executed by a service account having dataflow.worker role, dataflow.developer role and dataflow.objectAdmin role.
Please can someone suggest why the dataflow job is not being completed or why the job not executed after the worker started.
2021-02-09 11:01:29.753 GMTWorker configuration: n1-standard-1 in europe-west2-b.
Warning
2021-02-09 11:01:30.015 GMTThe network sdas-global-dev doesn't have rules that open TCP ports 12345-12346 for internal connection with other VMs. Only rules with a target tag 'dataflow' or empty target tags set apply. If you don't specify such a rule, any pipeline with more than one worker that shuffles data will hang. Causes: No firewall rules associated with your network.
Info
2021-02-09 11:01:31.067 GMTExecuting operation Read files/Read+ManageData/ParDo(ManageData)
Info
2021-02-09 11:01:31.115 GMTStarting 1 workers in europe-west2-b...
Warning
2021-02-09 11:07:33.341 GMTThe network sdas-global-dev doesn't have rules that open TCP ports 12345-12346 for internal connection with other VMs. Only rules with a target tag 'dataflow' or empty target tags set apply. If you don't specify such a rule, any pipeline with more than one worker that shuffles data will hang. Causes: No firewall rules associated with your network.
I found the problem. I was running the job in a region as where the VPC was in different region. Thus the worker did not able to spin up. Make the region same as of the VPC and then everything went well.

YarnCluster constructor hangs in dask-yarn

Im using dask-yarn version 0.3.1. Following the basic example on https://dask-yarn.readthedocs.io/en/latest/.
from dask_yarn import YarnCluster
from dask.distributed import Client
# Create a cluster where each worker has two cores and eight GB of memory
cluster = YarnCluster(environment='environment.tar.gz',
worker_vcores=2,
worker_memory="8GB")
The application is successfuly submitted to cluster but control does not return to console after YarnCluster constructor. The following is the final output from starting.
18/09/19 16:14:24 INFO skein.Daemon: Submitting application...
18/09/19 16:14:24 INFO impl.YarnClientImpl: Submitted application application_1534573350864_34823
18/09/19 16:14:27 INFO skein.Daemon: Notifying that application_1534573350864_34823 has started. 1 callbacks registered.
18/09/19 16:14:27 INFO skein.Daemon: Removing callbacks for application_1534573350864_34823
One thing I noticed when I was initially testing from within docker container was an exception related to grpc not parsing http_proxy environment variable. When running from dedicated cluster edge node, I don't see this exception but also don't see control returned after Constructor.

HDFS write from kafka : createBlockOutputStream Exception

I'm using Hadoop from docker swarm with 1 namenode and 3 datanodes (on 3 physical machines).
i'm also using kafka and kafka connect + hdfs connector to write messages into HDFS in parquet format.
I'm able to write data to HDFS using HDFS clients (hdfs put).
But when kafka is writing messages, it works at the very beginning, then if fails with this error :
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/10.0.0.8:50010]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1533)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1309)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1262)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:448)
[2018-05-23 10:30:10,125] INFO Abandoning BP-468254989-172.17.0.2-1527063205150:blk_1073741825_1001 (org.apache.hadoop.hdfs.DFSClient:1265)
[2018-05-23 10:30:10,148] INFO Excluding datanode DatanodeInfoWithStorage[10.0.0.8:50010,DS-cd1c0b17-bebb-4379-a5e8-5de7ff7a7064,DISK] (org.apache.hadoop.hdfs.DFSClient:1269)
[2018-05-23 10:31:10,203] INFO Exception in createBlockOutputStream (org.apache.hadoop.hdfs.DFSClient:1368)
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/10.0.0.9:50010]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1533)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1309)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1262)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:448)
And then the datanodes are not reachable anymore for the process :
[2018-05-23 10:32:10,316] WARN DataStreamer Exception (org.apache.hadoop.hdfs.DFSClient:557)
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /topics/+tmp/test_hdfs/year=2018/month=05/day=23/hour=08/60e75c4c-9129-454f-aa87-6c3461b54445_tmp.parquet could only be replicated to 0 nodes instead of minReplication (=1). There are 3 datanode(s) running and 3 node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1733)
at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2496)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:828)
But if I look into the hadoop web admin console, all the nodes seem to be up and OK.
I've checked the hdfs-site and the "dfs.client.use.datanode.hostname" setting is set to true both on namenode and datanodes. All ips in hadoop configuration files are defined using 0.0.0.0 addresses.
I've tried to format the namenode too, but the error happened again.
Could the problem be that Kafka is writing too fast in HDFS, so it overwhelms it? It would be weird as I've tried the same configuration on a smaller cluster and it worked good even with a big throughputof kafka messages.
Do you have any other idea of the origin of this problem?
Thanks
dfs.client.use.datanode.hostname=true has to be configured also to the client side and, following your log stack:
java.nio.channels.SocketChannel[connection-pending remote=/10.0.0.9:50010]
I guess 10.0.0.9 refers to a private net IP; thus, it seems that the property is not set in your client within hdfs-client.xml.
You can find more detail here.

Flink Could not upload the jar files on Kubernetes with Calico. PUT operation failed

We run Flink in Kubernetes 1.8 in AWS. It's been fine for months.
I've setup a new k8s clusters. Everything the same EXCEPT we enabled Calico (instead of using only Flannel)
Just like Flannel, Calico gives us networking between containers.
Since enabling Calico, Flink client receive this error when trying to send a jar file to job manager:
org.apache.flink.client.program.ProgramInvocationException: The program
execution failed: Could not upload the jar files to the job manager.
Caused by: java.io.IOException: Could not retrieve the JobManager's blob port.
Caused by: java.io.IOException: PUT operation failed: Connection reset
Caused by: java.net.SocketException: Connection reset
and Job manager says:
java.lang.IllegalArgumentException: Invalid BLOB addressing for permanent BLOBs
2018-03-27 06:28:16,069 INFO org.apache.flink.runtime.jobmanager.JobManager - Submitting job 11433fc332c7d76100fd08e6d1b623b4 (flink-job-connectivity-test).
2018-03-27 06:28:16,085 INFO org.apache.flink.runtime.jobmanager.JobManager - Using restart strategy NoRestartStrategy for 11433fc332c7d76100fd08e6d1b623b4.
2018-03-27 06:28:16,096 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job recovers via failover strategy: full graph restart
2018-03-27 06:28:16,105 INFO org.apache.flink.runtime.jobmanager.JobManager - Running initialization on master for job flink-job-connectivity-test (11433fc332c7d76100fd08e6d1b623b4).
2018-03-27 06:28:16,105 INFO org.apache.flink.runtime.jobmanager.JobManager - Successfully ran initialization on master in 0 ms.
2018-03-27 06:28:16,117 ERROR org.apache.flink.runtime.jobmanager.JobManager - Failed to submit job 11433fc332c7d76100fd08e6d1b623b4 (ignite-flink-job-connectivity-test)
java.lang.NullPointerException at
org.apache.flink.util.Preconditions.checkNotNull(Preconditions.java:58)
at org.apache.flink.runtime.checkpoint.CheckpointStatsTracker.(CheckpointStatsTracker.java:121)
at org.apache.flink.runtime.executiongraph.ExecutionGraphBuilder.buildGraph(ExecutionGraphBuilder.java:228)
at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$submitJob(JobManager.scala:1277)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:447)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
It looks like the file cannot be transferred from the client to the job manager. I believe Invalid BLOB addressing is because the job manager did not receive any file.
Everything is the same. Works on one cluster. Does not work on another. Ports are configured the same. Every artefact is the same.
We don't have any NetworkPolicy. But would Calico enabled have some form of effect on networking?
Problem solved. I added this to my Flink task manager manifest file
name: data
port: 6121
name: rpc
port: 6122
name: query
port: 6125
And this in the flink conf files :
taskmanager.data.port: 6121
So basically I pinned a data port for task manager. I had done that for the job manager (blob server port).
And it was fine. But it looks like Calico works differently than Flannel and it could not use a random data port for task manager

Docker Swarm Late Server Startup

I've been using docker swarm for a while and I'm really pleased with how simple it is to set up a swarm cluster and to run replicated services. However I've faced a problem that seems like a blocker in my use case.
I'm using docker 1.12 and swarm mode.
My problem is that the internal IPVS load balancer sends request to tasks that have "status health: starting" and whereas my application is not properly started.
My application takes some time to start but docker swarm load balancer starts sending requests as soon as the container is in "state running".
After running some tests I realized that If I scale up one instance, the instance is available to the load balancer immediately and the client may get a connection refused response if the load balancer sends the request to the starting server.
I've implemented the health check and I was expecting a particular instance to only become available to the load balancer after the first successful health check.
Is there any way to configure the load balancer or the scheduler to only send request to instance that are properly started?
Best Regards,
Bruno Vale

Resources