YarnCluster constructor hangs in dask-yarn - dask

Im using dask-yarn version 0.3.1. Following the basic example on https://dask-yarn.readthedocs.io/en/latest/.
from dask_yarn import YarnCluster
from dask.distributed import Client
# Create a cluster where each worker has two cores and eight GB of memory
cluster = YarnCluster(environment='environment.tar.gz',
worker_vcores=2,
worker_memory="8GB")
The application is successfuly submitted to cluster but control does not return to console after YarnCluster constructor. The following is the final output from starting.
18/09/19 16:14:24 INFO skein.Daemon: Submitting application...
18/09/19 16:14:24 INFO impl.YarnClientImpl: Submitted application application_1534573350864_34823
18/09/19 16:14:27 INFO skein.Daemon: Notifying that application_1534573350864_34823 has started. 1 callbacks registered.
18/09/19 16:14:27 INFO skein.Daemon: Removing callbacks for application_1534573350864_34823
One thing I noticed when I was initially testing from within docker container was an exception related to grpc not parsing http_proxy environment variable. When running from dedicated cluster edge node, I don't see this exception but also don't see control returned after Constructor.

Related

pyspark client no result from spark server in docker but is connecting

I have a spark cluster running in a docker container. I have a pyspark simple example program to test my configuration which is running on my desktop outside the docker container. The spark console gets and executes the job and completes the job. However the pyspark client never gets the results.
image of spark console
The pyspark program's console shows:
" Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties Setting default log level
to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For
SparkR, use setLogLevel(newLevel). 22/03/05 11:42:23 WARN
ProcfsMetricsGetter: Exception when trying to compute pagesize, as a
result reporting of ProcessTree metrics is stopped 22/03/05 11:42:28
WARN TaskSchedulerImpl: Initial job has not accepted any resources;
check your cluster UI to ensure that workers are registered and have
sufficient resources 22/03/05 11:42:43 WARN TaskSchedulerImpl: Initial
job has not accepted any resources; check your cluster UI to ensure
that workers are registered and have sufficient resources 22/03/05
11:42:58 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered
and have sufficient resources 22/03/05 11:43:13 WARN
TaskSchedulerImpl: Initial job has not accepted any resources; check
your cluster UI to ensure that workers are registered and have
sufficient resources 22/03/05 11:43:28 WARN TaskSchedulerImpl: Initial
job has not accepted any resources; check your cluster UI to ensure
that workers are registered and have sufficient resources 22/03/05
11:43:43 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered
and have sufficient resources "
I know this is false since the job executed on the server.
If I click the kill link on the server the pyspark program immediately gets:
22/03/05 11:46:22 ERROR Utils: Uncaught exception in thread
stop-spark-context org.apache.spark.SparkException: Exception thrown
in awaitResult: at
org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at
org.apache.spark.deploy.client.StandaloneAppClient.stop(StandaloneAppClient.scala:287)
at
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.org$apache$spark$scheduler$cluster$StandaloneSchedulerBackend$$stop(StandaloneSchedulerBackend.scala:259)
at
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.stop(StandaloneSchedulerBackend.scala:131)
at
org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:927)
at
org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2567)
at
org.apache.spark.SparkContext.$anonfun$stop$12(SparkContext.scala:2086)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1442)
at org.apache.spark.SparkContext.stop(SparkContext.scala:2086) at
org.apache.spark.SparkContext$$anon$3.run(SparkContext.scala:2035)
Caused by: org.apache.spark.SparkException: Could not find AppClient.
Thoughts on how to fix this?
There can be multiple reasons for it, as you are running spark client in docker container there is possibility that your container is not reachable from spark nodes while the reverse is possible, that's why your spark session gets created but gets killed in few seconds after it.
You should make your container accessible from spark nodes to make network connection complete. If in error message you are seeing some DNS name which might be container name in most cases, map it to docker container's host ip in /etc/hosts file on all nodes of spark cluster.
Hope it helps.

Unable to see any web transactions of my Vertx application in glowroot

I am working on a vertx application running some apis.
After following instructions mentioned under https://github.com/glowroot/glowroot/wiki/Central-Collector-Installation, I was not able to see any Web transaction data on glowroot central collector, here's what I tried.
I downloaded glowroot-central.jar, and after building the netty plugin from https://github.com/glowroot/glowroot/tree/master/agent/plugins/netty-plugin, placed it under plugins folder containing glowroot-central.jar file.
Next, started glowroot by calling "java -jar glowroot-central.jar"
And then passed in -javaagent:path/to/glowroot.jar to my JVM args of vertx application.
I was also able to confirm that my agent is able to connect to central collector from console output
Here's the output when I start my Vertx application, which confirms that agent is connecting to Central Collector
org.glowroot - Java version: 1.8.0_201 (Oracle Corporation / Mac OS X)
2019-04-05 12:31:43.988 INFO org.glowroot - Java args: -javaagent:/Users/somefolder/glowroot/glowroot.jar
org.glowroot - agent id: "testserver"
org.glowroot - connected to the central collector http://0.0.0.0:8181, version 0.13.2, built 2019-03-27 17:05:44 +0000
I am also able to see my agent's name which is "testserver" on glowroot web UI. However I can not see any web transaction data. I called my API, few hundred times to using an automated tool and waited few min (~30 min), but I don't see anything :(

Configure Spring Cloud Task to use Kafa of Spring Cloud Data Flow server

I have a Spring Cloud Data Flow (SCDF) server running on Kubernetes cluster with Kafka as the message broker. Now I am trying to launch a Spring Cloud Task (SCT) that writes to a topic in Kafka. I would like the SCT to use the same Kafka that SCDF is using. This brings up two questions that I have and hope they can be answered:
How to configure the SCT to use the same Kafka as SCDF?
Is it possible to configure the SCT so that the Kafka server uri can be passed to the SCT automatically when it launches, similar to
the data source properties that get passed to the SCT at launch?
As I could not find any examples on how to achieve this, help is very appreciated.
Edit: My own answer
This is how I get it working for my case. My SCT requires spring.kafka.bootstrap-servers to be supplied. From SCDF's shell, I provide it as an argument --spring.kafka.bootstrap-servers=${KAFKA_SERVICE_HOST}:${KAFKA_SERVICE_PORT}, where KAFKA_SERVICE_HOST and KAFKA_SERVICE_PORT are environment variables created by SCDF's k8s setup script.
This is how to launch the task within SCDF's shell
dataflow:>task launch --name sample-task --arguments "--spring.kafka.bootstrap-servers=${KAFKA_SERVICE_HOST}:${KAFKA_SERVICE_PORT}"
You may want to review the Spring Cloud Task Events section in the reference guide.
The expectation is that you'd choose the binder of choice and pack that library in the Task application's classpath. With that dependency, you'd then configure the application with Spring Cloud Stream's Kafka binder properties such as the spring.cloud.stream.kafka.binder.brokers and others that are relevant to connect to the existing Kafka cluster.
Upon launching the Task application (from SCDF) with these configurations, you'd be able to publish or receive events in your Task app.
Alternatively, with the Kafka-binder in the classpath of the Task application, you can define the Kafka binder properties to all the Task's launched by SCDF via global configuration. See Common Application Properties in the ref. guide for more information. In this model, you don't have to configure each of the Task application with Kafka properties explicitly, but instead, SCDF would propagate them automatically when it launches the Tasks. Keep in mind that these properties would be supplied to all the Task launches.

Stackdriver Log Agent - Log Level Irrelevant with Google Cloud Logging Driver for Docker

TL,DR; Log levels are ignored when making a Stackdriver logging API call using using a CloudLoggingHandler from a Docker container using the Google Cloud Logging driver.
Detail;
The recommended way to get logs from a Docker container running on Google's Compute Engine is to use the Stackdriver Logging Agent:
It is a best practice to run the Stackdriver Logging agent on all your
VM instances. The agent runs under both Linux and Windows. To install
the Stackdriver Logging agent, see Installing the Logging Agent.
The following steps were completed successfully:
Ensure Compute Engine default service account has Editor and Logs Writer roles.
Ensure the VM instance has Cloud API access scope for Stackdriver Logging API (Full)
Install and start Stackdriver Logging Agent.
I then copied the example CloudLoggingHandler example from Google's Cloud Platform Python docs.
import logging
import google.cloud.logging
from google.cloud.logging.handlers import CloudLoggingHandler
client = google.cloud.logging.Client()
handler = CloudLoggingHandler(client)
cloud_logger = logging.getLogger('cloudLogger')
cloud_logger.setLevel(logging.INFO)
cloud_logger.addHandler(handler)
cloud_logger.error('bad news error')
cloud_logger.warning('bad news warning')
cloud_logger.info('bad news info')
The Docker container is started with the Google Cloud Logging Driver flag (--log-driver=gcplogs):
sudo docker run --log-driver=gcplogs --name=server gcr.io/my-project/server:latest
This works, however all logs, irrespective of level are only visible in Stackdriver when viewing 'Any log level'. Strangely, the message itself contains the level:
2018-08-22 22:34:42.176 BST
ERROR:bad news error
2018-08-22 22:34:42.176 BST
WARNING:bad news warning
2018-08-22 22:34:42.176 BST
WARNING:bad news info
This makes it impossible to filter by level in the Stackdriver UI:
In the screenshot below, all icons on the LHS of every log entry show the level as Any:
From what I can tell, the CloudLoggingHandler is a standalone handler that sends logs to the Global log level. To integrate with gcplogs driver properly, try using the ContainerEngineHandler

In a dask distributed setup the worker sits idle

I'm trying to setup a dask distributed cluster, I've installed dask on three machines to get started:
laptop (where searchCV gets called)
scheduler (small box where the dask scheduler process lives)
HPC (Large box expected to do the work)
I have dask[complete] installed on the laptop and dask on the other machines.
The worker and scheduler start fine and I can see the dashboards, but I can't send them anything. Running GridSearchCV on the laptop get's a result but it comes from the laptop alone, the worker sits idle.
All machines are windows 7 (HPC is 10) I've checked the ports with netstat and it appears that it is really listening where it is supposed to.
When runnign a small example I get the following error:
from dask.distributed import Client
scheduler_address = 'tcp://10.X.XX.XX:8786'
client = Client(scheduler_address)
def square(x):
return x ** 2
def neg(x):
return -x
A = client.map(square, range(10))
B = client.map(neg, A)
total = client.submit(sum, B)
print(total.result())
INFO - Batched Comm Closed: in <closed TCP>: ConnectionAbortedError: [WinError 10053] An established connection was aborted by the software in your host machine
distributed.comm.core.CommClosedError
tornado.application - ERROR - Exception in callback <bound method Client._heartbeat of <Client: scheduler='tcp://10.X.XX.XX:8786' processes=1 cores=10
I've also filed a bug report as I don't know if this is a bug or ineptitude on my part (I'm guessing later)
Running client.get_versions(check=True) revealed all sorts of issues despite a clean install with -U. Fixing the environments to be the same fixed that. The laptop can have different versions of stuff installed, at least it worked for the differences I have, YMMV.

Resources