How to retrieve worker logs for a Dask-YARN job? - dask

I have a simple Dask-YARN script that does only one task: load a file from HDFS, as shown below. However, I'm running into a bug in the code, so I added a print statement in the function, but I don't see that statement being executed in the worker logs which I obtain using yarn logs -applicationId {application_id}. I even tried the method Client.get_worker_logs(), however that doesn't display the stdout as well, just shows some INFO about the worker(s). How does one obtain worker logs after the execution of the code has completed?
import sys
import numpy as np
import scipy.signal
import json
import dask
from dask.distributed import Client
from dask_yarn import YarnCluster
#dask.delayed
def load(input_file):
print("In call of Load...")
with open(input_file, "r") as fo:
data = json.load(fo)
return data
# Process input args
(_, filename) = sys.argv
dag_1 = {
'load-1': (load, filename)
}
print("Building tasks...")
tasks = dask.get(dag_1, 'load-1')
print("Creating YARN cluster now...")
cluster = YarnCluster()
print("Scaling YARN cluster now...")
cluster.scale(1)
print("Creating Client now...")
client = Client(cluster)
print("Getting logs..1")
print(client.get_worker_logs())
print("Doing Dask computations now...")
dask.compute(tasks)
print("Getting logs..2")
print(client.get_worker_logs())
print("Shutting down cluster now...")
cluster.shutdown()

I'm not sure what's going on here, print statements should (and usually do) end up in the log files stored by yarn.
If you want your debug statements to appear in the worker logs from get_worker_logs, you can use the worker logger directly:
from distributed.worker import logger
logger.info("This will show up in the worker logs")

Related

Pipeline fails with name error on Dataflow runner but not on Direct runner

I have created a Python based pipeline that contains a ParDo that leverages the Python base64 package. When I run the pipeline locally with DirectRunner, all is well. When I run the same pipeline with Dataflow on Google Cloud, it fails weith an error of:
NameError: name 'base64' is not defined [while running 'ParDo(WriteToSeparateFiles)-ptransform-47']
It seems to be missing the base64 package but I believe that to be part of standard python and always present.
Here is my complete pipeline code:
import base64 as base64
import argparse
import apache_beam as beam
import apache_beam.io.fileio as fileio
import apache_beam.io.filesystems as filesystems
from apache_beam.options.pipeline_options import PipelineOptions
class WriteToSeparateFiles(beam.DoFn):
def __init__(self, outdir):
self.outdir = outdir
def process(self, element):
writer = filesystems.FileSystems.create(self.outdir + str(element) + '.txt')
message = "This is the content of my file"
message_bytes = message.encode('ascii')
base64_bytes = base64.b64encode(message_bytes) ### Error here
writer.write(base64_bytes)
writer.close()
argv=None
parser = argparse.ArgumentParser()
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)
with beam.Pipeline(options=pipeline_options) as pipeline:
outputs = (
pipeline
| beam.Create(range(10)) # Change range here to be millions if needed
| beam.ParDo(WriteToSeparateFiles('gs://kolban-edi/'))
)
outputs | beam.Map(print)
#print(outputs)
Solution
The solution has been found and is documented here
https://cloud.google.com/dataflow/docs/resources/faq#how_do_i_handle_nameerrors
After much study, a section in the Google Dataflow Frequently Asked Questions was found titled: "How do I handle NameErrors?"
Reading that documentation, the suggestion was to add an extra runner parameter called --save_main_session. As soon as I added that to the execution, the problem was resolved.

Python KafkaConsumer not connecting

Setup:
I have 3 docker containers
1) For Kafka
2) For Zookeeper
3) For JupyterLab
I setup networking between these containers and I see that kafka producer is able to run and produce the data.
KafkaProducer.ipynb
KAFKA_BROKER = ['172.20.0.2:9093']
from kafka import KafkaProducer
from kafka.errors import KafkaError
producer = KafkaProducer(bootstrap_servers=KAFKA_BROKER)
for _ in range(100):
print("sending")
producer.send('my-topic', key=b'foo', value=b'bar')
print("success")
Here the send() sends message 100 times.
KafkaConsumer.ipynb
KAFKA_BROKER = ['172.20.0.2:9093']
from kafka import KafkaConsumer
consumer = KafkaConsumer('my-topic',group_id='my-group',bootstrap_servers=KAFKA_BROKER)
print("Comm success")
for message in consumer:
# message value and key are raw bytes -- decode if necessary!
# e.g., for unicode: `message.value.decode('utf-8')`
print ("%s:%d:%d: key=%s value=%s" % (message.topic, message.partition,
message.offset, message.key,
message.value))
In the above consumer code the line print("Comm success") never gets gets executed. Based on producer code execution, the network is open and jupyter is able to talk to kafka broker. But, client is not able to connect to the same broker for data consumption. How can I start debugging this?
By default auto.offset.reset value is latest, so set it to earliest with new group.id
consumer = KafkaConsumer('my-topic',group_id='new-group',auto_offset_reset = 'earliest',bootstrap_servers=KAFKA_BROKER)

How to capture logs from workers from a Dask-Yarn job?

I have tried using the following in ~/.config/dask/distributed.yaml and ~/.config/dask/yarn.yaml,
logging-file-config: "/path/to/config.ini"
or
logging:
version: 1
disable_existing_loggers: false
root:
level: INFO
handlers: [consoleHandler]
handlers:
consoleHandler:
class: logging.StreamHandler
level: INFO
formatter: sample_formatter
stream: ext://sys.stderr
formatters:
sample_formatter:
format: '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
and then in my function that gets evaluated at the worker:
import logging
from distributed.worker import logger
import dask
from dask.distributed import Client
from dask_yarn import YarnCluster
log = logging.getLogger(__name__)
#dask.delayed
def worker_func(args):
logger.info("This will show up in the worker logs")
log.info("This does not show up in worker logs")
return
if __name__ == "__main__":
dag_1 = {'worker_func': (worker_func, arg_1)}
tasks = dask.get(dag_1, 'load-1')
log.info("This also shows up in logs, and custom formatted)
cluster = YarnCluster()
client = Client(cluster)
dask.compute(tasks)
When I try to view the yarn logs using:
yarn logs -applicationId {application_id}
I do not see the log from log.info inside worker_func, but I do see the logs from distributed.worker.logger and from outside that function on the console. I also tried using client.get_worker_logs(), but that returned an empty dictionary. Is there a way to see customized logs from inside the function that gets evaluated at a worker?
There's a lot going on in this question, so I'm going to answer "How do I configure logging for dask-yarn workers" and hope everything else becomes clear by answering that.
Dask's configuration system is loaded locally on the machine you start a dask cluster from (usually the edge node). This configuration is not distributed to the workers automatically, you're responsible for doing that yourself. You have a few options here:
Have admin/IT put configuration in /etc/dask/ on every node, which will affect all users.
Bundle configuration with your packaged environment. Dask will load configuration from {prefix}/etc/dask/, where prefix is sys.prefix.
For example, if you have a conda environment at /path/to/environment, you'd do the following to bundle the configuration
# Create the configuration directory in the environment
mkdir -p /path/to/environment/etc/dask/
# Add your configuration to this directory
mv config.yaml /path/to/environment/etc/dask/config.yaml
# Package the environment
conda pack -p /path/to/environment -o environment.tar.gz
Any configuration values set in config.yaml will now be available on all the worker nodes. An example configuration file setting some logging configuration would be:
logging:
version: 1
root:
level: INFO
handlers: [consoleHandler]
handlers:
consoleHandler:
class: logging.StreamHandler
level: INFO
formatter: sample_formatter
stream: ext://sys.stderr
formatters:
sample_formatter:
format: '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
Logs from completed dask-yarn applications can be retrieved using the YARN cli at
yarn logs -applicationId <application-id>
Logs for running dask-yarn applications can be retrieved using client.get_worker_logs(). Note that these logs will only contain logs written to the distributed.worker logger. You cannot write to your own logger and have them appear in the output of client.get_worker_logs(). To write to this logger, get it via
import logging
logger = logging.getLogger("distributed.worker")
logger.info("Writing with the worker logger")
Any logger appropriately configured to log to stdout or stderr will appear in the logs accessed via the yarn CLI, but only the distributed.worker logger output will also be available to get_worker_logs().
Side note
I have tried using the following in ~/.config/dask/distributed.yaml and ~/.config/dask/yarn.yaml
The name of the config files doesn't matter, dask loads all yaml files in all config directories and merges their contents. For more information please read the configuration docs

Google Cloud Dataflow cryptic message when downloading file from gcp to local system

I am writing a dataflow pipeline that processes videos from a google cloud bucket. My pipeline downloads each work item to the local system and then reuploads results back to GCP bucket. Following previous question.
The pipeline works on local DirectRunner, i'm having trouble debugging on DataFlowRunnner.
The error reads
File "run_clouddataflow.py", line 41, in process
File "/usr/local/lib/python2.7/dist-packages/google/cloud/storage/blob.py", line 464, in download_to_file self._do_download(transport, file_obj, download_url, headers)
File "/usr/local/lib/python2.7/dist-packages/google/cloud/storage/blob.py", line 418, in _do_download download.consume(transport) File "/usr/local/lib/python2.7/dist-packages/google/resumable_media/requests/download.py", line 101, in consume self._write_to_stream(result)
File "/usr/local/lib/python2.7/dist-packages/google/resumable_media/requests/download.py", line 62, in _write_to_stream with response: AttributeError: __exit__ [while running 'Run DeepMeerkat']
When trying to execute blob.download_to_file(file_obj) within:
storage_client=storage.Client()
bucket = storage_client.get_bucket(parsed.hostname)
blob=storage.Blob(parsed.path[1:],bucket)
#store local path
local_path="/tmp/" + parsed.path.split("/")[-1]
print('local path: ' + local_path)
with open(local_path, 'wb') as file_obj:
blob.download_to_file(file_obj)
print("Downloaded" + local_path)
I'm guessing that the workers are not in permission to write locally? Or perhaps there is not a /tmp folder in the dataflow container. Where should I write objects? Its hard to debug without access to the environment. Is it possible to access stdout from workers for debugging purposes (serial console?)
EDIT #1
I've tried explicitly passing credentials:
try:
credentials, project = google.auth.default()
except:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = known_args.authtoken
credentials, project = google.auth.default()
as well as writing to cwd(), instead of /tmp/
local_path=parsed.path.split("/")[-1]
print('local path: ' + local_path)
with open(local_path, 'wb') as file_obj:
blob.download_to_file(file_obj)
Still getting the cryptic error on blob downloads from gcp.
Full Pipeline script is below, setup.py is here.
import logging
import argparse
import json
import logging
import os
import csv
import apache_beam as beam
from urlparse import urlparse
from google.cloud import storage
##The namespaces inside of clouddataflow workers is not inherited ,
##Please see https://cloud.google.com/dataflow/faq#how-do-i-handle-nameerrors, better to write ugly import statements then to miss a namespace
class PredictDoFn(beam.DoFn):
def process(self,element):
import csv
from google.cloud import storage
from DeepMeerkat import DeepMeerkat
from urlparse import urlparse
import os
import google.auth
DM=DeepMeerkat.DeepMeerkat()
print(os.getcwd())
print(element)
#try adding credentials?
#set credentials, inherent from worker
credentials, project = google.auth.default()
#download element locally
parsed = urlparse(element[0])
#parse gcp path
storage_client=storage.Client(credentials=credentials)
bucket = storage_client.get_bucket(parsed.hostname)
blob=storage.Blob(parsed.path[1:],bucket)
#store local path
local_path=parsed.path.split("/")[-1]
print('local path: ' + local_path)
with open(local_path, 'wb') as file_obj:
blob.download_to_file(file_obj)
print("Downloaded" + local_path)
#Assign input from DataFlow/manifest
DM.process_args(video=local_path)
DM.args.output="Frames"
#Run DeepMeerkat
DM.run()
#upload back to GCS
found_frames=[]
for (root, dirs, files) in os.walk("Frames/"):
for files in files:
fileupper=files.upper()
if fileupper.endswith((".JPG")):
found_frames.append(os.path.join(root, files))
for frame in found_frames:
#create GCS path
path="DeepMeerkat/" + parsed.path.split("/")[-1] + "/" + frame.split("/")[-1]
blob=storage.Blob(path,bucket)
blob.upload_from_filename(frame)
def run():
import argparse
import os
import apache_beam as beam
import csv
import logging
import google.auth
parser = argparse.ArgumentParser()
parser.add_argument('--input', dest='input', default="gs://api-project-773889352370-testing/DataFlow/manifest.csv",
help='Input file to process.')
parser.add_argument('--authtoken', default="/Users/Ben/Dropbox/Google/MeerkatReader-9fbf10d1e30c.json",
help='Input file to process.')
known_args, pipeline_args = parser.parse_known_args()
#set credentials, inherent from worker
try:
credentials, project = google.auth.default()
except:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = known_args.authtoken
credentials, project = google.auth.default()
p = beam.Pipeline(argv=pipeline_args)
vids = (p|'Read input' >> beam.io.ReadFromText(known_args.input)
| 'Parse input' >> beam.Map(lambda line: csv.reader([line]).next())
| 'Run DeepMeerkat' >> beam.ParDo(PredictDoFn()))
logging.getLogger().setLevel(logging.INFO)
p.run()
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
I spoke to the google-cloud-storage package mantainer, this was a known issue. Updating specific versiosn in my setup.py to
REQUIRED_PACKAGES = ["google-cloud-storage==1.3.2","google-auth","requests>=2.18.0"]
fixed the issue.
https://github.com/GoogleCloudPlatform/google-cloud-python/issues/3836

Kubernetes/Spring Cloud Dataflow stream > spring.cloud.stream.bindings.output.destination is ignored by producer

I'm trying to run a "Hello, world" Spring Cloud Data Flow stream based on the very simple example explained at http://cloud.spring.io/spring-cloud-dataflow/. I'm able to create a simple source and sink and run it on my local SCDF server using Kafka, so until here everything is correct and messages are produced and consumed in the topic specified by SCDF.
Now, I'm trying to deploy it in my private cloud based on the instructions listed at http://docs.spring.io/spring-cloud-dataflow-server-kubernetes/docs/current-SNAPSHOT/reference/htmlsingle/#_getting_started. Using this deployment I'm able to deploy a simple "time | log" out-of-the-box stream with no problems, but my example fails since the producer is not writing in the topic specified when the pod is created (for instance, spring.cloud.stream.bindings.output.destination=ntest33.nites-source9) but in the topic "output". I have a similar problem with the sink component, which creates and expect messages in the topic "input".
I created the stream definition using the dashboard:
nsource1 | log
And container args for the source are:
--spring.cloud.stream.bindings.output.producer.requiredGroups=ntest34
--spring.cloud.stream.bindings.output.destination=ntest34.nsource1
Code snippet for source component is
package xxxx;
import java.text.SimpleDateFormat;
import java.util.Date;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.cloud.stream.annotation.EnableBinding;
import org.springframework.cloud.stream.messaging.Source;
import org.springframework.context.annotation.Bean;
import org.springframework.integration.annotation.InboundChannelAdapter;
import org.springframework.integration.core.MessageSource;
import org.springframework.messaging.support.GenericMessage;
#SpringBootApplication
#EnableBinding(Source.class)
public class HelloNitesApplication
{
public static void main(String[] args)
{
SpringApplication.run(HelloNitesApplication.class, args);
}
#Bean
#InboundChannelAdapter(value = Source.OUTPUT)
public MessageSource<String> timerMessageSource()
{
return () -> new GenericMessage<>("Hello " + new SimpleDateFormat().format(new Date()));
}
And in the logs I can see clearly
2017-04-07T09:44:34.596842965Z 2017-04-07 09:44:34,593 INFO main o.s.i.c.DirectChannel:81 - Channel 'application.output' has 1 subscriber(s).
Question is, how to override properly the topic where messages must be produced/consumed or what attribute and values to use to make this work on k8s?
UPDATE: I have the similar problem using RabbitMQ
2017-04-07T12:56:40.435405177Z 2017-04-07 12:56:40.435 INFO 7 --- [ main] o.s.integration.channel.DirectChannel : Channel 'application.output' has 1 subscriber(s).
The problem was with my docker image. I still don't know the details but using the Dockerfile indicated at https://spring.io/guides/gs/spring-boot-docker/ instantiated 2 processes in the docker container, one with the parameters, and other without, which was the one with uptime and therefore being used.
The solution was to replace
ENTRYPOINT [ "sh", "-c", "java $JAVA_OPTS -Djava.security.egd=file:/dev/./urandom -jar /app.jar" ]
With
ENTRYPOINT [ "java", "-jar", "/app.jar" ]
And it started working. There must be a good reason why the example indicated the first entrypoint and why 2 processes were created, but the reason is still beyond my understanding.
Can you provide more details on how you set that configuration property? That feature is pretty basic, so this should work. If you are using a stream definition to set it, please update your question with the stream definition.
The channel name remains 'output' because that's what the application uses internally.

Resources