Pipeline fails with name error on Dataflow runner but not on Direct runner - google-cloud-dataflow

I have created a Python based pipeline that contains a ParDo that leverages the Python base64 package. When I run the pipeline locally with DirectRunner, all is well. When I run the same pipeline with Dataflow on Google Cloud, it fails weith an error of:
NameError: name 'base64' is not defined [while running 'ParDo(WriteToSeparateFiles)-ptransform-47']
It seems to be missing the base64 package but I believe that to be part of standard python and always present.
Here is my complete pipeline code:
import base64 as base64
import argparse
import apache_beam as beam
import apache_beam.io.fileio as fileio
import apache_beam.io.filesystems as filesystems
from apache_beam.options.pipeline_options import PipelineOptions
class WriteToSeparateFiles(beam.DoFn):
def __init__(self, outdir):
self.outdir = outdir
def process(self, element):
writer = filesystems.FileSystems.create(self.outdir + str(element) + '.txt')
message = "This is the content of my file"
message_bytes = message.encode('ascii')
base64_bytes = base64.b64encode(message_bytes) ### Error here
writer.write(base64_bytes)
writer.close()
argv=None
parser = argparse.ArgumentParser()
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)
with beam.Pipeline(options=pipeline_options) as pipeline:
outputs = (
pipeline
| beam.Create(range(10)) # Change range here to be millions if needed
| beam.ParDo(WriteToSeparateFiles('gs://kolban-edi/'))
)
outputs | beam.Map(print)
#print(outputs)
Solution
The solution has been found and is documented here
https://cloud.google.com/dataflow/docs/resources/faq#how_do_i_handle_nameerrors

After much study, a section in the Google Dataflow Frequently Asked Questions was found titled: "How do I handle NameErrors?"
Reading that documentation, the suggestion was to add an extra runner parameter called --save_main_session. As soon as I added that to the execution, the problem was resolved.

Related

Install Custom Dependency for KFP Op

I'm trying to setup a simple KubeFlow pipeline, and I'm having trouble packaging up dependencies in a way that works for KubeFlow.
The code simply downloads a config file and parses it, then passes back the parsed configuration.
However, in order to parse the config file, it needs to have access to another internal python package.
I have a .tar.gz archive of the package hosted on a bucket in the same project, and added the URL of the package as a dependency, but I get an error message saying tarfile.ReadError: not a gzip file.
I know the file is good, so it's some intermediate issue with hosting on a bucket or the way kubeflow installs dependencies.
Here is a minimal example:
from kfp import compiler
from kfp import dsl
from kfp.components import func_to_container_op
from google.protobuf import text_format
from google.cloud import storage
import training_reader
def get_training_config(working_bucket: str,
working_directoy: str,
config_file: str) -> training_reader.TrainEvalPipelineConfig:
download_file(working_bucket, os.path.join(working_directoy, config_file), "ssd.config")
pipeline_config = training_reader.TrainEvalPipelineConfig()
with open("ssd.config", 'r') as f:
text_format.Merge(f.read(), pipeline_config)
return pipeline_config
config_op_packages = ["https://storage.cloud.google.com/my_bucket/packages/training-reader-0.1.tar.gz",
"google-cloud-storage",
"protobuf"
]
training_config_op = func_to_container_op(get_training_config,
base_image="tensorflow/tensorflow:1.15.2-py3",
packages_to_install=config_op_packages)
def output_config(config: training_reader.TrainEvalPipelineConfig) -> None:
print(config)
output_config_op = func_to_container_op(output_config)
#dsl.pipeline(
name='Post Training Processing',
description='Building the post-processing pipeline'
)
def ssd_postprocessing_pipeline(
working_bucket: str,
working_directory: str,
config_file:str):
config = training_config_op(working_bucket, working_directory, config_file)
output_config_op(config.output)
pipeline_name = ssd_postprocessing_pipeline.__name__ + '.zip'
compiler.Compiler().compile(ssd_postprocessing_pipeline, pipeline_name)
The https://storage.cloud.google.com/my_bucket/packages/training-reader-0.1.tar.gz IRL requires authentication. Try to download it in Incognito mode and you'll see the login page instead of file.
Changing the URL to https://storage.googleapis.com/my_bucket/packages/training-reader-0.1.tar.gz works for public objects, but your object is not public.
The only thing you can do (if you cannot make the package public) is to use google.cloud.storage library or gsutil program to download the file from the bucket and then manually install it suing subprocess.run([sys.executable, '-m', 'pip', 'install', ...])
Where are you downloading the data from?
What's the purpose of
pipeline_config = training_reader.TrainEvalPipelineConfig()
with open("ssd.config", 'r') as f:
text_format.Merge(f.read(), pipeline_config)
return pipeline_config
Why not just do the following:
def get_training_config(
working_bucket: str,
working_directory: str,
config_file: str,
output_config_path: OutputFile('TrainEvalPipelineConfig'),
):
download_file(working_bucket, os.path.join(working_directoy, config_file), output_config_path)
the way kubeflow installs dependencies.
Export your component to loadable component.yaml and you'll see how KFP Lighweight components install dependencies:
training_config_op = func_to_container_op(
get_training_config,
base_image="tensorflow/tensorflow:1.15.2-py3",
packages_to_install=config_op_packages,
output_component_file='component.yaml',
)
P.S. Some small pieces of info:
#dsl.pipeline(
Not required unless you want to use the dsl-compile command-line program
pipeline_name = ssd_postprocessing_pipeline.name + '.zip'
compiler.Compiler().compile(ssd_postprocessing_pipeline, pipeline_name)
Did you know that you can just kfp.Client(host=...).create_run_from_pipeline_func(ssd_postprocessing_pipeline, arguments={}) to run the pipeline right away?

How to retrieve worker logs for a Dask-YARN job?

I have a simple Dask-YARN script that does only one task: load a file from HDFS, as shown below. However, I'm running into a bug in the code, so I added a print statement in the function, but I don't see that statement being executed in the worker logs which I obtain using yarn logs -applicationId {application_id}. I even tried the method Client.get_worker_logs(), however that doesn't display the stdout as well, just shows some INFO about the worker(s). How does one obtain worker logs after the execution of the code has completed?
import sys
import numpy as np
import scipy.signal
import json
import dask
from dask.distributed import Client
from dask_yarn import YarnCluster
#dask.delayed
def load(input_file):
print("In call of Load...")
with open(input_file, "r") as fo:
data = json.load(fo)
return data
# Process input args
(_, filename) = sys.argv
dag_1 = {
'load-1': (load, filename)
}
print("Building tasks...")
tasks = dask.get(dag_1, 'load-1')
print("Creating YARN cluster now...")
cluster = YarnCluster()
print("Scaling YARN cluster now...")
cluster.scale(1)
print("Creating Client now...")
client = Client(cluster)
print("Getting logs..1")
print(client.get_worker_logs())
print("Doing Dask computations now...")
dask.compute(tasks)
print("Getting logs..2")
print(client.get_worker_logs())
print("Shutting down cluster now...")
cluster.shutdown()
I'm not sure what's going on here, print statements should (and usually do) end up in the log files stored by yarn.
If you want your debug statements to appear in the worker logs from get_worker_logs, you can use the worker logger directly:
from distributed.worker import logger
logger.info("This will show up in the worker logs")

Dependency parse large text file with python

I am trying to parse a large txt file (about 2000 sentence). when I want to set the model_path, I get this massage:
NLTK was unable to find stanford-parser.jar! Set the CLASSPATH
environment variable.
And also when I set the CLASSPATH to this file, another message comes out:
NLTK was unable to find stanford-parser-(\d+)(.(\d+))+-models.jar!
Set the CLASSPATH environment variable.
Would you help me to solve it?
This is my code:
import nltk
from nltk.parse.stanford import StanfordDependencyParser
dependency_parser = StanfordDependencyParser( model_path="edu\stanford\lp\models\lexparser\englishPCFG.ser.gz")
===========================================================================
NLTK was unable to find stanford-parser.jar! Set the CLASSPATH
environment variable.
For more information, on stanford-parser.jar, see:
https://nlp.stanford.edu/software/lex-parser.shtml
import os
os.environ['CLASSPATH'] = "stanford-corenlp-full-2018-10-05/*"
dependency_parser = StanfordDependencyParser( model_path="edu\stanford\lp\models\lexparser\englishPCFG.ser.gz")
===========================================================================
NLTK was unable to find stanford-parser.jar! Set the CLASSPATH
environment variable.
For more information, on stanford-parser.jar, see:
https://nlp.stanford.edu/software/lex-parser.shtml
os.environ['CLASSPATH'] = "stanford-corenlp-full-2018-10-05/stanford-parser-full-2018-10-17/stanford-parser.jar"
>>> dependency_parser = StanfordDependencyParser( model_path="stanford-corenlp-full-2018-10-05/stanford-parser-full-2018-10-17/edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz")
NLTK was unable to find stanford-parser-(\d+)(.(\d+))+-models.jar!
Set the CLASSPATH environment variable.
For more information, on stanford-parser-(\d+)(.(\d+))+-models.jar, see:
https://nlp.stanford.edu/software/lex-parser.shtml
You should get the new stanfordnlp dependency parser that is native to Python!
It will run slower on the CPU than GPU, but it still should run reasonably fast.
Just run pip install stanfordnlp to install.
import stanfordnlp
stanfordnlp.download('en') # This downloads the English models for the neural pipeline
nlp = stanfordnlp.Pipeline() # This sets up a default neural pipeline in English
doc = nlp("Barack Obama was born in Hawaii. He was elected president in 2008.")
doc.sentences[0].print_dependencies()
There is also a helpful command line tool:
python -m stanfordnlp.run_pipeline -l en example.txt
Full details here: https://stanfordnlp.github.io/stanfordnlp/
GitHub: https://github.com/stanfordnlp/stanfordnlp

Google Cloud Dataflow cryptic message when downloading file from gcp to local system

I am writing a dataflow pipeline that processes videos from a google cloud bucket. My pipeline downloads each work item to the local system and then reuploads results back to GCP bucket. Following previous question.
The pipeline works on local DirectRunner, i'm having trouble debugging on DataFlowRunnner.
The error reads
File "run_clouddataflow.py", line 41, in process
File "/usr/local/lib/python2.7/dist-packages/google/cloud/storage/blob.py", line 464, in download_to_file self._do_download(transport, file_obj, download_url, headers)
File "/usr/local/lib/python2.7/dist-packages/google/cloud/storage/blob.py", line 418, in _do_download download.consume(transport) File "/usr/local/lib/python2.7/dist-packages/google/resumable_media/requests/download.py", line 101, in consume self._write_to_stream(result)
File "/usr/local/lib/python2.7/dist-packages/google/resumable_media/requests/download.py", line 62, in _write_to_stream with response: AttributeError: __exit__ [while running 'Run DeepMeerkat']
When trying to execute blob.download_to_file(file_obj) within:
storage_client=storage.Client()
bucket = storage_client.get_bucket(parsed.hostname)
blob=storage.Blob(parsed.path[1:],bucket)
#store local path
local_path="/tmp/" + parsed.path.split("/")[-1]
print('local path: ' + local_path)
with open(local_path, 'wb') as file_obj:
blob.download_to_file(file_obj)
print("Downloaded" + local_path)
I'm guessing that the workers are not in permission to write locally? Or perhaps there is not a /tmp folder in the dataflow container. Where should I write objects? Its hard to debug without access to the environment. Is it possible to access stdout from workers for debugging purposes (serial console?)
EDIT #1
I've tried explicitly passing credentials:
try:
credentials, project = google.auth.default()
except:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = known_args.authtoken
credentials, project = google.auth.default()
as well as writing to cwd(), instead of /tmp/
local_path=parsed.path.split("/")[-1]
print('local path: ' + local_path)
with open(local_path, 'wb') as file_obj:
blob.download_to_file(file_obj)
Still getting the cryptic error on blob downloads from gcp.
Full Pipeline script is below, setup.py is here.
import logging
import argparse
import json
import logging
import os
import csv
import apache_beam as beam
from urlparse import urlparse
from google.cloud import storage
##The namespaces inside of clouddataflow workers is not inherited ,
##Please see https://cloud.google.com/dataflow/faq#how-do-i-handle-nameerrors, better to write ugly import statements then to miss a namespace
class PredictDoFn(beam.DoFn):
def process(self,element):
import csv
from google.cloud import storage
from DeepMeerkat import DeepMeerkat
from urlparse import urlparse
import os
import google.auth
DM=DeepMeerkat.DeepMeerkat()
print(os.getcwd())
print(element)
#try adding credentials?
#set credentials, inherent from worker
credentials, project = google.auth.default()
#download element locally
parsed = urlparse(element[0])
#parse gcp path
storage_client=storage.Client(credentials=credentials)
bucket = storage_client.get_bucket(parsed.hostname)
blob=storage.Blob(parsed.path[1:],bucket)
#store local path
local_path=parsed.path.split("/")[-1]
print('local path: ' + local_path)
with open(local_path, 'wb') as file_obj:
blob.download_to_file(file_obj)
print("Downloaded" + local_path)
#Assign input from DataFlow/manifest
DM.process_args(video=local_path)
DM.args.output="Frames"
#Run DeepMeerkat
DM.run()
#upload back to GCS
found_frames=[]
for (root, dirs, files) in os.walk("Frames/"):
for files in files:
fileupper=files.upper()
if fileupper.endswith((".JPG")):
found_frames.append(os.path.join(root, files))
for frame in found_frames:
#create GCS path
path="DeepMeerkat/" + parsed.path.split("/")[-1] + "/" + frame.split("/")[-1]
blob=storage.Blob(path,bucket)
blob.upload_from_filename(frame)
def run():
import argparse
import os
import apache_beam as beam
import csv
import logging
import google.auth
parser = argparse.ArgumentParser()
parser.add_argument('--input', dest='input', default="gs://api-project-773889352370-testing/DataFlow/manifest.csv",
help='Input file to process.')
parser.add_argument('--authtoken', default="/Users/Ben/Dropbox/Google/MeerkatReader-9fbf10d1e30c.json",
help='Input file to process.')
known_args, pipeline_args = parser.parse_known_args()
#set credentials, inherent from worker
try:
credentials, project = google.auth.default()
except:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = known_args.authtoken
credentials, project = google.auth.default()
p = beam.Pipeline(argv=pipeline_args)
vids = (p|'Read input' >> beam.io.ReadFromText(known_args.input)
| 'Parse input' >> beam.Map(lambda line: csv.reader([line]).next())
| 'Run DeepMeerkat' >> beam.ParDo(PredictDoFn()))
logging.getLogger().setLevel(logging.INFO)
p.run()
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
I spoke to the google-cloud-storage package mantainer, this was a known issue. Updating specific versiosn in my setup.py to
REQUIRED_PACKAGES = ["google-cloud-storage==1.3.2","google-auth","requests>=2.18.0"]
fixed the issue.
https://github.com/GoogleCloudPlatform/google-cloud-python/issues/3836

How to configure Jenkins Cobertura plugin to monitor specific packages?

My project has a number of packages ("models", "controllers", etc.). I've set up Jenkins with the Cobertura plugin to generate coverage reports, which is great. I'd like to mark a build as unstable if coverage drops below a certain threshold, but only on certain packages (e.g., "controllers", but not "models"). I don't see an obvious way to do this in the configuration UI, however -- it looks like the thresholds are global.
Is there a way to do this?
(Answering my own question here)
As far as I can tell, this isn't possible -- I haven't seen anything after a couple days of looking. I wrote a simple script that would do what I want -- take the coverage output, parse it, and fail the build if coverage of specific packages didn't meet certain thresholds. It's dirty and can be cleaned up/expanded, but the basic idea is here. Comments are welcome.
#!/usr/bin/env python
'''
Jenkins' Cobertura plugin doesn't allow marking a build as successful or
failed based on coverage of individual packages -- only the project as a
whole. This script will parse the coverage.xml file and fail if the coverage of
specified packages doesn't meet the thresholds given
'''
import sys
from lxml import etree
PACKAGES_XPATH = etree.XPath('/coverage/packages/package')
def main(argv):
filename = argv[0]
package_args = argv[1:] if len(argv) > 1 else []
# format is package_name:coverage_threshold
package_coverage = {package: int(coverage) for
package, coverage in [x.split(':') for x in package_args]}
xml = open(filename, 'r').read()
root = etree.fromstring(xml)
packages = PACKAGES_XPATH(root)
failed = False
for package in packages:
name = package.get('name')
if name in package_coverage:
# We care about this one
print 'Checking package {} -- need {}% coverage'.format(
name, package_coverage[name])
coverage = float(package.get('line-rate', '0.0')) * 100
if coverage < package_coverage[name]:
print ('FAILED - Coverage for package {} is {}% -- '
'minimum is {}%'.format(
name, coverage, package_coverage[name]))
failed = True
else:
print "PASS"
if failed:
sys.exit(1)
if __name__ == '__main__':
main(sys.argv[1:])

Resources