Google Cloud Dataflow cryptic message when downloading file from gcp to local system

Google Cloud Dataflow cryptic message when downloading file from gcp to local system - google-cloud-dataflow

I am writing a dataflow pipeline that processes videos from a google cloud bucket. My pipeline downloads each work item to the local system and then reuploads results back to GCP bucket. Following previous question.
The pipeline works on local DirectRunner, i'm having trouble debugging on DataFlowRunnner.
The error reads
File "run_clouddataflow.py", line 41, in process
File "/usr/local/lib/python2.7/dist-packages/google/cloud/storage/blob.py", line 464, in download_to_file self._do_download(transport, file_obj, download_url, headers)
File "/usr/local/lib/python2.7/dist-packages/google/cloud/storage/blob.py", line 418, in _do_download download.consume(transport) File "/usr/local/lib/python2.7/dist-packages/google/resumable_media/requests/download.py", line 101, in consume self._write_to_stream(result)
File "/usr/local/lib/python2.7/dist-packages/google/resumable_media/requests/download.py", line 62, in _write_to_stream with response: AttributeError: __exit__ [while running 'Run DeepMeerkat']
When trying to execute blob.download_to_file(file_obj) within:
storage_client=storage.Client()
bucket = storage_client.get_bucket(parsed.hostname)
blob=storage.Blob(parsed.path[1:],bucket)
#store local path
local_path="/tmp/" + parsed.path.split("/")[-1]
print('local path: ' + local_path)
with open(local_path, 'wb') as file_obj:
blob.download_to_file(file_obj)
print("Downloaded" + local_path)
I'm guessing that the workers are not in permission to write locally? Or perhaps there is not a /tmp folder in the dataflow container. Where should I write objects? Its hard to debug without access to the environment. Is it possible to access stdout from workers for debugging purposes (serial console?)
EDIT #1
I've tried explicitly passing credentials:
try:
credentials, project = google.auth.default()
except:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = known_args.authtoken
credentials, project = google.auth.default()
as well as writing to cwd(), instead of /tmp/
local_path=parsed.path.split("/")[-1]
print('local path: ' + local_path)
with open(local_path, 'wb') as file_obj:
blob.download_to_file(file_obj)
Still getting the cryptic error on blob downloads from gcp.
Full Pipeline script is below, setup.py is here.
import logging
import argparse
import json
import logging
import os
import csv
import apache_beam as beam
from urlparse import urlparse
from google.cloud import storage
##The namespaces inside of clouddataflow workers is not inherited ,
##Please see https://cloud.google.com/dataflow/faq#how-do-i-handle-nameerrors, better to write ugly import statements then to miss a namespace
class PredictDoFn(beam.DoFn):
def process(self,element):
import csv
from google.cloud import storage
from DeepMeerkat import DeepMeerkat
from urlparse import urlparse
import os
import google.auth
DM=DeepMeerkat.DeepMeerkat()
print(os.getcwd())
print(element)
#try adding credentials?
#set credentials, inherent from worker
credentials, project = google.auth.default()
#download element locally
parsed = urlparse(element[0])
#parse gcp path
storage_client=storage.Client(credentials=credentials)
bucket = storage_client.get_bucket(parsed.hostname)
blob=storage.Blob(parsed.path[1:],bucket)
#store local path
local_path=parsed.path.split("/")[-1]
print('local path: ' + local_path)
with open(local_path, 'wb') as file_obj:
blob.download_to_file(file_obj)
print("Downloaded" + local_path)
#Assign input from DataFlow/manifest
DM.process_args(video=local_path)
DM.args.output="Frames"
#Run DeepMeerkat
DM.run()
#upload back to GCS
found_frames=[]
for (root, dirs, files) in os.walk("Frames/"):
for files in files:
fileupper=files.upper()
if fileupper.endswith((".JPG")):
found_frames.append(os.path.join(root, files))
for frame in found_frames:
#create GCS path
path="DeepMeerkat/" + parsed.path.split("/")[-1] + "/" + frame.split("/")[-1]
blob=storage.Blob(path,bucket)
blob.upload_from_filename(frame)
def run():
import argparse
import os
import apache_beam as beam
import csv
import logging
import google.auth
parser = argparse.ArgumentParser()
parser.add_argument('--input', dest='input', default="gs://api-project-773889352370-testing/DataFlow/manifest.csv",
help='Input file to process.')
parser.add_argument('--authtoken', default="/Users/Ben/Dropbox/Google/MeerkatReader-9fbf10d1e30c.json",
help='Input file to process.')
known_args, pipeline_args = parser.parse_known_args()
#set credentials, inherent from worker
try:
credentials, project = google.auth.default()
except:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = known_args.authtoken
credentials, project = google.auth.default()
p = beam.Pipeline(argv=pipeline_args)
vids = (p|'Read input' >> beam.io.ReadFromText(known_args.input)
| 'Parse input' >> beam.Map(lambda line: csv.reader([line]).next())
| 'Run DeepMeerkat' >> beam.ParDo(PredictDoFn()))
logging.getLogger().setLevel(logging.INFO)
p.run()
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()

I spoke to the google-cloud-storage package mantainer, this was a known issue. Updating specific versiosn in my setup.py to
REQUIRED_PACKAGES = ["google-cloud-storage==1.3.2","google-auth","requests>=2.18.0"]
fixed the issue.
https://github.com/GoogleCloudPlatform/google-cloud-python/issues/3836

Related

Pipeline fails with name error on Dataflow runner but not on Direct runner

I have created a Python based pipeline that contains a ParDo that leverages the Python base64 package. When I run the pipeline locally with DirectRunner, all is well. When I run the same pipeline with Dataflow on Google Cloud, it fails weith an error of:
NameError: name 'base64' is not defined [while running 'ParDo(WriteToSeparateFiles)-ptransform-47']
It seems to be missing the base64 package but I believe that to be part of standard python and always present.
Here is my complete pipeline code:
import base64 as base64
import argparse
import apache_beam as beam
import apache_beam.io.fileio as fileio
import apache_beam.io.filesystems as filesystems
from apache_beam.options.pipeline_options import PipelineOptions
class WriteToSeparateFiles(beam.DoFn):
def __init__(self, outdir):
self.outdir = outdir
def process(self, element):
writer = filesystems.FileSystems.create(self.outdir + str(element) + '.txt')
message = "This is the content of my file"
message_bytes = message.encode('ascii')
base64_bytes = base64.b64encode(message_bytes) ### Error here
writer.write(base64_bytes)
writer.close()
argv=None
parser = argparse.ArgumentParser()
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)
with beam.Pipeline(options=pipeline_options) as pipeline:
outputs = (
pipeline
| beam.Create(range(10)) # Change range here to be millions if needed
| beam.ParDo(WriteToSeparateFiles('gs://kolban-edi/'))
)
outputs | beam.Map(print)
#print(outputs)
Solution
The solution has been found and is documented here
https://cloud.google.com/dataflow/docs/resources/faq#how_do_i_handle_nameerrors

After much study, a section in the Google Dataflow Frequently Asked Questions was found titled: "How do I handle NameErrors?"
Reading that documentation, the suggestion was to add an extra runner parameter called --save_main_session. As soon as I added that to the execution, the problem was resolved.

Install Custom Dependency for KFP Op

I'm trying to setup a simple KubeFlow pipeline, and I'm having trouble packaging up dependencies in a way that works for KubeFlow.
The code simply downloads a config file and parses it, then passes back the parsed configuration.
However, in order to parse the config file, it needs to have access to another internal python package.
I have a .tar.gz archive of the package hosted on a bucket in the same project, and added the URL of the package as a dependency, but I get an error message saying tarfile.ReadError: not a gzip file.
I know the file is good, so it's some intermediate issue with hosting on a bucket or the way kubeflow installs dependencies.
Here is a minimal example:
from kfp import compiler
from kfp import dsl
from kfp.components import func_to_container_op
from google.protobuf import text_format
from google.cloud import storage
import training_reader
def get_training_config(working_bucket: str,
working_directoy: str,
config_file: str) -> training_reader.TrainEvalPipelineConfig:
download_file(working_bucket, os.path.join(working_directoy, config_file), "ssd.config")
pipeline_config = training_reader.TrainEvalPipelineConfig()
with open("ssd.config", 'r') as f:
text_format.Merge(f.read(), pipeline_config)
return pipeline_config
config_op_packages = ["https://storage.cloud.google.com/my_bucket/packages/training-reader-0.1.tar.gz",
"google-cloud-storage",
"protobuf"
]
training_config_op = func_to_container_op(get_training_config,
base_image="tensorflow/tensorflow:1.15.2-py3",
packages_to_install=config_op_packages)
def output_config(config: training_reader.TrainEvalPipelineConfig) -> None:
print(config)
output_config_op = func_to_container_op(output_config)
#dsl.pipeline(
name='Post Training Processing',
description='Building the post-processing pipeline'
)
def ssd_postprocessing_pipeline(
working_bucket: str,
working_directory: str,
config_file:str):
config = training_config_op(working_bucket, working_directory, config_file)
output_config_op(config.output)
pipeline_name = ssd_postprocessing_pipeline.__name__ + '.zip'
compiler.Compiler().compile(ssd_postprocessing_pipeline, pipeline_name)

The https://storage.cloud.google.com/my_bucket/packages/training-reader-0.1.tar.gz IRL requires authentication. Try to download it in Incognito mode and you'll see the login page instead of file.
Changing the URL to https://storage.googleapis.com/my_bucket/packages/training-reader-0.1.tar.gz works for public objects, but your object is not public.
The only thing you can do (if you cannot make the package public) is to use google.cloud.storage library or gsutil program to download the file from the bucket and then manually install it suing subprocess.run([sys.executable, '-m', 'pip', 'install', ...])
Where are you downloading the data from?
What's the purpose of
pipeline_config = training_reader.TrainEvalPipelineConfig()
with open("ssd.config", 'r') as f:
text_format.Merge(f.read(), pipeline_config)
return pipeline_config
Why not just do the following:
def get_training_config(
working_bucket: str,
working_directory: str,
config_file: str,
output_config_path: OutputFile('TrainEvalPipelineConfig'),
):
download_file(working_bucket, os.path.join(working_directoy, config_file), output_config_path)
the way kubeflow installs dependencies.
Export your component to loadable component.yaml and you'll see how KFP Lighweight components install dependencies:
training_config_op = func_to_container_op(
get_training_config,
base_image="tensorflow/tensorflow:1.15.2-py3",
packages_to_install=config_op_packages,
output_component_file='component.yaml',
)
P.S. Some small pieces of info:
#dsl.pipeline(
Not required unless you want to use the dsl-compile command-line program
pipeline_name = ssd_postprocessing_pipeline.name + '.zip'
compiler.Compiler().compile(ssd_postprocessing_pipeline, pipeline_name)
Did you know that you can just kfp.Client(host=...).create_run_from_pipeline_func(ssd_postprocessing_pipeline, arguments={}) to run the pipeline right away?

How to retrieve worker logs for a Dask-YARN job?

I have a simple Dask-YARN script that does only one task: load a file from HDFS, as shown below. However, I'm running into a bug in the code, so I added a print statement in the function, but I don't see that statement being executed in the worker logs which I obtain using yarn logs -applicationId {application_id}. I even tried the method Client.get_worker_logs(), however that doesn't display the stdout as well, just shows some INFO about the worker(s). How does one obtain worker logs after the execution of the code has completed?
import sys
import numpy as np
import scipy.signal
import json
import dask
from dask.distributed import Client
from dask_yarn import YarnCluster
#dask.delayed
def load(input_file):
print("In call of Load...")
with open(input_file, "r") as fo:
data = json.load(fo)
return data
# Process input args
(_, filename) = sys.argv
dag_1 = {
'load-1': (load, filename)
}
print("Building tasks...")
tasks = dask.get(dag_1, 'load-1')
print("Creating YARN cluster now...")
cluster = YarnCluster()
print("Scaling YARN cluster now...")
cluster.scale(1)
print("Creating Client now...")
client = Client(cluster)
print("Getting logs..1")
print(client.get_worker_logs())
print("Doing Dask computations now...")
dask.compute(tasks)
print("Getting logs..2")
print(client.get_worker_logs())
print("Shutting down cluster now...")
cluster.shutdown()

I'm not sure what's going on here, print statements should (and usually do) end up in the log files stored by yarn.
If you want your debug statements to appear in the worker logs from get_worker_logs, you can use the worker logger directly:
from distributed.worker import logger
logger.info("This will show up in the worker logs")

VnPy - ImportError: No module named vnctpmd

I just cloned VnPy, and I am trying to run VnTrader on a Ubuntu 16.04 Machine, as mentioned in the VnPy Starter Guide. I followed step by step, but when I run
python vnpy/examples/VnTrader/run.py
I get the following Import Error. What is the problem?
Traceback (most recent call last):
File "run.py", line 28, in <module>
from vnpy.trader.gateway import (ctpGateway, ibGateway)
File "/home/alessandro/anaconda2/lib/python2.7/site-packages/vnpy-1.9.0-py2.7.egg/vnpy/trader/gateway/ctpGateway/__init__.py", line 5, in <module>
from .ctpGateway import CtpGateway
File "/home/alessandro/anaconda2/lib/python2.7/site-packages/vnpy-1.9.0-py2.7.egg/vnpy/trader/gateway/ctpGateway/ctpGateway.py", line 16, in <module>
from vnpy.api.ctp import MdApi, TdApi, defineDict
File "/home/alessandro/anaconda2/lib/python2.7/site-packages/vnpy-1.9.0-py2.7.egg/vnpy/api/ctp/__init__.py", line 4, in <module>
from .vnctpmd import MdApi
ImportError: No module named vnctpmd

ImportError: No module named vnctpmd
The vnctpmd module, is the API interface for CTP broker of the VnPy package. As for every other API interface, you need to first build it, and then import it.
In your case, you probably didn't build the CTP interface when prompted during the installation of VnPy, so now run.py can't import the module.
Do you need 'CTP' interface? No
Solution A: I don't need CTP interface
If you don't need CTP interface you can open run.py commenting the parts relative to CTP (and also relative to all the other interfaces you didn't build)
# /examples/VnTrader/run.py script
# this version runs with only the IB interface built
# encoding: UTF-8
# 重载sys模块，设置默认字符串编码方式为utf8
try:
reload # Python 2
except NameError: # Python 3
from importlib import reload
import sys
reload(sys)
try:
sys.setdefaultencoding('utf8')
except AttributeError:
pass
# 判断操作系统
import platform
system = platform.system()
# vn.trader模块
from vnpy.event import EventEngine
from vnpy.trader.vtEngine import MainEngine
from vnpy.trader.uiQt import createQApp
from vnpy.trader.uiMainWindow import MainWindow
# 加载底层接口
from vnpy.trader.gateway import ibGateway
# ### here comment the interfaces you don't need
# from vnpy.trader.gateway import (ctpGateway, ibGateway)
if system == 'Linux':
# from vnpy.trader.gateway import xtpGateway
pass
elif system == 'Windows':
from vnpy.trader.gateway import (femasGateway, xspeedGateway,
secGateway)
# 加载上层应用
from vnpy.trader.app import (riskManager, ctaStrategy,
spreadTrading, algoTrading)
#----------------------------------------------------------------------
def main():
"""主程序入口"""
# 创建Qt应用对象
qApp = createQApp()
# 创建事件引擎
ee = EventEngine()
# 创建主引擎
me = MainEngine(ee)
# 添加交易接口
# me.addGateway(ctpGateway)
me.addGateway(ibGateway)
if system == 'Windows':
me.addGateway(femasGateway)
me.addGateway(xspeedGateway)
me.addGateway(secGateway)
if system == 'Linux':
# me.addGateway(xtpGateway)
pass
# 添加上层应用
me.addApp(riskManager)
me.addApp(ctaStrategy)
me.addApp(spreadTrading)
me.addApp(algoTrading)
# 创建主窗口
mw = MainWindow(me, ee)
mw.showMaximized()
# 在主线程中启动Qt事件循环
sys.exit(qApp.exec_())
if __name__ == '__main__':
main()
Solution B: I need CTP interface
In case you need CTP, you can just reinstall vnpy with the command
bash install.sh
and when prompted 'do you need CTP' answer Yes

ImportError: no module named 'umqtt.MQTTClient' but file with class exists

I installed MicroPython v1.9.3-8 on my ESP8266 board. Here is the beginning of my main.py file:
from machine import Pin
led = Pin(2, Pin.OUT, value=1)
#---MQTT Sending---
from time import sleep_ms
from ubinascii import hexlify
from machine import unique_id
#import socket
from umqtt import MQTTClient
SERVER = "10.6.6.192"
CLIENT_ID = hexlify(unique_id())
TOPIC1 = b"/server/tem"
TOPIC2 = b"/server/hum"
TOPIC3 = b"/server/led"
The line from umqtt import MQTTClient throws an error when I reset the module:
File "main.py", line 11, in < module >
ImportError: no module named 'umqtt.MQTTClient'
Here is my umqtt.py file.
I have the umqtt.py file uploaded to my esp8266 with webrepl. When I run:
import os
os.listdir()
I get this output:
>>> os.listdir()
['boot.py', 'webrepl_cfg.py', 'umqtt.py', 'main.py']
Since in the umqtt.py file in line 8 the class MQTTClient is defined, I do not know what am I doing wrong to get this code to work.

I think you need to either specify the simple or the robust versions:
from umqtt.simple import MQTTClient

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Google Cloud Dataflow cryptic message when downloading file from gcp to local system - google-cloud-dataflow

I spoke to the google-cloud-storage package mantainer, this was a known issue. Updating specific versiosn in my setup.py to REQUIRED_PACKAGES = ["google-cloud-storage==1.3.2","google-auth","requests>=2.18.0"] fixed the issue. https://github.com/GoogleCloudPlatform/google-cloud-python/issues/3836

Related

Pipeline fails with name error on Dataflow runner but not on Direct runner

Install Custom Dependency for KFP Op

How to retrieve worker logs for a Dask-YARN job?

VnPy - ImportError: No module named vnctpmd

ImportError: no module named 'umqtt.MQTTClient' but file with class exists

Categories

Resources