apache_beam.runners.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed:

apache_beam.runners.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed: - google-cloud-dataflow

I set up a Google Cloud project in Cloud Shell, and tried to run this tutorial script https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/flowers/sample.sh
Ran into this error:
***#***:~/git/cloudml-samples/flowers$ ./sample.sh
Your active configuration is: [cloudshell-4691]
Using job id: flowers_***_20170113_162148
python trainer/preprocess.py \
--input_dict "$DICT_FILE" \
--input_path "gs://cloud-ml-data/img/flower_photos/eval_set.csv" \
--output_path "${GCS_PATH}/preproc/eval" \
--cloud
WARNING:root:Using fallback coder for typehint: Any.
WARNING:root:Using fallback coder for typehint: Any.
WARNING:root:Using fallback coder for typehint: Any.
DEPRECATION: pip install --download has been deprecated and will be removed in the future. Pip now has a download command that should be used instead.
Collecting google-cloud-dataflow==0.4.4
Using cached google-cloud-dataflow-0.4.4.zip
Saved /tmp/tmpSoHiTi/google-cloud-dataflow-0.4.4.zip
Successfully downloaded google-cloud-dataflow
# Takes about 30 mins to preprocess everything. We serialize the two
Traceback (most recent call last):
File "trainer/preprocess.py", line 436, in <module>
main(sys.argv[1:])
File "trainer/preprocess.py", line 432, in main
run(arg_dict)
File "trainer/preprocess.py", line 353, in run
p.run()
File "/home/slalomconsultingsf/.local/lib/python2.7/site-packages/apache_beam/pipeline.py", line 159, in run
return self.runner.run(self)
File "/home/slalomconsultingsf/.local/lib/python2.7/site-packages/apache_beam/runners/dataflow_runner.py", line 195, in run
% getattr(self, 'last_error_msg', None), self.result)
apache_beam.runners.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed:
(b85b0a598a3565cb): Workflow failed.
I was not able to get any clue where I was doing wrong from the Error Log of GoogleCloud Dataflow
Appreciate any answer and help for troubleshooting.

Enable the Dataflow API. In the Pantheon top search box typing "dataflow api" will take you to a window where you can click "Enable API".
I think this will fix it for you. I disabled my Dataflow API and got the same error as you, then when it was re-enabled the problem went back away.

Related

Why is my containerized Selenium application failing only in AWS Lambda?

I'm trying to get a function to run in AWS Lambda that uses Selenium and Firefox/geckodriver in order to run. I've decided to go the route of creating a container image, and then uploading and running that instead of using a pre-configured runtime. I was able to create a Dockerfile that correctly installs Firefox and Python, downloads geckodriver, and installs my test code:
FROM alpine:latest
RUN apk add firefox python3 py3-pip
RUN pip install requests selenium
RUN mkdir /app
WORKDIR /app
RUN wget -qO gecko.tar.gz https://github.com/mozilla/geckodriver/releases/download/v0.28.0/geckodriver-v0.28.0-linux64.tar.gz
RUN tar xf gecko.tar.gz
RUN mv geckodriver /usr/bin
COPY *.py ./
ENTRYPOINT ["/usr/bin/python3","/app/lambda_function.py"]
The Selenium test code:
#!/usr/bin/env python3
import util
import os
import sys
import requests
def lambda_wrapper():
api_base = f'http://{os.environ["AWS_LAMBDA_RUNTIME_API"]}/2018-06-01'
response = requests.get(api_base + '/runtime/invocation/next')
request_id = response.headers['Lambda-Runtime-Aws-Request-Id']
try:
result = selenium_test()
# Send result back
requests.post(api_base + f'/runtime/invocation/{request_id}/response', json={'url': result})
except Exception as e:
# Error reporting
import traceback
requests.post(api_base + f'/runtime/invocation/{request_id}/error', json={'errorMessage': str(e), 'traceback': traceback.format_exc(), 'logs': open('/tmp/gecko.log', 'r').read()})
raise
def selenium_test():
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
options = Options()
options.add_argument('-headless')
options.add_argument('--window-size 1920,1080')
ffx = Firefox(options=options, log_path='/tmp/gecko.log')
ffx.get("https://google.com")
url = ffx.current_url
ffx.close()
print(url)
return url
def main():
# For testing purposes, currently not using the Lambda API even in AWS so that
# the same container can run on my local machine.
# Call lambda_wrapper() instead to get geckodriver logs as well (not informative).
selenium_test()
if __name__ == '__main__':
main()
I'm able to successfully build this container on my local machine with docker build -t lambda-test . and then run it with docker run -m 512M lambda-test.
However, the exact same container crashes with an error when I try and upload it to Lambda to run. I set the memory limit to 1024M and the timeout to 30 seconds. The traceback says that Firefox was unexpectedly killed by a signal:
START RequestId: 52adeab9-8ee7-4a10-a728-82087ec9de30 Version: $LATEST
/app/lambda_function.py:29: DeprecationWarning: use service_log_path instead of log_path
ffx = Firefox(options=options, log_path='/tmp/gecko.log')
Traceback (most recent call last):
File "/app/lambda_function.py", line 45, in <module>
main()
File "/app/lambda_function.py", line 41, in main
lambda_wrapper()
File "/app/lambda_function.py", line 12, in lambda_wrapper
result = selenium_test()
File "/app/lambda_function.py", line 29, in selenium_test
ffx = Firefox(options=options, log_path='/tmp/gecko.log')
File "/usr/lib/python3.8/site-packages/selenium/webdriver/firefox/webdriver.py", line 170, in __init__
RemoteWebDriver.__init__(
File "/usr/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
self.start_session(capabilities, browser_profile)
File "/usr/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/usr/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/usr/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: Process unexpectedly closed with status signal
END RequestId: 52adeab9-8ee7-4a10-a728-82087ec9de30
REPORT RequestId: 52adeab9-8ee7-4a10-a728-82087ec9de30 Duration: 20507.74 ms Billed Duration: 21350 ms Memory Size: 1024 MB Max Memory Used: 131 MB Init Duration: 842.11 ms
Unknown application error occurred
I had it upload the geckodriver logs as well, but there wasn't much useful information in there:
1608506540595 geckodriver INFO Listening on 127.0.0.1:41597
1608506541569 mozrunner::runner INFO Running command: "/usr/bin/firefox" "--marionette" "-headless" "--window-size 1920,1080" "-foreground" "-no-remote" "-profile" "/tmp/rust_mozprofileQCapHy"
*** You are running in headless mode.
How can I even begin to debug this? The fact that the exact same container behaves differently depending upon where it's run seems fishy to me, but I'm not knowledgeable enough about Selenium, Docker, or Lambda to pinpoint exactly where the problem is.
Is my docker run command not accurately recreating the environment in Lambda? If so, then what command would I run to better simulate the Lambda environment? I'm not really sure where else to go from here, seeing as I can't actually reproduce the error locally to test with.
If anyone wants to take a look at the full code and try building it themselves, the repository is here - the lambda code is in lambda_function.py.
As for prior research, this question a) is about ChromeDriver and b) has no answers from over a year ago. The link from that one only has information about how to run a container in Lambda, which I'm already doing. This answer is almost my problem, but I know that there's not a version mismatch because the container works on my laptop just fine.

I have exactly the same problem and a possible explanation.
I think what you want is not possible for the time being.
According to AWS DevOps Blog Firefox relies on fallocate system call and /dev/shm.
However AWS Lambda does not mount /dev/shm so Firefox will crash when trying to allocate memory. Unfortunately, this handling cannot be disabled for Firefox.
However if you can live with Chromium, there is an option for chromedriver --disable-dev-shm-usage that disables the usage of /dev/shm and instead writes shared memory files to /tmp.
chromedriver works fine for me on AWS Lambda, if that is an option for you.
According to AWS DevOps Blog you can also use AWS Fargate to run Firefox/geckodriver.
There is an entry in the AWS forum from 2015 that requests mounting /dev/shm in Lambdas, but nothing happened since then.

OpenCV(3.4.1-dev) Errors while running video object detection

I'm working on a project that I found online (Yolo Object Detection with OpenCV, one of Pyimageresearch projects). So, I downloaded the whole code and saved it in the Downloads folder as it was recommended the run the cmd line script:
python /home/ubuntu/Downloads/yolo-object-detection/yolo_video.py \
> --input /home/ubuntu/Downloads/yolo-object-detection/videos/WS-1sec.mp4 \
> --output /home/ubuntu/Downloads/yolo-object-detection/output/WS-1sec.avi \
> --yolo /home/ubuntu/Downloads/yolo-object-detection/yolo-coco
but the output was:
[INFO] loading YOLO from disk...
OpenCV(3.4.1-dev) Error: Parsing error (Unknown layer type: shortcut) in ReadDarknetFromCfgFile, file /home/ubuntu/src/opencv/modules/dnn/src/darknet/darknet_io.cpp, line 503
Traceback (most recent call last):
File "/home/ubuntu/Downloads/yolo-object-detection/yolo_video.py", line 42, in <module>
net = cv2.dnn.readNetFromDarknet(configPath, weightsPath)
cv2.error: OpenCV(3.4.1-dev) /home/ubuntu/src/opencv/modules/dnn/src/darknet/darknet_io.cpp:503: error: (-212) Unknown layer type: shortcut in function ReadDarknetFromCfgFile
I'm running the same exact version of OpenCV 3.4.1 on another machine and it worked there! This time I'm working on the Tetson TX2 but didn't rum!
Link to original project is here.
Any idea why these error occurs please!?

I think you might have the wrong OpenCV version. Check this answer:
OpenCV unknown layer type running darknet detect
"Support for running YOLOv3 has been added to OpenCV master branch (3.4.3)."

Tensorflow object detection train.py fails when running on cloud machine learning engine

I have a small working example of the tensorflow object detection api working locally. Everything looks great. My goal is to use their scripts to run in Google Machine Learning Engine, which i've used extensively in the past. I am following these docs.
Declare some relevant variables
declare PROJECT=$(gcloud config list project --format "value(core.project)")
declare BUCKET="gs://${PROJECT}-ml"
declare MODEL_NAME="DeepMeerkatDetection"
declare FOLDER="${BUCKET}/${MODEL_NAME}"
declare JOB_ID="${MODEL_NAME}_$(date +%Y%m%d_%H%M%S)"
declare TRAIN_DIR="${FOLDER}/${JOB_ID}"
declare EVAL_DIR="${BUCKET}/${MODEL_NAME}/${JOB_ID}_eval"
declare PIPELINE_CONFIG_PATH="${FOLDER}/faster_rcnn_inception_resnet_v2_atrous_coco_cloud.config"
declare PIPELINE_YAML="/Users/Ben/Documents/DeepMeerkat/training/Detection/cloud.yml"
My yaml looks like
trainingInput:
runtimeVersion: "1.0"
scaleTier: CUSTOM
masterType: standard_gpu
workerCount: 5
workerType: standard_gpu
parameterServerCount: 3
parameterServerType: standard
The relevant paths are set in the config, e.g
fine_tune_checkpoint: "gs://api-project-773889352370-ml/DeepMeerkatDetection/checkpoint/faster_rcnn_inception_resnet_v2_atrous_coco_11_06_2017/model.ckpt"
I've packaged object detection and slim using setup.py
Running
gcloud ml-engine jobs submit training "${JOB_ID}_train" \
--job-dir=${TRAIN_DIR} \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train \
--region us-central1 \
--config ${PIPELINE_YAML} \
-- \
--train_dir=${TRAIN_DIR} \
--pipeline_config_path= ${PIPELINE_CONFIG_PATH}
yields a tensorflow (import?) error. Its a bit cryptic
insertId: "1inuq6gg27fxnkc"
logName: "projects/api-project-773889352370/logs/ml.googleapis.com%2FDeepMeerkatDetection_20171017_141321_train"
receiveTimestamp: "2017-10-17T21:38:34.435293164Z"
resource: {…}
severity: "ERROR"
textPayload: "The replica ps 0 exited with a non-zero status of 1. Termination reason: Error.
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 198, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 145, in main
model_config, train_config, input_config = get_configs_from_multiple_files()
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 127, in get_configs_from_multiple_files
text_format.Merge(f.read(), train_config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 112, in read
return pywrap_tensorflow.ReadFromStream(self._read_buf, length, status)
File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
FailedPreconditionError: .
I've seen this error in other questions related to prediction on Machine Learning Engine, suggesting this error probably(?) is not directly related to the object detection code, but it feels like its not being packaged correctly, missing dependencies? I've updated my gcloud to the latest version.
Bens-MacBook-Pro:research ben$ gcloud --version
Google Cloud SDK 175.0.0
bq 2.0.27
core 2017.10.09
gcloud
gsutil 4.27
Hard to see how its related to this problem here
FailedPreconditionError when running TF Object Detection API with own model
why would code need to initialized differently in the cloud?
Update #1.
The curious thing is that the eval.py works fine, so it can't be a path to the config file, or anything that train.py and eval.py share. Eval.py patiently sits and waits for model checkpoints to be created.
Another idea might be that the checkpoint is somehow been corrupted during upload. We can test this bypassing and training from scratch.
In .config
from_detection_checkpoint: false
that yields the the same precondition error, so it can't be the model.

The root cause is a slight typo:
--pipeline_config_path= ${PIPELINE_CONFIG_PATH}
has an extra space. Try this:
gcloud ml-engine jobs submit training "${JOB_ID}_train" \
--job-dir=${TRAIN_DIR} \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train \
--region us-central1 \
--config ${PIPELINE_YAML} \
-- \
--train_dir=${TRAIN_DIR} \
--pipeline_config_path=${PIPELINE_CONFIG_PATH}

dataflow failed to set up worker

Tested my pipeline on DirectRunner and everything's working good.
Now I want to run it on DataflowRunner. It doesn't work. It fails even before enter my pipeline code, and I'm totally overwhelmed by the logs in stackdriver - just don't understand what they mean and really don't have any clue on what's wrong.
execution graph looks loaded fine
worker pool starts and 1 worker is trying to run through the setup process, however looks never success
some logs that I guess might provide useful information for debugging:
AttributeError:'module' object has no attribute 'NativeSource'
/usr/bin/python failed with exit status 1
Back-off 20s restarting failed container=python pod=dataflow-fiona-backlog-clean-test2-06140817-1629-harness-3nxh_default(50a3915d6501a3ec74d6d385f70c8353)
checking backoff for container "python" in pod "dataflow-fiona-backlog-clean-test2-06140817-1629-harness-3nxh"
INFO SSH key is not a complete entry: .....
How should I tackle this problem?
Edit:
my setup.py here if it helps: (copyed from [here], only modifiedREQUIRED_PACKAGES
and setuptools.setup section)
from distutils.command.build import build as _build
import subprocess
import setuptools
# This class handles the pip install mechanism.
class build(_build): # pylint: disable=invalid-name
"""A build command class that will be invoked during package install.
The package built using the current setup.py will be staged and later
installed in the worker using `pip install package'. This class will be
instantiated during install for this specific scenario and will trigger
running the custom commands specified.
"""
sub_commands = _build.sub_commands + [('CustomCommands', None)]
# Some custom command to run during setup. The command is not essential for this
# workflow. It is used here as an example. Each command will spawn a child
# process. Typically, these commands will include steps to install non-Python
# packages. For instance, to install a C++-based library libjpeg62 the following
# two commands will have to be added:
#
# ['apt-get', 'update'],
# ['apt-get', '--assume-yes', install', 'libjpeg62'],
#
# First, note that there is no need to use the sudo command because the setup
# script runs with appropriate access.
# Second, if apt-get tool is used then the first command needs to be 'apt-get
# update' so the tool refreshes itself and initializes links to download
# repositories. Without this initial step the other apt-get install commands
# will fail with package not found errors. Note also --assume-yes option which
# shortcuts the interactive confirmation.
#
# The output of custom commands (including failures) will be logged in the
# worker-startup log.
CUSTOM_COMMANDS = [
['echo', 'Custom command worked!']]
class CustomCommands(setuptools.Command):
"""A setuptools Command class able to run arbitrary commands."""
def initialize_options(self):
pass
def finalize_options(self):
pass
def RunCustomCommand(self, command_list):
print 'Running command: %s' % command_list
p = subprocess.Popen(
command_list,
stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
# Can use communicate(input='y\n'.encode()) if the command run requires
# some confirmation.
stdout_data, _ = p.communicate()
print 'Command output: %s' % stdout_data
if p.returncode != 0:
raise RuntimeError(
'Command %s failed: exit code: %s' % (command_list, p.returncode))
def run(self):
for command in CUSTOM_COMMANDS:
self.RunCustomCommand(command)
# Configure the required packages and scripts to install.
# Note that the Python Dataflow containers come with numpy already installed
# so this dependency will not trigger anything to be installed unless a version
# restriction is specified.
REQUIRED_PACKAGES = ['apache-beam==2.0.0',
'datalab==1.0.1',
'google-cloud==0.19.0',
'google-cloud-bigquery==0.22.1',
'google-cloud-core==0.22.1',
'google-cloud-dataflow==0.6.0',
'pandas==0.20.2']
setuptools.setup(
name='geotab-backlog-dataflow',
version='0.0.1',
install_requires=REQUIRED_PACKAGES,
packages=setuptools.find_packages(),
)
worker-startup log: and it ended at the following exception
I /usr/bin/python failed with exit status 1
I /usr/bin/python failed with exit status 1
I AttributeError: 'module' object has no attribute 'NativeSource'
I class ConcatSource(iobase.NativeSource):
I File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/concat_reader.py", line 26, in <module>
I from dataflow_worker import concat_reader
I File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/maptask.py", line 31, in <module>
I from dataflow_worker import maptask
I File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 26, in <module>
I from dataflow_worker import executor
I File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 63, in <module>
I from dataflow_worker import batchworker
I File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/start.py", line 26, in <module>
I exec code in run_globals
I File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
I "__main__", fname, loader, pkg_name)
I File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
I AttributeError: 'module' object has no attribute 'NativeSource'
I class ConcatSource(iobase.NativeSource):

You seem to be using incompatible requirements in your REQUIRED_PACKAGES directive, i.e. you specify "apache-beam==2.0.0" and "google-cloud-dataflow==0.6.0", which conflict with each other. Can you try removing / uninstalling the "apache-beam" package and install / include the "google-cloud-dataflow==2.0.0" package instead?

ponydebugger error when installing

I got some error message after the command :
curl -sk https://cloud.github.com/downloads/square/PonyDebugger/bootstrap-ponyd.py | \
python - --ponyd-symlink=/usr/local/bin/ponyd ~/Library/PonyDebugger
this is my termial tell me,
Overwriting /Users/hokila/Library/PonyDebugger/lib/python2.7/orig-prefix.txt with new content
New python executable in /Users/hokila/Library/PonyDebugger/bin/python
Traceback (most recent call last):
File "<stdin>", line 2462, in <module>
File "<stdin>", line 944, in main
File "<stdin>", line 1045, in create_environment
File "<stdin>", line 1361, in install_python
File "<stdin>", line 435, in copyfile
File "<stdin>", line 412, in copyfileordir
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 128, in copy2
copyfile(src, dst)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 83, in copyfile
with open(dst, 'wb') as fdst:
IOError: [Errno 2] No such file or directory: '/Users/hokila/Library/PonyDebugger/.Python'
It seems the python version error,so I update my python to 2.7.3.
then I still got the same error.How could I solve this?

This problem also occured to me and I was almost desperate. I reinstalled Python 2.7.2 and it worked fine.
I tried to solve this problem manually before using the instructions in README_ponyd.rst and I was also partially successful:
Development installation ````````````````````````
If you already have PonyDebugger git repo checked out you can can set
up a virtualenv manually and have your ponyd installation point to
your existing checkout. For demonstration we assume $VENV is set
to your intended install path and $PONYDEBUGGER_PATH is set to
your PonyDebugger git checkout::
# if you don't already have virtualenv installed
sudo easy_install
virtualenv
virtualenv "$VENV"
source "$VENV/bin/activate"
pip install -e "$PONYDEBUGGER_PATH"
# to ensure your shell knows ponyd exists hash -r
To run this ponyd you can either activate your environment by source
"$VENV/bin/activate" and ponyd will be added to your path. You
can also just call it directly via $VENV/bin/ponyd without
activating first.
You should install virtualenv before. You can try combination of both steps. Hope it helps...

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

apache_beam.runners.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed: - google-cloud-dataflow

Enable the Dataflow API. In the Pantheon top search box typing "dataflow api" will take you to a window where you can click "Enable API". I think this will fix it for you. I disabled my Dataflow API and got the same error as you, then when it was re-enabled the problem went back away.

Related

Why is my containerized Selenium application failing only in AWS Lambda?

OpenCV(3.4.1-dev) Errors while running video object detection

Tensorflow object detection train.py fails when running on cloud machine learning engine

dataflow failed to set up worker

ponydebugger error when installing

Categories

Resources