Composer - Dataflow hook crash - google-cloud-dataflow

I'm creating a Hourly task in Airflow that schedules a Dataflow Job, however the hook provided by Airflow Library most of the times crashes while the dataflow job actually succeed.
[2018-05-25 07:05:03,523] {base_task_runner.py:98} INFO - Subtask: [2018-05-25 07:05:03,439] {gcp_dataflow_hook.py:109} WARNING - super(GcsIO, cls).__new__(cls, storage_client))
[2018-05-25 07:05:03,721] {base_task_runner.py:98} INFO - Subtask: Traceback (most recent call last):
[2018-05-25 07:05:03,725] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/bin/airflow", line 27, in <module>
[2018-05-25 07:05:03,726] {base_task_runner.py:98} INFO - Subtask: args.func(args)
[2018-05-25 07:05:03,729] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/airflow/bin/cli.py", line 392, in run
[2018-05-25 07:05:03,729] {base_task_runner.py:98} INFO - Subtask: pool=args.pool,
[2018-05-25 07:05:03,731] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/airflow/utils/db.py", line 50, in wrapper
[2018-05-25 07:05:03,732] {base_task_runner.py:98} INFO - Subtask: result = func(*args, **kwargs)
[2018-05-25 07:05:03,734] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/airflow/models.py", line 1492, in _run_raw_task
[2018-05-25 07:05:03,738] {base_task_runner.py:98} INFO - Subtask: result = task_copy.execute(context=context)
[2018-05-25 07:05:03,740] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/airflow/contrib/operators/dataflow_operator.py", line 313, in execute
[2018-05-25 07:05:03,746] {base_task_runner.py:98} INFO - Subtask: self.py_file, self.py_options)
[2018-05-25 07:05:03,748] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 188, in start_python_dataflow
[2018-05-25 07:05:03,751] {base_task_runner.py:98} INFO - Subtask: label_formatter)
[2018-05-25 07:05:03,753] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 158, in _start_dataflow
[2018-05-25 07:05:03,756] {base_task_runner.py:98} INFO - Subtask: _Dataflow(cmd).wait_for_done()
[2018-05-25 07:05:03,757] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 129, in wait_for_done
[2018-05-25 07:05:03,759] {base_task_runner.py:98} INFO - Subtask: line = self._line(fd)
[2018-05-25 07:05:03,761] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 110, in _line
[2018-05-25 07:05:03,763] {base_task_runner.py:98} INFO - Subtask: line = lines[-1][:-1]
[2018-05-25 07:05:03,766] {base_task_runner.py:98} INFO - Subtask: IndexError: list index out of range
I look that file up in Airflow github repo and the line error does not match which makes me think that the actual Airflow instance from Cloud Composer is outdated. Is there any way to update it?

This would be resolved in 1.10 or 2.0.
Have a look to this PR
https://github.com/apache/incubator-airflow/pull/3165
This has been merged to master. You may use this PR code and create your own plugin.

Related

Appium starts application but test gives error “FAIL : No application is open”

The problem:
i use appium+robotframework to test my app.when i use key words:Open Application,it always gets failed result:No application is open.but actually the app was already open.i started appium server with code:appium -p 4723 --session-override --no-reset.
Environment:
info AppiumDoctor ### Diagnostic for necessary dependencies starting ###
info AppiumDoctor ✔ The Node.js binary was found at: C:\Program Files\nodejs\node.EXE
info AppiumDoctor ✔ Node version is 16.15.1
info AppiumDoctor ✔ ANDROID_HOME is set to: D:\Android_Sdk
info AppiumDoctor ✔ JAVA_HOME is set to: C:\Program Files\Java\jdk1.8.0_60
info AppiumDoctor Checking adb, android, emulator
info AppiumDoctor 'adb' is in D:\Android_Sdk\platform-tools\adb.exe
info AppiumDoctor 'android' is in D:\Android_Sdk\tools\android.bat
info AppiumDoctor 'emulator' is in D:\Android_Sdk\emulator\emulator.exe
info AppiumDoctor ✔ adb, android, emulator exist: D:\Android_Sdk
info AppiumDoctor ✔ 'bin' subfolder exists under 'C:\Program Files\Java\jdk1.8.0_60'
info AppiumDoctor ### Diagnostic for necessary dependencies completed, no fix needed. ###
Log:
in robotframework,i runed the test in debug,there some info:
20220802 18:05:05.399 : DEBUG : Starting new HTTP connection (1): 127.0.0.1:4723
20220802 18:05:14.770 : DEBUG : http://127.0.0.1:4723 "POST /wd/hub/session HTTP/1.1" 200 884
20220802 18:05:14.771 : DEBUG : Remote response: status=200 | data={"value":{"capabilities":{"platform":"LINUX","webStorageEnabled":false,"takesScreenshot":true,"javascriptEnabled":true,"databaseEnabled":false,"networkConnectionEnabled":true,"locationContextEnabled":false,"warnings":{},"desired":{"platformName":"Android","appPackage":"com.cmcc.myhouse.demo","appActivity":"com.cmcc.myhouse.MainActivity","appWaitDuration":60000,"noSign":true},"platformName":"Android","appPackage":"com.cmcc.myhouse.demo","appActivity":"com.cmcc.myhouse.MainActivity","appWaitDuration":60000,"noSign":true,"deviceName":"ed192f0","deviceUDID":"ed192f0","deviceApiLevel":29,"platformVersion":"10","deviceScreenSize":"1080x2160","deviceScreenDensity":380,"deviceModel":"ONEPLUS A5010","deviceManufacturer":"OnePlus","pixelRatio":2.375,"statBarHeight":57,"viewportRect":{"left":0,"top":57,"width":1080,"height":2103}},"sessionId":"312366fe-1008-47f4-9063-1cf0e4a27e0c"}} | headers=HTTPHeaderDict({'X-Powered-By': 'Express', 'Vary': 'X-HTTP-Method-Override', 'Content-Type': 'application/json; charset=utf-8', 'Content-Length': '884', 'ETag': 'W/"374-cX9IxtSKVtVV/oMPHrqcO0PP2Yg"', 'Date': 'Tue, 02 Aug 2022 10:05:14 GMT', 'Connection': 'keep-alive', 'Keep-Alive': 'timeout=600'})
20220802 18:05:14.771 : DEBUG : Finished Request
20220802 18:05:14.774 : FAIL : No application is open
20220802 18:05:14.776 : DEBUG :
Traceback (most recent call last):
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\AppiumLibrary\keywords\keywordgroup.py", line 16, in _run_on_failure_decorator
return method(*args, **kwargs)
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\AppiumLibrary\keywords\_applicationmanagement.py", line 52, in open_application
application = webdriver.Remote(str(remote_url), desired_caps)
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\appium\webdriver\webdriver.py", line 268, in __init__
AppiumConnection(command_executor, keep_alive=keep_alive), desired_capabilities, browser_profile, proxy
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 275, in __init__
self.start_session(capabilities, browser_profile)
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\appium\webdriver\webdriver.py", line 361, in start_session
self.capabilities = response.get('value')
AttributeError: can't set attribute
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\AppiumLibrary\keywords\keywordgroup.py", line 21, in _run_on_failure_decorator
raise err
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\AppiumLibrary\keywords\keywordgroup.py", line 16, in _run_on_failure_decorator
return method(*args, **kwargs)
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\AppiumLibrary\keywords\_screenshot.py", line 31, in capture_page_screenshot
if hasattr(self._current_application(), 'get_screenshot_as_file'):
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\AppiumLibrary\keywords\_applicationmanagement.py", line 367, in _current_application
raise RuntimeError('No application is open')
RuntimeError: No application is open
20220802 18:05:14.779 : WARN : Keyword 'Capture Page Screenshot' could not be run on failure: No application is open
20220802 18:05:14.780 : FAIL : AttributeError: can't set attribute
20220802 18:05:14.780 : DEBUG :
Traceback (most recent call last):
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\AppiumLibrary\keywords\keywordgroup.py", line 21, in _run_on_failure_decorator
raise err
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\AppiumLibrary\keywords\keywordgroup.py", line 16, in _run_on_failure_decorator
return method(*args, **kwargs)
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\AppiumLibrary\keywords\_applicationmanagement.py", line 52, in open_application
application = webdriver.Remote(str(remote_url), desired_caps)
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\appium\webdriver\webdriver.py", line 268, in __init__
AppiumConnection(command_executor, keep_alive=keep_alive), desired_capabilities, browser_profile, proxy
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 275, in __init__
self.start_session(capabilities, browser_profile)
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\appium\webdriver\webdriver.py", line 361, in start_session
self.capabilities = response.get('value')
AttributeError: can't set attribute
Ending test: XiriTest.XiriBusinessTest.26MainBusinessTest.2.6CommonCMD
The app is usually opened automatically by Appium and then makes the necessary connections. Tests can fail if the app is already open.
Are you sure the appPackage and appActivity are correct? It would be worth double checking these.

Ruby on Rails 6 AWS Elastic Beanstalk deploy Command reload_initctl_for_sidekiq failed

I'm trying to update ruby and rails versions on aws. After I create a new environment on elastic beanstalk I tried to deploy updated version of rails project but it gives me Command reload_initctl_for_sidekiq failed error. I couldn't figure out what caused it.
Here is the cfn-init.log file
2021-02-02 15:57:43,239 [INFO] -----------------------Starting build-----------------------
2021-02-02 15:57:43,248 [INFO] Running configSets: _OnInstanceBoot
2021-02-02 15:57:43,252 [INFO] Running configSet _OnInstanceBoot
2021-02-02 15:57:43,257 [INFO] Running config AWSEBBaseConfig
2021-02-02 15:57:43,499 [INFO] Command clearbackupfiles succeeded
2021-02-02 15:57:43,503 [INFO] Running config AWSEBCfnHupEndpointOverride
2021-02-02 15:57:43,510 [INFO] Command clearbackupfiles succeeded
2021-02-02 15:57:43,511 [INFO] ConfigSets completed
2021-02-02 15:57:43,511 [INFO] -----------------------Build complete-----------------------
2021-02-02 15:57:46,805 [INFO] -----------------------Starting build-----------------------
2021-02-02 15:57:46,840 [INFO] Running configSets: Infra-EmbeddedPreBuild
2021-02-02 15:57:46,848 [INFO] Running configSet Infra-EmbeddedPreBuild
2021-02-02 15:57:46,853 [INFO] Running config prebuild_0_myapp_com
2021-02-02 15:57:46,855 [INFO] ConfigSets completed
2021-02-02 15:57:46,855 [INFO] -----------------------Build complete-----------------------
2021-02-02 15:57:57,735 [INFO] -----------------------Starting build-----------------------
2021-02-02 15:57:57,743 [INFO] Running configSets: Infra-EmbeddedPostBuild
2021-02-02 15:57:57,747 [INFO] Running configSet Infra-EmbeddedPostBuild
2021-02-02 15:57:57,748 [INFO] ConfigSets completed
2021-02-02 15:57:57,748 [INFO] -----------------------Build complete-----------------------
2021-02-02 15:59:12,848 [INFO] -----------------------Starting build-----------------------
2021-02-02 15:59:12,857 [INFO] Running configSets: Infra-EmbeddedPreBuild
2021-02-02 15:59:12,861 [INFO] Running configSet Infra-EmbeddedPreBuild
2021-02-02 15:59:12,866 [INFO] Running config prebuild_0_myapp_com
2021-02-02 15:59:12,873 [INFO] Running config prebuild_1_myapp_com
2021-02-02 15:59:12,880 [INFO] Running config prebuild_2_myapp_com
2021-02-02 15:59:12,884 [ERROR] Command reload_initctl_for_sidekiq (initctl reload-configuration) failed
2021-02-02 15:59:12,884 [ERROR] Error encountered during build of prebuild_2_myapp_com: Command reload_initctl_for_sidekiq failed
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/cfnbootstrap/construction.py", line 542, in run_config
CloudFormationCarpenter(config, self._auth_config).build(worklog)
File "/usr/lib/python2.7/site-packages/cfnbootstrap/construction.py", line 260, in build
changes['commands'] = CommandTool().apply(self._config.commands)
File "/usr/lib/python2.7/site-packages/cfnbootstrap/command_tool.py", line 117, in apply
raise ToolError(u"Command %s failed" % name)
ToolError: Command reload_initctl_for_sidekiq failed
2021-02-02 15:59:12,886 [ERROR] -----------------------BUILD FAILED!------------------------
2021-02-02 15:59:12,886 [ERROR] Unhandled exception during build: Command reload_initctl_for_sidekiq failed
Traceback (most recent call last):
File "/opt/aws/bin/cfn-init", line 171, in <module>
worklog.build(metadata, configSets)
File "/usr/lib/python2.7/site-packages/cfnbootstrap/construction.py", line 129, in build
Contractor(metadata).build(configSets, self)
File "/usr/lib/python2.7/site-packages/cfnbootstrap/construction.py", line 530, in build
self.run_config(config, worklog)
File "/usr/lib/python2.7/site-packages/cfnbootstrap/construction.py", line 542, in run_config
CloudFormationCarpenter(config, self._auth_config).build(worklog)
File "/usr/lib/python2.7/site-packages/cfnbootstrap/construction.py", line 260, in build
changes['commands'] = CommandTool().apply(self._config.commands)
File "/usr/lib/python2.7/site-packages/cfnbootstrap/command_tool.py", line 117, in apply
raise ToolError(u"Command %s failed" % name)
ToolError: Command reload_initctl_for_sidekiq failed
2021-02-03 11:21:36,581 [INFO] -----------------------Starting build-----------------------
2021-02-03 11:21:36,590 [INFO] Running configSets: Infra-EmbeddedPreBuild
2021-02-03 11:21:36,594 [INFO] Running configSet Infra-EmbeddedPreBuild
2021-02-03 11:21:36,600 [INFO] Running config prebuild_0_myapp_com
2021-02-03 11:21:36,606 [INFO] Running config prebuild_1_myapp_com
2021-02-03 11:21:36,613 [INFO] Running config prebuild_2_myapp_com
2021-02-03 11:21:36,614 [INFO] Symbolic link /etc/init/sidekiq.conf already exists
2021-02-03 11:21:36,617 [ERROR] Command reload_initctl_for_sidekiq (initctl reload-configuration) failed
2021-02-03 11:21:36,618 [ERROR] Error encountered during build of prebuild_2_myapp_com: Command reload_initctl_for_sidekiq failed
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/cfnbootstrap/construction.py", line 542, in run_config
CloudFormationCarpenter(config, self._auth_config).build(worklog)
File "/usr/lib/python2.7/site-packages/cfnbootstrap/construction.py", line 260, in build
changes['commands'] = CommandTool().apply(self._config.commands)
File "/usr/lib/python2.7/site-packages/cfnbootstrap/command_tool.py", line 117, in apply
raise ToolError(u"Command %s failed" % name)
ToolError: Command reload_initctl_for_sidekiq failed
2021-02-03 11:21:36,618 [ERROR] -----------------------BUILD FAILED!------------------------
2021-02-03 11:21:36,618 [ERROR] Unhandled exception during build: Command reload_initctl_for_sidekiq failed
Traceback (most recent call last):
File "/opt/aws/bin/cfn-init", line 171, in <module>
worklog.build(metadata, configSets)
File "/usr/lib/python2.7/site-packages/cfnbootstrap/construction.py", line 129, in build
Contractor(metadata).build(configSets, self)
File "/usr/lib/python2.7/site-packages/cfnbootstrap/construction.py", line 530, in build
self.run_config(config, worklog)
File "/usr/lib/python2.7/site-packages/cfnbootstrap/construction.py", line 542, in run_config
CloudFormationCarpenter(config, self._auth_config).build(worklog)
File "/usr/lib/python2.7/site-packages/cfnbootstrap/construction.py", line 260, in build
changes['commands'] = CommandTool().apply(self._config.commands)
File "/usr/lib/python2.7/site-packages/cfnbootstrap/command_tool.py", line 117, in apply
raise ToolError(u"Command %s failed" % name)
ToolError: Command reload_initctl_for_sidekiq failed
2021-02-03 11:28:00,490 [INFO] -----------------------Starting build-----------------------
2021-02-03 11:28:00,499 [INFO] Running configSets: Infra-EmbeddedPreBuild
2021-02-03 11:28:00,503 [INFO] Running configSet Infra-EmbeddedPreBuild
2021-02-03 11:28:00,508 [INFO] Running config prebuild_0_myapp_com
2021-02-03 11:28:00,515 [INFO] Running config prebuild_1_myapp_com
2021-02-03 11:28:00,522 [INFO] Running config prebuild_2_myapp_com
2021-02-03 11:28:00,523 [INFO] Symbolic link /etc/init/sidekiq.conf already exists
2021-02-03 11:28:00,526 [ERROR] Command reload_initctl_for_sidekiq (initctl reload-configuration) failed
2021-02-03 11:28:00,527 [ERROR] Error encountered during build of prebuild_2_myapp_com: Command reload_initctl_for_sidekiq failed
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/cfnbootstrap/construction.py", line 542, in run_config
CloudFormationCarpenter(config, self._auth_config).build(worklog)
File "/usr/lib/python2.7/site-packages/cfnbootstrap/construction.py", line 260, in build
changes['commands'] = CommandTool().apply(self._config.commands)
File "/usr/lib/python2.7/site-packages/cfnbootstrap/command_tool.py", line 117, in apply
raise ToolError(u"Command %s failed" % name)
ToolError: Command reload_initctl_for_sidekiq failed
2021-02-03 11:28:00,527 [ERROR] -----------------------BUILD FAILED!------------------------
2021-02-03 11:28:00,527 [ERROR] Unhandled exception during build: Command reload_initctl_for_sidekiq failed
Traceback (most recent call last):
File "/opt/aws/bin/cfn-init", line 171, in <module>
worklog.build(metadata, configSets)
File "/usr/lib/python2.7/site-packages/cfnbootstrap/construction.py", line 129, in build
Contractor(metadata).build(configSets, self)
File "/usr/lib/python2.7/site-packages/cfnbootstrap/construction.py", line 530, in build
self.run_config(config, worklog)
File "/usr/lib/python2.7/site-packages/cfnbootstrap/construction.py", line 542, in run_config
CloudFormationCarpenter(config, self._auth_config).build(worklog)
File "/usr/lib/python2.7/site-packages/cfnbootstrap/construction.py", line 260, in build
changes['commands'] = CommandTool().apply(self._config.commands)
File "/usr/lib/python2.7/site-packages/cfnbootstrap/command_tool.py", line 117, in apply
raise ToolError(u"Command %s failed" % name)
ToolError: Command reload_initctl_for_sidekiq failed
2021-02-03 12:59:49,252 [INFO] -----------------------Starting build-----------------------
2021-02-03 12:59:49,262 [INFO] Running configSets: Infra-EmbeddedPreBuild
2021-02-03 12:59:49,267 [INFO] Running configSet Infra-EmbeddedPreBuild
2021-02-03 12:59:49,272 [INFO] Running config prebuild_0_myapp_com
2021-02-03 12:59:49,279 [INFO] Running config prebuild_1_myapp_com
2021-02-03 12:59:49,286 [INFO] Running config prebuild_2_myapp_com
2021-02-03 12:59:49,287 [INFO] Symbolic link /etc/init/sidekiq.conf already exists
2021-02-03 12:59:49,290 [ERROR] Command reload_initctl_for_sidekiq (initctl reload-configuration) failed
2021-02-03 12:59:49,290 [ERROR] Error encountered during build of prebuild_2_myapp_com: Command reload_initctl_for_sidekiq failed
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/cfnbootstrap/construction.py", line 542, in run_config
CloudFormationCarpenter(config, self._auth_config).build(worklog)
File "/usr/lib/python2.7/site-packages/cfnbootstrap/construction.py", line 260, in build
changes['commands'] = CommandTool().apply(self._config.commands)
File "/usr/lib/python2.7/site-packages/cfnbootstrap/command_tool.py", line 117, in apply
raise ToolError(u"Command %s failed" % name)
ToolError: Command reload_initctl_for_sidekiq failed
2021-02-03 12:59:49,291 [ERROR] -----------------------BUILD FAILED!------------------------
2021-02-03 12:59:49,291 [ERROR] Unhandled exception during build: Command reload_initctl_for_sidekiq failed
Traceback (most recent call last):
File "/opt/aws/bin/cfn-init", line 171, in <module>
worklog.build(metadata, configSets)
File "/usr/lib/python2.7/site-packages/cfnbootstrap/construction.py", line 129, in build
Contractor(metadata).build(configSets, self)
File "/usr/lib/python2.7/site-packages/cfnbootstrap/construction.py", line 530, in build
self.run_config(config, worklog)
File "/usr/lib/python2.7/site-packages/cfnbootstrap/construction.py", line 542, in run_config
CloudFormationCarpenter(config, self._auth_config).build(worklog)
File "/usr/lib/python2.7/site-packages/cfnbootstrap/construction.py", line 260, in build
changes['commands'] = CommandTool().apply(self._config.commands)
File "/usr/lib/python2.7/site-packages/cfnbootstrap/command_tool.py", line 117, in apply
raise ToolError(u"Command %s failed" % name)
ToolError: Command reload_initctl_for_sidekiq failed

dask.distributed SLURM cluster Nanny Timeout

I am trying to use the dask.distributed.SLURMCluster to submit batch jobs to a SLURM job scheduler on a supercomputing cluster. The jobs all submit as expect, but throw an error after 1 minute of running: asyncio.exceptions.TimeoutError: Nanny failed to start in 60 seconds. How do I get the nanny to connect?
Full Trace:
distributed.nanny - INFO - Start Nanny at: 'tcp://206.76.203.125:38324'
distributed.dashboard.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.worker - INFO - Start worker at: tcp://206.76.203.125:37609
distributed.worker - INFO - Listening to: tcp://206.76.203.125:37609
distributed.worker - INFO - dashboard at: 206.76.203.125:35505
distributed.worker - INFO - Waiting to connect to: tcp://129.114.63.43:35489
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 8
distributed.worker - INFO - Memory: 2.00 GB
distributed.worker - INFO - Local Directory: /home1/06729/tg860286/tests/dask-rsmas-presentation/dask-worker-space/worker-pu937jui
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Waiting to connect to: tcp://129.114.63.43:35489
distributed.worker - INFO - Waiting to connect to: tcp://129.114.63.43:35489
distributed.worker - INFO - Waiting to connect to: tcp://129.114.63.43:35489
distributed.worker - INFO - Waiting to connect to: tcp://129.114.63.43:35489
distributed.nanny - INFO - Closing Nanny at 'tcp://206.76.203.125:38324'
distributed.worker - INFO - Stopping worker at tcp://206.76.203.125:37609
distributed.worker - INFO - Closed worker has not yet started: None
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/distributed/node.py", line 173, in wait_for
await asyncio.wait_for(future, timeout=timeout)
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/asyncio/tasks.py", line 490, in wait_for
raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/runpy.py", line 193, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 440, in <module>
go()
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 436, in go
main()
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 422, in main
loop.run_sync(run)
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/tornado/ioloop.py", line 532, in run_sync
return future_cell[0].result()
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 416, in run
await asyncio.gather(*nannies)
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/asyncio/tasks.py", line 684, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/distributed/node.py", line 176, in wait_for
raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 60 seconds```
It looks like your workers weren't able to connect to the scheduler. My guess is that you need to specify a network interface. You should ask your system administrator which network interface you should use, and then specify that with the interface= keyword.
You might also want to read through https://blog.dask.org/2019/08/28/dask-on-summit , which gives a case study of common problems that arise.

Is it possible to read parquet metadata from Dask?

I have thousands of parquet files that I need to process. Before processing the files, I'm trying to get various information about the files using the parquet metadata, such as number of rows in each partition, mins, maxs, etc.
I tried reading the metadata using dask.delayed hoping to distribute the metadata gathering tasks across my cluster, but this seems to lead to instability in Dask. See an example code snippet and an error of a node time out below.
Is there a way to read the parquet metadata from Dask? I know Dask's "read_parquet" function has a "gather_statistics" option, which you can set to false to speed up the file reads. But, I don't see a way to access all of the parquet metadata / statistics if it's set to true.
Example code:
#dask.delayed
def get_pf(item_to_read):
pf = fastparquet.ParquetFile(item_to_read)
row_groups = pf.row_groups.copy()
all_stats = pf.statistics.copy()
col = pf.info['columns'].copy()
return [row_groups, all_stats, col]
stats_arr = get_pf(item_to_read)
Example error:
2019-10-03 01:43:51,202 - INFO - 192.168.0.167 - distributed.worker - ERROR - Worker stream died during communication: tcp://192.168.0.223:34623
2019-10-03 01:43:51,203 - INFO - 192.168.0.167 - Traceback (most recent call last):
2019-10-03 01:43:51,204 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/distributed/comm/core.py", line 218, in connect
2019-10-03 01:43:51,206 - INFO - 192.168.0.167 - quiet_exceptions=EnvironmentError,
2019-10-03 01:43:51,207 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 729, in run
2019-10-03 01:43:51,210 - INFO - 192.168.0.167 - value = future.result()
2019-10-03 01:43:51,211 - INFO - 192.168.0.167 - tornado.util.TimeoutError: Timeout
2019-10-03 01:43:51,212 - INFO - 192.168.0.167 -
2019-10-03 01:43:51,213 - INFO - 192.168.0.167 - During handling of the above exception, another exception occurred:
2019-10-03 01:43:51,214 - INFO - 192.168.0.167 -
2019-10-03 01:43:51,215 - INFO - 192.168.0.167 - Traceback (most recent call last):
2019-10-03 01:43:51,217 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/distributed/worker.py", line 1841, in gather_dep
2019-10-03 01:43:51,218 - INFO - 192.168.0.167 - self.rpc, deps, worker, who=self.address
2019-10-03 01:43:51,219 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 729, in run
2019-10-03 01:43:51,220 - INFO - 192.168.0.167 - value = future.result()
2019-10-03 01:43:51,222 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 736, in run
2019-10-03 01:43:51,223 - INFO - 192.168.0.167 - yielded = self.gen.throw(*exc_info) # type: ignore
2019-10-03 01:43:51,224 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/distributed/worker.py", line 3029, in get_data_from_worker
2019-10-03 01:43:51,225 - INFO - 192.168.0.167 - comm = yield rpc.connect(worker)
2019-10-03 01:43:51,640 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 729, in run
2019-10-03 01:43:51,641 - INFO - 192.168.0.167 - value = future.result()
2019-10-03 01:43:51,643 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 736, in run
2019-10-03 01:43:51,644 - INFO - 192.168.0.167 - yielded = self.gen.throw(*exc_info) # type: ignore
2019-10-03 01:43:51,645 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/distributed/core.py", line 866, in connect
2019-10-03 01:43:51,646 - INFO - 192.168.0.167 - connection_args=self.connection_args,
2019-10-03 01:43:51,647 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 729, in run
2019-10-03 01:43:51,649 - INFO - 192.168.0.167 - value = future.result()
2019-10-03 01:43:51,650 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 736, in run
2019-10-03 01:43:51,651 - INFO - 192.168.0.167 - yielded = self.gen.throw(*exc_info) # type: ignore
2019-10-03 01:43:51,652 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/distributed/comm/core.py", line 230, in connect
2019-10-03 01:43:51,653 - INFO - 192.168.0.167 - _raise(error)
2019-10-03 01:43:51,654 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/distributed/comm/core.py", line 207, in _raise
2019-10-03 01:43:51,656 - INFO - 192.168.0.167 - raise IOError(msg)
2019-10-03 01:43:51,657 - INFO - 192.168.0.167 - OSError: Timed out trying to connect to 'tcp://192.168.0.223:34623' after 10 s: connect() didn't finish in time
Does dd.read_parquet take a long time? If not, then you can follow whatever strategy is in there to do the reading in the client.
If the data has a single _metadata file in the root directory, then you can simply open this with fastparquet, which is exactly what Dask would do. It contains all the details of all of the data pieces.
There is no particular reason distributing the metadata reads should be a problem, but you should be aware that in some cases the total metadata items can add up to a substantial size.

Specify Beam Version for Dataflow Operator on Cloud Composer

We have written a Beam pipeline for version 2.11 but when we try to run it on Cloud Composer using the DataflowOperator it uses SDK version 2.5.
Is there anywhere to specify that 2.11 should be used?
Pipeline:
import argparse
import apache_beam as beam
from apache_beam.io.gcp import gcsio
from apache_beam.options.pipeline_options import PipelineOptions
import logging
from google.cloud import storage
import numpy as np
import pandas as pd
GCS_PREFIX = 'gs://'
def run(argv=None):
"""
Create and run Dataflow pipeline.
:return: none
"""
parser = argparse.ArgumentParser()
# Add the arguments needed for this specific Dataflow job.
parser.add_argument('--gvcf_bucket', dest='gvcf_bucket', required=True,
parser.add_argument('--parquet_bucket', dest='parquet_bucket', required=True,
help='Bucket on Google Cloud Storage to write parquet files to.')
parser.add_argument('--destination_table', dest='destination_table', required=True,
help='BigQuery table where transformed gvcfs should land')
parser.add_argument('--bq_dataset', dest='bq_dataset', required=True,
help='BigQuery dataset where destination table lives')
known_args, pipeline_args = parser.parse_known_args(argv)
# Add argument so that declared constants (ie, GCS_PREFIX)
# are available to Dataflow workers
pipeline_args.append('--save_main_session')
# Set options necessary for pipeline such as runner, project, region
p_opts = PipelineOptions(pipeline_args)
# Create and run beam pipeline object
with beam.Pipeline(options=p_opts) as p:
# Sink info
gvcf_bucket = known_args.gvcf_bucket
parquet_sink = known_args.parquet_bucket
# Set BigQuery Table spec for beam.io
# format is: dataset.table
table_spec = '{}.{}'.format(known_args.bq_dataset, known_args.destination_table)
# Get files to transform
files = get_files_to_transform(gvcf_bucket)
if files:
logging.info("Found {} files to transform".format((len(files))))
# Create pcollection of list of files to transform
gvcfs_to_transform = p | 'GetFiles' >> beam.Create(files)
# Read gvcfs from gcs into pcollection
parquets_to_load = gvcfs_to_transform | 'GvcfToParquet' >> beam.ParDo(GvcfToParquet(),
gvcf_bucket,
parquet_sink)
# Read Parquet files into pcollection
records = parquets_to_load | 'ReadParquet' >> beam.io.ReadAllFromParquet()
# Load all Parquet files into BigQuery
records | 'WriteParquetToBigQuery' >> beam.io.WriteToBigQuery(table_spec,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
else:
logging.info("No new files found")
if __name__ == '__main__':
run()
Composer
import datetime
import os
from airflow import models, configuration
from airflow.operators import subdag_operator, dummy_operator, bash_operator
from airflow.contrib.operators import dataflow_operator
import googleapiclient.discovery
import json
from computation_query_dag import computation_dag
yesterday = datetime.datetime.combine(
datetime.datetime.today() - datetime.timedelta(1),
datetime.datetime.min.time())
DEFAULT_DAG_ARGS = {
'start_date': yesterday,
'retries': 0,
'project_id': models.Variable.get('gcp_project'),
'dataflow_default_options': {
'project': models.Variable.get('gcp_project'),
'temp_location': models.Variable.get('gcp_temp_location'),
'staging_location': models.Variable.get('gcp_staging_location'),
'runner': 'DataflowRunner',
# 'region': 'us-central1',
},
}
with models.DAG(dag_id='TestEngine',
description='',
schedule_interval=None, default_args=DEFAULT_DAG_ARGS, start_date=yesterday) as dag:
dataflow_scripts = os.path.join(configuration.get('core', 'dags_folder'), 'pipeline')
# Args required for the ETL Dataflow job.
gvcf_dataflow_job_args = {
'gvcf_bucket': os.getenv('gvcf_bucket'),
'parquet_bucket': os.getenv('parquet_bucket'),
# 'job_name': os.getenv('gvcf_job_name'),
#'setup_file': os.path.join(dataflow_scripts, 'setup.py'),
'requirements_file': os.path.join(dataflow_scripts, 'requirements.txt'),
'destination_table': os.getenv('call_table'),
'bq_dataset': os.getenv('bq_dataset'),
# 'py_file': os.path.join(dataflow_scripts, 'gvcf_pipeline.py')
}
# Dataflow task that will process and load.
dataflow_gvcf = dataflow_operator.DataFlowPythonOperator(
task_id="gvcf-etl-bigquery",
py_file=os.path.join(dataflow_scripts, 'gvcf_pipeline.py'),
# dataflow_default_options=DEFAULT_DAG_ARGS['dataflow_default_options'],
options=gvcf_dataflow_job_args,
# gcp_conn_id='google_cloud_default'
)
Since I am able to run the pipeline locally I would think that if we specified the Beam version it would work when run from Cloud Composer also.
We have installed 2.11 into our Composer environment but we get the following error:
*** Reading remote log from gs://us-central1-test-env-96162c22-bucket/logs/AlleleAnalyticsEngine/gvcf-etl-bigquery/2019-04-17T22:33:07.326577+00:00/1.log.
[2019-04-17 22:33:18,604] {models.py:1361} INFO - Dependencies all met for <TaskInstance: Engine.gvcf-etl-bigquery 2019-04-17T22:33:07.326577+00:00 [queued]>
[2019-04-17 22:33:18,611] {models.py:1361} INFO - Dependencies all met for <TaskInstance: Engine.gvcf-etl-bigquery 2019-04-17T22:33:07.326577+00:00 [queued]>
[2019-04-17 22:33:18,613] {models.py:1573} INFO -
-------------------------------------------------------------------------------
Starting attempt 1 of
-------------------------------------------------------------------------------
[2019-04-17 22:33:18,659] {models.py:1595} INFO - Executing <Task(DataFlowPythonOperator): gvcf-etl-bigquery> on 2019-04-17T22:33:07.326577+00:00
[2019-04-17 22:33:18,660] {base_task_runner.py:118} INFO - Running: ['bash', '-c', u'airflow run Engine gvcf-etl-bigquery 2019-04-17T22:33:07.326577+00:00 --job_id 209 --raw -sd DAGS_FOLDER/main_dag.py --cfg_path /tmp/tmpGhGCxD']
[2019-04-17 22:33:20,148] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery [2019-04-17 22:33:20,147] {settings.py:176} INFO - setting.configure_orm(): Using pool settings. pool_size=5, pool_recycle=1800
[2019-04-17 22:33:21,073] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery [2019-04-17 22:33:21,072] {default_celery.py:80} WARNING - You have configured a result_backend of redis://airflow-redis-service.default.svc.cluster.local:6379/0, it is highly recommended to use an alternative result_backend (i.e. a database).
[2019-04-17 22:33:21,076] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery [2019-04-17 22:33:21,075] {__init__.py:51} INFO - Using executor CeleryExecutor
[2019-04-17 22:33:21,155] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery [2019-04-17 22:33:21,155] {app.py:51} WARNING - Using default Composer Environment Variables. Overrides have not been applied.
[2019-04-17 22:33:21,162] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery [2019-04-17 22:33:21,162] {configuration.py:516} INFO - Reading the config from /etc/airflow/airflow.cfg
[2019-04-17 22:33:21,174] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery [2019-04-17 22:33:21,174] {configuration.py:516} INFO - Reading the config from /etc/airflow/airflow.cfg
[2019-04-17 22:33:21,363] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery [2019-04-17 22:33:21,362] {models.py:271} INFO - Filling up the DagBag from /home/airflow/gcs/dags/main_dag.py
[2019-04-17 22:33:23,991] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery [2019-04-17 22:33:23,985] {cli.py:484} INFO - Running <TaskInstance: AlleleAnalyticsEngine.gvcf-etl-bigquery 2019-04-17T22:33:07.326577+00:00 [running]> on host airflow-worker-796dcd49fc-x7fx6
[2019-04-17 22:33:24,237] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery [2019-04-17 22:33:24,236] {gcp_dataflow_hook.py:120} INFO - Running command: python2 /home/airflow/gcs/dags/pipeline/gvcf_pipeline.py --runner=DataflowRunner --parquet_bucket=parquet_sink_test --runner=DataflowRunner --region=us-central1 --labels=airflow-version=v1-10-1-composer --destination_table=calls_table_test --project=genomics-207320 --bq_dataset=allele_analytics --gvcf_bucket=gvcf_sink_test --temp_location=gs://aa_dataflow_staging/temp --job_name=gvcf-etl-bigquery-cfc96be4
[2019-04-17 22:33:25,214] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery [2019-04-17 22:33:25,213] {gcp_dataflow_hook.py:151} INFO - Start waiting for DataFlow process to complete.
[2019-04-17 22:33:43,821] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery [2019-04-17 22:33:43,820] {gcp_dataflow_hook.py:132} WARNING - Traceback (most recent call last):
[2019-04-17 22:33:43,822] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery File "/home/airflow/gcs/dags/pipeline/gvcf_pipeline.py", line 339, in <module>
[2019-04-17 22:33:43,822] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery run()
[2019-04-17 22:33:43,823] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery File "/home/airflow/gcs/dags/pipeline/gvcf_pipeline.py", line 335, in run
[2019-04-17 22:33:43,823] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery logging.info("No new files found")
[2019-04-17 22:33:43,824] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery File "/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.py", line 426, in __exit__
[2019-04-17 22:33:43,825] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery self.run().wait_until_finish()
[2019-04-17 22:33:43,825] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery File "/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.py", line 406, in run
[2019-04-17 22:33:43,825] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery self._options).run(False)
[2019-04-17 22:33:43,827] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery File "/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.py", line 419, in run
[2019-04-17 22:33:43,831] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery return self.runner.run_pipeline(self, self._options)
[2019-04-17 22:33:43,831] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 408, in run_pipeline
[2019-04-17 22:33:43,831] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery self.dataflow_client = apiclient.DataflowApplicationClient(options)
[2019-04-17 22:33:43,832] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 445, in __init__
[2019-04-17 22:33:43,835] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery response_encoding=get_response_encoding())
[2019-04-17 22:33:43,835] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/dataflow/internal/clients/dataflow/dataflow_v1b3_client.py", line 58, in __init__
[2019-04-17 22:33:43,835] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery response_encoding=response_encoding)
[2019-04-17 22:33:43,835] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery TypeError: __init__() got an unexpected keyword argument 'response_encoding'
[2019-04-17 22:33:43,832] {models.py:1760} ERROR - DataFlow failed with return code 1
Traceback (most recent call last)
File "/usr/local/lib/airflow/airflow/models.py", line 1659, in _run_raw_tas
result = task_copy.execute(context=context
File "/usr/local/lib/airflow/airflow/contrib/operators/dataflow_operator.py", line 332, in execut
self.py_file, self.py_options
File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataflow_hook.py", line 241, in start_python_dataflo
label_formatter
File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_api_base_hook.py", line 213, in wrappe
return func(self, *args, **kwargs
File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataflow_hook.py", line 199, in _start_dataflo
job_id = _Dataflow(cmd).wait_for_done(
File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataflow_hook.py", line 172, in wait_for_don
self._proc.returncode)
Exception: DataFlow failed with return code
[2019-04-17 22:33:43,840] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery [2019-04-17 22:33:43,832] {models.py:1760} ERROR - DataFlow failed with return code 1
[2019-04-17 22:33:43,841] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery Traceback (most recent call last):
[2019-04-17 22:33:43,841] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery File "/usr/local/lib/airflow/airflow/models.py", line 1659, in _run_raw_task
[2019-04-17 22:33:43,841] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery result = task_copy.execute(context=context)
[2019-04-17 22:33:43,841] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery File "/usr/local/lib/airflow/airflow/contrib/operators/dataflow_operator.py", line 332, in execute
[2019-04-17 22:33:43,841] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery self.py_file, self.py_options)
[2019-04-17 22:33:43,843] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataflow_hook.py", line 241, in start_python_dataflow
[2019-04-17 22:33:43,843] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery label_formatter)
[2019-04-17 22:33:43,843] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_api_base_hook.py", line 213, in wrapper
[2019-04-17 22:33:43,844] {models.py:1791} INFO - Marking task as FAILED.
[2019-04-17 22:33:43,844] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery return func(self, *args, **kwargs)
[2019-04-17 22:33:43,845] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataflow_hook.py", line 199, in _start_dataflow
[2019-04-17 22:33:43,845] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery job_id = _Dataflow(cmd).wait_for_done()
[2019-04-17 22:33:43,847] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataflow_hook.py", line 172, in wait_for_done
[2019-04-17 22:33:43,847] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery self._proc.returncode))
[2019-04-17 22:33:43,847] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery Exception: DataFlow failed with return code 1
[2019-04-17 22:33:43,848] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery [2019-04-17 22:33:43,844] {models.py:1791} INFO - Marking task as FAILED.
[2019-04-17 22:33:43,890] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery Traceback (most recent call last):
[2019-04-17 22:33:43,891] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery File "/usr/local/bin/airflow", line 7, in <module>
[2019-04-17 22:33:43,891] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery exec(compile(f.read(), __file__, 'exec'))
[2019-04-17 22:33:43,892] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery File "/usr/local/lib/airflow/airflow/bin/airflow", line 32, in <module>
[2019-04-17 22:33:43,892] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery args.func(args)
[2019-04-17 22:33:43,893] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery File "/usr/local/lib/airflow/airflow/utils/cli.py", line 74, in wrapper
[2019-04-17 22:33:43,893] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery return f(*args, **kwargs)
[2019-04-17 22:33:43,893] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery File "/usr/local/lib/airflow/airflow/bin/cli.py", line 490, in run
[2019-04-17 22:33:43,894] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery _run(args, dag, ti)
[2019-04-17 22:33:43,894] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery File "/usr/local/lib/airflow/airflow/bin/cli.py", line 406, in _run
[2019-04-17 22:33:43,895] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery pool=args.pool,
[2019-04-17 22:33:43,895] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery File "/usr/local/lib/airflow/airflow/utils/db.py", line 74, in wrapper
[2019-04-17 22:33:43,895] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery return func(*args, **kwargs)
[2019-04-17 22:33:43,897] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery File "/usr/local/lib/airflow/airflow/models.py", line 1659, in _run_raw_task
[2019-04-17 22:33:43,897] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery result = task_copy.execute(context=context)
[2019-04-17 22:33:43,897] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery File "/usr/local/lib/airflow/airflow/contrib/operators/dataflow_operator.py", line 332, in execute
[2019-04-17 22:33:43,899] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery self.py_file, self.py_options)
[2019-04-17 22:33:44,083] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataflow_hook.py", line 241, in start_python_dataflow
[2019-04-17 22:33:44,083] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery label_formatter)
[2019-04-17 22:33:44,083] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_api_base_hook.py", line 213, in wrapper
[2019-04-17 22:33:44,084] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery return func(self, *args, **kwargs)
[2019-04-17 22:33:44,084] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataflow_hook.py", line 199, in _start_dataflow
[2019-04-17 22:33:44,084] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery job_id = _Dataflow(cmd).wait_for_done()
[2019-04-17 22:33:44,085] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataflow_hook.py", line 172, in wait_for_done
[2019-04-17 22:33:44,085] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery self._proc.returncode))
[2019-04-17 22:33:44,085] {base_task_runner.py:101} INFO - Job 209: Subtask gvcf-etl-bigquery Exception: DataFlow failed with return code 1
The solution was to add google-apitools==0.5.26 to the Composer environment using the PyPi Packages option.
Check your composer image version and see if it has the pypi package for apache-beam version you want.
https://cloud.google.com/composer/docs/concepts/versioning/composer-versions

Resources