Apache Beam stateful ParDo Work token invalid - google-cloud-dataflow

I have a stateful DoFn that basically batches the elements that are coming and when the buffer reaches a certain size, the buffer is cleared and the elements are inserted into BigQuery. What I've noticed is that from time to time, the pipeline is raising an exception, the exception is not stopping the job to run. Below is the stacktrace:
Error message from worker: generic::unknown: Traceback (most recent call last):
File "apache_beam/runners/common.py", line 1213, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 742, in apache_beam.runners.common.PerWindowInvoker.invoke_process
File "apache_beam/runners/common.py", line 867, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
File "/usr/local/lib/python3.7/site-packages/gp/pipelines/common/writer_transforms.py", line 140, in process
self._flush_buffer(buffer_state, count_state, buffer_size_state)
File "/usr/local/lib/python3.7/site-packages/gp/pipelines/common/writer_transforms.py", line 162, in _flush_buffer
rows = self._extract_rows(buffer_state)
File "/usr/local/lib/python3.7/site-packages/gp/pipelines/common/writer_transforms.py", line 197, in _extract_rows
for row in buffer.read():
File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py", line 510, in __iter__
for elem in self.first:
File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 1039, in _lazy_iterator
self._underlying.get_raw(state_key, continuation_token))
File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 846, in get_raw
continuation_token=continuation_token)))
File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 886, in _blocking_request
raise RuntimeError(response.error)
RuntimeError: INTERNAL: Work token invalid
This is raised when the process method is called and it tries to extract the elements from the buffer, see rows = self._extract_rows(buffer_state)
The DoFn is implemented exactly like in the example https://beam.apache.org/blog/timely-processing/#example-batched-rpc

I've confirmed this error is expected during work reassignments, e.g. when autoscaling. The work item will be retried on the new machine and the pipeline will continue processing correctly. (I agree the error message could be improved.)

Related

jira - 401 Client Error: Unauthorized for url

I'm creating a Slack bot to connect to Jira. The whole thing works on my local machine, however, when I deploy the server on a VM (Azure), it gives me this error. Any thoughts?
Error:
401 Client Error: Unauthorized for url: https://***.atlassian.net/rest/api/2/search?startAt=0&fields=%2Aall&jql=project%3Dtest
2022-11-09 17:31:40,851 - ERROR - slack_bolt.App - Failed to run listener function (error: Object of type HTTPError is not JSON serializable)
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.8/site-packages/slack_bolt/listener/thread_runner.py", line 120, in run_ack_function_asynchronously
listener.run_ack_function(request=request, response=response)
File "/home/ubuntu/.local/lib/python3.8/site-packages/slack_bolt/listener/custom_listener.py", line 50, in run_ack_function
return self.ack_function(
File "/home/ubuntu/flask/src/helpers/slack_controller.py", line 87, in handle_jira_action
sj_view_one(slack_app, body)
File "/home/ubuntu/flask/src/helpers/slack_jira_services.py", line 41, in sj_view_one
slack_app.client.views_open(trigger_id=body['trigger_id'], view={
File "/home/ubuntu/.local/lib/python3.8/site-packages/slack_sdk/web/client.py", line 4452, in views_open
return self.api_call("views.open", json=kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/slack_sdk/web/base_client.py", line 156, in api_call
return self._sync_send(api_url=api_url, req_args=req_args)
File "/home/ubuntu/.local/lib/python3.8/site-packages/slack_sdk/web/base_client.py", line 187, in _sync_send
return self._urllib_api_call(
File "/home/ubuntu/.local/lib/python3.8/site-packages/slack_sdk/web/base_client.py", line 294, in _urllib_api_call
response = self._perform_urllib_http_request(url=url, args=request_args)
File "/home/ubuntu/.local/lib/python3.8/site-packages/slack_sdk/web/base_client.py", line 339, in _perform_urllib_http_request
body = json.dumps(args["json"])
File "/usr/lib/python3.8/json/__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "/usr/lib/python3.8/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib/python3.8/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/usr/lib/python3.8/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type HTTPError is not JSON serializable
After hours of research, I found the answer. There are two problems here:
First, Unauthorized.
This is a problem I have with Atlassian API. (Have a ticket opened for it, however, they couldn't help me with it yet).
Second, TypeError.
For fixing this, one way is to map the exception error to a string. For example:
try:
something()
exception Exception as error:
logger.error(str(error))
return str(error)
instead of returning the error, I sent str(error).

ValueError: file descriptor out of range in select() in py2neo

After one of the recent updates of py2neo we are seeing a lot of randomly appearing errors saying
ValueError: file descriptor out of range in select() we are using py2neo to connect to remote neo4j instance.
client_identifier = request.args.get('tribes_client_id')
graph_obj = generic_helpers.get_graph_object(client_identifier) # <- returns py2neo instance
graph_transaction = graph_obj.begin() # <- this is the line causing the exception
Below is stack trace of the exception being raised
Traceback (most recent call last): File "/env/lib/python3.7/site-packages/flask/app.py", line 2447, in wsgi_app response = self.full_dispatch_request() File "/env/lib/python3.7/site-packages/flask/app.py", line 1952, in full_dispatch_request rv = self.handle_user_exception(e) File "/env/lib/python3.7/site-packages/flask/app.py", line 1821, in handle_user_exception reraise(exc_type, exc_value, tb) File "/env/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise raise value File "/env/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request rv = self.dispatch_request() File "/env/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request return self.view_functionsrule.endpoint File "/srv/neo4j_maintenance_routes.py", line 116, in get_words_list_for_lookup graph_transaction = graph_obj.begin() File "/env/lib/python3.7/site-packages/py2neo/database.py", line 353, in begin return Transaction(self, autocommit) File "/env/lib/python3.7/site-packages/py2neo/database.py", line 781, in __init__ self.transaction = self.connector.begin() File "/env/lib/python3.7/site-packages/py2neo/internal/connectors.py", line 297, in begin tx = self.pool.acquire() File "/env/lib/python3.7/site-packages/neobolt/direct.py", line 715, in acquire return self.acquire_direct(self.address) File "/env/lib/python3.7/site-packages/neobolt/direct.py", line 608, in acquire_direct connection = self.connector(address, error_handler=self.connection_error_handler) File "/env/lib/python3.7/site-packages/py2neo/internal/connectors.py", line 227, in connector encrypted=cx_data["secure"], **kwargs) File "/env/lib/python3.7/site-packages/neobolt/direct.py", line 972, in connect raise last_error File "/env/lib/python3.7/site-packages/neobolt/direct.py", line 964, in connect connection = _handshake(s, address, der_encoded_server_certificate, **config) File "/env/lib/python3.7/site-packages/neobolt/direct.py", line 898, in _handshake ready_to_read, _, _ = select((s,), (), (), 1) ValueError: filedescriptor out of range in select()
py2neo -> version 4.3.0
python -> version 3.7.3
neo4j -> version (Enterprise 3.5.3)
Thing to notice is we aren't getting these errors every time we are creating a new transaction but on random times and it automatically gets fixed after sometime.
I also come across this issue occasionally. The different thing is, I am implementing neo4j-python-driver (1.7.6).
File "/home/my/anaconda3/envs/python37/lib/python3.7/site-packages/neo4j/__init__.py", line 444, in run
self._connect()
File "/home/my/anaconda3/envs/python37/lib/python3.7/site-packages/neo4j/__init__.py", line 383, in _connect
self._connection = self._acquirer(access_mode)
File "/home/my/anaconda3/envs/python37/lib/python3.7/site-packages/neobolt/direct.py", line 715, in acquire
return self.acquire_direct(self.address)
File "/home/my/anaconda3/envs/python37/lib/python3.7/site-packages/neobolt/direct.py", line 608, in acquire_direct
connection = self.connector(address, error_handler=self.connection_error_handler)
File "/home/my/anaconda3/envs/python37/lib/python3.7/site-packages/neo4j/__init__.py", line 232, in connector
return connect(address, **dict(config, **kwargs))
File "/home/my/anaconda3/envs/python37/lib/python3.7/site-packages/neobolt/direct.py", line 972, in connect
raise last_error
File "/home/my/anaconda3/envs/python37/lib/python3.7/site-packages/neobolt/direct.py", line 964, in connect
connection = _handshake(s, address, der_encoded_server_certificate, **config)
File "/home/my/anaconda3/envs/python37/lib/python3.7/site-packages/neobolt/direct.py", line 898, in _handshake
ready_to_read, _, _ = select((s,), (), (), 1)
ValueError: filedescriptor out of range in select()
After I rerun the same code, nothing happened and the program worked well. I think this issue is probably caused by multiple 'login in' action of the same user. When I still in the process of accessing the neo4j database in the browser and use that account to fetch data with neo4j-python-driver at the same time, this error could happened. However, sometimes it seems that it just happened in that way regardless of the fact that I did nothing but run the code.
Extra Info:
Neo4j Graph Database: 3.5 community version

Dask not starting workers

I am trying to use Dask to perform a groupby operation on a Dataframe.
The code below does not work but it seems that if I initialize the Client from another console the code works, even though I can't see anything on the dashboard ( http://localhost:8787/status ): I mean, there is a dashboard, but all the figures look empty. I am on macOS.
Code:
from datetime import datetime
import numpy as np
import os
from dask import dataframe as dd
from dask.distributed import Client
import pandas as pd
client = Client()
# open http://localhost:8787/status
csv_path = 'chicago-complete.monthly.2018-07-01-to-2018-07-31/data.csv'
dir_destination = 'data'
df = dd.read_csv(csv_path,
dtype = {
'timestamp': str,
'node_id': str,
'subsystem': str,
'sensor': str,
'parameter': str,
'value_raw': str,
'value_hrf': str,
},
parse_dates=['timestamp'],
date_parser=lambda x: pd.datetime.strptime(x, '%Y/%m/%d %H:%M:%S')
)
#%%
if not os.path.exists(dir_destination):
os.makedirs(dir_destination)
def create_node_csv(df_node):
# test function
return len(df_node)
res = df.groupby('node_id').apply(create_node_csv, meta=int)
The csv file is simply composed by columns of string. My goal is to group of all the rows that contains a certain value in a column and than save them as separates file using create_node_csv(df_node) (even though right now is a dummy function). Any other way to do it is appreciated, but I would like to understand what's going on here.
When I run it, the console prints multiple times the following errors:
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 883, in callback
result_list.append(f.result())
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/anaconda3/lib/python3.7/site-packages/distributed/deploy/local.py", line 208, in _start_worker
yield w._start()
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/anaconda3/lib/python3.7/site-packages/distributed/nanny.py", line 157, in _start
response = yield self.instantiate()
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/anaconda3/lib/python3.7/site-packages/distributed/nanny.py", line 226, in instantiate
self.process.start()
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/anaconda3/lib/python3.7/site-packages/distributed/nanny.py", line 370, in start
yield self.process.start()
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/anaconda3/lib/python3.7/site-packages/distributed/process.py", line 35, in _call_and_set_future
res = func(*args, **kwargs)
File "/anaconda3/lib/python3.7/site-packages/distributed/process.py", line 184, in _start
process.start()
File "/anaconda3/lib/python3.7/multiprocessing/process.py", line 112, in start
self._popen = self._Popen(self)
File "/anaconda3/lib/python3.7/multiprocessing/context.py", line 291, in _Popen
return Popen(process_obj)
File "/anaconda3/lib/python3.7/multiprocessing/popen_forkserver.py", line 35, in __init__
super().__init__(process_obj)
File "/anaconda3/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
self._launch(process_obj)
File "/anaconda3/lib/python3.7/multiprocessing/popen_forkserver.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/anaconda3/lib/python3.7/multiprocessing/spawn.py", line 143, in get_preparation_data
_check_not_importing_main()
File "/anaconda3/lib/python3.7/multiprocessing/spawn.py", line 136, in _check_not_importing_main
is not going to be frozen to produce an executable.''')
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
And:
distributed.nanny - WARNING - Worker process 1844 exited with status 1
distributed.nanny - WARNING - Restarting worker
And:
Traceback (most recent call last):
File "/anaconda3/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
send_bytes(obj)
File "/anaconda3/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/anaconda3/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/anaconda3/lib/python3.7/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 883, in callback
result_list.append(f.result())
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1147, in run
yielded = self.gen.send(value)
File "/anaconda3/lib/python3.7/site-packages/distributed/deploy/local.py", line 217, in _start_worker
raise gen.TimeoutError("Worker failed to start")
tornado.util.TimeoutError: Worker failed to start
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 883, in callback
result_list.append(f.result())
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1147, in run
yielded = self.gen.send(value)
File "/anaconda3/lib/python3.7/site-packages/distributed/deploy/local.py", line 217, in _start_worker
EDIT:
Based on the answer:
- How do I prevent the creation of a new Client if I run the program again?
- How can I do the following?
def create_node_csv(df_node):
return len(df_node)
It returns me the following error, is it related to the meta parameter?
ValueError: cannot reindex from a duplicate axis
When you run the script, Client() is causing new Dask workers to be spawned, which also get copies of variables from the original main process. In some some cases, this involves re-importing the script in each worker, each of which, of course, then tries to create a Client and new set of processes.
The best answer, as in general with anything running in processes, is to use functions, and protect the main execution. The following would be a way to do this, without changing your one-script structure:
from datetime import datetime
import numpy as np
import os
from dask import dataframe as dd
from dask.distributed import Client
import pandas as pd
csv_path = 'chicago-complete.monthly.2018-07-01-to-2018-07-31/data.csv'
dir_destination = 'data'
def run():
client = Client()
df = dd.read_csv(csv_path, ...)
if not os.path.exists(dir_destination):
os.makedirs(dir_destination)
def create_node_csv(df_node):
# test function
return len(df_node)
res = df.groupby('node_id').apply(create_node_csv, meta=int)
print(res.compute())
if __name__ == "__main__":
run()
How do I prevent the creation of a new Client if I run the program again?
In the call to Client() you can include the address of an existing cluster, if you know what that would be. Also, some specific types of deployments (are there are a few) may have a concept of the "current cluster".

How do you timeout a twisted test that uses pytest?

I'm working on converting some tests from using Nose and twisted, to using Pytest and twisted, as Nose is no longer in development. The easiest way to convert the tests is by editing the custom decorator that each test has. This decorator is on every test, and defines a timeout for the individual test.
I've tried using #pytest.mark.timeout, but the only method that's worked is the 'thread' method, but this stops the entire test run and won't continue on to the next test. Using the method 'signal' fails to stop the test, but I can see an error present in the junitxml file.
def inlineCallbacksTest ( timeout = None ):
def decorator ( method ):
#wraps ( method )
#pytest.mark.timeout(timeout = timeout, method = 'signal' )
#pytest.inlineCallbacks
def testMethod ( *args, **kwargs ):
return method(*args, **kwargs)
return testMethod
return decorator
The tests themselves use twisted to start up and send messages to the software. I don't need the tests to cancel any twisted processes or locks. I would just like pytest to mark the test as a failure after the timeout, and then move onto the next test.
Below is the error I see in the xml file when using signal method of timeout.
</system-out><system-err>
+++++++++++++++++++++++++++++++++++ Timeout ++++++++++++++++++++++++++++++++++++
~~~~~~~ Stack of PoolThread-twisted.internet.reactor-2 (139997693642496) ~~~~~~~
File "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
self.__bootstrap_inner()
File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 764, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib64/python2.7/site-packages/twisted/python/threadpool.py", line 190, in _worker
o = self.q.get()
File "/usr/lib64/python2.7/Queue.py", line 168, in get
self.not_empty.wait()
File "/usr/lib64/python2.7/threading.py", line 339, in wait
waiter.acquire()
~~~~~~~ Stack of PoolThread-twisted.internet.reactor-1 (139997702035200) ~~~~~~~
File "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
self.__bootstrap_inner()
File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 764, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib64/python2.7/site-packages/twisted/python/threadpool.py", line 190, in _worker
o = self.q.get()
File "/usr/lib64/python2.7/Queue.py", line 168, in get
self.not_empty.wait()
File "/usr/lib64/python2.7/threading.py", line 339, in wait
waiter.acquire()
+++++++++++++++++++++++++++++++++++ Timeout ++++++++++++++++++++++++++++++++++++
Unhandled Error
Traceback (most recent call last):
File "/usr/lib64/python2.7/site-
packages/twisted/internet/base.py", line 1169, in run
self.mainLoop()
--- <exception caught here> ---
File "/usr/lib64/python2.7/site-
packages/twisted/internet/base.py", line 1181, in mainLoop
self.doIteration(t)
File "/usr/lib64/python2.7/site-
packages/twisted/internet/epollreactor.py", line 362, in doPoll
l = self._poller.poll(timeout, len(self._selectables))
File "/usr/lib/python2.7/site-packages/pytest_timeout.py", line 110, in handler
timeout_sigalrm(item, timeout)
File "/usr/lib/python2.7/site-packages/pytest_timeout.py", line 243, in timeout_sigalrm
pytest.fail(&apos;Timeout >%ss&apos; % timeout)
File "/usr/lib/python2.7/site-packages/_pytest/outcomes.py", line 85, in fail
raise Failed(msg=msg, pytrace=pytrace)
builtins.Failed: Timeout >5.0s
</system-err>
I have looked around for a similar solution, and the closest I could find was this question. Any help or suggestions would be appreciated.
Coming back to this after 4+ years with an answer. The problem seems to be the exception from the test getting caught by the twisted reactor. I was able to resolve this by updating the version of twisted. Twisted versions since 16.5 have a new Deferred function call addTimeout (Docs). Using that, I was able to modify the original decorator to the following. Now whenever a test times out, it simply raises an exception and moves on to the next one. May not be the most elegant, but I hope this helps someone else out!
import twisted.internet.defer as defer
import pytest_twisted as pt
from functools import wraps
def inlineCallbacksTest ( timeout = None ):
def testDecorator ( testFunc ):
def timeoutError ( value, timeout ):
raise Exception ( "Test Timeout: {} secs have expired".format ( timeout ) )
#wraps ( testFunc )
def wrapper ( *args, **kwargs ):
testDefer = pt.inlineCallbacks ( testFunc )( *args, **kwargs )
testDefer.addTimeout ( timeout, reactor, timeoutError )
return testDefer
return wrapper
return testDecorator

Running queue in background in Tensorflow causes strange exceptions

I am implementing such graph in Tensorflow: there is a queue Q, to which a background thread is enqueueing tensors. In the main thread, I sequentially dequeue elements from Q.
My code can be simplified as following:
import time
import threading
import tensorflow as tf
sess = tf.InteractiveSession()
coord = tf.train.Coordinator()
q = tf.FIFOQueue(32, dtypes=tf.int32)
def loop(g):
with g.as_default():
enqueue_op = q.enqueue(1, name="example_enqueue")
for i in range(20):
if coord.should_stop():
return
try:
sess.run(enqueue_op)
except tf.errors.CancelledError:
print("enqueue canncelled")
threads = [
threading.Thread(target=loop, args=(tf.get_default_graph(),))
]
sess.run(tf.initialize_all_variables())
for t in threads: t.start()
# If I sleep 1 seconds, it will be fine!
# time.sleep(1)
print(sess.run(q.dequeue()))
coord.request_stop()
coord.join(threads)
sess.close()
I commented, if I sleep 1 second before running dequeue operation, things will be fine. However, if run immediately, following exception will be raised:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 715, in _do_call
return fn(*args)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 697, in _run_fn
status, run_metadata)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/contextlib.py", line 66, in __exit__
next(self.gen)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/tensorflow/python/framework/errors.py", line 450, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors.NotFoundError: FetchOutputs node fifo_queue_Dequeue:0: not found
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/runpy.py", line 170, in _run_module_as_main
"__main__", mod_spec)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/hanxu/Downloads/BrainSeg/playgrounds/7.py", line 32, in <module>
print(sess.run(q.dequeue()))
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 372, in run
run_metadata_ptr)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 636, in _run
feed_dict_string, options, run_metadata)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 708, in _do_run
target_list, options, run_metadata)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 728, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.NotFoundError: FetchOutputs node fifo_queue_Dequeue:0: not found
HanXus-MacBook-Pro:BrainSeg hanxu$ python3 -m playgrounds.7
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 715, in _do_call
return fn(*args)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 697, in _run_fn
status, run_metadata)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/contextlib.py", line 66, in __exit__
next(self.gen)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/tensorflow/python/framework/errors.py", line 450, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors.NotFoundError: FetchOutputs node fifo_queue_Dequeue:0: not found
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/runpy.py", line 170, in _run_module_as_main
"__main__", mod_spec)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/hanxu/Downloads/BrainSeg/playgrounds/7.py", line 34, in <module>
print(sess.run(q.dequeue()))
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 372, in run
run_metadata_ptr)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 636, in _run
feed_dict_string, options, run_metadata)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 708, in _do_run
target_list, options, run_metadata)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 728, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.NotFoundError: FetchOutputs node fifo_queue_Dequeue:0: not found
Could anyone help? Thanks very much!!
Update
I am using Tensorflow 9.0rc0.
My real situation is a little more complicated. The enqueued tensor is in fact different at each time, say
def loop(g):
with g.as_default():
for i in range(20):
if coord.should_stop():
return
# Look here!
enqueue_op = q.enqueue(i, name="example_enqueue")
try:
sess.run(enqueue_op)
except tf.errors.CancelledError:
print("enqueue canncelled")
So it is not trivial to move the enqueue operation to main thread:( and I don't know how. Please help:)
This was an issue with old (pre-0.9) versions of TensorFlow, which was fixed in version 0.9. The issue is that adding nodes to the graph (i.e. in your calls to q.dequeue() and q.enqueue()) was not thread-safe when other threads (i.e. your loop() thread) were using the graph.
There are two issues you'd need to fix to avoid the race condition (in pre-0.9 versions):
Don't call q.enqueue() in the loop() thread. Instead create it in the main thread. For example:
q = tf.FIFOQueue(32, dtypes=tf.int32)
enqueue_op = q.enqueue(1, name="example_enqueue")
def loop(g):
for i in range(20):
if coord.should_stop():
return
try:
sess.run(enqueue_op)
except tf.errors.CancelledError:
print("enqueue canncelled")
Move the call to q.dequeue() (which adds a node to the graph) before where you start the loop() thread:
dequeued_t = q.dequeue()
for t in threads: t.start()
print(sess.run(deqeueued_t))

Resources