Getting an unclear Error Message in Neo4j with APOC batch loader

Getting an unclear Error Message in Neo4j with APOC batch loader - neo4j

I am loading a large number of Neo4j nodes into the system from a json file. It is failing with this error message "Failed to invoke procedure apoc.merge.node: Caused by: java.lang.NullPointerException" - I am not seeing enough information to figure out what I am doing wrong and as this is the first time i have used this, I just don't see it. This is the last 7 or so errors on the error stack. Looks like the error originates when merge_node is called.
File "F:\ClientSide\current\testload1.py", line 104, in <lambda>
nodes.apply(lambda h: merge_node(h), axis=1)
File "F:\ClientSide\current\testload1.py", line 61, in merge_node
ses.run("UNWIND $batch AS row CALL apoc.merge.node(['ProgNode', row.nodetype], {node:row.node}, apoc.map.removeKeys(properties(row), ['nodetype', 'node'])) YIELD node RETURN 1", batch=BATCH["batch"])
File "C:\Users\Bill Dickenson\AppData\Local\Programs\Python\Python37\lib\site-packages\neo4j\work\simple.py", line 217, in run
self._autoResult._run(query, parameters, self._config.database, self._config.default_access_mode, self._bookmarks, **kwparameters)
File "C:\Users\Bill Dickenson\AppData\Local\Programs\Python\Python37\lib\site-packages\neo4j\work\result.py", line 101, in _run
self._attach()
File "C:\Users\Bill Dickenson\AppData\Local\Programs\Python\Python37\lib\site-packages\neo4j\work\result.py", line 202, in _attach
self._connection.fetch_message()
File "C:\Users\Bill Dickenson\AppData\Local\Programs\Python\Python37\lib\site-packages\neo4j\io\_bolt3.py", line 326, in fetch_message
response.on_failure(summary_metadata or {})
File "C:\Users\Bill Dickenson\AppData\Local\Programs\Python\Python37\lib\site-packages\neo4j\io\_bolt3.py", line 512, in on_failure
raise Neo4jError.hydrate(**metadata)
neo4j.exceptions.ClientError: Failed to invoke procedure `apoc.merge.node`: Caused by: java.lang.NullPointerException
The "batch" data structure contains lists of variables like this one
{'EIEO': True, 'FILECOUNT': 1, 'KDM': 'data:Writes', 'changed': False, 'ctx': '113540257', 'level': 'code', 'location': [55, 8, 55, 94], 'node': 100, 'quvioDensity': 1.0, 'quviolations': 2, 'szAFP': '', 'szaep': 17, 'szlocs': 2, 'text': 'FilemavenWrapperPropertyFile=newFile(baseDirectory,MAVEN_WRAPPER_PROPERTIES_PATH);', 'type': 'localVariableDeclarationStatement'}
and the code that is processing it looks like this, including the print statement that generated the data above.
def merge_node(args):
global INNODE, NODECOUNT
"""
Function to create nodes from a batch.
"""
INNODE += 1
if (INNODE % 10000) == 0:
print("...Sent %s of %s for processing" % (INNODE, NODECOUNT))
if len(BATCH['batch']) == 4:
print(BATCH['batch'][3])
if (len(BATCH['batch']) > 1000) or (INNODE == NODECOUNT):
if INNODE == NODECOUNT:
print("...Final Record (%s) added and transmitted" % INNODE)
BATCH['batch'].append(args.to_dict())
with graphDB_Driver.session() as ses:
ses.run("UNWIND $batch AS row CALL apoc.merge.node(['ProgNode', row.nodetype], {node:row.node}, apoc.map.removeKeys(properties(row), ['nodetype', 'node'])) YIELD node RETURN 1", batch=BATCH["batch"])
reset_batch()
BATCH['batch'].append(args.to_dict())
Oddly enough, this runs locally with this error. When it runs against my remote Neo4j db, it processes fine ( no errors) but does NOT generate anything on the server. So I assume its failing up there but APOC is redirecting the console and just moving on.
Anyone see what I am doing incorrectly ?

The "batch" data structure in your question contains neither of the properties required by your Cypher code:
nodetype
node
You have to make sure that all the elements in the $batch list have at least those 2 properties if you want to use that Cypher code.

Related

DEAP mutation function receives parameters out of order

I'm struggling to get the deap.tools.mutUniformInt mutation function to work. To isolate the issue for this SO question, I changed line 62 of examples/ga/onemax.py from
toolbox.register("mutate", tools.mutFlipBit, indpb=0.05)
to
toolbox.register("mutate", tools.mutUniformInt, 0, 1, indpb=0.05)
The onemax.py example now fails:
C:\Users\mshiv\DEAP>python onemax.py
Start of evolution
Evaluated 300 individuals
-- Generation 1 --
Traceback (most recent call last):
File "C:\Users\mshiv\DEAP\onemax.py", line 161, in <module>
main()
File "C:\Users\mshiv\DEAP\onemax.py", line 128, in main
toolbox.mutate(mutant)
File "C:\Users\mshiv\AppData\Local\Programs\Python\Python39\lib\site-packages\deap\tools\mutation.py", line 159, in mutUniformInt
size = len(individual)
TypeError: object of type 'int' has no len()
Both mutators should operate on an Individual, which is defined in onemax.py to be a list, so why does mutFlipBit work, but mutUniformInt seems to receive the Individual parameter as an int, not a list?
Poking around in the DEAP code, I found that mutUniformInt receives the parameters out of order, i.e. they are passed in as (low, up, individual, indpb) whereas the function itself is defined as
def mutUniformInt(individual, low, up, indpb):
Am I registering this mutation function incorrectly?
Source of the onemax example I altered:
https://github.com/DEAP/deap/blob/master/examples/ga/onemax.py

NVM - found the answer here:
https://groups.google.com/g/deap-users/c/4sw2_Al4YFI/m/EvUiq70IBAAJ
The correct syntax should have been:
toolbox.register("mutate", tools.mutUniformInt, low=0, up=1, indpb=0.05)

Apache Beam stateful ParDo Work token invalid

I have a stateful DoFn that basically batches the elements that are coming and when the buffer reaches a certain size, the buffer is cleared and the elements are inserted into BigQuery. What I've noticed is that from time to time, the pipeline is raising an exception, the exception is not stopping the job to run. Below is the stacktrace:
Error message from worker: generic::unknown: Traceback (most recent call last):
File "apache_beam/runners/common.py", line 1213, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 742, in apache_beam.runners.common.PerWindowInvoker.invoke_process
File "apache_beam/runners/common.py", line 867, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
File "/usr/local/lib/python3.7/site-packages/gp/pipelines/common/writer_transforms.py", line 140, in process
self._flush_buffer(buffer_state, count_state, buffer_size_state)
File "/usr/local/lib/python3.7/site-packages/gp/pipelines/common/writer_transforms.py", line 162, in _flush_buffer
rows = self._extract_rows(buffer_state)
File "/usr/local/lib/python3.7/site-packages/gp/pipelines/common/writer_transforms.py", line 197, in _extract_rows
for row in buffer.read():
File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py", line 510, in __iter__
for elem in self.first:
File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 1039, in _lazy_iterator
self._underlying.get_raw(state_key, continuation_token))
File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 846, in get_raw
continuation_token=continuation_token)))
File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 886, in _blocking_request
raise RuntimeError(response.error)
RuntimeError: INTERNAL: Work token invalid
This is raised when the process method is called and it tries to extract the elements from the buffer, see rows = self._extract_rows(buffer_state)
The DoFn is implemented exactly like in the example https://beam.apache.org/blog/timely-processing/#example-batched-rpc

I've confirmed this error is expected during work reassignments, e.g. when autoscaling. The work item will be retried on the new machine and the pipeline will continue processing correctly. (I agree the error message could be improved.)

Dask not starting workers

I am trying to use Dask to perform a groupby operation on a Dataframe.
The code below does not work but it seems that if I initialize the Client from another console the code works, even though I can't see anything on the dashboard ( http://localhost:8787/status ): I mean, there is a dashboard, but all the figures look empty. I am on macOS.
Code:
from datetime import datetime
import numpy as np
import os
from dask import dataframe as dd
from dask.distributed import Client
import pandas as pd
client = Client()
# open http://localhost:8787/status
csv_path = 'chicago-complete.monthly.2018-07-01-to-2018-07-31/data.csv'
dir_destination = 'data'
df = dd.read_csv(csv_path,
dtype = {
'timestamp': str,
'node_id': str,
'subsystem': str,
'sensor': str,
'parameter': str,
'value_raw': str,
'value_hrf': str,
},
parse_dates=['timestamp'],
date_parser=lambda x: pd.datetime.strptime(x, '%Y/%m/%d %H:%M:%S')
)
#%%
if not os.path.exists(dir_destination):
os.makedirs(dir_destination)
def create_node_csv(df_node):
# test function
return len(df_node)
res = df.groupby('node_id').apply(create_node_csv, meta=int)
The csv file is simply composed by columns of string. My goal is to group of all the rows that contains a certain value in a column and than save them as separates file using create_node_csv(df_node) (even though right now is a dummy function). Any other way to do it is appreciated, but I would like to understand what's going on here.
When I run it, the console prints multiple times the following errors:
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 883, in callback
result_list.append(f.result())
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/anaconda3/lib/python3.7/site-packages/distributed/deploy/local.py", line 208, in _start_worker
yield w._start()
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/anaconda3/lib/python3.7/site-packages/distributed/nanny.py", line 157, in _start
response = yield self.instantiate()
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/anaconda3/lib/python3.7/site-packages/distributed/nanny.py", line 226, in instantiate
self.process.start()
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/anaconda3/lib/python3.7/site-packages/distributed/nanny.py", line 370, in start
yield self.process.start()
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/anaconda3/lib/python3.7/site-packages/distributed/process.py", line 35, in _call_and_set_future
res = func(*args, **kwargs)
File "/anaconda3/lib/python3.7/site-packages/distributed/process.py", line 184, in _start
process.start()
File "/anaconda3/lib/python3.7/multiprocessing/process.py", line 112, in start
self._popen = self._Popen(self)
File "/anaconda3/lib/python3.7/multiprocessing/context.py", line 291, in _Popen
return Popen(process_obj)
File "/anaconda3/lib/python3.7/multiprocessing/popen_forkserver.py", line 35, in __init__
super().__init__(process_obj)
File "/anaconda3/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
self._launch(process_obj)
File "/anaconda3/lib/python3.7/multiprocessing/popen_forkserver.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/anaconda3/lib/python3.7/multiprocessing/spawn.py", line 143, in get_preparation_data
_check_not_importing_main()
File "/anaconda3/lib/python3.7/multiprocessing/spawn.py", line 136, in _check_not_importing_main
is not going to be frozen to produce an executable.''')
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
And:
distributed.nanny - WARNING - Worker process 1844 exited with status 1
distributed.nanny - WARNING - Restarting worker
And:
Traceback (most recent call last):
File "/anaconda3/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
send_bytes(obj)
File "/anaconda3/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/anaconda3/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/anaconda3/lib/python3.7/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 883, in callback
result_list.append(f.result())
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1147, in run
yielded = self.gen.send(value)
File "/anaconda3/lib/python3.7/site-packages/distributed/deploy/local.py", line 217, in _start_worker
raise gen.TimeoutError("Worker failed to start")
tornado.util.TimeoutError: Worker failed to start
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 883, in callback
result_list.append(f.result())
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1147, in run
yielded = self.gen.send(value)
File "/anaconda3/lib/python3.7/site-packages/distributed/deploy/local.py", line 217, in _start_worker
EDIT:
Based on the answer:
- How do I prevent the creation of a new Client if I run the program again?
- How can I do the following?
def create_node_csv(df_node):
return len(df_node)
It returns me the following error, is it related to the meta parameter?
ValueError: cannot reindex from a duplicate axis

When you run the script, Client() is causing new Dask workers to be spawned, which also get copies of variables from the original main process. In some some cases, this involves re-importing the script in each worker, each of which, of course, then tries to create a Client and new set of processes.
The best answer, as in general with anything running in processes, is to use functions, and protect the main execution. The following would be a way to do this, without changing your one-script structure:
from datetime import datetime
import numpy as np
import os
from dask import dataframe as dd
from dask.distributed import Client
import pandas as pd
csv_path = 'chicago-complete.monthly.2018-07-01-to-2018-07-31/data.csv'
dir_destination = 'data'
def run():
client = Client()
df = dd.read_csv(csv_path, ...)
if not os.path.exists(dir_destination):
os.makedirs(dir_destination)
def create_node_csv(df_node):
# test function
return len(df_node)
res = df.groupby('node_id').apply(create_node_csv, meta=int)
print(res.compute())
if __name__ == "__main__":
run()
How do I prevent the creation of a new Client if I run the program again?
In the call to Client() you can include the address of an existing cluster, if you know what that would be. Also, some specific types of deployments (are there are a few) may have a concept of the "current cluster".

How do you timeout a twisted test that uses pytest?

I'm working on converting some tests from using Nose and twisted, to using Pytest and twisted, as Nose is no longer in development. The easiest way to convert the tests is by editing the custom decorator that each test has. This decorator is on every test, and defines a timeout for the individual test.
I've tried using #pytest.mark.timeout, but the only method that's worked is the 'thread' method, but this stops the entire test run and won't continue on to the next test. Using the method 'signal' fails to stop the test, but I can see an error present in the junitxml file.
def inlineCallbacksTest ( timeout = None ):
def decorator ( method ):
#wraps ( method )
#pytest.mark.timeout(timeout = timeout, method = 'signal' )
#pytest.inlineCallbacks
def testMethod ( *args, **kwargs ):
return method(*args, **kwargs)
return testMethod
return decorator
The tests themselves use twisted to start up and send messages to the software. I don't need the tests to cancel any twisted processes or locks. I would just like pytest to mark the test as a failure after the timeout, and then move onto the next test.
Below is the error I see in the xml file when using signal method of timeout.
</system-out><system-err>
+++++++++++++++++++++++++++++++++++ Timeout ++++++++++++++++++++++++++++++++++++
~~~~~~~ Stack of PoolThread-twisted.internet.reactor-2 (139997693642496) ~~~~~~~
File "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
self.__bootstrap_inner()
File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 764, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib64/python2.7/site-packages/twisted/python/threadpool.py", line 190, in _worker
o = self.q.get()
File "/usr/lib64/python2.7/Queue.py", line 168, in get
self.not_empty.wait()
File "/usr/lib64/python2.7/threading.py", line 339, in wait
waiter.acquire()
~~~~~~~ Stack of PoolThread-twisted.internet.reactor-1 (139997702035200) ~~~~~~~
File "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
self.__bootstrap_inner()
File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 764, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib64/python2.7/site-packages/twisted/python/threadpool.py", line 190, in _worker
o = self.q.get()
File "/usr/lib64/python2.7/Queue.py", line 168, in get
self.not_empty.wait()
File "/usr/lib64/python2.7/threading.py", line 339, in wait
waiter.acquire()
+++++++++++++++++++++++++++++++++++ Timeout ++++++++++++++++++++++++++++++++++++
Unhandled Error
Traceback (most recent call last):
File "/usr/lib64/python2.7/site-
packages/twisted/internet/base.py", line 1169, in run
self.mainLoop()
--- <exception caught here> ---
File "/usr/lib64/python2.7/site-
packages/twisted/internet/base.py", line 1181, in mainLoop
self.doIteration(t)
File "/usr/lib64/python2.7/site-
packages/twisted/internet/epollreactor.py", line 362, in doPoll
l = self._poller.poll(timeout, len(self._selectables))
File "/usr/lib/python2.7/site-packages/pytest_timeout.py", line 110, in handler
timeout_sigalrm(item, timeout)
File "/usr/lib/python2.7/site-packages/pytest_timeout.py", line 243, in timeout_sigalrm
pytest.fail(&apos;Timeout >%ss&apos; % timeout)
File "/usr/lib/python2.7/site-packages/_pytest/outcomes.py", line 85, in fail
raise Failed(msg=msg, pytrace=pytrace)
builtins.Failed: Timeout >5.0s
</system-err>
I have looked around for a similar solution, and the closest I could find was this question. Any help or suggestions would be appreciated.

Coming back to this after 4+ years with an answer. The problem seems to be the exception from the test getting caught by the twisted reactor. I was able to resolve this by updating the version of twisted. Twisted versions since 16.5 have a new Deferred function call addTimeout (Docs). Using that, I was able to modify the original decorator to the following. Now whenever a test times out, it simply raises an exception and moves on to the next one. May not be the most elegant, but I hope this helps someone else out!
import twisted.internet.defer as defer
import pytest_twisted as pt
from functools import wraps
def inlineCallbacksTest ( timeout = None ):
def testDecorator ( testFunc ):
def timeoutError ( value, timeout ):
raise Exception ( "Test Timeout: {} secs have expired".format ( timeout ) )
#wraps ( testFunc )
def wrapper ( *args, **kwargs ):
testDefer = pt.inlineCallbacks ( testFunc )( *args, **kwargs )
testDefer.addTimeout ( timeout, reactor, timeoutError )
return testDefer
return wrapper
return testDecorator

py2neo.neo4j.GraphDatabaseService.node(id) raise ClientError(e)

I think this might be an obvious to the seasoned py2neo users, but I could not get over it since I'm new. I'm trying to follow py2neo online doc: http://book.py2neo.org/en/latest/graphs_nodes_relationships/, but I was able to use the 'Node' methods for the instance returned from
GraphDatabaseService.create, but when I use GraphDatabaseService.node to retrieve the node, all the expected Node methods stop working, I've narrowed it down to an example below using the Node.len method.
Thanks in advance for any helpful insights.
Bruce
My env:
windows 7 professional
pycharm 3.4
py2neo 1.6.4
python2.7.5
Here are the codes:
from py2neo import node, neo4j
db = neo4j.GraphDatabaseService()
db.clear()
a, = db.create(node({'name': ['a']}))
a.add_labels('Label')
b = db.node(a._id)
print db.neo4j_version
print b, type(b)
print "There is %s node in db" % db.order
print len(b)
Here are the outputs:
C:\Python27\python.exe C:/Users/you_zhang/PycharmProjects/py2neo/ex11.py
(2, 0, 3, u'')
(10) <class 'py2neo.neo4j.Node'>
There is 1 node in db
Traceback (most recent call last):
File "C:/Users/you_zhang/PycharmProjects/py2neo/ex11.py", line 11, in <module>
print len(b)
File "C:\Users\you_zhang\AppData\Roaming\Python\Python27\site-packages\py2neo\neo4j.py", line 1339, in __len__
return len(self.get_properties())
File "C:\Users\you_zhang\AppData\Roaming\Python\Python27\site-packages\py2neo\neo4j.py", line 1398, in get_properties
self._properties = assembled(self._properties_resource._get()) or {}
File "C:\Users\you_zhang\AppData\Roaming\Python\Python27\site-packages\py2neo\neo4j.py", line 1349, in _properties_resource
return self._subresource("properties")
File "C:\Users\you_zhang\AppData\Roaming\Python\Python27\site-packages\py2neo\neo4j.py", line 403, in _subresource
uri = URI(self.__metadata__[key])
File "C:\Users\you_zhang\AppData\Roaming\Python\Python27\site-packages\py2neo\neo4j.py", line 338, in __metadata__
self.refresh()
File "C:\Users\you_zhang\AppData\Roaming\Python\Python27\site-packages\py2neo\neo4j.py", line 360, in refresh
self._metadata = ResourceMetadata(self._get().content)
File "C:\Users\you_zhang\AppData\Roaming\Python\Python27\site-packages\py2neo\neo4j.py", line 367, in _get
raise ClientError(e)
py2neo.exceptions.ClientError: Not Found

Your exact code snippet works for me (OS X, neo4j 2.1.2). There shouldn't be any problem. Have you tried to install the latest version of neo4j and run your code on a fresh and untouched database? I have encountered inconsistencies in corrupted databases.
Have you tried to load the node with .find()?
result = db.find('Label')
for n in result:
print(n)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Getting an unclear Error Message in Neo4j with APOC batch loader - neo4j

The "batch" data structure in your question contains neither of the properties required by your Cypher code: nodetype node You have to make sure that all the elements in the $batch list have at least those 2 properties if you want to use that Cypher code.

Related

DEAP mutation function receives parameters out of order

Apache Beam stateful ParDo Work token invalid

Dask not starting workers

How do you timeout a twisted test that uses pytest?

py2neo.neo4j.GraphDatabaseService.node(id) raise ClientError(e)

Categories

Resources