I am having trouble parallelizing code that reads some files and writes to neo4j.
I am using dask to parallelize the process_language_files function (3rd cell from the bottom).
I try to explain the code below, listing out the functions (First 3 cells).
The errors are printed at the end (Last 2 cells).
I am also listing environments and package versions at the end.
If I remove dask.delayed and run this code sequentially, its works perfectly well.
Thank you for your help. :)
==========================================================================
Some functions to work with neo4j.
from neo4j import GraphDatabase
from tqdm import tqdm
def get_driver(uri_scheme='bolt', host='localhost', port='7687', username='neo4j', password=''):
"""Get a neo4j driver."""
connection_uri = "{uri_scheme}://{host}:{port}".format(uri_scheme=uri_scheme, host=host, port=port)
auth = (username, password)
driver = GraphDatabase.driver(connection_uri, auth=auth)
return driver
def format_raw_res(raw_res):
"""Parse neo4j results"""
res = []
for r in raw_res:
res.append(r)
return res
def run_bulk_query(query_list, driver):
"""Run a list of neo4j queries in a session."""
results = []
with driver.session() as session:
for query in tqdm(query_list):
raw_res = session.run(query)
res = format_raw_res(raw_res)
results.append({'query':query, 'result':res})
return results
global_driver = get_driver(uri_scheme='bolt', host='localhost', port='8687', username='neo4j', password='abc123') # neo4j driver object.=
This is how we create a dask client to parallelize.
from dask.distributed import Client
client = Client(threads_per_worker=4, n_workers=1)
The functions that the main code is calling.
import sys
import time
import json
import pandas as pd
import dask
def add_nodes(nodes_list, language_code):
"""Returns a list of strings. Each string is a cypher query to add a node to neo4j."""
list_of_create_strings = []
create_string_template = """CREATE (:LABEL {{node_id:{node_id}}})"""
for index, node in nodes_list.iterrows():
create_string = create_string_template.format(node_id=node['new_id'])
list_of_create_strings.append(create_string)
return list_of_create_strings
def add_relations(relations_list, language_code):
"""Returns a list of strings. Each string is a cypher query to add a relationship to neo4j."""
list_of_create_strings = []
create_string_template = """
MATCH (a),(b) WHERE a.node_id = {source} AND b.node_id = {target}
MERGE (a)-[r:KNOWS {{ relationship_id:{edge_id} }}]-(b)"""
for index, relations in relations_list.iterrows():
create_string = create_string_template.format(
source=relations['from'], target=relations['to'],
edge_id=''+str(relations['from'])+'-'+str(relations['to']))
list_of_create_strings.append(create_string)
return list_of_create_strings
def add_data(language_code, edges, features, targets, driver):
"""Add nodes and relationships to neo4j"""
add_nodes_cypher = add_nodes(targets, language_code) # Returns a list of strings. Each string is a cypher query to add a node to neo4j.
node_results = run_bulk_query(add_nodes_cypher, driver) # Runs each string in the above list in a neo4j session.
add_relations_cypher = add_relations(edges, language_code) # Returns a list of strings. Each string is a cypher query to add a relationship to neo4j.
relations_results = run_bulk_query(add_relations_cypher, driver) # Runs each string in the above list in a neo4j session.
# Saving some metadata
results = {
"nodes": {"results": node_results, "length":len(add_nodes_cypher),},
"relations": {"results": relations_results, "length":len(add_relations_cypher),},
}
return results
def load_data(language_code):
"""Load data from files"""
# Saving file names to variables
edges_filename = './edges.csv'
features_filename = './features.json'
target_filename = './target.csv'
# Loading data from the file names
edges = helper.read_csv(edges_filename)
features = helper.read_json(features_filename)
targets = helper.read_csv(target_filename)
# Saving some metadata
results = {
"edges": {"length":len(edges),},
"features": {"length":len(features),},
"targets": {"length":len(targets),},
}
return edges, features, targets, results
The main code.
def process_language_files(process_language_files, driver):
"""Reads files, creates cypher queries to add nodes and relationships, runs cypher query in a neo4j session."""
edges, features, targets, reading_results = load_data(language_code) # Read files.
writing_results = add_data(language_code, edges, features, targets, driver) # Convert files nodes and relationships and add to neo4j in a neo4j session.
return {"reading_results": reading_results, "writing_results": writing_results} # Return some metadata
# Execution starts here
res=[]
for index, language_code in enumerate(['ENGLISH', 'FRENCH']):
lazy_result = dask.delayed(process_language_files)(language_code, global_driver)
res.append(lazy_result)
Result from res. These are dask delayed objects.
print(*res)
Delayed('process_language_files-a73f4a9d-6ffa-4295-8803-7fe09849c068') Delayed('process_language_files-c88fbd4f-e8c1-40c0-b143-eda41a209862')
The errors. Even if use dask.compute(), I am getting similar errors.
futures = dask.persist(*res)
AttributeError Traceback (most recent call last)
~/Code/miniconda3/envs/MVDS/lib/python3.6/site-packages/distributed/protocol/pickle.py in dumps(x, buffer_callback, protocol)
48 buffers.clear()
---> 49 result = pickle.dumps(x, **dump_kwargs)
50 if len(result) < 1000:
AttributeError: Can't pickle local object 'BoltPool.open.<locals>.opener
==========================================================================
# Name
Version
Build
Channel
dask
2020.12.0
pyhd8ed1ab_0
conda-forge
jupyterlab
3.0.3
pyhd8ed1ab_0
conda-forge
neo4j-python-driver
4.2.1
pyh7fcb38b_0
conda-forge
python
3.9.1
hdb3f193_2
You are getting this error because you are trying to share the driver object amongst your worker.
The driver object contains private data about the connection, data that do not make sense outside the process (and also are not serializable).
It is like trying to open a file somewhere and share the file descriptor somewhere else.
It won't work because the file number makes sense only within the process that generates it.
If you want your workers to access the database or any other network resource, you should give them the directions to connect to the resource.
In your case, you should not pass the global_driver as a parameter but rather the connection parameters and let each worker call get_driver to get its own driver.
How does one run SQL queries with different column dimensions in parallel using dask? Below was my attempt:
from dask.delayed import delayed
from dask.diagnostics import ProgressBar
import dask
ProgressBar().register()
con = cx_Oracle.connect(user="BLAH",password="BLAH",dsn = "BLAH")
#delayed
def loadsql(sql):
return pd.read_sql_query(sql,con)
results = [loadsql(x) for x in sql_to_run]
dask.compute(results)
df1=results[0]
df2=results[1]
df3=results[2]
df4=results[3]
df5=results[4]
df6=results[5]
However this results in the following error being thrown:
DatabaseError: Execution failed on sql: "SQL QUERY"
ORA-01013: user requested cancel of current operation
unable to rollback
and then shortly thereafter another error comes up:
MultipleInstanceError: Multiple incompatible subclass instances of TerminalInteractiveShell are being created.
sql_to_run is a list of different sql queries
Any suggestions or pointers?? Thanks!
Update 9.7.18
Think this is more a case of me not reading documentation close enough. Indeed the con being outside the loadsql function was causing the problem. The below is the code change that seems to be working as intended now.
def loadsql(sql):
con = cx_Oracle.connect(user="BLAH",password="BLAH",dsn = "BLAH")
result = pd.read_sql_query(sql,con)
con.close()
return result
values = [delayed(loadsql)(x) for x in sql_to_run]
#MultiProcessing version
import dask.multiprocessing
results = dask.compute(*values, scheduler='processes')
#My sample queries took 56.2 seconds
#MultiThreaded version
import dask.threaded
results = dask.compute(*values, scheduler='threads')
#My sample queries took 51.5 seconds
My guess is, that the oracle client is not thread-safe. You could try running with processes instead (by using the multiprocessing scheduler, or the distributed one), if the conn object serialises - this may be unlikely. More likely to work, would be to create the connection within loadsql, so it gets remade for each call, and the different connections hopefully don't interfere with one-another.
I had used Tinkerpop and openrdf sail to connect the neo4j at local well.
String dB_DIR = "neo4j//data";
Sail sail = new GraphSail(new Neo4jGraph(dB_DIR));
sail.initialize();
that I can import ttl or rdf file then query0
but now I want to connect the remote neo4j.
How can I use neo4j jdbc in this case ?
or the Tinkerpop blueprint has the way can do it ?
( I did some searching work but no good answer )
You should use the Sesame Repository API for accessing a Sail object.
Specifically, what you need to do is wrap your Sail object in a Repository object:
Sail sail = new GraphSail(new Neo4jGraph(dB_DIR));
Repository rep = new SailRepository(sail);
rep.initialize();
After this, use the Repository object to connect to your store and perfom actions, e.g. to load a Turtle file and then do a query:
RepositoryConnection conn = rep.getConnection();
try {
// load data
File file = new File("/path/to/file.tll");
conn.add(file, file.getAbsolutePath(), RDFFormat.TURTLE);
// do query and print result to STDOUT
String query = "SELECT * WHERE {?s ?p ?o} LIMIT 10";
TupleQueryResult result =
conn.prepareTupleQuery(QueryLanguage.SPARQL, query).evaluate();
while (result.hasNext()) {
System.out.println(result.next().toString());
}
}
finally {
conn.close();
}
See the Sesame documentation or Javadoc for more info and examples of how to use the Repository API.
(disclosure: I am on the Sesame development team)
I am using Neo4j BatchInserters to insert nodes in db.I am using LuceneBatchInserterIndexProvider for indexes. I have multiple files from where i am importing the data. I want if my process break then i should be able to restart the process from next file. But whenever i restart process it creates new db in graph folder and new indexes. My initialization code look like this.
Map<String, String> config = new HashMap<String, String>();
config.put("neostore.nodestore.db.mapped_memory", "2G");
config.put("batch_import.keep_db", "true");
BatchInserter db = BatchInserters.inserter("ttl.db", config);
BatchInserterIndexProvider indexProvider = new LuceneBatchInserterIndexProvider(
db);
index = indexProvider.nodeIndex("ttlIndex",
MapUtil.stringMap("type", "exact"));
index.setCacheCapacity(URI_PROPERTY, indexCache + 1);
Can somebody please help here?
To provide more details. I have multiple files ( around 400) which i want to import to Neo4j.
I want to divide my process into batches. After every batch i want to restart the process.
I used neo4j batch inserter config batch_import.keep_db = "true". This does not clear the graph but after restart indexer has lost information. I have this method to check for node existence. I am sure before restart i have created node.
private Long getNode(String nodeUrl)
{
IndexHits<Long> hits = index.get(URI_PROPERTY, nodeUrl);
if (hits.hasNext()) { // node exists
return hits.next();
}
return null;
}
I am able to create nodes and relationships through Java on a Neo4j database. When I try to access the created nodes in the next run I get this error:
Exception in thread "main" org.neo4j.graphdb.NotFoundException: Node 27 not found
In webadmin interface the dashboard shows the number of nodes/relationships created through Java, but when I issue this query: START n=node(*) RETURN n; I get only 1 node in the ouput.
(FYI I have installed Ne04j in my windows machine(local) and using embedded database java code to create nodes.)
Java code I used to connect to db:
final String dbpath = "C:\\neo4j-community-1.9.4\\data\\graph.db";
GraphDatabaseService graphdb = new GraphDatabaseFactory().newEmbeddedDatabase(dbpath);
The settings I have used in ne04j-server.properties are:
org.neo4j.server.database.location=/C:/neo4j-community-1.9.4/data/graph.db/
org.neo4j.server.webserver.https.keystore.location=data/keystore
org.neo4j.server.webadmin.rrdb.location=data/rrd
org.neo4j.server.webadmin.data.uri=/C:/neo4j-community-1.9.4/data/graph.db/
org.neo4j.server.webadmin.management.uri=/db/manage/
When I create node through Java the data/keystore file does not get populated, and only gets populated when creating a node through webadmin interface. Changing the path of keystore file to absolute path also did not work.
Can anybody point the mistake in this scenario, Thanks .
The problem was the nodes created were not comitted. To commit the nodes we got to give finish() :
try{
Transaction tx = graphdb.beginTx();
final String dbpath = "/C:/neo4j-community-1.9.4/data/graph.db/";
GraphDatabaseService graphdb = new GraphDatabaseFactory().newEmbeddedDatabase(dbpath);
Node n1 = graphdb.createNode();
n1.setProperty("type", "company");
n1.setProperty("location", "india");
....
...
}} catch(Exception e){
tx.failure();
} finally {
tx.success();
**tx.finish();**
}
Ranjith's answer was correct until recently, but tx.finish() has now been deprecated.
tx.close(); is now the correct way to commit or rollback the transaction - it will do one or the other depending on whether you've previously called tx.success().
They changed this so the transaction is autocloseable in a try with resources block.
Have you tried:
String dbpath = "C:/neo4j-community-1.9.4/data/graph.db";