Need help to optimize neo4j cypher CREATE and MERGE queries - neo4j

I am parsing bitcoin blockchain, the whole idea is to build a node graph that looks like this (address)-[redeemed]->(tx)-[sent]->(address) so I can see how bitcoin addresses are related to each other. The problem is the execution time, sometimes it takes a few minutes to import just one transaction. Besides, some of these queries are too long, like few thousands of lines, and won't execute at all. I have read a few articles on how to optimize match queries, but found almost nothing about create and merge. I saw a few guys here recommending to use UNWIND and send as much data as possible as parameters, to make queries shorter, but I have no idea how to implement this in my query.
Here is example of my query: http://pastebin.com/9S0kLNey

You can try using the following simple query, passing the string parameters "hash", "time", "block", and "confs"; and a collection parameter named "data":
CREATE (tx:Tx {hash: {hash}, time: {time}, block: {block}, confirmations: {confs}})
FOREACH(x IN {data} |
MERGE (addr:Address {address: x.a})
CREATE (addr)-[:REDEEMED {value: x.v}]->(tx)
);
The values to use for the string parameters should be obvious.
Each "data" element would be a map containing an address ("a") and a value ("v"). For example, here is a snippet of the "data" collection that would correspond to the data in your sample query:
[
{a: "18oBAMgFaeFcZ5bziaYpUpsNCJ7G8EgH8g", v: "240"},
{a: "192W3HUVDyrp6ewvisHSijcx9f5ZoarrwX", v: "410"},
{a: "18tnEFy4usZvpMZLnjBFPjbmLKEzqPz958", v: "16.88"},
...
]
This query should run faster than your original sample, but I don't know how much faster.

Related

Correct order of operations in neo4j - LOAD, MERGE, MATCH, WITH, SET

I am loading simple csv data into neo4j. The data is simple as follows :-
uniqueId compound value category
ACT12_M_609 mesulfen 21 carbon
ACT12_M_609 MNAF 23 carbon
ACT12_M_609 nifluridide 20 suphate
ACT12_M_609 sulfur 23 carbon
I am loading the data from the URL using the following query -
LOAD CSV WITH HEADERS
FROM "url"
AS row
MERGE( t: Transaction { transactionId: row.uniqueId })
MERGE(c:Compound {name: row.compound})
MERGE (t)-[r:CONTAINS]->(c)
ON CREATE SET c.category= row.category
ON CREATE SET r.price =row.value
Next I do the aggregation to count total orders for a compound and create property for a node in the following way -
MATCH (c:Compound) <-[:CONTAINS]- (t:Transaction)
with c.name as name, count( distinct t.transactionId) as ord
set c.orders = ord
So far so good. I can accomplish what I want but I have the following 2 questions -
How can I create the orders property for compound node in the first step itself? .i.e. when I am loading the data I would like to perform the aggregation straight away.
For a compound node I am also setting the property for category. Theoretically, it can also be modelled as category -contains-> compound by creating Categorynode. But what advantage will I have if I do it? Because I can execute the queries and get the expected output without creating this additional node.
Thank you for your answer.
I don't think that's possible, LOAD CSV goes over one row at a time, so at row 1, it doesn't know how many more rows will follow.
I guess you could create virtual nodes and relationships, aggregate those and then use those to create the real nodes, but that would be way more complicated. Virtual Nodes/Rels
That depends on the questions/queries you want to ask.
A graph database is optimised for following relationships, so if you often do a query where the category is a criteria (e.g. MATCH (c: Category {category_id: 12})-[r]-(:Compound) ), it might be more performant to create a label for it.
If you just want to get the category in the results (e.g. RETURN compound.category), then it's fine as a property.

Is there a way to dynamically generate nodes from JSON with apoc.load.json procedure?

I would like to create a set of nodes and relationships from a JSON document. Here is sample JSON:
{"records": [{
"type": "bundle",
"id": "bundle--1",
"objects": [
{
"type": "evaluation",
"id": "evaluation--12345",
"name": "Eval 1",
"goals": [
"test"
]
},
{
"type": "subject",
"id": "subject--67890",
"name": "Eval 2",
"goals": [
"execute"
]
},
{
"type": "relationship",
"id": "relationship--45678",
"relationship_type": "participated-in",
"source_ref": "subject--67890",
"target_ref": "evaluation--12345"
}
}]
}
And I would like that JSON to be represented in Neo similar to the following:
(:evaluation {properties})<-[:RELATIONSHIP]-(:subject {properties})
Ultimately I would like to have a model that represents the evaluation, subject, and relationship generated via a few cypher calls with as little outside manipulation as possible. Is it possible to use the apoc.create.* set of calls to create the necessary nodes and relationships from this JSON? I have tried something similar to the following to get this JSON to load and I can get it to create nodes of an arbitrary, in this case "object", type.
WITH "file:///C:/path/to/my/sample.json" AS json
CALL apoc.load.json(json, "$.records") YIELD value
UNWIND value.objects as object
MERGE (o:object {id: object.id, type: object.type, name: object.name})
RETURN count(*)
I have tried changing the JSONPath expression to filter different record types but it is difficult to run a Goessner path like $.records..objects[?(#.type = 'subject')] thanks to the embedded quotes. This would also lead to multiple runs (I have 15 or so different types) against the real JSON, which could be very time consuming. The LoadJSON docs have a simple filter expression and there is a blog post that shows how to parse stackoverflow but the JSON objects are keyed in a way that is easy to map in cypher. Is there a cypher trick or APOC I should be aware of that can help me solve this problem?
I would approach this as a two-pass method:
First pass: create the nodes for evaluation and subject. You could use apoc.do.case/when if helpful
Second pass: only scan for relationship and then do a MATCH to find the evaluation and subject nodes based on the source_ref and target_ref, and then MERGE or CREATE the relationship to connect them.
Like this you're not impacted by situations such as the relationship coming before the nodes it connects etc. or how many items you've got within objects
As Lju pointed out, the apoc.do.case function can be used to create a set of conditions to check, followed by a cypher statement. Combining that with another apoc call requires the returns from each apoc call to be handled properly. My answer ended up looking like the following:
WITH "file:///C:/path/to/my/sample.json" AS json
CALL apoc.load.json(json, "$.records") YIELD value as records
UNWIND records.objects as object
CALL apoc.do.case(
[object.type="evaluation", "MERGE (:Evaluation {id: object.id}) ON CREATE SET Evaluation.id = object.id, Evaluation.prop1 = object.prop1",
object.type="subject", "MERGE (:Subject {id: object.id}) ON CREATE SET Subject.id = object.id, Subject.prop1 = object.prop1",
....]
"<default case cypher query goes here>", {object:object}
)
YIELD value RETURN count(*)
Notice there are two apoc calls that YIELD. Use aliases to help the parser differentiate between objects. The documentation for the apoc.do.case is a little sparse but describes the syntax for the statement. It looks like there are other ways to accomplish this task but with smaller JSON files, and a handful of cases, this works well enough.

Neo4j variable-length pattern matching tunning

Query:
PROFILE
MATCH(node:Symptom) WHERE node.symptom =~ '.*adult male.*|.*151.*'
WITH node
MATCH (node)-[*1..2]-(result:Disease)
RETURN result
Profile:
enter image description here
Problems:
There are over 40 thousand "Symptom" nodes in the database, and the query is very slow because of the part - "[*1..2]".
It only took 4 seconds when length is 1, i.e "[*1]", but it will take about 30 seconds when length is 2, i.e "[*1..2]".
Is there any way to tune this query???
Firstly your query is using the regex operator, and it can't use indexes. You should use the CONTAINS operator instead :
MATCH (node:Symptom)
WHERE node.symptom CONTAINS 'adult male' OR node.symptom CONTAINS '151'
RETURN node
And you can create an index :CREATE INDEX ON :Symptom(symptom)
For the second part of your query, as it, there is nothing to do ... it's due to the complexity you are asking to do.
So to have better performances, you should think to :
put the relationship type on the pattern to reduce the number returned path : (node)-[*1..2:MY_REL_TYPE]-(result:Disease)
put the direction on the relationship on the pattern to reduce the number returned path : (node)-[*1..2:MY_REL_TYPE]->(result:Disease)
find an other way to reduce this complexity (filter on a property of the relationship , review your model, etc)
For your information, you can directly write your query in one step (ie. without the WITH, but in your case performances should be the same) :
MATCH (node:Symptom)-[*1..2]-(result:Disease)
WHERE node.symptom CONTAINS 'adult male' OR node.symptom CONTAINS '151'
RETURN result

Cypher query with literal map syntax & dynamic keys

I'd like to make a cypher query that generates a specific json output. Part of this output includes an object with a dynamic amount of keys relative to the children of a parent node:
{
...
"parent_keystring" : {
child_node_one.name : child_node_one.foo
child_node_two.name : child_node_two.foo
child_node_three.name : child_node_three.foo
child_node_four.name : child_node_four.foo
child_node_five.name : child_node_five.foo
}
}
I've tried to create a cypher query but I do not believe I am close to achieving the desired output mentioned above:
MATCH (n)-[relone:SPECIFIC_RELATIONSHIP]->(child_node)
WHERE n.id='839930493049039430'
RETURN n.id AS id,
n.name AS name,
labels(n)[0] AS type,
{
COLLECT({
child.name : children.foo
}) AS rel_two_representation
} AS parent_keystring
I had planned for children.foo to be a count of how many occurrences of each particular relationship/child of the parent. Is there a way to make use of the reduce function? Where a report would generate based on analyzing the array proposed below? ie report would be a json object where each key is a distinct RELATIONSHIP and the property value would be the amount of times that relationship stems from the parent node?
Thank you greatly in advance for guidance you can offer.
I'm not sure that Cypher will let you use a variable to determine an object's key. Would using an Array work for you?
COLLECT([child.name, children.foo]) AS rel_two_representation
I think, Neo4j Server API output by itself should be considered as any database output (like MySQL). Even if it is possible to achieve, with default functionality, desired output - it is not natural way for database.
Probably you should look into creating your own server plugin. This allows you to implement any custom logic, with desired output.

Parse a big file and populate a Neo4j database

I am working on a Ruby on Rails project that will read and parse somewhat big text file (around 100k lines) and build Neo4j nodes (I am using Neography) with that data.
This is the Neo4j related fraction of the code I wrote:
d= Neography::Rest.new.execute_query("MATCH (n:`Label`) WHERE (n.`name`='#{id}') RETURN n")
d= Neography::Node.load(d, #neo)
p= Neography::Rest.new.create_node("name" => "#{id}")
Neography::Rest.new.add_label(p, "LabelSample")
d=Neography::Rest.new.get_node(d)
Neography::Rest.new.create_relationship("belongs_to", p, d)
so, what I want to do is: a search in the already populated db for the node with the same "name" field as the parsed data, create a new node for this data and finally create a relationship between the two of them.
Obiously this code simply takes way too much time to be executed.
So I tried with Neography's batch, but I ran into some issues.
p = Neography::Rest.new.batch [:create_node, {"name" => "#{id}"}]
gave me a "undefined method `split' for nil:NilClass" in
id["self"].split('/').last
d=Neography::Rest.new.batch [:get_node, d]
gives me a Neography::UnknownBatchOptionException for get_node
I am not even sure this will save me enough time either.
I also tried different ways to do this, using Batch Import for example, but I couldn't find a way to get the already created node I need from the db.
As you can see I'm kinda new to this so any help will be appreciated.
Thanks in advance.
You might be able to do this with pure cypher (or neography generated cypher). Something like this perhaps:
MATCH (n:Label) WHERE n.name={id}
WITH n
CREATE (p:LabelSample {name: n.name})-[:belongs_to]->n
Not that I'm using CREATE, but if you don't want to create duplicate LabelSample nodes you could do:
MATCH (n:Label) WHERE n.name={id}
WITH n
MERGE (p:LabelSample {name: n.name})
CREATE p-[:belongs_to]->n
Note that I'm using params, which are generally recommended for performance (though this is just one query, so it's not as big of a deal)

Resources