Create Neo4j Nodes by passing a data variable in Python - neo4j

I work with a large data set in Python and would like to create Neo4j nodes out of a data array within Python. So, my naive attempt to do so would be something like the following.
(In Python script)
Product_IDs = data_array[1:1000] # This contains a list of product IDs
tot_node_num = len(Product_IDs) # It states the total number of product IDs
graph = Graph()
tx = graph.cypher.begin()
tx.append("FOREACH (r IN range(1,tot_node_num) | CREATE (:Product {ID:Product_IDs[r]}))")
tx.commit()
With the above statement, the variables: tot_node_num and Product_IDs are not recognized. How can I pass down an array that I created with my python script to create nodes in Neo4j graph database?
Thank you!

You're absolutely right - the best way to pass in variables is via parameters. Bear in mind however that while this works for expressions and property values, parameters cannot be used for labels and relationship types. To help with this, py2neo provides the cypher_escape function (http://py2neo.org/2.0/cypher.html#py2neo.cypher.cypher_escape):
>>> from py2neo.cypher import cypher_escape
>>> rel_type = "KNOWS WELL"
>>> "MATCH (a)-[:%s]->(b) RETURN a, b" % cypher_escape(rel_type)
'MATCH (a)-[:`KNOWS WELL`]->(b) RETURN a, b'

Related

How optimised is this Cypher query?

[Edit] I'm using Neo4j 4.2.1
I have this need for a Cypher query that brings back a complete tree given its root node. All nodes and relationships must be fetched and present only once in the returned sets. Here's what I have come to:
MATCH p = (n)-[*..]->(m)
WHERE id(n) = 0
WITH relationships(p) AS r
WITH distinct last(r) as rel
WITH [node IN [startNode(rel), endNode(rel)] | node] AS tmp, rel
UNWIND tmp AS node
RETURN collect(DISTINCT node) AS nodes, collect(distinct rel) AS relationships;
Running the query on our database to get about 820 nodes makes the thing crash for lack of memory (5Gb allowed). Hard to believe. So I'm wondering : Is this query ill-born? Is there one technique I'm using that shouldn't be used for my purpose?
I strongly recommend that you come up with a node property that is guaranteed to be the same on all the nodes in a contiguous tree, if you don't have one already. I'll call that property same_prop. Here's what I do to run queries like the one you're running:
Index same_prop. If you have different node labels, then you need this index created for each different node label you expect to have in the tree.
CREATE INDEX samepropnode FOR (n:your_label) ON (n.same_prop)
is the kind of thing you need in Neo4j 4+. In Neo4j, indices are cheap, and can sometimes speed up queries quite a bit.
Collect all possible values of same_prop and store them in a text file (I use tab-separated values as safer than comma-separated values).
Use the Python driver, or your language of choice that has a Neo4j driver written (strongly recommend Neo4j-provided drivers, not third-party) to write wrapper code that executes a Cypher query something like this:
MATCH (p)-->(c)
USING INDEX p:your_label(same_prop)
WHERE p.same_prop IN [ same_prop_list ]
RETURN DISTINCT
p.datapiece1 AS `first_parent_datapiece`,
p.datapiecen AS `nth_parent_datapiece`,
c.datapiece1 AS `first_child_datapiece`,
c.datapiecen AS `nth_child_datapiece`
It's not a good idea, in general, to return nodes and relationships unless you're debugging.
Then in your Python (for example) code, you're simply going to read in all your same_prop values from the file you got in Step 2, chunk up the values in reasonable size chunks, maybe 1,000 or 10,000, and substitute them in for the [ same_prop_list ] in the Cypher query on-the-fly.

Neo4j / Cypher: Returning sum of value in relationship between nodes within the node itself

There are two node types, Account and Transfer. A Transfer signifies movement of funds between Account nodes. Transfer nodes may have any number of input and output nodes. For example, three Accounts could each send $40 ($120 combined) to sixteen other Accounts in any way they please and it would work.
The Transfer object, as is, does not have the sum of the funds sent or received - those are only stored in the relationships themselves. I'd like to calculate this in the cypher query and return it as part of the the returned Transfer object, not separately. (Similar to a SQL JOIN)
I'm rather new to Neo4j + Cypher; So far, the query I've got is this:
MATCH (tf:Transfer {id:'some_id'})
MATCH (tf)<-[in:IN_TO]-(in_account:Account)
MATCH (tf)-[out:OUT_TO]->(out_account:Account)
RETURN tf,in_account,in,out_account,out, sum(in.value) as sum_in, sum(out.value) as sum_out
If I managed this database, I'd just precalculate the sums and store it in the Transfer properties - but that's not an option at this time.
tl;dr: I'd like to store sum_in and sum_out in the returned tf object.
Tore Eschliman's answer is very insightful, especially on the properties of aggregated aliases.
I came up with a more hackish solution, that could work in this case.
Example data set:
CREATE
(a1:Account),
(a2:Account),
(a3:Account),
(tf:Transfer),
(a1)-[:IN_TO {value: 110}]->(tf),
(a2)-[:IN_TO {value: 230}]->(tf),
(tf)-[:OUT_TO {value: 450}]->(a3)
Query:
MATCH (in_account:Account)-[in:IN_TO]->(tf:Transfer)
WITH tf, SUM(in.value) AS sum_in
SET tf.sum_in = sum_in
RETURN tf
UNION
MATCH (tf:Transfer)-[out:OUT_TO]->(out_account:Account)
WITH tf, SUM(out.value) AS sum_out
SET tf.sum_out = sum_out
RETURN tf
Results:
╒═══════════════════════════╕
│tf │
╞═══════════════════════════╡
│{sum_in: 340, sum_out: 450}│
└───────────────────────────┘
Note that UNION performs a set union (as opposed to UNION ALL, which performs a multiset/bag union), hence we will not have duplicates in the results.
Update: as Tore Eschliman pointed out in the comments, this solution will modify the database. As a workaround, you can collect the results and abort the transaction afterwards.
When you use an aggregation like SUM, you have to leave the aggregated aliases out of your result row, or you'll end up with single-row sums. This should help you get something closer to what you want, including a workaround for your dynamic property assignment:
CREATE (temp)
WITH temp
MATCH (tf:Transfer {id:'some_id'})
MATCH (tf)<-[in:IN_TO]-(in_account:Account)
MATCH (tf)-[out:OUT_TO]->(out_account:Account)
SET temp += PROPERTIES(tf)
WITH temp, SUM(in.value) AS sum_in, SUM(out.value) AS sum_out, COLLECT(in_account) AS in_accounts, COLLECT(out_account) AS out_accounts
SET temp.sum_in = sum_in
SET temp.sum_out = sum_out
WITH temp, PROPERTIES(temp) AS props, in_accounts, out_accounts
DELETE temp
RETURN props, in_accounts, out_accounts
You are creating a dummy node to hold properties, because that's the only way to assign dynamic properties to existing Maps or Map-alikes, but the node won't ever be committed to the graph. This query should return a Map of the :Transfer's properties, with the in- and out-sums included, plus lists of the in- and out-accounts in case you need to do any additional work on them.

How to optimize the calculation of the shortest time in a certain path in Neo4j?

I have a database in Neo4j of modules that I imported through CSV. The data looks something like this. Each module has its name, it's module that is the successor, average time duration and another duration called medtime.
I have been able to import the data and to set the relationships through a Cypher Query script that looks like this:
LOAD CSV WITH HEADERS FROM "file:c:/users/Skelo/Desktop/Neo4J related/Statistic Dependencies/Simple.csv" AS row FIELDTERMINATOR ';'
CREATE (n:Module)
SET n = row, n.name = row.name, n.mafter = row.mafter, n.avgtime = row.avgtime, n.medtime = row.medtime
WITH n
RETURN n
Then I have set the relationships like this:
Match (p:Module),(q:Module)
Where p.mafter = q.name
Merge (p)-[:PRECEEDS]->(q)
Return p,q
Now to the point. I want to calculate the shortest path from a certain module to another, more specifically the time that it takes to get from a module to another and for this, I use the more or less copied part of the script from
http://www.neo4j.org/graphgist?8412907 and that is
MATCH p = (trop:Module {name:'BLSACXAMT0A_00'})-[prec:PRECEEDS*]->(hop:Module {name:'BL_LOAD_CLOSE'})
WITH p, REDUCE(x = 0, a IN NODES(p) | x + a.avgtime) AS cum_duration
ORDER BY cum_duration DESC
LIMIT 1
RETURN cum_duration AS `Total Average Time`
This, however, takes about 50 second to execute and that is outrageous. You can see it on the screenshot right below. The ammount of modules imported into the database is only about 2000 and what I want to achieve, is to successfully work with more than 50 000 nodes and perform such tasks much faster.
Other issue is, that the results are somehow suspicious. The format looks wrong, every number I have in the database has max 4 digits after the decimal point and I am only adding these values to zero, therefore if the result looks like this: 00103,68330,51670, I have serious doubts. Please, help me, if it is wrong, why is it so, and what can I do to correct it.
Neo4j claims that it is efficient and fast, therefore I presume that the fault is in my code (the performance of my computer is more than enough). Please, If you can, help me to shorten this time and explain the patterns needed to perform this.
A few observations that should help:
You have several errors in how you are importing. These errors will create many more nodes than you think, and create the "suspicious" issue you raised:
Your file has multiple rows with the same name, but your import is creating a new Module node every time. Therefore, you are ending up with multiple nodes for some of your modules. You should be using MERGE instead of CREATE.
Your mafter property needs to contain a collection of strings, not a single string.
You are importing the numeric values as strings, so code such as x + a.avgtime is just doing string concatenation, not numeric addition. Furthermore, even if you did attempt to convert your strings to numbers, that would fail because your numbers use a comma instead of a period to indicate the decimal place.
Try this for importing (into an empty DB):
LOAD CSV WITH HEADERS FROM "file:c:/users/Skelo/Desktop/Neo4J related/Statistic Dependencies/Simple.csv" AS row FIELDTERMINATOR ';'
MERGE (n:Module {name: row.name})
ON CREATE SET
n.mafter = [row.mafter],
n.avgtime = TOFLOAT(REPLACE(row.avgtime, ',', '.')),
n.medtime = TOFLOAT(REPLACE(row.medtime, ',', '.'))
ON MATCH SET
n.mafter = n.mafter + row.mafter;
You also need to change your current merge query so that you can handle an mafter that is a collection. Note that the following query is designed to NOT create any new nodes (even if a name in mafter does not yet have a module node).
MATCH (p:Module)
OPTIONAL MATCH (p)-[:PRECEEDS]->(z:Module)
WITH p, COLLECT(z.name) AS existing
WITH p, filter(x IN p.mafter
WHERE NOT x IN existing) AS todo
MATCH (q:Module)
WHERE q.name IN todo
MERGE (p)-[:PRECEEDS]->(q)
RETURN p, q;
You should create an index to speed up the matching of modules by name:
CREATE INDEX ON :Module(name)
Cypher does have a shortestPath function, see http://neo4j.com/docs/stable/query-match.html#_shortest_path. However this calculates the shortest path based on the number of hops and does not take a weight into account.
Neo4j has couple of graph algorithms on board, e.g. Dijekstra or AStar. Unfortunately these are not yet available via cypher. Instead you have two alternatives to use them:
1) write an unmanaged extension to Neo4j and use GraphAlgoFactory in the implmentation. This requires to write same java code and deploy it to the Neo4j server. Using a custom CostEvaluator you can use the avgTime property on your nodes as cost parameter.
2) use the REST API as documented on http://neo4j.com/docs/stable/rest-api-graph-algos.html#rest-api-execute-a-dijkstra-algorithm-and-get-a-single-path. This approach requires to have the weight as a property on the relationship and not on a node (like in your data model)

Neo4j / Cypher: aggregate using reduce

I'm trying to do some basic similarity search in a neo4j database. It looks something like this:
begin
create (_1:`Article` {`name`:"Bow", `weight`:"20"})
create (_2:`Article` {`name`:"Shield", `weight`:"30"})
create (_3:`Article` {`name`:"Knife", `weight`:"40"})
create (_4:`Article` {`name`:"Sword", `weight`:"50"})
create (_5:`Article` {`name`:"Helmet", `weight`:"15"})
create (_6:`Order` {`customer`:"Peter"})
create (_7:`Order` {`customer`:"Paul"})
create (_8:`Order` {`customer`:"Mary"})
create (_9:`Accessory` {`name`:"Arrow",`type`:"optional", `weight`:"2"})
create _6-[:`CONTAINS` {`timestamp`:"1413204480"}]->_1
create _6-[:`CONTAINS` {`timestamp`:"1413204480"}]->_2
create _6-[:`CONTAINS` {`timestamp`:"1413204480"}]->_3
create _7-[:`CONTAINS` {`timestamp`:"1413204480"}]->_1
create _7-[:`CONTAINS` {`timestamp`:"1413204480"}]->_4
create _8-[:`CONTAINS` {`timestamp`:"1413204480"}]->_5
create _9-[:`BELONGS_TO` {`timestamp`:"1413204480"}]->_1
;
commit
Pretty pointless database, I know. Sole reason for it is this post.
When a new order comes in, I need to find out, whether or not similar orders have already been placed. Similar means: existing or new customer and same products. The hard part is: I need the sum of the weight of all (directly or indirectly) contained nodes.
Here's what I have:
START n=node(*)
MATCH p1 = (a:Order)-[*]->n, p2 = (b:Order)-[*]->m
WHERE a<>b AND n.name = m.name
RETURN reduce (sum="", x in p2 | sum+x.weight) limit 25;
However, it seems that p2 is not the right thing to aggregate across. Cypher expects a collection and not a path.
Truly sorry for this newbie post, but rest assured: I am a very grateful newbie. Thanks!
rene
In your query, it looks like you are pretending that p2 is the path of you new order. I presume that in your actual query you will be binding b to a specific node. Also, your timestamps and weights should have numeric values (without the quotes).
This query will return the total weight of the "new" path. Is this what you wanted?
MATCH p1 =(a:Order)-[*]->n, p2 =(b:Order)-[*]->m
WHERE a<>b AND n.name = m.name
WITH p2, collect(m) AS ms
RETURN reduce(sum=0, x IN ms | sum+x.weight)
LIMIT 25;
By the way, START n=node(*) is superfluous. It is the same as using an unbound n.
See this console.

need help on cypher query

I need some help to do a cypher query.
In my neo4j databases I have element nodes which are linked by relation nodes (not relationship) and I would like to find all nodes that inherit from a node. For example if I have B-->A, c-->B and D-->A where "-->" means "inherit" I would like to retrieve B, C and D when I ask to retrieve which elements are inherit from A.
I already written a cypher query which is working well on a single level (where I replace "A" by the node id) :
Start
node=node(A)
match
(node)-[:IS_SOURCE_OF]->relation<-[:IS_TARGET_OF]-target
where
relation.relationType="INHERIT"
return target.uuid
This query returns B and D but I don't know how to return C as well.
Does someone can help me please ?
Thanks a lot
Cypher allows variable length matches on single relationships, but not the way you have designed your graph. To find the node c in your example you need to do:
Start node=node(A)
match (node)-[:IS_SOURCE_OF]->(r1)<-[:IS_TARGET_OF]-()-[:IS_SOURCE_OF]->(r2)<-[:IS_TARGET_OF]-(target)
where
r1.relationType="INHERIT" AND r2.relationType="INHERIT"
return target.uuid
However you should take a step back and rethink if you cannot model the inheritance relationship explicitly - in this case a single query catches all inherited nodes from a
start node=node(a)
match node-[:INHERITS*]->target
return target.uuid

Resources