Neo4j / Cypher: aggregate using reduce - neo4j

I'm trying to do some basic similarity search in a neo4j database. It looks something like this:
begin
create (_1:`Article` {`name`:"Bow", `weight`:"20"})
create (_2:`Article` {`name`:"Shield", `weight`:"30"})
create (_3:`Article` {`name`:"Knife", `weight`:"40"})
create (_4:`Article` {`name`:"Sword", `weight`:"50"})
create (_5:`Article` {`name`:"Helmet", `weight`:"15"})
create (_6:`Order` {`customer`:"Peter"})
create (_7:`Order` {`customer`:"Paul"})
create (_8:`Order` {`customer`:"Mary"})
create (_9:`Accessory` {`name`:"Arrow",`type`:"optional", `weight`:"2"})
create _6-[:`CONTAINS` {`timestamp`:"1413204480"}]->_1
create _6-[:`CONTAINS` {`timestamp`:"1413204480"}]->_2
create _6-[:`CONTAINS` {`timestamp`:"1413204480"}]->_3
create _7-[:`CONTAINS` {`timestamp`:"1413204480"}]->_1
create _7-[:`CONTAINS` {`timestamp`:"1413204480"}]->_4
create _8-[:`CONTAINS` {`timestamp`:"1413204480"}]->_5
create _9-[:`BELONGS_TO` {`timestamp`:"1413204480"}]->_1
;
commit
Pretty pointless database, I know. Sole reason for it is this post.
When a new order comes in, I need to find out, whether or not similar orders have already been placed. Similar means: existing or new customer and same products. The hard part is: I need the sum of the weight of all (directly or indirectly) contained nodes.
Here's what I have:
START n=node(*)
MATCH p1 = (a:Order)-[*]->n, p2 = (b:Order)-[*]->m
WHERE a<>b AND n.name = m.name
RETURN reduce (sum="", x in p2 | sum+x.weight) limit 25;
However, it seems that p2 is not the right thing to aggregate across. Cypher expects a collection and not a path.
Truly sorry for this newbie post, but rest assured: I am a very grateful newbie. Thanks!
rene

In your query, it looks like you are pretending that p2 is the path of you new order. I presume that in your actual query you will be binding b to a specific node. Also, your timestamps and weights should have numeric values (without the quotes).
This query will return the total weight of the "new" path. Is this what you wanted?
MATCH p1 =(a:Order)-[*]->n, p2 =(b:Order)-[*]->m
WHERE a<>b AND n.name = m.name
WITH p2, collect(m) AS ms
RETURN reduce(sum=0, x IN ms | sum+x.weight)
LIMIT 25;
By the way, START n=node(*) is superfluous. It is the same as using an unbound n.
See this console.

Related

Find all relations starting with a given node

In a graph where the following nodes
A,B,C,D
have a relationship with each nodes successor
(A->B)
and
(B->C)
etc.
How do i make a query that starts with A and gives me all nodes (and relationships) from that and outwards.
I do not know the end node (C).
All i know is to start from A, and traverse the whole connected graph (with conditions on relationship and node type)
I think, you need to use this pattern:
(n)-[*]->(m) - variable length path of any number of relationships from n to m. (see Refcard)
A sample query would be:
MATCH path = (a:A)-[*]->()
RETURN path
Have also a look at the path functions in the refcard to expand your cypher query (I don't know what exact conditions you'll need to apply).
To get all the nodes / relationships starting at a node:
MATCH (a:A {id: "id"})-[r*]-(b)
RETURN a, r, b
This will return all the graphs originating with node A / Label A where id = "id".
One caveat - if this graph is large the query will take a long time to run.

Create relationships in Neo4j

I have a graph with about 800k nodes and I want to create random relationships among them, using Cypher.
Examples like the following didn't work because the cartesian product is too big:
match (u),(p)
with u,p
create (u)-[:LINKS]->(p);
For example I want 1 relationship for each node (800k), or 10 relationships for each node (8M).
In short, I need a query Cypher in order to UNIFORMLY create relationships between nodes.
Does someone know the query to create relationships in this way?
So you want every node to have exactly x relationships? Try this in batches until no more relationships are updated:
MATCH (u),(p) WHERE size((u)-[:LINKS]->(p)) < {x}
WITH u,p LIMIT 10000 WHERE rand() < 0.2 // LIMIT to 10000 then sample
CREATE (u)-[:LINKS]->(p)
This should work (assuming your neo4j server has enough memory):
MATCH (n)
WITH COLLECT(n) AS ns, COUNT(n) AS len
FOREACH (i IN RANGE(1, {numLinks}) |
FOREACH (x IN ns |
FOREACH(y IN [ns[TOINT(RAND()*len)]] |
CREATE (x)-[:LINK]->(y) )));
This query collects all nodes, and uses nested loops to do the following {numLinks} times: create a LINK relationship between every node and a randomly chosen node.
The innermost FOREACH is used as a workaround for the current Cypher limitation that you cannot put an operation that returns a node inside a node pattern. To be specific, this is illegal: CREATE (x)-[:LINK]->(ns[TOINT(RAND()*len)]).

Optimizing Cypher Query Neo4j

I want to write a query in Cypher and run it on Neo4j.
The query is:
Given some start vertexes, walk edges and find all vertexes that is connected to any of start vertex.
(start)-[*]->(v)
for every edge E walked
if startVertex(E).someproperty != endVertex(E).someproperty, output E.
The graph may contain cycles.
For example, in the graph above, vertexes are grouped by "group" property. The query should return 7 rows representing the 7 orange colored edges in the graph.
If I write the algorithm by myself it would be a simple depth / breadth first search, and for every edge visited if the filter condition is true, output this edge. The complexity is O(V+E)
But I can't express this algorithm in Cypher since it's very different language.
Then i wrote this query:
find all reachable vertexes
(start)-[*]->(v), reachable = start + v.
find all edges starting from any of reachable. if an edge ends with any reachable vertex and passes the filter, output it.
match (reachable)-[]->(n) where n in reachable and reachable.someprop != n.someprop
so the Cypher code looks like this:
MATCH (n:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
WITH n MATCH (n:Col)-[*]->(m:Col)
WITH collect(distinct n) + collect(distinct m) AS c1
UNWIND c1 AS rn
MATCH (rn:Col)-[]->(xn:Col) WHERE rn.schema<>xn.schema and xn in c1
RETURN rn,xn
The performance of this query is not good as I thought. There are index on :Col(schema)
I am running neo4j 2.3.0 docker image from dockerhub on my windows laptop. Actually it runs on a linux virtual machine on my laptop.
My sample data is a small dataset that contains 0.1M vertexes and 0.5M edges. For some starting nodes it takes 60 or more seconds to complete this query. Any advice for optimizing or rewriting the query? Thanks.
The following code block is the logic I want:
VertexQueue1 = (starting vertexes);
VisitedVertexSet = (empty);
EdgeSet1 = (empty);
While (VertexSet1 is not empty)
{
Vertex0 = VertexQueue1.pop();
VisitedVertexSet.add(Vertex0);
foreach (Edge0 starting from Vertex0)
{
Vertex1 = endingVertex(Edge0);
if (Vertex1.schema <> Vertex0.schema)
{
EdgeSet1.put(Edge0);
}
if (VisitedVertexSet.notContains(Vertex1)
and VertexQueue1.notContains(Vertex1))
{
VertexQueue1.push(Vertex1);
}
}
}
return EdgeSet1;
EDIT:
The profile result shows that expanding all paths has a high cost. Looking at the row number, it seems that Cypher exec engine returns all paths but I want distint edge list only.
LEFT one:
match (start:Col {table:"F_XXY_DSMK_ITRPNL_IDX_STAT_W"})
,(start)-[*0..]->(prev:Col)-->(node:Col)
where prev.schema<>node.schema
return distinct prev,node
RIGHT one:
MATCH (n:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
WITH n MATCH (n:Col)-[*]->(m:Col)
WITH collect(distinct n) + collect(distinct m) AS c1
UNWIND c1 AS rn
MATCH (rn:Col)-[]->(xn:Col) WHERE rn.schema<>xn.schema and xn in c1
RETURN rn,xn
I think Cypher lets this be much easier than you're expecting it to be, if I'm understanding the query. Try this:
MATCH (start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})-->(node:Col)
WHERE start.schema <> node.schema
RETURN start, node
Though I'm not sure why you're comparing the schema property on the nodes. Isn't the schema for the start node fixed by the value that you pass in?
I might not be understanding the query though. If you're looking for more than just the nodes connected to the start node, you could do:
MATCH
(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
(start)-[*0..]->(prev:Col)-->(node:Col)
WHERE prev.schema <> node.schema
RETURN prev, node
That open-ended variable length relationship specification might be slow, though.
Also note that when Cypher is browsing a particular path it stops which it finds that it's looped back onto some node (EDIT relationship, not node) in the path matched so far, so cycles aren't really a problem.
Also, is the DWMDATA value that you're passing in interpolated? If so, you should think about using parameters for security / performance:
http://neo4j.com/docs/stable/cypher-parameters.html
EDIT:
Based on your comment I have a couple of thoughts. First limiting to DISTINCT path isn't going to help because every path that it finds is distinct. What you want is the distinct set of pairs, I think, which I think could be achieved by just adding DISTINCT to the query:
MATCH
(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
(start)-[*0..]->(prev:Col)-->(node:Col)
WHERE prev.schema <> node.schema
RETURN DISTINT prev, node
Here is another way to go about it which may or may not be more efficient, but might at least give you an idea for how to go about things differently:
MATCH
path=(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})-->(node:Col)
WITH rels(path) AS rels
UNWIND rels AS rel
WITH DISTINCT rel
WITH startNode(rel) AS start_node, endNode(rel) AS end_node
WHERE start_node.schema <> end_node.schema
RETURN start_node, end_node
I can't say that this would be faster, but here's another way to try:
MATCH (start:Col)-[*]->(node:Col)
WHERE start.property IN {property_values}
WITH collect(ID(node)) AS node_ids
MATCH (:Col)-[r]->(node:Col)
WHERE ID(node) IN node_ids
WITH DISTINCT r
RETURN startNode(r) AS start_node, endNode(r) AS end_node
I suspect that the problem in all cases is with the open-ended variable length path. I've actually asked on the Slack group to try to get a better understanding of how it works. In the meantime, for all the queries that you try I would suggest prefixing them with the PROFILE keyword to get a report from Neo4j on what parts of the query are slow.
// this is very inefficient!
MATCH (start:Col)-[*]->(node:Col)
WHERE start.property IN {property_values}
WITH distinct node
MATCH (prev)-[r]->(node)
RETURN distinct prev, node;
you might be better off with this:
MATCH (start:Col)
WHERE start.property IN {property_values}
MATCH (node:Col)
WHERE shortestPath((start)-[*]->(node)) IS NOT NULL
MATCH (prev)-[r]->(node)
RETURN distinct prev, node;

Limiting the number of match collection elements in neo4j using cypher

I have a big amounts of nodes that have outgoing relations to even bigger amount of nodes. I want to be able to
query for a limited amount of starting nodes, returning with it the related nodes, but the related nodes should also be limited in numbers.
Is this possible in neo4j 1.9?
For example create these nodes and have an auto index on name:
CREATE p = (bar{company:'Bar1'})<-[:FREQUENTS]-(andres {name:'Andres'})-[:WORKS_AT]->(neo{company:'Neo1'})
WITH andres
CREATE (restaurant{company:'Restaurant1'})<-[:FREQUENTS]-(andres)-[:WORKS_AT]-(lib{company:'Library'}) ;
CREATE p = (bar{company:'Bar2'})<-[:FREQUENTS]-(todd {name:'Todd'})-[:WORKS_AT]->(neo{company:'Neo2'})
WITH todd
CREATE (restaurant{company:'Restaurant2'})<-[:FREQUENTS]-(todd)-[:WORKS_AT]-(lib{company:'Library2'}) ;
CREATE p = (bar{company:'Bar3'})<-[:FREQUENTS]-(hank {name:'Hank'})-[:WORKS_AT]->(neo{company:'Neo3'})
WITH hank
CREATE (restaurant{company:'Restaurant3'})<-[:FREQUENTS]-(hank)-[:WORKS_AT]-(lib{company:'Library3'}) ;
What I would like is something like:
START p=node:node_auto_index('*:*')
MATCH p-[:WORKS_AT]-> c, p-[:FREQUENTS]-> f
RETURN p, collect(distinct c.company), collect(distinct f.company) LIMIT 2;
To return 2 rows and have the collections limited to one, but without using the function on the collections, tried that on a large
data set and it becomes extremely slow. So some way to LIMIT the matches..
If this is not possible in neo4j 1.9, would there be a solution in neo4j 2.0?
Can you try something like this:
START p=node:node_auto_index('*:*')
RETURN p,
head(extract(path in p-[:WORKS_AT]->() : head(tail(nodes(path))))) as work_company,
head(extract(path in p-[:FREQUENTS]->() : head(tail(nodes(path))))) as visit_company
The head function on the extracted path node should be lazy so it pulls only the first one from the pattern match
If you look at the profiling output you should see that it touches only the first node each.
It could be that the : query triggers some very large operations in the indexing layer, rather than being lazy.. I would try something like this:
START p=node:node_auto_index('*:*')
WITH p LIMIT 2
MATCH p-[:WORKS_AT]-> c, p-[:FREQUENTS]-> f return p, collect(distinct c.company), collect(distinct f.company)

Getting multiple relationship property returns from a single cypher call

I'm new to cypher, neo4j and graph databases in general. The data model I was given to work with is a tad confusing, but it looks like the nodes are just GUID placeholders with all the real "data" as properties on relationships to the nodes (which relates each node back to node zero).
Each node (which basically only has a guid) has a dozen relations with key/value pairs that are the actual data I need. (I'm guessing this was done for versioning?.. )
I need to be able to make a single cypher call to get properties off two (or more) relationships connected to the same node -- here is two calls that I'd like to make into one call;
start n = Node(*)
match n-[ar]->m
where has(ar.value) and has(ar.proptype) and ar.proptype = 'ccid'
return ar.value
and
start n = Node(*)
match n-[br]->m
where has(br.value) and has(br.proptype) and br.proptype = 'description'
return br.value
How do I go about doing this in a single cypher call?
EDIT - for clarification;
I want to get the results as a list of rows with a column for each value requested.
Something returned like;
n.id as Node, ar.value as CCID, br.value as Description
Correct me if I'm wrong, but I believe you can just do this:
start n = Node(*)
match n-[r]->m
where has(r.value) and has(r.proptype) and (r.proptype = 'ccid' or r.proptype = 'description')
return r.value
See here for more documentation on Cypher operations.
Based on your edits I'm guessing that you're actually looking to find cases where a node has both a ccid and a description? The question is vague, but I think this is what you're looking for.
start n = Node(*)
match n-[ar]->m, n-[br]->m
where (has(ar.value) and has(ar.proptype) and ar.proptype = 'ccid') and
(has(br.value) and has(br.prototype) and br.proptype = 'description')
return n.id as Node, ar.value as CCID, br.value as Description
You can match relationships from two sides:
start n = Node(*)
match m<-[br]-n-[ar]->m
where has(ar.value) and has(ar.proptype) and ar.proptype = 'ccid' and
has(br.value) and has(br.proptype) and br.proptype = 'description'
return ar.value, br.value

Resources