Limiting the number of match collection elements in neo4j using cypher - neo4j

I have a big amounts of nodes that have outgoing relations to even bigger amount of nodes. I want to be able to
query for a limited amount of starting nodes, returning with it the related nodes, but the related nodes should also be limited in numbers.
Is this possible in neo4j 1.9?
For example create these nodes and have an auto index on name:
CREATE p = (bar{company:'Bar1'})<-[:FREQUENTS]-(andres {name:'Andres'})-[:WORKS_AT]->(neo{company:'Neo1'})
WITH andres
CREATE (restaurant{company:'Restaurant1'})<-[:FREQUENTS]-(andres)-[:WORKS_AT]-(lib{company:'Library'}) ;
CREATE p = (bar{company:'Bar2'})<-[:FREQUENTS]-(todd {name:'Todd'})-[:WORKS_AT]->(neo{company:'Neo2'})
WITH todd
CREATE (restaurant{company:'Restaurant2'})<-[:FREQUENTS]-(todd)-[:WORKS_AT]-(lib{company:'Library2'}) ;
CREATE p = (bar{company:'Bar3'})<-[:FREQUENTS]-(hank {name:'Hank'})-[:WORKS_AT]->(neo{company:'Neo3'})
WITH hank
CREATE (restaurant{company:'Restaurant3'})<-[:FREQUENTS]-(hank)-[:WORKS_AT]-(lib{company:'Library3'}) ;
What I would like is something like:
START p=node:node_auto_index('*:*')
MATCH p-[:WORKS_AT]-> c, p-[:FREQUENTS]-> f
RETURN p, collect(distinct c.company), collect(distinct f.company) LIMIT 2;
To return 2 rows and have the collections limited to one, but without using the function on the collections, tried that on a large
data set and it becomes extremely slow. So some way to LIMIT the matches..
If this is not possible in neo4j 1.9, would there be a solution in neo4j 2.0?

Can you try something like this:
START p=node:node_auto_index('*:*')
RETURN p,
head(extract(path in p-[:WORKS_AT]->() : head(tail(nodes(path))))) as work_company,
head(extract(path in p-[:FREQUENTS]->() : head(tail(nodes(path))))) as visit_company
The head function on the extracted path node should be lazy so it pulls only the first one from the pattern match
If you look at the profiling output you should see that it touches only the first node each.

It could be that the : query triggers some very large operations in the indexing layer, rather than being lazy.. I would try something like this:
START p=node:node_auto_index('*:*')
WITH p LIMIT 2
MATCH p-[:WORKS_AT]-> c, p-[:FREQUENTS]-> f return p, collect(distinct c.company), collect(distinct f.company)

Related

Remove automorphisms of a cypher query output

When doing a Cypher query to retrieve a specific subgraph with automorphisms, let's say
MATCH (a)-[:X]-(b)-[:X]-(c),
RETURN a, b, c
It seems that the default behaviour is to return every retrieved subgraph and all their automorphisms.
In that exemple, if (u)-[:X]-(v)-[:X]-(w) is a graph matching the pattern, the output will be u,v,w but also w,v,u, which consist in the same graph.
Is there a way to retrieve each subgraph only once ?
EDIT: It would be great if Cypher have a feature to do that in the search, using some kind of symmetry breaking condition as it would reduce the computing time. If that is not the case, how would you post-process to find the desired output ?
In the query you are making, (a)-[r:X]-(b) and (a)-[t:X]-(c) refer to a similar pattern. Since (b) and (c) can be interchanged. What is the need to repeat matching twice? MATCH (a)-[r:X]-(b) RETURN a, r, b returns all the subgraphs you are looking for.
EDIT
You can do something as follows to find the nodes, which are having two relations of type X.
MATCH (a)-[r:X]-(b) WHERE size((a)-[:X]-()) = 2 RETURN a, r, b
For these kind of mirrored patterns, we can add a restriction on the internal graph ids so only one of the two paths is kept:
MATCH (a)-[:X]-(b)-[:X]-(c)
WHERE id(a) < id(c)
RETURN a, b, c
This will also prevent the case where a = c.

Neo4J match and create relationship is very very slow with few millions records

I have about 3.5M nodes with label A and about 400 nodes with label B.
Nodes with label B already have directed relation like (b1:B)-(c:CONNECTS)->(b2:B) now I need to add 3.5M another type of relationships by comparing A node properties with :CONNECTS relationship properties.
My statement looks like this:
MATCH (a:A)
MATCH (c:C)
MATCH (b1:B {id: a.a1_id})-[rl:CONNECTS*1..21]->(b2:B {id: a.b2_id}) WHERE ALL(x in rl WHERE x.connect_id = c.connect_id)
MATCH (new_a:B)-[r:TO]->(new_b:B) WHERE r in rl
CREATE (new_a)-[:TICKET {ticket_id: ID(a)}]->(new_b)
This statement is extremely slow and just hangs up. I even tried to do some performance tuning mentioned here, especially I allocated heap size to 16GB.
I find it quite strange that it can't handle this size of data. What am I missing? I tried to model differently and reduce relationship queries and use more schema index, but I failed to do a lot differently because of type of data I have and type of query I want to perform after all data is there.
I also tried to use periodic commit while creating A nodes with csv import. It has same issues.
I hope I am clear enough. I would really appreciate some inputs. Thanks.
What are the labels A, B, C ? A CONNECTS relationship is also free of meaning.
Queries like this are meant to be comprehensible not the opposite!
// generates 3.5M rows
MATCH (a:A)
// generates x-times 3.5M rows
// you never use that C except for checking an connect id?
MATCH (c:C)
// many million times execute this variable length expand
MATCH (b1:B {id: a.a1_id})-[rl:CONNECTS*1..21]->(b2:B {id: b2_id})
WHERE ALL(x in rl WHERE x.connect_id = c.connect_id)
// lookup by relationship is very bad esp. as you looking over a cross product of all 400x400 B's
MATCH (new_a:B)-[r:TO]->(new_b:B) WHERE r in rl
// why do you store the id of a on this self!!-relationship?
CREATE (new_b)-[:TICKET {ticket_id: ID(a)}]->(new_b);
Where does b2_id come from?
Perhaps something like this:
MATCH (a:A)
MATCH (b1:B {id: a.a1_id})
MATCH (b2:B {id: {b2_id}})
MATCH (b1)-[rels:CONNECTS*..21]->(b2)
WHERE ALL(x in tail(rels) WHERE x.connect_id = head(rels).connect_id)
UNWIND rels AS r
WITH a,startNode(r) as new_a, endNode(r) as new_b
CREATE (new_a)-[:TICKET {ticket_id: ID(a)}]->(new_b);

Optimizing Cypher Query Neo4j

I want to write a query in Cypher and run it on Neo4j.
The query is:
Given some start vertexes, walk edges and find all vertexes that is connected to any of start vertex.
(start)-[*]->(v)
for every edge E walked
if startVertex(E).someproperty != endVertex(E).someproperty, output E.
The graph may contain cycles.
For example, in the graph above, vertexes are grouped by "group" property. The query should return 7 rows representing the 7 orange colored edges in the graph.
If I write the algorithm by myself it would be a simple depth / breadth first search, and for every edge visited if the filter condition is true, output this edge. The complexity is O(V+E)
But I can't express this algorithm in Cypher since it's very different language.
Then i wrote this query:
find all reachable vertexes
(start)-[*]->(v), reachable = start + v.
find all edges starting from any of reachable. if an edge ends with any reachable vertex and passes the filter, output it.
match (reachable)-[]->(n) where n in reachable and reachable.someprop != n.someprop
so the Cypher code looks like this:
MATCH (n:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
WITH n MATCH (n:Col)-[*]->(m:Col)
WITH collect(distinct n) + collect(distinct m) AS c1
UNWIND c1 AS rn
MATCH (rn:Col)-[]->(xn:Col) WHERE rn.schema<>xn.schema and xn in c1
RETURN rn,xn
The performance of this query is not good as I thought. There are index on :Col(schema)
I am running neo4j 2.3.0 docker image from dockerhub on my windows laptop. Actually it runs on a linux virtual machine on my laptop.
My sample data is a small dataset that contains 0.1M vertexes and 0.5M edges. For some starting nodes it takes 60 or more seconds to complete this query. Any advice for optimizing or rewriting the query? Thanks.
The following code block is the logic I want:
VertexQueue1 = (starting vertexes);
VisitedVertexSet = (empty);
EdgeSet1 = (empty);
While (VertexSet1 is not empty)
{
Vertex0 = VertexQueue1.pop();
VisitedVertexSet.add(Vertex0);
foreach (Edge0 starting from Vertex0)
{
Vertex1 = endingVertex(Edge0);
if (Vertex1.schema <> Vertex0.schema)
{
EdgeSet1.put(Edge0);
}
if (VisitedVertexSet.notContains(Vertex1)
and VertexQueue1.notContains(Vertex1))
{
VertexQueue1.push(Vertex1);
}
}
}
return EdgeSet1;
EDIT:
The profile result shows that expanding all paths has a high cost. Looking at the row number, it seems that Cypher exec engine returns all paths but I want distint edge list only.
LEFT one:
match (start:Col {table:"F_XXY_DSMK_ITRPNL_IDX_STAT_W"})
,(start)-[*0..]->(prev:Col)-->(node:Col)
where prev.schema<>node.schema
return distinct prev,node
RIGHT one:
MATCH (n:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
WITH n MATCH (n:Col)-[*]->(m:Col)
WITH collect(distinct n) + collect(distinct m) AS c1
UNWIND c1 AS rn
MATCH (rn:Col)-[]->(xn:Col) WHERE rn.schema<>xn.schema and xn in c1
RETURN rn,xn
I think Cypher lets this be much easier than you're expecting it to be, if I'm understanding the query. Try this:
MATCH (start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})-->(node:Col)
WHERE start.schema <> node.schema
RETURN start, node
Though I'm not sure why you're comparing the schema property on the nodes. Isn't the schema for the start node fixed by the value that you pass in?
I might not be understanding the query though. If you're looking for more than just the nodes connected to the start node, you could do:
MATCH
(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
(start)-[*0..]->(prev:Col)-->(node:Col)
WHERE prev.schema <> node.schema
RETURN prev, node
That open-ended variable length relationship specification might be slow, though.
Also note that when Cypher is browsing a particular path it stops which it finds that it's looped back onto some node (EDIT relationship, not node) in the path matched so far, so cycles aren't really a problem.
Also, is the DWMDATA value that you're passing in interpolated? If so, you should think about using parameters for security / performance:
http://neo4j.com/docs/stable/cypher-parameters.html
EDIT:
Based on your comment I have a couple of thoughts. First limiting to DISTINCT path isn't going to help because every path that it finds is distinct. What you want is the distinct set of pairs, I think, which I think could be achieved by just adding DISTINCT to the query:
MATCH
(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
(start)-[*0..]->(prev:Col)-->(node:Col)
WHERE prev.schema <> node.schema
RETURN DISTINT prev, node
Here is another way to go about it which may or may not be more efficient, but might at least give you an idea for how to go about things differently:
MATCH
path=(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})-->(node:Col)
WITH rels(path) AS rels
UNWIND rels AS rel
WITH DISTINCT rel
WITH startNode(rel) AS start_node, endNode(rel) AS end_node
WHERE start_node.schema <> end_node.schema
RETURN start_node, end_node
I can't say that this would be faster, but here's another way to try:
MATCH (start:Col)-[*]->(node:Col)
WHERE start.property IN {property_values}
WITH collect(ID(node)) AS node_ids
MATCH (:Col)-[r]->(node:Col)
WHERE ID(node) IN node_ids
WITH DISTINCT r
RETURN startNode(r) AS start_node, endNode(r) AS end_node
I suspect that the problem in all cases is with the open-ended variable length path. I've actually asked on the Slack group to try to get a better understanding of how it works. In the meantime, for all the queries that you try I would suggest prefixing them with the PROFILE keyword to get a report from Neo4j on what parts of the query are slow.
// this is very inefficient!
MATCH (start:Col)-[*]->(node:Col)
WHERE start.property IN {property_values}
WITH distinct node
MATCH (prev)-[r]->(node)
RETURN distinct prev, node;
you might be better off with this:
MATCH (start:Col)
WHERE start.property IN {property_values}
MATCH (node:Col)
WHERE shortestPath((start)-[*]->(node)) IS NOT NULL
MATCH (prev)-[r]->(node)
RETURN distinct prev, node;

Neo4j: Transitive query and node ordering

I am using Neo4j to track relationships in OOP architecture. Let us assume that nodes represent classes and (u) -[:EXTENDS]-> (v) if class u extends class v (i.e. for each node there is at most one outgoing edge of type EXTENDS). I am trying to find out a chain of predecessor classes for a given class (n). I have used the following Cypher query:
start n=node(...)
match (n) -[:EXTENDS*]-> (m)
return m.className
I need to process nodes in such an order that the direct predecessor of class n comes first, its predecessor comes as second etc. It seems that the Neo4j engine returns the nodes in exactly this order (given the above query) - is this something I should rely on or could this behavior suddenly change in some of the future releases?
If I should not rely on this behavior, what Cypher query would allow me to get all predecessor nodes in given ordering? I was thinking about following query:
start n=node(...)
match p = (n) -[:EXTENDS*]-> (m {className: 'Object'})
return p
Which would work quite fine, but I would like to avoid specifying the root class (Object in this case).
It's unlikely to change anytime soon as this is really the nature of graph databases at work.
The query you've written will return ALL possible "paths" of nodes that match that pattern. But given that you've specified that there is at most one :EXTENDS edge from each such node, the order is implied with the direction you've included in the query.
In other words, what's returned won't start "skipping" nodes in a chain.
What it will do, though, is give you all "sub-paths" of a path. That is, assuming you specified you wanted the predecessors for node "a", for the following path...
(a)-[:EXTENDS]->(b)-[:EXTENDS]->(c)
...your query (omitting the property name) will return "a, b, c" and "a, b". If you only want ALL of its predecessors, and you can use Cypher 2.x, consider using the "path" way, something like:
MATCH p = (a)-[:EXTENDS*]->(b)
WITH p
ORDER BY length(p) DESC
LIMIT 1
RETURN extract(x in nodes(p) | p.className)
Also, as a best practice, given that you're looking at paths of indefinite length, you should likely limit the number of hops your query makes to something reasonable, e.g.
MATCH (n) -[:EXTENDS*0..10]-> (m)
Or some such.
HTH

Perform MATCH on collection / Break apart collection

I am trying to mimc the functionality of the neo4j browser to display my graph in my front end. The neo4j browser issues two calls for every query - the first call performs the query that the user types into the query box and the second call uses find the relationships between every node returned in the first user-entered query.
{
"statements":[{
"statement":"START a = node(1,2,3,4), b = node(1,2,3,4)
MATCH a -[r]-> b RETURN r;",
"resultDataContents":["row","graph"],
"includeStats":true}]
}
In my application I would like to be more efficient so I would like to be able to get all of my nodes and relationships in a single query. The query that I have at present is:
START person = node({personId})
MATCH person-[:RELATIONSHIP*]-(p:Person)
WITH distinct p
MATCH p-[r]-(d:Data), p-[:DETAILS]->(details), d-[:FACT]->(facts)
RETURN p, r, d, details, facts
This query runs well but it doesn't give me the "d" and "details" nodes which were linked to the original "person".
I have tried to join the "p" and "person" results in a collection:
collect(p) + collect(person) AS people
But this does not allow me to perform a MATCH on the resulting collection. As far as I can figure out there is no way of breaking apart a collection.
The only option I see at the moment is to split the query into two; return the "collect(p) + collect(person) AS people" collection and then use the node values in a second query. Is there a more efficient way of performing this query?
If you use the quantifier *0.. RELATIONSHIP is also match at a depth of 0 making person the same as p in this case. The * without specified limits defaults to 1..infinity
START person = node({personId})
MATCH person-[:RELATIONSHIP*0..]-(p:Person)
WITH distinct p
MATCH p-[r]-(d:Data), p-[:DETAILS]->(details), d-[:FACT]->(facts)
RETURN p, r, d, details, facts

Resources