Query optimization for matching nodes with equal values - neo4j

I want to collect all nodes that have the same property value
MATCH (rhs:DataValue),(lhs:DataValue) WHERE rhs.value = lhs.value RETURN rhs,lhs
I have created an index on the property
CREATE INDEX ON :DataValue(value)
the index is created:
Indexes
ON :DataValue(value) ONLINE
I have only 2570 DataValues.
match (n:DataValue) return count(n)
> 2570
However, the query takes ages/does not terminate within the timeout of my browser.
This surprises me as I have an index and expected the query to run within O(n) with n being the amount of nodes.
My train of thought is: If I'd implement it myself I could just match all nodes O(n) sort them by value O(n log n) and then go through the sorted list and return all sublists that are longer than 1 O(n). Thus, the time I could archive is O(n log n). However, I expect the sorting already being covered by the indexing.
How am I mistaken and how can I optimize this query?

Your complexity is actually O(n^2), since your match creates a cartesian product for rhs and lhs, and then does filtering for every single pairing to see if they are equal. The index doesn't apply in your query at all. You should be able to confirm that by running EXPLAIN or PROFILE on the query.
You'll want to tweak your query a little to get it to O(n). Hopefully in a future neo4j version query planning will be smarter so we don't have to be so explicit.
MATCH (lhs:DataValue)
WITH lhs
MATCH (rhs:DataValue)
WHERE rhs.value = lhs.value
RETURN rhs,lhs
Note that your returned values will include opposite pairs ((A, B), (B, A)).

Related

Count the number of relationship types to add them as a frequency-property to the edges

I'm trying to count the different types of relationships in my neo4j graph to add them as a "frequency" property to the corresponding edges (ie I have 4 e:EX relationship types, so I would like my edges of type EX to have an e.frequency=4).
So far I have played around with this code:
MATCH ()-[e:EX]-()
WITH e, count(e) as amount
SET e.frequency = amount
RETURN e
For this piece of code my returned e.frequency is 2 for all EX edges. Maybe anyone here knows how to correct this?
It sounds like you want this information for quick access later for any node with that type. If you're planning on deleting or adding edges in your graph, you should realize that your data will get stale quickly, and a graph-wide query to update the property on every edge in the graph just doesn't make sense.
Thankfully Neo4j keeps a transactional count store of various statistics, including the number of relationships per relationship type.
It's easiest to get these via procedure calls, either in Neo4j itself or APOC Procedures.
If you have APOC installed, you can see the map of relationship type counts like this:
CALL apoc.meta.stats() YIELD relTypesCount
RETURN relTypesCount
If you know which type you want the count of, you can use dot notation into the relTypesCount map to get the value in question.
If it's dynamic (either passed in as a parameter, or obtained after matching to a relationship in the query) you can use the map index notation to get the count in question like this:
CALL apoc.meta.stats() YIELD relTypesCount
MATCH ()-[r]->()
WITH relTypesCount, r
LIMIT 5
RETURN type(r) as type, relTypesCount[type(r)] as count
If you don't have APOC, you can make use of db.stats.retrieve('GRAPH COUNTS')
YIELD data, but you'll have to do some additional filtering to make sure you get the counts for ALL of the relationships of the given type, and exclude the counts that include the labels of the start or end nodes:
CALL db.stats.retrieve('GRAPH COUNTS') YIELD data
WITH [entry IN data.relationships WHERE NOT exists(entry.startLabel) AND NOT exists(entry.endLabel)] as relCounts
MATCH ()-[r]->()
WITH relCounts, r
LIMIT 5
RETURN type(r) as type, [rel in relCounts WHERE rel.relationshipType = type(r) | rel.count][0] as count
First, here is what your query is doing
// Match all EX edges (relationships), ignore direction
MATCH ()-[e:EX]-()
// Logical partition; With the edges, count how many times that instance occurred (will always be 2. (a)-[e:EX]->(b) and the reverse order of (b)<-[e:EX]-(a)
WITH e, count(e) as amount
// Set the property frequency on the instance of e to amount (2)
SET e.frequency = amount
// return the edges
RETURN e
So to filter the duplicates (reverse direction match), you need to specify the direction on the MATCH. So MATCH ()-[e:EX]->(). for the frequency part, you don't even need a match; you can just count the occurences of the pattern WITH SIZE(()-[:EX]->()) as c (SIZE because pattern matching returns a list, not a row set)
So
WITH SIZE(()-[:EX]->()) as c
MATCH ()-[e:EX]->()
SET e.frequency = c
return e
Although, frequency will be invalidated as soon as an EX edge is created or deleted, so I would just open your Cypher with asking for the edge count.
Also, in this trivial case, the best way to get the relation count is with a MATCH - COUNT because this form helps the Cypher planer to recognize that it can just fetch the edge count from it's internal metadata store.
MATCH ()-[e:EX]->()
WITH COUNT(e) as c
MATCH ()-[e:EX]->()
SET e.frequency = c
return e

simple match query taking ages

I have a simple query
MATCH (n:TYPE {id:123})<-[:CONNECTION*]<-(m:TYPE) RETURN m
and when executing the query "manually" (i.e. using the browser interface to follow edges) I only get a single node as a result as there are no further connections. Checking this with the query
MATCH (n:TYPE {id:123})<-[:CONNECTION]<-(m:TYPE)<-[n:CONNECTION]-(o:TYPE) RETURN m,o
shows no results and
MATCH (n:TYPE {id:123})<-[:CONNECTION]<-(m:TYPE) RETURN m
shows a single node so I have made no mistake doing the query manually.
However, the issue is that the first question takes ages to finish and I do not understand why.
Consequently: What is the reason such trivial query takes so long even though the maximum result would be one?
Bonus: How to fix this issue?
As Tezra mentioned, the variable-length pattern match isn't in the same category as the other two queries you listed because there's no restrictions given on any of the nodes in between n and m, they can be of any type. Given that your query is taking a long time, you likely have a fairly dense graph of :CONNECTION relationships between nodes of different types.
If you want to make sure all nodes in your path are of the same label, you need to add that yourself:
MATCH path = (n:TYPE {id:123})<-[:CONNECTION*]-(m:TYPE)
WHERE all(node in nodes(path) WHERE node:TYPE)
RETURN m
Alternately you can use APOC Procedures, which has a fairly efficient means of finding connected nodes (and restricting nodes in the path by label):
MATCH (n:TYPE {id:123})
CALL apoc.path.subgraphNodes(n, {labelFilter:'TYPE', relationshipFilter:'<CONNECTION'}) YIELD node
RETURN node
SKIP 1 // to avoid returning `n`
MATCH (n:TYPE {id:123})<-[:CONNECTION]<-(m:TYPE)<-[n:CONNECTION]-(o:TYPE) RETURN m,o Is not a fair test of MATCH (n:TYPE {id:123})<-[:CONNECTION*]<-(m:TYPE) RETURN m because it excludes the possibility of MATCH (n:TYPE {id:123})<-[:CONNECTION]<-(m:ANYTHING_ELSE)<-[n:CONNECTION]-(o:TYPE) RETURN m,o.
For your main query, you should be returning DISTINCT results MATCH (n:TYPE {id:123})<-[:CONNECTION*]<-(m:TYPE) RETURN DISTINCT m.
This is for 2 main reasons.
Without distinct, each node needs to be returned the number of times for each possible path to it.
Because of the previous point, that is a lot of extra work for no additional meaningful information.
If you use RETURN DISTINCT, it gives the cypher planner the choice to do a pruning search instead of an exhaustive search.
You can also limit the depth of the exhaustive search using ..# so that it doesn't kill your query if you run against a much older version of Neo4j where the Cypher Planner hasn't learned pruning search yet. Example use MATCH (n:TYPE {id:123})<-[:CONNECTION*..10]<-(m:TYPE) RETURN m

Neo4j 3.2 Cypher low performance

Without getting unnecessarily too specific, I'm facing the following issue with Cyper in Neo4j 3.2. Let's say we have a database with 3 entities: User, Comment, Like.
For whatever reason, I'm trying to run the following query:
MATCH (n:USER) WHERE n.name = "name"
WITH n
MATCH (o:USER)
WITH n, o, "2000" as number
MATCH (n)<-[:CREATED_BY]-(:COMMENT)-[:HAS]->(l:LIKE)-[:CREATED_BY]->(o)
RETURN n, o, number, count(l)
The query takes minutes to complete. However, if I simply remove the "2000" as number part, it completes within tens of milliseconds.
Does anybody have an explanation why?
EDIT:
Top image, with "2000" as number part; bottom, without it.
You're going to have to clean up your query, right now you're not using indexes (so the initial match with the specific name is slow), and then you perform a cartesian product against all :User nodes, then create strings for each row.
So first, create an index on :USER(name) so you can find your start node fast.
Then we'll have to clean up the rest of the match.
Try something like this instead:
MATCH (n:USER) WHERE n.name = "name"
WITH n, "2000" as number
MATCH (n)<-[:CREATED_BY]-(:COMMENT)-[:HAS]->(l:LIKE)-[:CREATED_BY]->(o:User)
RETURN n, o, number, count(l)
You should see a similar plan with this query as in the query without the "2000".
The reason for this is that although your plan has a cartesian product with your match to o, the planner was intelligent enough to realize there was an additional restriction for o in that it had to occur in the pattern in your last match, and its optimization for that situation let you avoid performing a cartesian product.
Introduction of a new variable number, however, prevented the planner from recognizing that this was basically the same situation, so the planner did not optimize out the cartesian product.
For now, try to be explicit about how you want the query to be performed, and try to avoid cartesian products in your queries.
In this particular case, it's important to realize that when you have MATCH (o:User) on the third line, that's not declaring that the type of o is a :User in the later match, it's instead saying that for every row in your results so far, perform a cartesian product against all :User nodes, and then for each of those user nodes, see which ones exist in the pattern provided. That is a lot of unnecessary work, compared to simply expanding the pattern provided and getting whatever :User nodes you find at the other end of the pattern.
EDIT
As far as getting both :LIKE and :DISLIKE node counts, maybe try something like this:
MATCH (n:USER) WHERE n.name = "name"
WITH n, "2000" as number
MATCH (n)<-[:CREATED_BY]-(:COMMENT)-[:HAS]->(likeDislike)-[:CREATED_BY]->(o:User)
WITH n, o, number, head(labels(likeDislike)) as type, count(likeDislike) as cnt
WITH n, o, number, CASE WHEN type = "LIKE" THEN cnt END as likeCount, CASE WHEN type = "DISLIKE" THEN cnt END as dislikeCount
RETURN n, o, number, sum(likeCount) as likeCount, sum(dislikeCount) as dislikeCount
Assuming you still need that number variable in there.

Return multiple relationship counts for one MATCH statement

I want to do something like this:
MATCH (p:person)-[a:UPVOTED]->(t:topic),(p:person)-[b:DOWNVOTED]->(t:topic),(p:person)-[c:FLAGGED]->(t:topic) WHERE ID(t)=4 RETURN COUNT(a),COUNT(b),COUNT(c)
..but I get all 0 counts when I should get 2, 1, 1
A better solution is to use size which improve drastically the performance of the query :
MATCH (t:Topic)
WHERE id(t) = 4
RETURN size((t)<-[:DOWNVOTED]-(:Person)) as downvoted,
size((t)<-[:UPVOTED]-(:Person)) as upvoted,
size((t)<-[:FLAGGED]-(:Person)) as flagged
If you are sure that the other nodes on the relationships are always labelled with Person, you can remove them from the query and it will be a bit faster again
Let's start with refactoring the query a bit (hopefully the meaning of it isn't lost):
MATCH
(t:topic)
(p:person)-[upvote:UPVOTED]-(t),
(p:person)-[downvote:DOWNVOTED]->(t),
(p:person)-[flag:FLAGGED]->(t)
WHERE ID(t)=4
RETURN COUNT(upvote), COUNT(downvote), COUNT(flag)
Since t is your primary variable (since you are filtering on it), I've matched once with the label and then used just the variable throughout the rest of the matches. Seeing the query cleaned up like this, it seems to me that you're trying to count all upvotes/downvotes/flags for a topic, but you don't care who did those things. Currently, since you're using the same variable p Cypher is going to try to match the same person for all three lines. So you could have different variables:
(p1:person)-[upvote:UPVOTED]-(t),
(p2:person)-[downvote:DOWNVOTED]->(t),
(p3:person)-[flag:FLAGGED]->(t)
Or better, since you're not referencing the people anywhere else, you can just leave the variables out:
(:person)-[upvote:UPVOTED]-(t),
(:person)-[downvote:DOWNVOTED]->(t),
(:person)-[flag:FLAGGED]->(t)
And stylistically, I would also suggest starting your matches with the item that you're filtering on:
(t)<-[upvote:UPVOTED]-(:person)
(t)<-[downvote:DOWNVOTED]-(:person)
(t)<-[flag:FLAGGED]-(:person)
The next problem comes in because by making these a MATCH, you're saying that there NEEDS to be a match. Which means you'll never get cases with zeros. So you'll want OPTIONAL MATCH:
MATCH (t:topic)
WHERE ID(t)=4
OPTIONAL MATCH (t)<-[upvote:UPVOTED]-(:person)
OPTIONAL MATCH (t)<-[downvote:DOWNVOTED]-(:person)
OPTIONAL MATCH (t)<-[flag:FLAGGED]-(:person)
RETURN COUNT(upvote), COUNT(downvote), COUNT(flag)
Even then, though what you're saying is: "Find a topic and find all cases where there is 1 upvote, no downvote, no flag, 1 upvote, 1 downvote, no flag, etc... to all permutations). That means you'll want to COUNT one at a time:
MATCH (t:topic)
WHERE ID(t)=4
OPTIONAL MATCH (t)<-[r:UPVOTED]-(:person)
WITH t, COUNT(r) AS upvotes
OPTIONAL MATCH (t)<-[r:DOWNVOTED]-(:person)
WITH t, upvotes, COUNT(r) AS downvotes
OPTIONAL MATCH (t)<-[r:FLAGGED]-(:person)
RETURN upvotes, downvotes, COUNT(r) AS flags
A couple of miscellaneous items:
Be careful about using Neo IDs as a long-term reference because they can be recycled.
Use parameters whenever possible for performance / security (WHERE ID(t)={topic_id})
Also, labels are generally TitleCase. See The Zen of Cypher guide.
Check this query, i think it will help you.
MATCH (p:person)-[a:UPVOTED]->(t:topic),
(p)-[b:DOWNVOTED]->(t),(p)-[c:FLAGGED]->(t)
WHERE ID(t)=4
RETURN COUNT(a) as a_count,COUNT(b) as b_count,COUNT(c) as c_count;
Your current MATCH requires that the same person node (identified by p) have relationships of all 3 types with t. This is because an identifier is bound to a specific node (or relationship, or value), and (unless hidden by a WITH clause, which you do not have in your query) will reference that same node (or relationship, or value) throughout a query.
Based on your expected results, I am assuming that you are just trying to count the number of relationships of those 3 types between any person and t. If so, this is a performant way to do that:
MATCH (t:topic)
WHERE ID(t) = 4
MATCH (:person)-[r:UPVOTED|DOWNVOTED|FLAGGED]->(t)
RETURN REDUCE(s=[0,0,0], x IN COLLECT(r) |
CASE TYPE(x)
WHEN 'UPVOTED' THEN [s[0]+1, s[1], s[2]]
WHEN 'DOWNVOTED' THEN [s[0], s[1]+1, s[2]]
ELSE [s[0], s[1], s[2]+1]
END
) As res;
res is an array with the number of UPVOTED, DOWNVOTED, and FLAGGED relationships, respectively, between any person and t.
Another approach would be to use separate OPTIONAL MATCH statements for each relationship type, returning three COUNT(DISTINCT x) values. But the above query uses a single MATCH statement, greatly reducing the number of DB hits, which are generally expensive.

Querying multiple indexes not working if one condition fails in Neo4j

I am trying to search for a key word on all the indexes. I have in my graph database.
Below is the query:
start n=node:Users(Name="Hello"),
m=node:Location(LocationName="Hello")
return n,m
I am getting the nodes and if keyword "Hello" is present in both the indexes (Users and Location), and I do not get any results if keyword Hello is not present in any one of index.
Could you please let me know how to modify this cypher query so that I get results if "Hello" is present in any of the index keys (Name or LocationName).
In 2.0 you can use UNION and have two separate queries like so:
start n=node:Users(Name="Hello")
return n
UNION
start n=node:Location(LocationName="Hello")
return n;
The problem with the way you have the query written is the way it calculates a cartesian product of pairs between n and m, so if n or m aren't found, no results are found. If one n is found, and two ms are found, then you get 2 results (with a repeating n). Similar to how the FROM clause works in SQL. If you have an empty table called empty, and you do select * from x, empty; then you'll get 0 results, unless you do an outer join of some sort.
Unfortunately, it's somewhat difficult to do this in 1.9. I've tried many iterations of things like WITH collect(n) as n, etc., but it boils down to the cartesian product thing at some point, no matter what.

Resources