I'm trying to understand efficient usage patterns with neo4j, specifically in reference to high degree nodes. To give an idea of what I'm talking about, I have User nodes that have attributes which I have modeled as nodes. So there are relationships in my table such as
(:User)-[:HAS_ATTRIB]->(:AgeCategory)
and so on and so forth. The problem is that some of these AgeCategory nodes are very high degree, on the order of 100k, and queries such as
MATCH (u:User)-->(:AgeCategory)<--(v:User), (u)-->(:FavoriteLanguage)<--(v)
WHERE u.uid = "AAA111" AND v.uid <> u.uid
RETURN v.uid
(matching all users that share the same age category and favorite language as AAA111) are very, very slow, since you have to run over the FavoriteLanguage linked list once for every element in the AgeCategory linked list (or at least that's how I understand it).
I think it's pretty clear from the fact that this query takes minutes to resolve that I'm doing something wrong, but I am curious what the right procedure for dealing with queries like this is. Should I pull down the matching users from each query individually and compare them with an in-memory hash? Is there a way to put an index on the relationships on a node? Is this even a good idea for a schema to begin with?
My intuition is it would be more efficient to first retrieve the two end points (AgeCategory and FavoriteLanguage) for the given node u, and then query the middle node v for a path with these two fixed end points.
To prove that, I created a test graph with the following components,
A node u:User with u.uid = 'AAA111'
A node c:AgeCategory
A node l:FavoriteLanguage
A relationship between u and c, u-[:HAS_AGE]->c
A relationship between u and l, u-[:LIKE_LANGUAGE]->l
100,000 nodes v, each of which shar the same c:AgeCategory and l:FavoriteLanguage with the ndoe u, that is each v connects to l and c, v-[:HAS_AGE]->c, v-[:LIKE_LANGUAGE]->l
I run the following query 10 times, and got the average running time 10500 millis.
Match l:FavoriteLanguage<-[:LIKE_LANGUAGE]-u:User-[:HAS_AGE]->c:AgeCategory
Where u.uid = 'AAA111'
With l,c
Match l<-[:LIKE_LANGUAGE]-v:User-[:HAS_AGE]->c
Where v.uid <> 'AAA111'
Return v.uid
With 10,000 v nodes, this query takes around 2000 millis, your query takes about 27000 millis.
With 100,000 v node, this query takes around 10500 millis, it seems to take forever with your original query.
So you might give this query a try and see if it can improve the performance with your graph.
Related
Let's assume this use case;
We have few nodes (labeled Big) and each having a simple integer ID property.
Each Big node has a relation with millions of (labeled Small) nodes.
such as :
(Small)-[:BELONGS_TO]->(Big)
How can I phrase a Cypher query to represent the following in natural language:
For each Big node in the range of ids between 4-7, get me 10 of Small nodes that belongs to it.
The supposed result would give 2 Big nodes, 20 Small nodes, and 20 Relations
The needed result would be represented by this graph:
2 Big nodes, each with a subset of 10 of Small nodes that belongs to them
What I've tried but failed (it only shows 1 big node (id=5) along with 10 of its related Small nodes, but doesn't show the second node (id=6):
MATCH (s:Small)-[:BELONGS_TO]->(b:Big)
Where 4<b.bigID<7
return b,s limit 10
I guess I need a more complex compound query.
Hope I could phrase my question in an understandable way!
As stdob-- says, you can't use limit here, at least not in this way, as it limits the entire result set.
While the aggregation solution will return you the right answer, you'll still pay the cost for the expansion to those millions of nodes. You need a solution that will lazily get the first ten for each.
Using APOC Procedures, you can use apoc.cypher.run() to effectively perform a subquery. The query will be run per-row, so if you limit the rows first, you can call this and use LIMIT within the subquery, and it will properly limit to 10 results per row, lazily expanding so you don't pay for an expansion to millions of nodes.
MATCH (b:Big)
WHERE 4 < b.bigID < 7
CALL apoc.cypher.run('
MATCH (s:Small)-[:BELONGS_TO]->(b)
RETURN s LIMIT 10',
{b:b}) YIELD value
RETURN b, value.s
Your query does not work because the limit applies to the entire previous flow.
You need to use aggregation function collect:
MATCH (s:Small)-[:BELONGS_TO]->(b:Big) Where 4<b.bigID<7
With b,
collect(distinct s)[..10] as smalls
return b,
smalls
To keep things simple, as part of the ETL on my time-series data, I added a sequence number property to each row corresponding to 0..370365 (370,366 nodes, 5,555,490 properties - not that big). I later added a second property and named it "outeseq" (original) and "ineseq" (second) to see if an outright equivalence to base the relationship on might speed things up a bit.
I can get both of the following queries to run properly on up to ~30k nodes (LIMIT 30000) but past that, its just an endless wait. My JVM has 16g max (if it can even use it on a windows box):
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.outeseq-1
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
or
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.ineseq
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
I also added these in hopes of speeding things up:
CREATE CONSTRAINT ON (a:BOOK)
ASSERT a.outeseq IS UNIQUE
CREATE CONSTRAINT ON (b:BOOK)
ASSERT b.ineseq IS UNIQUE
I can't get the relationships created for the entire data set! Help!
Alternatively, I can also get bits of the relationships built with parameters, but haven't figured out how to parameterize the sequence over all of the node-to-node sequential relationships, at least not in a semantically general enough way to do this.
I profiled the query, but did't see any reason for it to "blow-up".
Another question: I would like each relationship to have a property to represent the difference in the time-stamps of each node or delta-t. Is there a way to take the difference between the two values in two sequential nodes, and assign it to the relationship?....for all of the relationships at the same time?
The last Q, if you have the time - I'd really like to use the raw data and just chain the directed relationships from one nodes'stamp to the next nearest node with the minimum delta, but didn't run right at this for fear that it cause scanning of all the nodes in order to build each relationship.
Before anyone suggests that I look to KDB or other db's for time series, let me say I have a very specific reason to want to use a DAG representation.
It seems like this should be so easy...it probably is and I'm blind. Thanks!
Creating Relationships
Since your queries work on 30k nodes, I'd suggest to run them page by page over all the nodes. It seems feasible because outeseq and ineseq are unique and numeric so you can sort nodes by that properties and run query against one slice at time.
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq = b.outeseq-1
WITH a, b ORDER BY a.outeseq SKIP {offset} LIMIT 30000
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
It will take about 13 times to run the query changing {offset} to cover all the data. It would be nice to write a script on any language which has a neo4j client.
Updating Relationship's Properties
You can assign timestamp delta to relationships using SET clause following the MATCH. Assuming that a timestamp is a long:
MATCH (a:BOOK)-[s:FORWARD_SEQ]->(b:BOOK)
SET s.delta = abs(b.timestamp - a.timestamp);
Chaining Nodes With Minimal Delta
When relationships have the delta property inside, the graph becomes a weighted graph. So we can apply this approach to calculate the shortest path using deltas. Then we just save the length of the shortest path (summ of deltas) into the relation between the first and the last node.
MATCH p=(a:BOOK)-[:FORWARD_SEQ*1..]->(b:BOOK)
WITH p AS shortestPath, a, b,
reduce(weight=0, r in relationships(p) : weight+r.delta) AS totalDelta
ORDER BY totalDelta ASC
LIMIT 1
MERGE (a)-[nearest:NEAREST {delta: totalDelta}]->(b)
RETURN nearest;
Disclaimer: queries above are not supposed to be totally working, they just hint possible approaches to the problem.
If I had a million users and if I search them using IN Operator with more than 1000 custom ids which are unique indexed.
For example,in movie database given by neo4j
Let's say I need to get all movies where my list of actors ( > 1000) should acted in that movie and ordered by movie released date and distinct movie results.
Is that really good to have that operation on database and what are the time complexities if I execute that in single node instance and ha cluster.
This will give you a rough guide on the computational complexity involved in your calculation.
For each of your Actors Neo will look for all the Acted_In relationships going from that node. Lets assume that the average number of Acted_In relationships is 4 per Actor.
Therefore Neo will require 4 traversals per Actor.
Therefore for 1000 Actors that will be 4000 traversals.
Which for Neo is not a lot (they claim to do about 1 million a second, but of course this depends upon hardware)
Then, the Distinct aspect of the query is trivial for Neo as it knows which Nodes it has visited, so Neo would automatically have the unique list of Movie nodes, so this would be very quick.
If the Release date of the movie is indexed in Neo the ordering of the results would also be very quick.
So theoretically this query should run quickly (well under a second) and have minimal impact on the database
Here is what I'd do, I would start traversing from the actor with the lowest degree, i.e. the highest selectivity of your dataset. Then find the movies he acted in and check those movies against the rest of the actors.
The second option might be more efficient implementation wise. (There is also another trick that can speed up that one even more, let me know via email when you have the dataset to test it on).
MATCH (n:Actor) WHERE n.id IN {ids}
WITH n, SIZE( (n)-[:ACTED_IN]->() ) as degree
ORDER BY degree ASC
WITH collect(n) as actors WITH head(actors) as first, tail(actors) as rest, size(actors)-1 as number
// either
MATCH (n)-[:ACTED_IN]->(m)
WHERE size( (m)<-[:ACTED_IN]->() ) > number AND ALL(a in rest WHERE (a)-[:ACTED_IN]->(m))
RETURN m;
// or
MATCH (n)-[:ACTED_IN]->(m)
WHERE size( (m)<-[:ACTED_IN]->() ) > number
MATCH (m)<-[:ACTED_IN]-(a)
WHERE a IN rest
WITH m,count(*) as c, number
WHERE c = number
RETURN m;
I try to simplify my question. If all nodes in Neo4jDB have same label Science, what's the difference between MATCH n WHERE n.ID="UUID-0001" RETURN n and MATCH (n:Science) WHERE n.ID="UUID-0001" RETURN n. Why the performance is not the same?
My Neo4j database contains about 70000 nodes and 100 relations.
The nodes have two types: Paper and Author, and they both have an ID field.
I created each node with corresponding label, and I also use ID as the index.
However, since one of my functions need to query nodes by ID without considering the label. The query just like: MATCH n WHERE n.ID="UUID-0001" RETURN n. The query time cost about 4000~5000 ms!
But after adding Science for each node and using MATCH (n:Science) WHERE n.ID="UUID-0001" RETURN n. The query time became about 1000~1100 ms. Does anyone know the difference between these two cases?
PS. Count(n:Science) = Count(n:Paper) + Count(n:Author), which mean each node has two labels.
Because for every label Neo4j automatically creates an extra index. The Cypher language can be broadly thought of as piping + filtering, so Match n WHere ... will first get every node and then filter on the where part. Whereas Match (n:Science) Where... will get every node with label science (using an index) and then try to match the where. From your query performance we can see that about 1/5th of your nodes were marked science so the query runs in a fifth he time, because it did a fifth as many comparisons.
Even though I got the advisement from #phil_20686 and #Michael Hunger, but I think these answers do not solve my question.
I think there are some tricks when using label. If their are 10 thousand nodes in Neo4j DB, and the type of these nodes are the same. The query will perform better when adding label to these nodes.
I hope this post can help some people and give me some feedback if you find the reasons. Thanks.
I am having some extremely high query times and I'm unable to pinpoint the issue.
I am having a graph database with 6685 nodes, 26407 properties and 22921 relationships, running on an Amazon EC2 instance having 1.7GB RAM.
My use case is to map people to their various interest points and find for a given user, who are the people who have common interests with him.
I have data about 500 people in my db, and each person has an average of a little more than 100 different interest points related to him.
1) When I run this cypher query:
START u=node(5) MATCH (u)-[:interests]->(i)<-[:interests]-(o) RETURN o;
Here node(5) is a user node. So, I am trying to find all users who have the same ":interests" relation with user (u).
This query return 2557 rows and takes about 350ms.
2) When I sprinkle in a few extra MATCH conditions, the query time exponentially degrades.
For eg., if I want to find all users who have common interests with user (u) = node(5), and also share the same hometown, I wrote:
START u=node(5)
MATCH (u)-[:interests]->(i)<-[:interests]-(o)
WITH u,o,i
MATCH (u)-[:hometown]->(h)<-[:hometown]-(o)
RETURN u, o, i, h;
This query return 755 rows and takes about 2500ms!
3) If I add more constraints to the MATCH, like same gender, same alma mater etc., query times progressively worsen to >10,000 ms.
What am I doing wrong here?
Could you try stating the pattern as a whole in your first MATCH clause, i.e. MATCH (u)-[:interests]->(i)<-[:interests]-(o)-[:hometown]->(h)<-[:hometown]-(o) ?