Let's assume this use case;
We have few nodes (labeled Big) and each having a simple integer ID property.
Each Big node has a relation with millions of (labeled Small) nodes.
such as :
(Small)-[:BELONGS_TO]->(Big)
How can I phrase a Cypher query to represent the following in natural language:
For each Big node in the range of ids between 4-7, get me 10 of Small nodes that belongs to it.
The supposed result would give 2 Big nodes, 20 Small nodes, and 20 Relations
The needed result would be represented by this graph:
2 Big nodes, each with a subset of 10 of Small nodes that belongs to them
What I've tried but failed (it only shows 1 big node (id=5) along with 10 of its related Small nodes, but doesn't show the second node (id=6):
MATCH (s:Small)-[:BELONGS_TO]->(b:Big)
Where 4<b.bigID<7
return b,s limit 10
I guess I need a more complex compound query.
Hope I could phrase my question in an understandable way!
As stdob-- says, you can't use limit here, at least not in this way, as it limits the entire result set.
While the aggregation solution will return you the right answer, you'll still pay the cost for the expansion to those millions of nodes. You need a solution that will lazily get the first ten for each.
Using APOC Procedures, you can use apoc.cypher.run() to effectively perform a subquery. The query will be run per-row, so if you limit the rows first, you can call this and use LIMIT within the subquery, and it will properly limit to 10 results per row, lazily expanding so you don't pay for an expansion to millions of nodes.
MATCH (b:Big)
WHERE 4 < b.bigID < 7
CALL apoc.cypher.run('
MATCH (s:Small)-[:BELONGS_TO]->(b)
RETURN s LIMIT 10',
{b:b}) YIELD value
RETURN b, value.s
Your query does not work because the limit applies to the entire previous flow.
You need to use aggregation function collect:
MATCH (s:Small)-[:BELONGS_TO]->(b:Big) Where 4<b.bigID<7
With b,
collect(distinct s)[..10] as smalls
return b,
smalls
Related
I need to check if any node with the given label exists in my application. What's the most efficient approach to do so (in Java)? I was expecting
Transaction'getAllLabelsInUse()
to do the job, but it seems to also return truewhen any index or constraint exists for the given label.
My current workaround is running a query like this:
match (n:`label`) return n._id limit 1
assuming it would be a bit faster than
match (n:Crew) with n limit 1 return count(*)
The counts store can quickly service simple queries, such as getting the counts of all nodes of a label, so match (n:Crew) return count(n) will be very fast.
Take a look at our knowledge base article on getting fast counts from the counts store for other alternatives that leverage the counts store.
I am using Neo4j to store data regarding movie ratings. I would like to count the number of movies that two users both rated. When running the query
match (a:User)-[:RATED]->(b:Movie)<-[:RATED]-(c:User) return a,b,c limit 1000
it completes in less than a second, however running
match (a:User)-[:RATED]->(b:Movie)<-[:RATED]-(c:User) return a,count(b),c limit 1000
the database can't finish the query as the heap runs out of memory, which I have set as 4gb. Am I using the count function properly? I don't understand how the performance between these two queries can differ so significantly.
MissingNumber has a good explanation of what's going on. When you do aggregations, the whole set has to be considered to do the aggregations correctly, and that must happen before the LIMIT, and this is taking a huge toll on your heap space.
As an alternate in your case, you can try the following:
match (a:User)-[:RATED]->()<-[:RATED]-(c:User)
with DISTINCT a, c
where id(a) < id(c)
limit 1000
match (a)-[:RATED]->(m:Movie)<-[:RATED]-(c)
with a, c, count(m) as moviesRated
return a, moviesRated, c
By moving the LIMIT up before the aggregation, but using DISTINCT instead to ensure we only deal with a pair of nodes in this pattern once (and apply a predicate based on graph ids to ensure we never deal with mirrored results), we should get a more efficient query. Then for each of those 1000 pairs of a and c, we expand out the pattern again and get the actual counts.
I have ran into a similar similar situation and solved this using the following approach, this will be applicable to you.
I used a data set having:
(TYPE_S) - 380 nodes
(TYPE_N) - 800000 nodes
[:S_realation_N] - 5600000 relations
Query one :
match (s:TYPE_S)-[]-(n:TYPE_N) return s, n limit 10
This took 2 milli-seconds.
As soon as 10 patterns(relations) are found in db, neo4j just returns result.
Query two :
match (s:TYPE_S)-[]-(n:TYPE_N) return s, sum(n.value) limit 10
This took ~4000 milli-seconds.
This might look like a query as fast as last one. But surely it won’t be as fast as the previous one because of aggregation involved.
Reason:
For the query to aggregate over pattern, Neo4j has to load all the paths that matches given pattern (these are way more than 10 or given limit here and will be 5600000 as per my dataset) into ram before performing aggregation. Later this aggregation is performed over 10 full records S_TYPE nodes, so this falls into specified return format with given limit now. Rest of the relations in the ram are then flushed. Which means for a moment ran is loaded with lot data which will later be ignored due to limit.
So to optimize runtimes and memory usage here you have to avoid part of query which leads to loading data which will later be ignored.
This is how I optimized it:
match (s:TYPE_S) where ((s)-[]-(:TYPE_N))
with collect(s)[0..10] as s_list
unwind s_list as s
match (s)-[]-(n:TYPE_N) return s, sum(n.value)
This took 64 milli-seconds.
Now neo4j first shortlists 10 nodes of type TYPE_S which have relations with TYPE_S, and then matches the pattern with these nodes and get their data.
This should work and run better than query2 since you are loading a limited set of records in to ram.
You could use this similar way to build your query, by shorting 1000 (a,b) distinct user pairs and then perform aggregations on them.
But this approach will fail in case where need to order by aggregation.
Reason for your query to run out of memory is because you are using 4 gb ram and running a query that may load a lot of combinational data into ram(this may sometimes be more than size of you db due multiplicity of data combinations defined in you patterns, in your case even if you have 50 unique users, you have 50*49 possible unique combinations of patterns that can loaded in to ram). Also other transactions and queries running in parallel could also impact.
To keep things simple, as part of the ETL on my time-series data, I added a sequence number property to each row corresponding to 0..370365 (370,366 nodes, 5,555,490 properties - not that big). I later added a second property and named it "outeseq" (original) and "ineseq" (second) to see if an outright equivalence to base the relationship on might speed things up a bit.
I can get both of the following queries to run properly on up to ~30k nodes (LIMIT 30000) but past that, its just an endless wait. My JVM has 16g max (if it can even use it on a windows box):
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.outeseq-1
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
or
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.ineseq
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
I also added these in hopes of speeding things up:
CREATE CONSTRAINT ON (a:BOOK)
ASSERT a.outeseq IS UNIQUE
CREATE CONSTRAINT ON (b:BOOK)
ASSERT b.ineseq IS UNIQUE
I can't get the relationships created for the entire data set! Help!
Alternatively, I can also get bits of the relationships built with parameters, but haven't figured out how to parameterize the sequence over all of the node-to-node sequential relationships, at least not in a semantically general enough way to do this.
I profiled the query, but did't see any reason for it to "blow-up".
Another question: I would like each relationship to have a property to represent the difference in the time-stamps of each node or delta-t. Is there a way to take the difference between the two values in two sequential nodes, and assign it to the relationship?....for all of the relationships at the same time?
The last Q, if you have the time - I'd really like to use the raw data and just chain the directed relationships from one nodes'stamp to the next nearest node with the minimum delta, but didn't run right at this for fear that it cause scanning of all the nodes in order to build each relationship.
Before anyone suggests that I look to KDB or other db's for time series, let me say I have a very specific reason to want to use a DAG representation.
It seems like this should be so easy...it probably is and I'm blind. Thanks!
Creating Relationships
Since your queries work on 30k nodes, I'd suggest to run them page by page over all the nodes. It seems feasible because outeseq and ineseq are unique and numeric so you can sort nodes by that properties and run query against one slice at time.
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq = b.outeseq-1
WITH a, b ORDER BY a.outeseq SKIP {offset} LIMIT 30000
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
It will take about 13 times to run the query changing {offset} to cover all the data. It would be nice to write a script on any language which has a neo4j client.
Updating Relationship's Properties
You can assign timestamp delta to relationships using SET clause following the MATCH. Assuming that a timestamp is a long:
MATCH (a:BOOK)-[s:FORWARD_SEQ]->(b:BOOK)
SET s.delta = abs(b.timestamp - a.timestamp);
Chaining Nodes With Minimal Delta
When relationships have the delta property inside, the graph becomes a weighted graph. So we can apply this approach to calculate the shortest path using deltas. Then we just save the length of the shortest path (summ of deltas) into the relation between the first and the last node.
MATCH p=(a:BOOK)-[:FORWARD_SEQ*1..]->(b:BOOK)
WITH p AS shortestPath, a, b,
reduce(weight=0, r in relationships(p) : weight+r.delta) AS totalDelta
ORDER BY totalDelta ASC
LIMIT 1
MERGE (a)-[nearest:NEAREST {delta: totalDelta}]->(b)
RETURN nearest;
Disclaimer: queries above are not supposed to be totally working, they just hint possible approaches to the problem.
If I had a million users and if I search them using IN Operator with more than 1000 custom ids which are unique indexed.
For example,in movie database given by neo4j
Let's say I need to get all movies where my list of actors ( > 1000) should acted in that movie and ordered by movie released date and distinct movie results.
Is that really good to have that operation on database and what are the time complexities if I execute that in single node instance and ha cluster.
This will give you a rough guide on the computational complexity involved in your calculation.
For each of your Actors Neo will look for all the Acted_In relationships going from that node. Lets assume that the average number of Acted_In relationships is 4 per Actor.
Therefore Neo will require 4 traversals per Actor.
Therefore for 1000 Actors that will be 4000 traversals.
Which for Neo is not a lot (they claim to do about 1 million a second, but of course this depends upon hardware)
Then, the Distinct aspect of the query is trivial for Neo as it knows which Nodes it has visited, so Neo would automatically have the unique list of Movie nodes, so this would be very quick.
If the Release date of the movie is indexed in Neo the ordering of the results would also be very quick.
So theoretically this query should run quickly (well under a second) and have minimal impact on the database
Here is what I'd do, I would start traversing from the actor with the lowest degree, i.e. the highest selectivity of your dataset. Then find the movies he acted in and check those movies against the rest of the actors.
The second option might be more efficient implementation wise. (There is also another trick that can speed up that one even more, let me know via email when you have the dataset to test it on).
MATCH (n:Actor) WHERE n.id IN {ids}
WITH n, SIZE( (n)-[:ACTED_IN]->() ) as degree
ORDER BY degree ASC
WITH collect(n) as actors WITH head(actors) as first, tail(actors) as rest, size(actors)-1 as number
// either
MATCH (n)-[:ACTED_IN]->(m)
WHERE size( (m)<-[:ACTED_IN]->() ) > number AND ALL(a in rest WHERE (a)-[:ACTED_IN]->(m))
RETURN m;
// or
MATCH (n)-[:ACTED_IN]->(m)
WHERE size( (m)<-[:ACTED_IN]->() ) > number
MATCH (m)<-[:ACTED_IN]-(a)
WHERE a IN rest
WITH m,count(*) as c, number
WHERE c = number
RETURN m;
I have a Neo4j database (version 2.0.0) containing words and their etymological relationships with other words. I am currently able to create "word networks" by traversing these word origins, using a variable depth Cypher query.
For client-side performance reasons (these networks are visualized in JavaScript), and because the number of relationships varies significantly from one word to the next, I would like to be able to make the depth traversal conditional on the number of nodes. My query currently looks something like this:
start a=node(id)
match p=(a)-[r:ORIGIN_OF*1..5]-(b)
where not b-->()
return nodes(p)
Going to a depth of 5 usually yields very interesting results, but at times delivers far too many nodes for my client-side visualization to handle. I'd like to check against, for example, sum(length(nodes(p))) and decrement the depth if that result exceeds a particular maximum value. Or, of course, any other way of achieving this goal.
I have experimented with adding a WHERE clause to the path traversal, but this is specific to individual paths and does not allow me to sum() the total number of nodes.
Thanks in advance!
What you're looking to do isn't fairly straight forward in a single query. Assuming you are using labels and indexing on the word property, the following query should do what you want.
MATCH p=(a:Word { word: "Feet" })-[r:ORIGIN_OF*1..5]-(b)
WHERE NOT (b)-->()
WITH reduce(pathArr =[], word IN nodes(p)| pathArr + word.word) AS wordArr
MATCH (words:Word)
WHERE words.word IN wordArr
WITH DISTINCT words
MATCH (origin:Word { word: "Feet" })
MATCH p=shortestPath((words)-[*]-(origin))
WITH words, length(nodes(p)) AS distance
RETURN words
ORDER BY distance
LIMIT 100
I should mention that this most likely won't scale to huge datasets. It will most likely take a few seconds to complete if there are 1000+ paths extending from your origin word.
The query basically does a radial distance operation by collecting all distinct nodes from your paths into a word array. Then it measures the shortest path distance from each distinct word to the origin word and orders by the closest distance and imposes a maximum limit of results, for example 100.
Give it a try and see how it performs in your dataset. Make sure to index on the word property and to apply the Word label to your applicable word nodes.
what comes to my mind is an stupid optimalization of graph:
what you need to do is to ad an information into each node, which will show up how many connections it has for each depth from 1 to 5, ie:
start a=node(id)
match (a)-[r:ORIGIN_OF*1..1]-(b)
with count(*) as cnt
set a.reach1 = cnt
...
start a=node(id)
match (a)-[r:ORIGIN_OF*5..5]-(b)
where not b-->()
with count(*) as cnt
set a.reach5 = cnt
then, before each run of your question query above, check if the number of reachX < you_wished_results and run the query with [r:ORIGIN_OF*X..X]
this would have some consequences - either you would have to run this optimalisation each time after new items or updates happens to your db, or after each new node /updated node you must add the reachX param to the update