Improve performance for querying count - neo4j

I have table in my website to show a list of links and the number of times they have been visited. Here's the cypher query I use to get such a data:
MATCH (u:USER {email: $email})-[:CREATED]->(l:URL)
OPTIONAL MATCH (l)<-[v:VISITED]-(:VISIT)
RETURN l, COUNT(v) AS count
LIMIT 10
I create a VISIT node for each visit for a URL in order to store analytics data for each visit. So in the above code, I grab the links that a user have created and count the visits for each one.
The problem is the above query is not performant. Now that data has got huge, it takes at least 8 seconds to resolve.
Any ways to improve this query?

For the :VISITED relationships, if those only connect :VISIT nodes to :URL nodes, then you can use the size() function on the pattern, excluding the node label, which will get the degree information from the :URL node itself without having to expand out (you can confirm this by doing a PROFILE or EXPLAIN of the plan and expand all elements, look for GetDegreePrimitive in the Projection operation).
Also, since you're using LIMIT 10 without any kind of ordering, it's better to do the LIMIT earlier so you only perform subsequent operations with the limited set of nodes rather than doing all the work for all the nodes then only keeping 10.
MATCH (u:USER {email: $email})-[:CREATED]->(l:URL)
WITH l
LIMIT 10
RETURN l, size((l)<-[:VISITED]-()) as count
Also, as noted by cybersam, you'll absolutely want an index on :USER(email) so lookup to your specific :USER node is fast.

In addition to #InverseFalcon's suggestions, you should either create an index or uniqueness constraint on :USER(email), to avoid having to scan through all USER nodes to find the one of interest.

Related

Adding a property filter to cypher query explodes memory, why?

I'm trying to write a query that explores a DAG-type graph (a bill of materials) for all construction paths leading down to a specific part number (second MATCH), among all the parts associated with a given product (first MATCH). There is a strange behavior I don't understand:
This query runs in a reasonable time using Neo4j community edition (~2 s):
WITH '12345' as snid, 'ABCDE' as pid
MATCH (m:Product {full_sn:snid})-[:uses]->(p:Part)
WITH snid, pid, collect(p) AS mparts
MATCH path=(anc:Part)-[:has*]->(child:Part)
WHERE ALL(node IN nodes(path) WHERE node IN mparts)
WITH snid, path, relationships(path)[-1] AS rel,
nodes(path)[-2] AS parent, nodes(path)[-1] AS child
RETURN stuff I want
However, to get the query I want, I must add a filter on the child using the part number pid in the second MATCH statement:
MATCH path=(anc:Part)-[:has*]->(child:Part {pn:pid})
And when I try to run the new query, neo4j browser compains that there is not enough memory. (Neo.TransientError.General.OutOfMemoryError). When I run it with EXPLAIN, the db hits are exploding into the 10s of billions, as if I'm asking it for a massive cartestian product: but all I have done is added a restriction on the child, so this should be reducing the search space, shouldn't it?
I also tried adding an index on :Part(pn). Now the profile shown by EXPLAIN looks very efficient, but I still have the same memory error.
If anyone can help me understand why this change between the two queries is causing problems, I'd greatly appreciate it!
Best wishes,
Ben
MATCH path=(anc:Part)-[:has*]->(child:Part)
The * is exploding to every downstream child node.
That's appropriate if that is what's desired. If you make this an optional match and limit to the collect items, this should restrict the return results.
OPTIONAL MATCH path=(anc:Part)-[:has*]->(child:Part)
This is conceptionally (& crudely) similar to an inner join in SQL.

How to limit recursion depending on node and relationship properties for each connected node pair

Let's start from simple query that finds all coworkers recursively.
match (user:User {username: 'John'})
match (user)-[:WORKED_WITH *1..]-(coworkers:User)
return user, coworkers
Now, I have to modify it in order to recieve only those users, that are connected with first N relationships.
Every User label have value of N in the properties, and every relationship have date of creation in its properties.
I suppose, that it can be reasonable to create and maintain separate set of relationships that will satisfy this condition.
UPD: Limitations have to be applied only for those, who know each other directly.
Limitation have to be applied to each node in the path, e.g. first user have 3 relationships :WORKED_WITH (on the first level) and limitation 5, than everything OK we can continue to check connected users, if user have 6 relationships and limitation 5, only 5 of relationships have to be used to move on.
I understand that it can be slow query, but how to do that without hand written tools? One of improvements is to move all that limitations out of query execution into some preprocessing step and create additional type of relationships that will hold all of those limitations, it will require validations because they are not part of the state but projection of the state.
The following query should work (as long as you do not have a lot of data). It uses DISTINCT to remove duplicates.
MATCH (user:User {username: 'John'})-[:WORKED_WITH*]-(coworker:User)
WITH DISTINCT user, coworker
ORDER BY coworker.createDate
RETURN COLLECT(coworker)[0..user.N] AS coworkers;
Note: since variable-length paths have exponential complexity, you would usually want to specify a reasonable upper bound (e.g., [:WORKED_WITH*..5]) to avoid the query running too long or causing an out-of-memory error. Also, since the LIMIT operator does not accept a variable as its argument, this query uses COLLECT(coworker)[0..user.N] to get the N coworkers with the earliest createDate -- which is also a bit expensive.
Now, if (as you suggested) you had created a specific type of relationship (e.g., FIRST_WORKED_WITH) between each User and its N earliest "coworkers", that would allow you to use the following trivial and fast query:
MATCH (user:User {username: 'John'})-[:FIRST_WORKED_WITH]->(coworker:User)
RETURN coworker;

Cypher query to get subsets of different node labels, with relations

Let's assume this use case;
We have few nodes (labeled Big) and each having a simple integer ID property.
Each Big node has a relation with millions of (labeled Small) nodes.
such as :
(Small)-[:BELONGS_TO]->(Big)
How can I phrase a Cypher query to represent the following in natural language:
For each Big node in the range of ids between 4-7, get me 10 of Small nodes that belongs to it.
The supposed result would give 2 Big nodes, 20 Small nodes, and 20 Relations
The needed result would be represented by this graph:
2 Big nodes, each with a subset of 10 of Small nodes that belongs to them
What I've tried but failed (it only shows 1 big node (id=5) along with 10 of its related Small nodes, but doesn't show the second node (id=6):
MATCH (s:Small)-[:BELONGS_TO]->(b:Big)
Where 4<b.bigID<7
return b,s limit 10
I guess I need a more complex compound query.
Hope I could phrase my question in an understandable way!
As stdob-- says, you can't use limit here, at least not in this way, as it limits the entire result set.
While the aggregation solution will return you the right answer, you'll still pay the cost for the expansion to those millions of nodes. You need a solution that will lazily get the first ten for each.
Using APOC Procedures, you can use apoc.cypher.run() to effectively perform a subquery. The query will be run per-row, so if you limit the rows first, you can call this and use LIMIT within the subquery, and it will properly limit to 10 results per row, lazily expanding so you don't pay for an expansion to millions of nodes.
MATCH (b:Big)
WHERE 4 < b.bigID < 7
CALL apoc.cypher.run('
MATCH (s:Small)-[:BELONGS_TO]->(b)
RETURN s LIMIT 10',
{b:b}) YIELD value
RETURN b, value.s
Your query does not work because the limit applies to the entire previous flow.
You need to use aggregation function collect:
MATCH (s:Small)-[:BELONGS_TO]->(b:Big) Where 4<b.bigID<7
With b,
collect(distinct s)[..10] as smalls
return b,
smalls

Is there anything like a "do while" match pattern that satisfy an aggregated value ? (propeties etc)

I dont know if this make sense using Cypher or graph traversal, but i was trying to do sort of a "shortest path" query but not based on weighted relationship but rather aggregated properties.
Assume i have nodes labeled People and they all vists different homepages with a VISIT relationship to the homepage node. Each homepage node has hits stats depending on its popularity. Now i would like to match people that has a visit relationship to a homepage until i reach max X number of exposure (hits).
Why ? Becuase then i know a "expected" exposure strategy for a certain group of people.
Something like
Do
MATCH (n:People)-[:VISITS]-(sites)
while (reduce (x)<100000)
Of course this "Do while" is nothing i have seen in the Cypher syntax but wouldn't it be useful? or should this be on app level by just returning a DESC list and do the math on in the applicaton. Mabey it should also be matched with some case if the loop cant be satisfied.
MATCH (n:People)-[:VISITS]-sites
WITH reduce(hits=0, x IN collect(sites.dailyhits)| hits + x) AS totalhits
RETURN totalhits;
Can return the correct aggregated hits value (all), but i would like this function to run each matched pattern until it satisfy a value and the return the match (of course i miss other possible and mixes between pages becuase the match never traversal the entire graph..but at least i have got an answer of pages in a list that match the requirement if it makes sense) ?
Thanks!
Not sure how you'd aggregate, but there are several aggregation functions (avg, sum, etc). And... you can pass these to a 2nd part of the cypher query, with a WITH clause.
That said: Cypher also supports the ability to sort a result (ORDER BY), and the ability to limit the number of results given (LIMIT). I don't know what you'd sort by, but... just for fun, let's sort it arbitrarily on something:
MATCH (n:People)-[v:VISITS]->(site:Site)
WHERE site.url= "http://somename.com"
RETURN n
ORDER BY v.VisitCount DESC
LIMIT 1000
This would cap your return set at 1,000 people, for people who visit a given site.

Incredibly high query times

I am having some extremely high query times and I'm unable to pinpoint the issue.
I am having a graph database with 6685 nodes, 26407 properties and 22921 relationships, running on an Amazon EC2 instance having 1.7GB RAM.
My use case is to map people to their various interest points and find for a given user, who are the people who have common interests with him.
I have data about 500 people in my db, and each person has an average of a little more than 100 different interest points related to him.
1) When I run this cypher query:
START u=node(5) MATCH (u)-[:interests]->(i)<-[:interests]-(o) RETURN o;
Here node(5) is a user node. So, I am trying to find all users who have the same ":interests" relation with user (u).
This query return 2557 rows and takes about 350ms.
2) When I sprinkle in a few extra MATCH conditions, the query time exponentially degrades.
For eg., if I want to find all users who have common interests with user (u) = node(5), and also share the same hometown, I wrote:
START u=node(5)
MATCH (u)-[:interests]->(i)<-[:interests]-(o)
WITH u,o,i
MATCH (u)-[:hometown]->(h)<-[:hometown]-(o)
RETURN u, o, i, h;
This query return 755 rows and takes about 2500ms!
3) If I add more constraints to the MATCH, like same gender, same alma mater etc., query times progressively worsen to >10,000 ms.
What am I doing wrong here?
Could you try stating the pattern as a whole in your first MATCH clause, i.e. MATCH (u)-[:interests]->(i)<-[:interests]-(o)-[:hometown]->(h)<-[:hometown]-(o) ?

Resources