Is this cypher neo4j query working as I expect? - neo4j

So I have a graph with Users and Places. Users are r:MEMBER_OF Places. I want to find suggestions of Places that a User might like to be a MEMBER_OF based on what Users are in which Places. So if a User is already in 1 Place, and many other users that are in that 1 Place are also in another Place, then that Place should be suggested, as long as the original User is not already in that Place.
So here's what I've come up with, and it does yield results, but I want to make sure that the suggested Places are not just random. Is this query properly ranking Places that should be suggested? Or is it just a random collection of Places that fit the criterion?
MATCH (a:User {username:'johndoe123'})-[:MEMBER_OF]->()<-[:MEMBER_OF]-(b:User)
MATCH (b)-[r:MEMBER_OF]->(suggestion)
WHERE NOT (a)-[:MEMBER_OF]->(suggestion)
RETURN suggestion limit 5

You are close, but you are return a suggestion every time another user is associated with that Place. You probably only want to return each Place suggestion once, and you probably want to rank the suggestions by their frequency. Try this.
MATCH (a:User {username:'johndoe123'})-[:MEMBER_OF]->()<-[:MEMBER_OF]-(b:User)
MATCH (b)-[r:MEMBER_OF]->(suggestion)
WHERE NOT (a)-[:MEMBER_OF]->(suggestion)
RETURN suggestion, count(*) AS otherUserCount
ORDER BY otherUserCount DESC limit 5

Related

Improve performance for querying count

I have table in my website to show a list of links and the number of times they have been visited. Here's the cypher query I use to get such a data:
MATCH (u:USER {email: $email})-[:CREATED]->(l:URL)
OPTIONAL MATCH (l)<-[v:VISITED]-(:VISIT)
RETURN l, COUNT(v) AS count
LIMIT 10
I create a VISIT node for each visit for a URL in order to store analytics data for each visit. So in the above code, I grab the links that a user have created and count the visits for each one.
The problem is the above query is not performant. Now that data has got huge, it takes at least 8 seconds to resolve.
Any ways to improve this query?
For the :VISITED relationships, if those only connect :VISIT nodes to :URL nodes, then you can use the size() function on the pattern, excluding the node label, which will get the degree information from the :URL node itself without having to expand out (you can confirm this by doing a PROFILE or EXPLAIN of the plan and expand all elements, look for GetDegreePrimitive in the Projection operation).
Also, since you're using LIMIT 10 without any kind of ordering, it's better to do the LIMIT earlier so you only perform subsequent operations with the limited set of nodes rather than doing all the work for all the nodes then only keeping 10.
MATCH (u:USER {email: $email})-[:CREATED]->(l:URL)
WITH l
LIMIT 10
RETURN l, size((l)<-[:VISITED]-()) as count
Also, as noted by cybersam, you'll absolutely want an index on :USER(email) so lookup to your specific :USER node is fast.
In addition to #InverseFalcon's suggestions, you should either create an index or uniqueness constraint on :USER(email), to avoid having to scan through all USER nodes to find the one of interest.

Neo4j Query Optimization for Cartesian Product

I am trying to implement a user-journey analytics solution. Simply analyze on which screens, which users leave the application.
For this , I have modeled the data like this:
I modeled single activity since I want to index some attributes. Relation attributes can not be indexed in Neo4j.
With this model, I am trying to write a query that follows three successive event types with below query:
MATCH (eventType1:EventType {eventName:'viewStart-home'})<--(event:EventNode)
<--(eventType2:EventType{eventName:'viewStart-payment'})
WITH distinct event.deviceId as eUsers, event.clientCreationDate as eDate
MATCH((eventType2)<--(event2:EventNode)
<--(eventType3:EventType{eventName:'viewStart-screen1'}))
WITH distinct event2.deviceId as e2Users, event2.clientCreationDate as e2Date
RETURN e2Users limit 200000
And the execution plan is below:
I could not figure the reason of this process out. Can you help me?
Your query is doing a lot more work than it needs to.
The first WITH clause is not needed at all, since its generated eUsers and eDate variables are never used. And the second WITH clause does not need to generate the unused e2Date variable.
In addition, you could first add an index for :EventType(eventName) to speed up the processing:
CREATE INDEX ON :EventType(eventName);
With these changes, your query's profile could be simpler and the processing would be faster.
Here is an updated query (that should use the index to quickly find the EventType node at one end of the path, to kick off the query):
MATCH (:EventType {eventName:'viewStart-home'})<--(:EventNode)
<--(:EventType{eventName:'viewStart-payment'})<--(event2:EventNode)
<--(:EventType{eventName:'viewStart-screen1'})
RETURN distinct event2.deviceId as e2Users
LIMIT 200000;
Here is an alternate query that uses 2 USING INDEX hints to tell the planner to quickly find the :EventType nodes at both ends of the path to kick off the query. This might be even faster than the first query:
MATCH (a:EventType {eventName:'viewStart-home'})<--(:EventNode)
<--(:EventType{eventName:'viewStart-payment'})<--(event2:EventNode)
<--(b:EventType{eventName:'viewStart-screen1'})
USING INDEX a:EventType(eventName)
USING INDEX b:EventType(eventName)
RETURN distinct event2.deviceId as e2Users
LIMIT 200000;
Try profiling them both on your DB, and pick the best one or keep tweaking further.

Neo4j - complete a query with an alternative match if it finds few results

I am trying to write a query which looks for potential friends in a Neo4j db based on common friends and interests.
I don't want to post the whole query (part of school assignment), but this is the important part
MATCH (me:User {firstname: "Name"}), (me)-[:FRIEND]->(friend:User)<-[:FRIEND]-(potential:User), (me)-[:MEMBER]->(i:Interest)
WHERE NOT (potential)-[:FRIEND]->(me)
WITH COLLECT(DISTINCT potential) AS potentialFriends,
COLLECT(DISTINCT friend) AS friends,
COLLECT(i) as interests
UNWIND potentialFriends AS potential
/*
#HANDLING_FINDINGS
Here I count common friends, interests and try to find relationships between
potential friends too -- hence the collect/unwind
*/
RETURN potential,
commonFriends,
commonInterests,
(commonFriends+commonInterests) as totalPotential
ORDER BY totalPotential DESC
LIMIT 10
In the section #HANDLING_FINDINGS I use the found potential friends to find relationships between each other and calculate their potential (i.e. sum of shared friends and common interests) and then order them by potential.
The problem is that there might be users with no friends whom I would also like to recommend someone friends.
My question - can I somehow insert a few random users into the "potential" findings if their count is below 10 so that everyone gets a recommendation?
I have tried something like this
...
UNWIND potentialFriends AS potential
CASE
WHEN (count(potential) < 10 )
...
But that produced an error as soon as it hit start of the CASE. I think that case can be used only as part of a command like return? (maybe just return)
Edit with 2nd related question:
I was already thinking of matching all users and then ranking them based on common friends/interestes, but wouldn't searching through the whole DB be intensive?
A CASE expression can be used wherever a value is needed, but it cannot be used as a complete clause.
With respect to your main question, you can put a WITH clause like the following between your existing WITH and UNWIND clauses:
WITH friends, interests,
CASE WHEN SIZE(potentialFriends) < 10 THEN {randomFriends} ELSE potentialFriends END AS potentialFriends
If the size of the potentialFriends collection is less than 10, the CASE expression assigns the value of the {randomFriends} parameter to potentialFriends.
As for your second question, yes it would be expensive.

Is there anything like a "do while" match pattern that satisfy an aggregated value ? (propeties etc)

I dont know if this make sense using Cypher or graph traversal, but i was trying to do sort of a "shortest path" query but not based on weighted relationship but rather aggregated properties.
Assume i have nodes labeled People and they all vists different homepages with a VISIT relationship to the homepage node. Each homepage node has hits stats depending on its popularity. Now i would like to match people that has a visit relationship to a homepage until i reach max X number of exposure (hits).
Why ? Becuase then i know a "expected" exposure strategy for a certain group of people.
Something like
Do
MATCH (n:People)-[:VISITS]-(sites)
while (reduce (x)<100000)
Of course this "Do while" is nothing i have seen in the Cypher syntax but wouldn't it be useful? or should this be on app level by just returning a DESC list and do the math on in the applicaton. Mabey it should also be matched with some case if the loop cant be satisfied.
MATCH (n:People)-[:VISITS]-sites
WITH reduce(hits=0, x IN collect(sites.dailyhits)| hits + x) AS totalhits
RETURN totalhits;
Can return the correct aggregated hits value (all), but i would like this function to run each matched pattern until it satisfy a value and the return the match (of course i miss other possible and mixes between pages becuase the match never traversal the entire graph..but at least i have got an answer of pages in a list that match the requirement if it makes sense) ?
Thanks!
Not sure how you'd aggregate, but there are several aggregation functions (avg, sum, etc). And... you can pass these to a 2nd part of the cypher query, with a WITH clause.
That said: Cypher also supports the ability to sort a result (ORDER BY), and the ability to limit the number of results given (LIMIT). I don't know what you'd sort by, but... just for fun, let's sort it arbitrarily on something:
MATCH (n:People)-[v:VISITS]->(site:Site)
WHERE site.url= "http://somename.com"
RETURN n
ORDER BY v.VisitCount DESC
LIMIT 1000
This would cap your return set at 1,000 people, for people who visit a given site.

efficiency of where clause in cypher vs match

I'm trying to find 10 posts that were not LIKED by user "mike" using cypher. Will putting a where clause with a NOT relationship be efficient than matching with an optional relationship then checking if that relationship is null in the where clause? Specifically I want to make sure it won't do the equivalent of a full table scan and make sure that this is a scalable query.
Here's what I'm using
START user=node:node_auto_index(uname:"mike"),
posts=node:node_auto_index("postId:*")
WHERE not (user-[:LIKES]->posts)
RETURN posts SKIP 20 LIMIT 10;
Or can I do something where I filter on a MATCH optional relationship
START user=node:node_auto_index(uname="mike"),
posts=node:node_auto_index("postId:*")
MATCH user-[r?:LIKES]->posts
WHERE r IS NULL
RETURN posts SKIP 100 LIMIT 10;
Some quick tests on the console seem to show faster performance in the 2nd approach. Am I right to assume the 2nd query is faster? And, if so why?
i think in the first query the engine runs through all postID nodes and manually checks the condition of not (user-[:LIKES]->posts) for each post ID
whereas in the second example (assuming you use at least v1.9.02) the engine picks up only the post nodes, which actually aren't connected to the user. this is just optimalization where the engine does not go through all postIDs nodes.
if possible, always use the MATCH clause in your queries instead of WHERE, and try to omit the asterix in the declaration START n=node:index('name:*')

Resources