Neo4j - traversal to find friends of friends that are not friends with the user yet - neo4j

Using Neo4j 2.0.1, I'm trying to find friends of friends that are not friends with the user yet at depth of any level(2-5).
At first, I used cypher to get all friends of friends but sadly the performance was so bad when I tried to find friends of friends at depth of 4 and 5. So, I moved from cypher to traversal rest api and later I will use in Neo4jPHP traversal. So this is the change I made:
Note:
- there are 10 users with 5 friends of each user
- user that I want to traverse at depth of 3 is 1
- traversal at depth of 3
Friends List:
User | Friends
1 | 9,2,8,7,5
2 | 1,6,3,8,10
3 | 5,7,1,10,2
4 | 3,10,6,9,5
5 | 4,8,1,9,3
6 | 7,9,3,2,10
7 | 9,5,10,6,8
8 | 6,9,1,10,5
9 | 6,5,10,1,8
10 | 8,6,4,5,9
Cypher:
MATCH (U:User)-[F:Friend]->(FU:User)-[FF:Friend]->(FFU:User)
WHERE U.user_id=1
WITH DISTINCT U, FFU
WHERE FFU<>U
WITH DISTINCT U, FFU
MATCH (FFU:User)-[FFF:Friend]->(FFFU:User)
WHERE FFFU<>U AND NOT (U)-[:Friend]->(FFFU)
RETURN DISTINCT FFFU.username;
Travesal Rest Api[UPDATED]:
POST http://localhost:7474/db/data/node/1/traverse/node
{
"order" : "breadth_first",
"uniqueness" : "node_global",
"prune_evaluator" : {
"name" : "none",
"language" : "builtin"
},
"return_filter" : {
"body" : "position.endNode().getProperty('user_id')!=1 && position.endNode().getProperty('user_id')!=9 && position.endNode().getProperty('user_id')!=2 && position.endNode().getProperty('user_id')!=8 && position.endNode().getProperty('user_id')!=7 && position.endNode().getProperty('user_id')!=5;",
"language" : "javascript"
},
"relationships" : {
"direction" : "out",
"type" : "Friend"
},
"max_depth" : 3
}
Neo4jPHP Traversal[UPDATED]:
$traversal->addRelationship('Friend', Relationship::DirectionOut)
->setPruneEvaluator(Traversal::PruneNone)
->setReturnFilter('javascript', "position.endNode().getProperty('user_id')!=1 && position.endNode().getProperty('user_id')!=9 && position.endNode().getProperty('user_id')!=2 && position.endNode().getProperty('user_id')!=8 && position.endNode().getProperty('user_id')!=7 && position.endNode().getProperty('user_id')!=5;")
->setMaxDepth(3)
->setUniqueness(Traversal::UniquenessNodeGlobal)
->setOrder(Traversal::OrderBreadthFirst);
Using the Traversal Rest Api and Neo4jPHP Traversal above I got the result: 9,6,7,3,2,10,5,4,8
While the result I want is: 6,3,10,4
Because 9,7,2,5,8 are already friends with the user: 1
NOTE:
I just updated the way I traverse my graph to find friends of friends at depth of 3, so I updated my question too.
We can see that conditions I made in return_filter is manually:
"body" : "position.endNode().getProperty('user_id')!=1 && position.endNode().getProperty('user_id')!=9 && position.endNode().getProperty('user_id')!=2 && position.endNode().getProperty('user_id')!=8 && position.endNode().getProperty('user_id')!=7 && position.endNode().getProperty('user_id')!=5;"
While in Cypher, we can easily remove friends of friends that are already friends with user: 1:
WHERE NOT (U)-[:Friend]->(FFFU)
Now, how to make a condition like that in Traversal Rest Api?
I ask because not much info from the documentation.
Please anyone help me. I really need your help.
Thank you.

Shouldn't this be possible by simply specifying:
MATCH (user:User)-[:FRIEND*2..4]->(fof)
WHERE NOT (user)-[:FRIEND]->(fof)
Or perhaps I'm missing something, are you using the DISTINCT statements as a way to improve performance? I'm surprised cypher is not performing well for you here, would you be able to try your query with the PROFILE command in the Neo4j shell, and send me the result? You can email me at jakewins AT gmail.com
As for the traversal, conceptually, I would do this:
Start at User
Find all Users friends, and put them in a Set 'friends'
Start at each friend, and traverse out as many hops you like
Return each user found that is not in the set of friends
I don't believe you can do the set part in the REST traversal API, which means you either need to write a server extension, which will allow you to write this in Java and use the more powerful Java traversal API, you can read about extensions here: http://docs.neo4j.org/chunked/stable/server-unmanaged-extensions.html and the Java traversal API here: http://docs.neo4j.org/chunked/stable/tutorial-traversal-java-api.html
Alternatively, you can do two calls, one to fetch all the users friends, and one to do the REST traversal with the users friends as part of the script you send over, like you do in your question but with your app generating the filter code.

Related

Low performance of neo4j

I am server engineer in company that provide dating service.
Currently I am building a PoC for our new recommendation engine.
I try to use neo4j. But performance of this database does not meet our needs.
I have strong feeling that I am doing something wrong and neo4j can do much better.
So can someone give me an advice how to improve performance of my Cypher’s query or how to tune neo4j in right way?
I am using neo4j-enterprise-2.3.1 which is running on c4.4xlarge instance with Amazon Linux.
In our dataset each user can have 4 types of relationships with others users - LIKE, DISLIKE, BLOCK and MATCH.
Also he has a properties like countryCode, birthday and gender.
I made import of all our users and relationships from RDBMS to neo4j using neo4j-import tool.
So each user is a node with properties and each reference is a relationship.
The report from neo4j-import tool said that :
2 558 667 nodes,
1 674 714 539 properties and
1 664 532 288 relationships
were imported.
So it’s huge DB :-) In our case some nodes can have up to 30 000 outgoing relationships..
I made 3 indexes in neo4j :
Indexes
ON :User(userId) ONLINE
ON :User(countryCode) ONLINE
ON :User(birthday) ONLINE
Then I try to build online recommendation engine using this query :
MATCH (me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()<-[:LIKE | :MATCH]-(similar:User)
USING INDEX me:User(userId)
USING INDEX similar:User(birthday)
WHERE similar.birthday >= {target_age_gte} AND
similar.birthday <= {target_age_lte} AND
similar.countryCode = {target_country_code} AND
similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC
SKIP {skip_similar_person} LIMIT {limit_similar_person}
MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
recommendation.birthday <= {recommendation_age_lte} AND
recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation
here is the execution plan for one of the user :
plan
When I executed this query for list of users I had the result :
count=2391, min=4565.128849, max=36257.170065, mean=13556.750555555178, stddev=2250.149335254768, median=13405.409811, p75=15361.353029999998, p95=17385.136478, p98=18040.900481, p99=18426.811424, p999=19506.149138, mean_rate=0.9957385490980866, m1=1.2148195797996817, m5=1.1418078036067119, m15=0.9928564378521962, rate_unit=events/second, duration_unit=milliseconds
So even the fastest is too slow for Real-time recommendations..
Can you tell me what I am doing wrong?
Thanks.
EDIT 1 : plan with the expanded boxes :
I built an unmanaged extension to see if I could do better than Cypher. You can grab it here => https://github.com/maxdemarzi/social_dna
This is a first shot, there are a couple of things we can do to speed things up. We can pre-calculate/save similar users, cache things here and there, and random other tricks. Give it a shot, let us know how it goes.
Regards,
Max
If I'm reading this right, it's finding all matches for users by userId and separately finding all matches for users by your various criteria. It's then finding all of the places that they come together.
Since you have a case where you're starting on the left with a single node, my guess is that we'd be better served by following the paths and then filtering what it gotten via relationship traversal.
Let's see how starting like this works for you:
MATCH
(me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()
<-[:LIKE | :MATCH]-(similar:User)
WITH similar
WHERE similar.birthday >= {target_age_gte} AND
similar.birthday <= {target_age_lte} AND
similar.countryCode = {target_country_code} AND
similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC
SKIP {skip_similar_person} LIMIT {limit_similar_person}
MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
recommendation.birthday <= {recommendation_age_lte} AND
recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation
[UPDATED]
One possible (and nonintuitive) cause of inefficiency in your query is that when you specify the similar:User(birthday) filter, Cypher uses an index seek with the :User(birthday) index (and additional tests for countryCode and gender) to find all possible DB matches for similar. Let's call that large set of similar nodes A.
Only after finding A does the query filter to see which of those nodes are actually connected to me, as specified by your MATCH pattern.
Now, if there are relatively few me to similar paths (as specified by the MATCH pattern, but without considering its WHERE clause) as compared to the size of A -- say, 2 or more orders of magnitude smaller -- then it might be faster to remove the :User label from similar (since I presume they are probably all going to be users anyway, in your data model), and also remove the USING INDEX similar:User(birthday) clause. In this case, not using the index for similar may actually be faster for you, since you will only be using the WHERE clause on a relatively small set of nodes.
The same considerations also apply to the recommendation node.
Of course, this all has to be verified by testing on your actual data.

Trying to find the most connected node in cypher

I'm using NEO4J 2.0 M6 and I'm trying to find the most connected nodes and list order them by their connections descending. I have tried many snippets from lots of other posts but without success.
The data I have is simple:
create (Account1 { id:123}),
(Account2 { id:456}),
(Account3 { id:789}),
(Account4 { id:101}),
(PERMISSION1 { name: 'ChangeOneThing'}),
(PERMISSION2 { name: 'ChangeAnotherThing'}),
(PERMISSION3 { name: 'ConsumeThePlanet'}),
(Account1)-[:ADDED]->(PERMISSION1),
(Account1)-[:ADDED]->(PERMISSION2),
(Account2)-[:ADDED]->(PERMISSION2),
(Account4)-[:ADDED]->(PERMISSION2),
(Account3)-[:REMOVED]->(PERMISSION3)
What I need as the result is something like the following as I am trying to determine which are the most added permissions in order to creating groupings in a new system.
PermissionName Count
==========================
ChangeAnotherThing 3
ChangeOneThing 1
This will allow me to determine the most popular groups of permissions that have been assigned to accounts which will help me to simplify the current infinitely custom allocations into small groups.
I'm very new to cypher and here is my attempt at getting it to work:
match (account)-[:ADDED]->(permission)<-[:ADDED]-(other_account)
return count(permission) asscore, collect(permission.name) as permissions
order by score desc
But that just gives me:
6 ChangeAnotherThing, ChangeAnotherThing, ChangeAnotherThing, ChangeAnotherThing, ChangeAnotherThing, ChangeAnotherThing
If I understand you right you want something like: take each permission, find all accounts that have added it and count them to see how many times the permission is used. The easiest way to do this is probably to label your accounts and permissions when you create the graph, for instance
CREATE (acct1:Account {id:123}), (acct2:Account {id:456}),
(acct3:Account {id:789}), (acct4:Account {id:101}),
(perm1:Permission {name:'ChangeOneThing'}),
(perm2:Permission {name:'ChangeAnotherThing'}),
(perm3:Permission {name:'ConsumeThePlanet'}),
(acct1)-[:ADDED]->(perm1), (acct1)-[:ADDED]->(perm2),
(acct2)-[:ADDED]->(perm2), (acct4)-[:ADDED]->(perm2),
(acct3)-[:REMOVED]->(perm3)
and then query it like this
MATCH (permission:Permission)<-[:ADDED]-(account:Account)
RETURN permission.name, COUNT(account) AS score
ORDER BY score DESC
You don't have to count or group the permissions, when you return a, count(b) a becomes a grouping key–you get one row for each a and the aggregate value of b.

Neo4j / Cypher : order by and where, know the position of the result in the sort

Does it possible to have an order by "property" with a where clause and now the "index/position" of the result?
I mean, when using order for sorting we need to be able to know the position of the result in the sort.
Imagine a scoreboard with 1 million user node, i do an order by on user node.score with a where "name = user_name" and i wan't to know the current rank of the user. I do not find how to do this using order by ...
start game=node(1)
match game-[:has_child_user]->user
with user
order by user.score
with user
where user.name = "my_user"
return user , "the position in the sort";
the expected result would be :
node_user | rank
(i don't want to fetch one million entries at client side to know the current rank/position of a node in the ORDER BY!)
This functionality does not exist today in Cypher. Do you have an example of what this would look like in SQL? Would the below be something that fits the bill? (just a sketch, not working!)
(your code)
start game=node(1)
match game-[:has_child_user]->user
with user
order by user.score
(+ this code)
with user, index() as rank
return user.name, rank;
If you have more thoughts or want to start hacking on this please open an issue at https://github.com/neo4j/neo4j/issues
For the time being there is a work around that you can do:
start n=node(0),rank_node=node(1)
match n-[r:rank]->rn
where rn.score <= rank_node.score
return rank_node,count(*) as pos;
For live example see: http://console.neo4j.org/?id=bela20

Cypher: Is it possible to find creepy people following my friends?

Let's say I've pulled down the Twitter graph local to myself into Neo4J. I want to find people who follow my friends in number larger that should be expected. More specifically, I want to find people who follow the people I follow, but I want the results to be sorted so that the person following the highest number of my friends is sorted first. Possible in Cypher?
Here's a console example:
http://console.neo4j.org/r/p36cgj
create (me {n:"a"}), (fo1 {n:"fo1"}), (fo2 {n:"fo2"}), (fo3 {n:"fo3"}), (fr1 {n:"fr1"}),
(fr2 {n:"fr2"}), (fr3 {n:"fr3"}),
fo1-[:follows]->me, fo2-[:follows]->me, fo3-[:follows]->me, me-[:follows]->fr1,
me-[:follows]->fr2, me-[:follows]->fr3, fo1-[:follows]->fr1, fo2-[:follows]->fr2,
fo1-[:follows]->fr2, fo1-[:follows]->fr3;
start me=node:node_auto_index(n="me")
match me-[:follows]->friends<-[:follows]-follower-[:follows]->me
return follower, count(friends) as creepinessFactor, length(me-[:follows]->()) as countIFollow
order by creepinessFactor desc;
I'm curious to hear the results, btw. :P
You could also throw in a where like:
where not(me-[:follows]->follower)
To avoid getting friends within your circle.

How to get friends of friends that have the same interest?

Getting friends of friend are pretty easy, I got this which seems to work great.
g.v(1).in('FRIEND').in('FRIEND').filter{it != g.v(1)}
But what I want to do is only get friends of friends that have the same interests. Below I want Joe to be suggested Moe but not Noe because they do not have the same interest.
You simply need to extend your gremlin traversal to go over the LIKES edges too:
g.v(1).in('FRIEND').in('FRIEND').filter{it != g.v(1)}.dedup() \
as('friend').in('LIKES').out('LIKES').filter{it == g.v(1)}. \
back('friend').dedup()
Basically this goes out to friends of friends, as you had before and saves the position in the pipe under the name friend. It then goes out to mutual likes and searches for the original
source node. If it finds one it jumps back friend. The dedup() just removes duplicates and may speed up traversals.
The directionality of this may not be 100% correct as you haven't indicated direction of edges in your diagram.
Does this have to be in Gremlin? If Cypher is acceptable, you can do:
START s=node(Joe)
MATCH s-[:FRIEND]-()-[:FRIEND]-fof, s-[:LIKES]-()-[:LIKES]-fof
WHERE s != fof
RETURN fof
Query to get Mutual friends without considering common likes,
But if you they have common likes it will come on top.
Take a look of Order by.
MATCH (me:User{userid:'34219'})
MATCH (me)-[:FRIEND]-()-[:FRIEND]-(potentialFriend)
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
WITH me,
potentialFriend,
SIZE((potentialFriend)-[:LIKES]->()<-[:LIKES]-(me)) AS sameInterest,
friendsInCommon
WHERE NOT (me)-[:FRIEND]-(potentialFriend)
RETURN potentialFriend, sameInterest, friendsInCommon,
friendsInCommon + sameInterest AS score
ORDER BY score DESC;
If you want only common likes add foll. condition -
Where sameInterest>0

Resources