Efficient duplicate node finding in neo4j - neo4j

A feature request for the next Neo4j version: Neo4j already supports indices that keep properties in a sorted order, allowing fast lookups. Eg. for a person's first name, one might have an index that looks like:
Alice
Bob
Carol
Dave
Emily
(....)
so one can look up "Dave" with binary search (O(log n)) instead of linear scanning (O(n)).
However, one can also use an index to efficiently find duplicates (nodes which have the same value for some property). Eg., if one wants a list of every group of "person" nodes sharing the same first name, what Neo4j 2.3 seems to do (via EXPLAIN in Cypher) is run a comparison of each node's first name against every other first name, which is O(N^2). Eg. this query:
EXPLAIN MATCH (a:person) WITH a MATCH (b:person) WHERE a.name = b.name RETURN a, b LIMIT 5
shows a CartesianProduct step followed by a Filter step. But with an index on first names, one can do a linear scan over a list like:
Alice
Alice
Alice
Bob
Carol
Carol
Dave
Emily
Frank
Frank
Frank
(....)
comparing item #1 to #2, #2 to #3, and so on, to build an ordered list of all the duplicates in O(n) time per scan. Neo4j doesn't seem to support that, but it would be very useful for my application, so I'd like to put in a request.

I have a couple of suggestions for what you might try, but if you find them insufficient (and nobody else has any better ideas), I would suggest submitting new feature ideas to the Neo4j GitHub issues list.
So I was wondering if maybe Neo4j considers properties special. If you have an index on a label/property (which you can create with CREATE INDEX ON :person(name)), then comparing a property with a string should be pretty efficient. I tried passing the name through as just a variable and it seems to have fewer DB hits in my small test DB:
MATCH (a:person)
WITH a, a.name AS name
MATCH (b:person)
WHERE name = b.name
RETURN a, b LIMIT 5
That seems to give me fewer DB hits when I PROFILE it.
Another way to go about it, since you're talking about the same set of objects, is to group the nodes by name and then pull out the pairs for each group. Like so:
MATCH (a:person)
WITH a.name AS name, collect(a) AS people
UNWIND people AS a
UNWIND people AS b
WITH name, a, b
WHERE a <> b
RETURN a, b LIMIT 50
Here we collect up an array for each unique name (we could also lower/upper if we wanted to be case-insensitive) and then UNWIND twice to get a cartesian product of the array. Since we're working on a group-by-group basis, this should be much faster than comparing every node to every other node.

Related

How to delete nodes and relationship by using aggregate function on a value

I am using neo4j for the first time, and its fun using such an interactive database, but currently i got stuck in a problem, i have a data of people(uid,first name,last name, skills) , i also have a relationship [:has_skill]
my result frame looks like - p1 has a skill s (Robert has skill java)
I need to find out how many people have common skills, so i tried the following cypher query
match (p1:People)-[:has_skill]->(s:Skill)<-[:has_skill]-(p2:People)
where p1.people_uid="49981" and p2.people_uid="34564"
return p1.first_name+' '+p1.last_name as Person1, p2.first_name+' '+p2.last_name as Person2,s.skill_name,s.skillid,count(s)
i am getting p1 as different persons, but due to high skill set, the p2 person is getting repeated, and also the skill is not changing, i tried to delete every node and relationship where skill count of a person is greater then 6 to get good results, but cannot delete it, i am getting "invalid use of aggregating function"
This is my attempt to delete
match (p1:People)-[:has_skill]->(s:Skill)
where count(s)>6
detach delete p1,s
Please if anyone could guide or correct me where i am going wrong, your help would be highly appreciable . Thanks in advance.
Make sure when using count or other aggregating functions, they are within a WITH clause or a RETURN clause - seems to be a design decision that Neo Technology made when creating Neo4j - see some of the following links for similar cases to yours:
How to count the number of relationships in Neo4j
Neo4j aggregate function
I need to count the number of connection between two nodes with a certain property
Also - see the WITH clause documentation here and the RETURN clause documentation here, in particular, this part of the WITH documentation:
Another use is to filter on aggregated values. WITH is used to introduce aggregates which can then be used in predicates in WHERE. These aggregate expressions create new bindings in the results. WITH can also, like RETURN, alias expressions that are introduced into the results using the aliases as the binding name.
In your case, you are going to want your aggregate function to be used within a WITH clause because you need to use WHERE afterwards to filter only those persons with more than 6 skills. You can use the following query to see which persons have more than 6 skills:
match (p1:People)-[r:has_skill]->(s:Skill)
with p1,count(s) as rels, collect (s) as skills
where rels > 6
return p1,rels,skills
After confirming that the result set is correct, you can use the following query to delete the persons who have more than 6 skills along with all the skill nodes that these persons are related to:
MATCH(p1:People)-[r:has_skill]->(s:Skill)
WITH p1,count(s) as rels, collect (s) as skills
WHERE rels > 6
FOREACH(s in skills | DETACH DELETE s)
DETACH DELETE p1

Get full graph that node N is a part of in neo4j

I'm trying use Cypher to get the entire graph that exists if I start at a given node in neo4j. When I say entire graph I mean all nodes and relationships that are connected to at least one other node in the graph.
I've seen examples where people can get all nodes that might be connected to a given start node with a known relationship. Examples of this include this and this, but how could I do this if I do not know the relationships?
Ultimately I'd like every node and relationship where I start at one given node and sprawl out, listing the nodes that are linked by every relationship.
I've tried this:
START n=node(441007) MATHC (n)-[:*]->(d) RETURN d
but the syntax is incorrect. I'm unsure if you can submit a wildcard relationship. Additionaly I do not think this will give me what I am looking for.
Try this:
MATCH (n)-[r*]->(d)
WHERE ID(n) = 441007
RETURN r, d
This will fan out from n (if using an older version of Neo, you should revert to your START syntax) and return you the paths to each d Node that can be reached. It is relationship type agnostic through not defining the relationship label. If you didn't care about the path you could omit itwith:
MATCH (n)-[*]->(d)
WHERE ID(n) = 441007
RETURN d
Obviously on a large graph this will get expensive!
Edit
Meant to add the link to the cheat sheet, check out the section called Patterns.
Hej WildBill,
i have created a Company Graph for learning Neo4J, so i send the following Pattern against the Graph a got this result:
START a=node(9)
MATCH (a)<-[rel]-(d)
MATCH (d)-[sk]->(skill)
RETURN a, d, skill
Node 9 is my Company, which is part of the Graph.

Neo4j - Cypher return 1 to 1 relationships

Using neo4j 1.9.2, I'm trying to find all nodes in my graph that have a one to one relationship to another node. Let's say I have persons in my graph and I would like to find all persons, that have exactly one friend (since 2013), and this one friend only has the other person as friend and no one else. As a return, I would like to have all these pairs of "isolated" friends.
I tried the following:
START n=node(*) MATCH n-[r:is_friend]-m-[s:is_friend]-n
WHERE r.since >= 2013 and s.since >= 2013
WITH n, m, count(r), count(s)
WHERE count(r) = 1 AND count(s) = 1
RETURN n, m
But this query does not what it is supposed to do - it simply returns nothing.
Note: There exists just one relation between the two persons. So one friend has a incoming relationship and the other one an outgoing one. Also, these two persons might have some other relations, like "works_in" or so, but I just want to check if there is a 1:1 relation of type *is_friends* between the persons.
EDIT: The suggestion of Stefan works perfect if using node(*) as starting point. But when trying this query for one specific node as start point (e.g. start n=node(42)), it doesn't work. What would the solution look like in this case?
Update: I'm still wondering about a solution for this szenario: How to check if a given start node has a 1-to-1 relation to another node of a specific relationship type. Any ideas?
Here it's crucial to understand the concept of paths in the MATCH clause. A path is a alternating collection of node, relationship, node, relationship, .... node. There is the constraint that the same relationship will never occur twice in the same path - otherwise there would be a danger of having endless loops.
That said, you need to decide if is_friend in your domain is directed. If it is directed you'd distinguish a being friend to b and b being friend to a. From the description I assume is_friend is undirected and the statement should look like:
START n=node(*) MATCH n-[r:is_friend]-()
WHERE r.since >= 2013
WITH n, count(r) as numberOfFriends
WHERE numberOfFriends=1
RETURN n
You don't have to care about the other end, it's traversed nonetheless since you do a node(*). Be aware that node(*) gets obviously more expensive when your graph grows.

High degree nodes in neo4j

I'm trying to understand efficient usage patterns with neo4j, specifically in reference to high degree nodes. To give an idea of what I'm talking about, I have User nodes that have attributes which I have modeled as nodes. So there are relationships in my table such as
(:User)-[:HAS_ATTRIB]->(:AgeCategory)
and so on and so forth. The problem is that some of these AgeCategory nodes are very high degree, on the order of 100k, and queries such as
MATCH (u:User)-->(:AgeCategory)<--(v:User), (u)-->(:FavoriteLanguage)<--(v)
WHERE u.uid = "AAA111" AND v.uid <> u.uid
RETURN v.uid
(matching all users that share the same age category and favorite language as AAA111) are very, very slow, since you have to run over the FavoriteLanguage linked list once for every element in the AgeCategory linked list (or at least that's how I understand it).
I think it's pretty clear from the fact that this query takes minutes to resolve that I'm doing something wrong, but I am curious what the right procedure for dealing with queries like this is. Should I pull down the matching users from each query individually and compare them with an in-memory hash? Is there a way to put an index on the relationships on a node? Is this even a good idea for a schema to begin with?
My intuition is it would be more efficient to first retrieve the two end points (AgeCategory and FavoriteLanguage) for the given node u, and then query the middle node v for a path with these two fixed end points.
To prove that, I created a test graph with the following components,
A node u:User with u.uid = 'AAA111'
A node c:AgeCategory
A node l:FavoriteLanguage
A relationship between u and c, u-[:HAS_AGE]->c
A relationship between u and l, u-[:LIKE_LANGUAGE]->l
100,000 nodes v, each of which shar the same c:AgeCategory and l:FavoriteLanguage with the ndoe u, that is each v connects to l and c, v-[:HAS_AGE]->c, v-[:LIKE_LANGUAGE]->l
I run the following query 10 times, and got the average running time 10500 millis.
Match l:FavoriteLanguage<-[:LIKE_LANGUAGE]-u:User-[:HAS_AGE]->c:AgeCategory
Where u.uid = 'AAA111'
With l,c
Match l<-[:LIKE_LANGUAGE]-v:User-[:HAS_AGE]->c
Where v.uid <> 'AAA111'
Return v.uid
With 10,000 v nodes, this query takes around 2000 millis, your query takes about 27000 millis.
With 100,000 v node, this query takes around 10500 millis, it seems to take forever with your original query.
So you might give this query a try and see if it can improve the performance with your graph.

neo4j complex pattern searching

I'm new to NEO4J and I need help on a specific problem. Or an answer if it's even possible.
SETUP:
We have 2 distinct type of nodes: users (A,B,C,D) and Products (1,2,3,4,5,6,7,8)
Next we have 2 distinct type of relationships between users and products where a users WANTS a Product and where a product is OWNED BY a user.
1,2 is owned by A
3,4 is owned by B
5,6 is owned by C
7,8 is owned by D
Now
B wants 1
C wants 3
D wants 5
So for now, I have no problems and I created the graph data with no difficulty. My questions starts here. We have a circle, when A wants product 8.
A-[:WANTS]->8-[:OWNEDBY]->D-[:WANTS]->5-[:OWNEDBY]->C-[:WANTS]->3-[:OWNEDBY]->B-[:WANTS]->1-[:OWNEDBY]->A
So we have a distinct pattern, U-[:WANTS]->P-[:OWNEDBY]->U
Now what I want to do is to find the paths toward the start node (initiating user that wants a product) following that pattern.
How do I define this using Cypher? Or do I need another way?
Thanks upfront.
i got a feeling this can be hacked with reduce and counting every even (every second) relationship:
MATCH p=A-[:OWNEDBY|WANTS*..20]->X
WITH r in relationships(p)
RETURN type(r),count(r) as cnt,
WHERE cnt=10;
or maybe counting all paths where the number of rels is even:
MATCH p=A-[:OWNEDBY|WANTS*..]->X
RETURN p,reduce(total, r in relationships(p): total + 1) as tt
WHERE tt%2=0;
but you graph must have the strict pattern, where all incoming relationship from the set of ownedby and wants must be different from all outgoin relationships from the same set. in other words, this pattern can not exist: A-[:WANTS]->B-[:WANTS]->C or A-[:OWNEDBY]->B-[:OWNEDBY]->C
the query is probably wrong in syntax, but the logic can be implemented in cypher whn you will play more with it.
or use gremlin, I think I saw somewhere a gremlin query, where you could define a pattern and than loop n-times via that pattern further till the end node.
I've played around with this and created http://console.neo4j.org/?id=qq9v1 showing the sample graph. I've found the following cypher statement solving the issue:
start a=node(1)
match p=( a-[:WANTS]->()-[:WANTS|OWNEDBY*]-()-[:OWNEDBY]->a )
return p
order by length(p) desc
limit 1
There is just one glitch: the intermediate part [:WANTS|OWNEDBY*] does not mandate alternating WANT and OWNEDBY chains. As #ulkas stated, this should not be an issue if you take care during data modelling.
You might also look into http://api.neo4j.org/current/org/neo4j/graphalgo/GraphAlgoFactory.html to apply graph algorithms from Java code. You might use unmangaged extensions to provide REST access to that.

Resources