Find a node with most connections to other unique nodes - neo4j

Given:
Two node labels:
1000 (:A) nodes
1000 (:B) nodes
Constraints:
CREATE CONSTRAINT ON (a:A) ASSERT a.id IS UNIQUE;
CREATE CONSTRAINT ON (b:B) ASSERT b.id IS UNIQUE;
One unidirectional relationship type:
4000 [:RELATED_TO] relationships
Multiple (a:A)-[:RELATED_TO]->(b:B) paths
(Meaning, the same node (a:A) could be related to the same node (b:B) multiple times)
I'm trying to run a query that would show the paths of the node that is connected to the biggest number of other unique nodes in the graph. For example, if nodes (a1:A), (a2:A), (a3:A), and (a4:A) are all connected to (b:B) at least once, and it so happens that no other (:B) is connected to any more than three unique (:A) nodes elsewhere in the graph, I would like for the Neo4j Browser to show (b:B) in the center and (a1:A) through (a4:A) around it. I feel like my biggest challenge is that I haven't been able to figure out how to avoid counting up multiple (a1:A)-[:RELATED_TO]->(b:B) paths.
I'll be happy to provide more context if necessary. Thanks in advance!

This query uses the aggregating function COLLECT (with the DISTINCT operator to qualify its argument) to return the B node that has relationships with the most distinct A nodes, along with those A nodes:
MATCH (a:A)-[:RELATED_TO]->(b:B)
RETURN b, COLLECT(DISTINCT a) AS aNodes
ORDER BY SIZE(aNodes) DESC
LIMIT 1;

Related

What it means to connect two nodes without a relation?

I have written below three queries and trying to understand difference between all 3 of them.
Query1:
MATCH (person)-[r]->(otherPerson)
Query2:
MATCH (person)-->(otherPerson)
Query3:
MATCH (person)--(otherPerson)
Please let me know if there is any difference between the three queries.
Query 1 and 2 are basically the same, you are asking for all nodes connected by relationships that start at the person nodes and end at the otherPerson node. In Query 1 you are also adding an alias/label to the actual relationship r that would allow you to return the relationship. In Query 1 you could do
MATCH (person)-[r]->(otherPerson) RETURN r
In Query 2, you could not return the relationship.
Query 3 is similar to Query 2 except that you are asking for all nodes connected by relationships that start or end at the person nodes and start or end at the otherPerson node.
Query 1 and 2 will find all nodes and give them a label of person. It will then go out all outbound relationships and label the connected node as otherPerson. In the case of Query 1 the relationship will also be given a label of r.
Query 3 will match the same pattern except it will traverse both incoming and outgoing edges to find the otherPerso node.

To get nodes and relationships between two specified nodes for review

I have a database containing millions of nodes and edge data and I want to get all the nodes and relationships data between two specified nodes.
Below is the sample data for the graph which has 7 nodes and 7 relationships.
To traverse from 1st node to 7th node I can use the variable length relationship approach and can get the nodes and relationships in between the first and 7th nodes (but in this approach we need to know the number of relationships and nodes between 1st and 7th node).
For using variable length relationship approach we have to specify the number where we will get the end node and it traverses in one direction.
But in my case I know the start and end node and don't know how many relationships and nodes are in between them. Please suggest how I can write a Cypher query for this case.
I have used the APOC spanning tree procedure where it returns ‘path’ from the 1st and 7th element, but it does not return the nodes and relationships. Can I get nodes and relationships data in return using the spanning tree procedure and how?
Is there any other way to get all nodes and relations between two nodes without using the APOC procedure?
Here is query with apoc procedure:
MATCH (start:temp {Name:"Joel"}), (end:temp {Name:"Jack"}) CALL apoc.path.spanningTree(start,{terminatorNodes:[end]}) YIELD path return path
Note: In our graph database nodes can have multi direction relations.
[Sample nodes and relationships snapshot]
: https://i.stack.imgur.com/nN9hk.png
I assume you do not want to have duplicates in your result, so my approach would be this
MATCH (start:temp {Name:"Joel"}), (end:temp {Name:"Jack"})
MATCH p=shortestPath((start)-[*]->(end))
UNWIND nodes(p) AS node
UNWIND relationships(p) AS rel
RETURN COLLECT(DISTINCT node) as nodes, COLLECT(DISTINCT rel) as rels
Might be better to use shortestPath operator to find the shortest path between two nodes.
MATCH (start:temp {Name:"Joel"}), (end:temp {Name:"Jack"})
MATCH p=shortestPath((start)-[*]->(end))
RETURN nodes(p) as nodes, relationships(p) as rels

Create relationship between each two nodes for a set of nodes

I have created many nodes in neo4j, the attributes of these nodes are the same, they all have user_id and item_id, the code used is as follows:
LOAD CSV WITH HEADERS FROM 'file://data.csv' AS row
CREATE (main:Main_table {USER_ID: row.user_id,
ITEM_ID: row.item_id}
)
CREATE INDEX ON :Main_table(USER_ID);
CREATE INDEX ON :Main_table(ITEM_ID);
Now I want to create relationship between the nodes with the same user_id or item_id. For example, if node A, B and C have the same USER_ID, I want to create (A)-[:EDGE]->(B), (A)-[:EDGE]->(C) and (B)-[:EDGE]->(C). In order to achieve this goal, I tried the following code:
MATCH (a:Main_table),(b:Main_table)
WHERE a.USER_ID = b.USER_ID
CREATE (a)-[:USER_EDGE]->(b);
MATCH (a:Main_table),(b:Main_table)
WHERE a.ITEM_ID = b.ITEM_ID
CREATE (a)-[:ITEM_EDGE]->(b);
But due to the large amount of data (3000000 nodes, 100000 users), this process is very slow, how can I quickly complete this process? Any help would be greatly appreciated!
Your query is causing a cartesian product, and the Cypher planner does not use indexes to optimize node lookups involving node property comparisons.
A query like this (instead of your USER_EDGE query) may be faster, as it does not cause a cartesian product:
MATCH (a:Main_table)
WITH a.USER_ID AS id, COLLECT(a) AS mains
UNWIND mains AS a
UNWIND mains AS b
WITH a, b
WHERE ID(a) < ID(b)
MERGE (a)-[:USER_EDGE]->(b)
This query uses the aggregating function COLLECT to collect the nodes that have the same USER_ID value, and uses the ID(a) < ID(b) test to ensure that a and b are not the same nodes and to also prevent duplicate relationships (in opposite directions).

Create relationships in Neo4j

I have a graph with about 800k nodes and I want to create random relationships among them, using Cypher.
Examples like the following didn't work because the cartesian product is too big:
match (u),(p)
with u,p
create (u)-[:LINKS]->(p);
For example I want 1 relationship for each node (800k), or 10 relationships for each node (8M).
In short, I need a query Cypher in order to UNIFORMLY create relationships between nodes.
Does someone know the query to create relationships in this way?
So you want every node to have exactly x relationships? Try this in batches until no more relationships are updated:
MATCH (u),(p) WHERE size((u)-[:LINKS]->(p)) < {x}
WITH u,p LIMIT 10000 WHERE rand() < 0.2 // LIMIT to 10000 then sample
CREATE (u)-[:LINKS]->(p)
This should work (assuming your neo4j server has enough memory):
MATCH (n)
WITH COLLECT(n) AS ns, COUNT(n) AS len
FOREACH (i IN RANGE(1, {numLinks}) |
FOREACH (x IN ns |
FOREACH(y IN [ns[TOINT(RAND()*len)]] |
CREATE (x)-[:LINK]->(y) )));
This query collects all nodes, and uses nested loops to do the following {numLinks} times: create a LINK relationship between every node and a randomly chosen node.
The innermost FOREACH is used as a workaround for the current Cypher limitation that you cannot put an operation that returns a node inside a node pattern. To be specific, this is illegal: CREATE (x)-[:LINK]->(ns[TOINT(RAND()*len)]).

Neo4j: Cypher Query With Variable Length and Condition on Node Labels

What I'm looking for
With variable length relationships (see here in the neo4j manual), it is possible to have a variable number of relationships with a certain label between two nodes.
# Cypher
match (g1:Group)-[:sub_group*]->(g2:Group) return g1, g2
I'm looking for the same thing with nodes, i.e. a way to query for two nodes with a variable number of nodes in between, but with a label condition on the nodes rather than the relationships:
# Looking for something like this in Cypher:
match (g1:Group)-->(:Group*)-->(g2:Group) return g1, g2
Example
I would use this mechanism, for example, to find all (direct or indirect) members of a group within a group structure.
# Looking for somthing like this in Cypher:
match (group:Group)-->(:Group*)-->(member:User) return member
Take, for example, this structure:
group1:Group
|-------> group2:Group -------> user1:User
|-------> group3:Group
|--------> page1:Page -----> group4:Group -----> user2:User
In this example, user1 is a member of group1 and group2, but user2 is only member of group4, not member of the other groups, because a non-Group labeled node is in between.
Abstraction
A more abstract pattern would be a kind of repeat operator |...|* in Cypher:
# Looking for repeat operator in Cypher:
match (g1:Group)|-[:is_subgroup_of]->(:Group)|*-[:is_member_of]->(member:User)
return member
Does anyone know of such a repeat operator? Thanks!
Possible Solution
One solution I've found, is to use a condition on the nodes using where, but I hope, there is a better (and shorter) soluation out there!
# Cypher
match path = (member:User)<-[*]-(g:Group{id:1})
where all(node in tail(nodes(path)) where ('Group' in labels(node)))
return member
Explanation
In the above query, all(node in tail(nodes(path)) where ('Group' in labels(node))) is one single where condition, which consists of the following key parts:
all: ALL(x in coll where pred): TRUE if pred is TRUE for all values in
coll
nodes(path): NODES(path): Returns the nodes in path
tail(): TAIL(coll): coll except first element–––I'm using this, because the first node is a User, not a Group.
Reference
See Cypher Cheat Sheet.
How about this:
MATCH (:Group {id:1})<-[:IS_SUBGROUP_OF|:IS_MEMBER_OF*]-(u:User)
RETURN DISTINCT u
This will:
find all subtrees of the group with ID 1
only traverse the relationships IS_GROUP_OF and IS_MEMBER_OF in incoming direction (meaning sub-groups or users that belong to group with ID or one of its sub-groups)
only return nodes which have a IS_MEMBER_OF relationship to a group in the subtree
and discard duplicate results (users who belong to more than one of the groups in the tree would otherwise appear multiple times)
I know this relies on relationships types rather than node labels, but IMHO this is a more graphy approach.
Let me know if this would work or not.

Resources