We have a large graph (over 1 billion edges) that has multiple relationship types between nodes.
In order to check the number of nodes that have a single unique relationship between nodes (i.e. a single relationship between two nodes per type, which otherwise would not be connected) we are running the following query:
MATCH (n)-[:REL_TYPE]-(m)
WHERE size((n)-[]-(m))=1 AND id(n)>id(m)
RETURN COUNT(DISTINCT n) + COUNT(DISTINCT m)
To demonstrate a similar result, the below sample code can run on the movie graph after running
:play movies in an empty graph, resulting with 4 nodes (in this case we are asking for nodes with 3 types of relationships)
MATCH (n)-[]-(m)
WHERE size((n)-[]-(m))=3 AND id(n)>id(m)
RETURN COUNT(DISTINCT n) + COUNT(DISTINCT m)
Is there a better/more efficient way to query the graph?
The following query is more performant, since it only scans each relationship once [whereas size((n)--(m)) will cause relationships to be scanned multiple times]. It also specifies a relationship direction to filter out half of the relationship scans, and to avoid the need for comparing native IDs.
MATCH (n)-->(m)
WITH n, m, COUNT(*) AS cnt
WHERE cnt = 3
RETURN COUNT(DISTINCT n) + COUNT(DISTINCT m)
NOTE: It is not clear what you are using the COUNT(DISTINCT n) + COUNT(DISTINCT m) result for, but be aware that it is possible for some nodes to be counted twice after the addition.
[UPDATE]
If you want to get the actual number of distinct nodes that pass your filter, here is one way to do that:
MATCH (n)-->(m)
WITH n, m, COUNT(*) AS cnt
WHERE cnt = 3
WITH COLLECT(n) + COLLECT(m) AS nodes
UNWIND nodes AS node
RETURN COUNT(DISTINCT node)
Related
I have a graph where a pair of nodes can have several relationships between them.
I would like to count this relationships between each pair of nodes, and set it as a parameter of each relationship.
I tried something like:
MATCH (s:LabeledExperience)-[r:NextExp]->(e:LabeledExperience)
with s, e, r, length(r) as cnt
MATCH (s2:LabeledExperience{name:s.name})-[r2:NextExp{name:r.name}]->(e2:LabeledExperience{name: e.name})
SET r2.weight = cnt
But this set the weight always to one.
I also tried:
MATCH ()-[r:NextExp]->()
with r, length(r) as cnt
MATCH ()-[r2:NextExp{name:r.name}]->()
SET r2.weight = cnt
But this takes too much time since there are more than 90k relationships and there is no index (since it is not possible to have them on edges).
They are always set to 1 because of the way you are counting.
When you group by s, e, r that is always going to result in a single row. But if you collect(r) for every s, e then you will get a collection of all of the :NextExp relationships between those two nodes.
Also, length() is for measuring the length (number of nodes) in a matched path and should not work directly on a relationship.
Match the relationship and put them in a collection for each pair of nodes. Iterate over each rel in the collection and set the size of the collection of rels.
MATCH (s:LabeledExperience)-[r:NextExp]->(e:LabeledExperience)
WITH s, e, collect(r) AS rels
UNWIND rels AS rel
SET rel.weight = size(rels)
I was practicing with the Movie Database from Neo4j in order to practice and I have done the next query:
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(a)
RETURN a
This query returns 3 rows but If I go to the graph view on the web editor and expand the "Tom Hanks" node I, of course, have one movie such that Tom Hanks directed and acted in that movie but the rest of the connected nodes only have the ACTED_IN relation. What I want to do is to, in this case, filter and remove Tom Hanks from the result since he has at least one connection such that it has only one relation (either ACTED_IN or DIRECTED)
PD: My expected result would be only the row representing node "Clint Eastwood"
So you only want results where the person acted in and directed the same movies, but never simply acted in, without directing, or directed, without acting.
You could use this approach:
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(a)
WITH a, count(m) as actedDirectedCount
WHERE size((a)-[:ACTED_IN]->()) = actedDirectedCount AND size((a)-[:DIRECTED]->()) = actedDirectedCount
RETURN a
Though you can simplify this a bit by combining the relationship types in the pattern used in your WHERE clause like so:
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(a)
WITH a, count(m) as actedDirectedCount
WHERE size((a)-[:ACTED_IN|DIRECTED]->()) = actedDirectedCount * 2
RETURN a
If the actedDirectedCount = 3 movies, then there must be at a minimum 3 :ACTED_IN relationships and 3 :DIRECTED relationships, so a minimum of 6 relationships using either relationship. If there are any more than this, then there are additional movies that they either acted in or directed, so we'd filter that out.
There options come to my mind:
1.
MATCH (m:Movie)<-[:DIRECTED]-(a:Person)
with a, collect(distinct m) as directedMovies
match (a)-[:ACTED_IN]->(m:Movie)
with a, directedMovies, collect(distinct m) as actedMovies
with a where all(x in directedMovies where x in actedMovies) and all(x in actedMovies where x in directedMovies)
return a
2.
MATCH (m:Movie)<-[:DIRECTED]-(a:Person)
with * order by id(m)
with a, collect(distinct m) as directedMovies
match (a)-[:ACTED_IN]->(m:Movie)
with a, directedMovies, m order by id (m)
with a, directedMovies, collect(distinct m) as actedMovies
with a where actedMovies=directedMovies
return a
MATCH (m:Movie)<-[:DIRECTED]-(a:Person)
with a, collect(distinct m) as directedMovies
with * where all(x in directedMovies where (a)-[:ACTED_IN]->(x))
MATCH (m:Movie)<-[:ACTED_IN]-(a)
with a, collect(distinct m) as actedMovies
with * where all(x in actedMovies where (a)-[:DIRECTED]->(x))
return a
The first two are equally expensive and the last one is a bit more expensive.
I have a graph with one node type 'nodeName' and one relationship type 'relName'. Each node pair has 0-1 'relName' relationships with each other but each node can be connected to many nodes.
Given an initial list of nodes (I'll refer to this list as the query subset) I want to:
Find all the nodes that connect to the query subset
I'm currently doing this (which may be overly convoluted):
MATCH (a: nodeName)-[r:relName]-()
WHERE (a.name IN ['query list'])
WITH a
MATCH (b: nodeName)-[r2:relName]-()
WHERE NOT (b.name IN ['query list'])
WITH a, b
MATCH (a)--(b)
RETURN DISTINCT b
Then for each connected node (b) I want to return the SUM of the weights that connect to the query subset
For example. If node b1 has 4 edges that connect to nodes in the query subset I would like to RETURN SUM(r2.weight) AS totalWeight for b2. I actually need a list of all the b nodes ordered by totalWeight.
No. 2 is where I'm stuck. I've been reading the docs about FOREACH and reduce() but I'm not sure how to apply them here.
Speed is important as I have 30,000 nodes and 1.5M edges if you have any suggestions regarding this please throw them into the mix.
Many thanks
Matt
Why do you need so many Match statements? You can specify a nodes and b nodes in single Match statement and select only those who have a relationship between them.
After that just return b nodes and sum of the weights. b nodes will automatically be acting as a group by if it is returned along with aggregation function such as sum.
MATCH (a:nodeName)-[r:relName]-(b:nodeName)
WHERE (a.name IN ['query list']) AND NOT((b.name IN ['query list']))
RETURN b.name, sum(r.weight) as weightSum order by weightSum
I think we can simplify that query a bit.
MATCH (a: nodeName)
WHERE (a.name IN ['query list'])
WITH collect(a) as subset
UNWIND subset as a
MATCH (a)-[r:relName]-(b)
WHERE NOT b in subset
RETURN b, sum(r.weight) as totalWeight
ORDER BY totalWeight ASC
Since sum() is an aggregating function, it will make the non-aggregation variables the grouping key (in this case b), so the sum is per b node, then we order them (switch to DESC if needed).
I hope you can help me. I want to count for every node all its neighbours sperated by the type of relationship.
For example if i got this graph:
I want to get for Node 165 following:
id AnzTaxi AnzBus AnzSchiff
165 2 2 0
I made this query, but it seems like neo4j connects my "Match" as an AND so it will only list nodes, which got at least 1 relationship at every type.
MATCH (Station)-[:TAXI]-(b)
MATCH (Station)-[:BUS]-(c)
MATCH (Station)-[:SCHIFF]-(d)
RETURN Station.id, COUNT(DISTINCT b) AS AnzTaxi,
COUNT(DISTINCT c) AS AnzBus, COUNT(DISTINCT d) AS
AnzSchiff
ORDER BY COUNT(DISTINCT b) DESC;
Many thanks in advance!
You should use a OPTIONAL MATCH instead of a simple MATCH. The docs says:
The OPTIONAL MATCH clause is used to search for the pattern described
in it, while using nulls for missing parts of the pattern.
Try it:
OPTIONAL MATCH (Station)-[:TAXI]-(b)
OPTIONAL MATCH (Station)-[:BUS]-(c)
OPTIONAL MATCH (Station)-[:SCHIFF]-(d)
RETURN Station.id, COUNT(DISTINCT b) AS AnzTaxi,
COUNT(DISTINCT c) AS AnzBus, COUNT(DISTINCT d) AS
AnzSchiff
ORDER BY COUNT(DISTINCT b) DESC;
One additional approach is not to expand at all, and use relationship degrees (which are stored on the node itself) to get the counts you need.
MATCH (Station)
RETURN Station.id,
size((Station)<-[:BUS]-()) AS AnzBus,
size((Station)<-[:TAXI]-()) AS AnzTaxi,
size((Station)<-[:SCHIFF]-()) AS AnzSchiff
ORDER BY AnzBus DESC;
Note that this counts the relationships rather than nodes, and this assumes (from your example) that every :BUS, :TAXI, and :SCHIFF relationship in the graph has both incoming and outgoing relationships between each connected node.
Though if that is the case, it's better to only model this with one relationship between nodes and treat it as bidirectional rather than double your relationships unnecessarily (and if you do make that change you'll need to remove direction from the relationships in my query).
If your model doesn't work like this, and a relationship can go one way, but not be reciprocated (so there can be an outgoing :BUS relationship to a node, but no incoming :BUS relationships from that same node), then this answer won't work, and you should choose one of the others.
An alternate approach would be to match all of the neighbour nodes in one go rather than three separate optional statements. This way if there was no result then you would know there were no neighbours connected by TAXI, BUS, or SCHIFF. Then you could just use CASE statements to separate them after the fact and aggregate them back up using SUM.
MATCH (s1:Station)-[mode:TAXI|BUS|SCHIFF]-(neighbour)
WITH s1,
TYPE(mode) as mode,
COLLECT(DISTINCT neighbour) as neighbours
WITH s1,
CASE WHEN mode = 'TAXI' THEN size(neighbours) END AS AnzTaxi,
CASE WHEN mode = 'BUS' THEN size(neighbours) END AS AnzBus,
CASE WHEN mode = 'SCHIFF' THEN size(neighbours) END AS AnzSchiff
RETURN s1.id,
SUM(AnzTaxi) as AnzTaxi,
SUM(AnzBus) AS AnzBus,
SUM(AnzSchiff) AS AnzSchiff
ORDER BY AnzTaxi DESC, s1.id
I'm starting with Neo4j and using graphs, and I'm trying to get the following:
I have to find the subtraction(difference) between the number of users (each user is a node) and the number of differents names they have. I have 16 nodes, and each one has his own name (name is one of the properties it has), but some of them have the same name (for example the node A has (Name:Amanda,City:Roma) and node B has (Name:Amanda, City:Paris), so I will have less name's count because some of them are repeated.
I have tried this:
match (n) with n, count(n) as c return sum(c)
That gives me the number of nodes. And then I tried this
match (n) with n, count(n) as nodeC with n, count( distinct n.Name) as
nameC return sum(nodeC) as sumN, sum(nameC) as sumC, sumN-sumC
But it doesn't work (I'm not sure if even i'm getting the names well, because when I try it, separated, it doesn't work neither).
I think this is what you are looking for:
MATCH (n)
RETURN COUNT(n) - COUNT(DISTINCT n.name) AS diff;