I have the following query in neo4j which uses a UNION
MATCH (u:User {userId:'1'})-[dw:DIRECTOR_WEIGHT]->(d:Person)-[:DIRECTED]->(m:Movie)
WITH m, avg(dw.weight) AS mean_dw, 0 AS mean_aw, 0 AS mean_gw
WHERE m.title = 'Bambi'
RETURN m.title, mean_dw, mean_aw, mean_gw, mean_dw + mean_aw + mean_gw AS total
UNION
MATCH (u:User {userId:'1'})-[aw:ACTOR_WEIGHT]->(a:Person)-[:ACTED_IN]->(m:Movie)
WITH m, 0 AS mean_dw, avg(aw.weight) AS mean_aw, 0 AS mean_gw
WHERE m.title = 'Bambi'
RETURN m.title, mean_dw, mean_aw, mean_gw, mean_dw + mean_aw + mean_gw AS total
UNION
MATCH (u:User {userId:'1'})-[gw:GENRE_WEIGHT]->(g:Genre)<-[:GENRE]-(m:Movie)
WITH m, 0 AS mean_dw, 0 AS mean_aw, avg(gw.weight) AS mean_gw
WHERE m.title = 'Bambi'
RETURN m.title, mean_dw, mean_aw, mean_gw, mean_dw + mean_aw + mean_gw AS total
yielding the following result:
╒═════════╤═══════════════╤════════════════╤═════════════════╤═════════════════╕
│"m.title"│"mean_dw" │"mean_aw" │"mean_gw" │"total" │
╞═════════╪═══════════════╪════════════════╪═════════════════╪═════════════════╡
│"Bambi" │7.2916666666667│"0" │"0" │7.2916666666667 │
├─────────┼───────────────┼────────────────┼─────────────────┼─────────────────┤
│"Bambi" │"0" │0.45322110715442│"0" │0.45322110715442 │
├─────────┼───────────────┼────────────────┼─────────────────┼─────────────────┤
│"Bambi" │"0" │"0" │9.289617486338933│9.289617486338933│
└─────────┴───────────────┴────────────────┴─────────────────┴─────────────────┘
My problem is the "total" doesn't do what I intend it to do, since I only want a single total per movie (i.e. the sum of the three non-zero weights: 7.29 + 0.45 + 9.28),
but I cannot find a way to use this returned result further. I.e., I would like to be able to say say something like
RETURN m.title, sum(total)
or
RETURN m.title, mean_dw + mean_aw + mean_gw
after getting the union of mean_dw, mean_aw, and mean_gw respectively
While post-union processing isn't currently supported by Cypher, you can get around this with apoc.cypher.run() in APOC procedures. This will let you perform a union within the run and yield the unioned result, allowing you to finish up whatever remaining processing you want.
Though looking at your queries, you're performing identical operations in each one, the only difference is the relationships followed in the matches. There's also some unnecessary work being done for three separate mean columns, as the only thing you're interested in is getting the average of each specific relationship's weight as the mean, and then summing all the means.
That should allow us to cut out some redundant operations and work with a narrower set of variables.
Something like this:
MATCH (u:User {userId:'1'}), (m:Movie{title:'Bambi'})
CALL apoc.cypher.run("
MATCH (u)-[dw:DIRECTOR_WEIGHT]->()-[:DIRECTED]->(m)
RETURN avg(dw.weight) as mean
UNION ALL
MATCH (u)-[aw:ACTOR_WEIGHT]->()-[:ACTED_IN]->(m)
RETURN avg(aw.weight) AS mean
UNION ALL
MATCH (u)-[gw:GENRE_WEIGHT]->()<-[:GENRE]-(m)
RETURN avg(gw.weight) AS mean
", {u:u, m:m}) YIELD value
RETURN m.title, SUM(value.mean) as total
Now, all that said, you don't necessarily need to use unions at all. You can just use subqueries connected with WITH.
MATCH (u:User {userId:'1'}), (m:Movie{title:'Bambi'})
MATCH (u)-[dw:DIRECTOR_WEIGHT]->()-[:DIRECTED]->(m)
WITH u, m, avg(dw.weight) as total
MATCH (u)-[aw:ACTOR_WEIGHT]->()-[:ACTED_IN]->(m)
WITH u, m, total + avg(aw.weight) AS total
MATCH (u)-[gw:GENRE_WEIGHT]->()<-[:GENRE]-(m)
WITH u, m, total + avg(gw.weight) AS total
RETURN m.title, total
Related
We have a large graph (over 1 billion edges) that has multiple relationship types between nodes.
In order to check the number of nodes that have a single unique relationship between nodes (i.e. a single relationship between two nodes per type, which otherwise would not be connected) we are running the following query:
MATCH (n)-[:REL_TYPE]-(m)
WHERE size((n)-[]-(m))=1 AND id(n)>id(m)
RETURN COUNT(DISTINCT n) + COUNT(DISTINCT m)
To demonstrate a similar result, the below sample code can run on the movie graph after running
:play movies in an empty graph, resulting with 4 nodes (in this case we are asking for nodes with 3 types of relationships)
MATCH (n)-[]-(m)
WHERE size((n)-[]-(m))=3 AND id(n)>id(m)
RETURN COUNT(DISTINCT n) + COUNT(DISTINCT m)
Is there a better/more efficient way to query the graph?
The following query is more performant, since it only scans each relationship once [whereas size((n)--(m)) will cause relationships to be scanned multiple times]. It also specifies a relationship direction to filter out half of the relationship scans, and to avoid the need for comparing native IDs.
MATCH (n)-->(m)
WITH n, m, COUNT(*) AS cnt
WHERE cnt = 3
RETURN COUNT(DISTINCT n) + COUNT(DISTINCT m)
NOTE: It is not clear what you are using the COUNT(DISTINCT n) + COUNT(DISTINCT m) result for, but be aware that it is possible for some nodes to be counted twice after the addition.
[UPDATE]
If you want to get the actual number of distinct nodes that pass your filter, here is one way to do that:
MATCH (n)-->(m)
WITH n, m, COUNT(*) AS cnt
WHERE cnt = 3
WITH COLLECT(n) + COLLECT(m) AS nodes
UNWIND nodes AS node
RETURN COUNT(DISTINCT node)
I found the following information on how to calculate the size of neo4j database: https://neo4j.com/developer/guide-sizing-and-hardware-calculator/#_disk_storage
An example disk space calculation is:
10,000 Nodes x 14B = 140kB 1,000,000 Rels x 33B = 31.5MB 2,010,000
Props x 41B = 78.6MB
Total is 110.2MB
Is there a query that could simply fetch this information for me?
For the node count the query is simple:
match (n) return count(n);
for the rel count the query would be as follows:
match (n)-[r]-() return count(r);
How do I get the count of all properties of all nodes and relations combined though?
Answering your main question, use the keys function to get a list of property names and summarize their length:
MATCH (n)
WITH SUM(SIZE(KEYS(n))) AS countOfNodeProps,
COUNT(n) AS countOfNodes
MATCH ()-[r]->()
WITH countOfNodeProps,
countOfNodes,
SUM(SIZE(KEYS(r))) AS countOfRelProps,
COUNT(r) AS countOfRels
RETURN countOfNodeProps,
countOfRelProps,
(countOfNodeProps + countOfRelProps) as countOfProps,
countOfNodes,
countOfRels
But it's easier to use the apoc.monitor.store function to get the exact information about the storage:
CALL apoc.monitor.store() YIELD
logSize,
stringStoreSize,
arrayStoreSize,
relStoreSize,
propStoreSize,
totalStoreSize,
nodeStoreSize
RETURN *
In this Cypher query, I want to sum all the weights over paths in a graph:
MATCH p=(n:person)-[r*2..3]->(m:person)
WHERE n.name = 'alice' and m.name = 'bob'
WITH REDUCE(weights=0, rel IN r : weights + rel.weight) AS weight_sum, p
return n.name, m.name, weight_sum
LIMIT 10
In this query, I expect to receive a table with 3 columns: n.name, m.name (identical in all the rows), and weight_sum -- according to the weight sum in the specific path.
However, I get this error:
reduce(...) requires '| expression' (an accumulation expression) (line 3,
column 6 (offset: 89))
"WITH REDUCE(weights=0, rel IN r : weights + rel.weight) AS weight_sum, p"
I obviously miss something trivial. But what?
Shouldn't that be
REDUCE(weights=0, rel IN r | weights + rel.weight) AS weight_sum
(with a pipe instead of a colon) as per the documentation in http://neo4j.com/docs/developer-manual/current/cypher/functions/list/ ?
reduce(totalAge = 0, n IN nodes(p)| totalAge + n.age) AS reduction
Hope this helps.
Regards,
Tom
I am writing a Cypher query in Neo4j 2.0.4 that attempts to get the total number of inbound and outbound relationships for a selected node. I can do this easily when I only use this query one-node-at-a-time, like so:
MATCH (g1:someIndex{name:"name1"})
MATCH g1-[r1]-()
RETURN count(r1);
//Returns 305
MATCH (g2:someIndex{name:"name2"})
MATCH g2-[r2]-()
RETURN count(r2);
//Returns 2334
But when I try to run the query with 2 nodes together (i.e. get the total number of relationships for both g1 and g2), I seem to get a bizarre result.
MATCH (g1:someIndex{name:"name1"}), (g2:someIndex{name:"name2"})
MATCH g1-[r1]-(), g2-[r2]-()
RETURN count(r1)+count(r2);
//Returns 1423740
For some reason, the number is much much greater than the total of 305+2334.
It seems like other Neo4j users have run into strange issues when using multiple MATCH clauses, so I read through Michael Hunger's explanation at https://groups.google.com/d/msg/neo4j/7ePLU8y93h8/8jpuopsFEFsJ, which advised Neo4j users to pipe the results of one match using WITH to avoid "identifier uniqueness". However, when I run the following query, it simply times out:
MATCH (g1:gene{name:"SV422_HUMAN"}),(g2:gene{name:"BRCA1_HUMAN"})
MATCH g1-[r1]-()
WITH r1
MATCH g2-[r2]-()
RETURN count(r1)+count(r2);
I suspect this query doesn't return because there's a lot of records returned by r1. In this case, how would I operate my "get-number-of-relationships" query on 2 nodes? Am I just using some incorrect syntax, or is there some fundamental issue with the logic of my "2 node at a time" query?
Your first problem is that you are returning a Cartesian product when you do this:
MATCH (g1:someIndex{name:"name1"}), (g2:someIndex{name:"name2"})
MATCH g1-[r1]-(), g2-[r2]-()
RETURN count(r1)+count(r2);
If there are 305 instances of r1 and 2334 instances of r2, you're returning (305 * 2334) == 711870 rows, and because you are summing this (count(r1)+count(r2)) you're getting a total of 711870 + 711870 == 1423740.
Your second problem is that you are not carrying over g2 in the WITH clause of this query:
MATCH (g1:gene{name:"SV422_HUMAN"}),(g2:gene{name:"BRCA1_HUMAN"})
MATCH g1-[r1]-()
WITH r1
MATCH g2-[r2]-()
RETURN count(r1)+count(r2);
You match on g2 in the first MATCH clause, but then you leave it behind when you only carry over r1 in the WITH clause at line 3. Then, in line 4, when you match on g2-[r2]-() you are matching literally everything in your graph, because g2 has been unbound.
Let me walk through a solution with the movie dataset that ships with the Neo4j browser, as you have not provided sample data. Let's say I want to get the total count of relationships attached to Tom Hanks and Hugo Weaving.
As separate queries:
MATCH (:Person {name:'Tom Hanks'})-[r]-()
RETURN COUNT(r)
=> 13
MATCH (:Person {name:'Hugo Weaving'})-[r]-()
RETURN COUNT(r)
=> 5
If I try to do it your way, I'll get (13 * 5) * 2 == 90, which is incorrect:
MATCH (:Person {name:'Tom Hanks'})-[r1]-(),
(:Person {name:'Hugo Weaving'})-[r2]-()
RETURN COUNT(r1) + COUNT(r2)
=> 90
Again, this is because I've matched on all combinations of r1 and r2, of which there are 65 (13 * 5 == 65) and then summed this to arrive at a total of 90 (65 + 65 == 90).
The solution is to use DISTINCT:
MATCH (:Person {name:'Tom Hanks'})-[r1]-(),
(:Person {name:'Hugo Weaving'})-[r2]-()
RETURN COUNT(DISTINCT r1) + COUNT(DISTINCT r2)
=> 18
Clearly, the DISTINCT modifier only counts the distinct instances of each entity.
You can also accomplish this with WITH if you wanted:
MATCH (:Person {name:'Tom Hanks'})-[r]-()
WITH COUNT(r) AS r1
MATCH (:Person {name:'Hugo Weaving'})-[r]-()
RETURN r1 + COUNT(r)
=> 18
TL;DR - Beware of Cartesian products. DISTINCT is your friend:
MATCH (:someIndex{name:"name1"})-[r1]-(),
(:someIndex{name:"name2"})-[r2]-()
RETURN COUNT(DISTINCT r1) + COUNT(DISTINCT r2);
The explosion of results you're seeing can be easily explained:
MATCH (g1:someIndex{name:"name1"}), (g2:someIndex{name:"name2"})
MATCH g1-[r1]-(), g2-[r2]-()
RETURN count(r1)+count(r2);
//Returns 1423740
In the 2nd line every combination of any relationship from g1 is combined with any relationship of g2, this explains the number since 1423740 = 305 * 2334 * 2. So you're evaluating basically a cross product here.
The right way to calculate the sum of all relationships for name1 and name2 is:
MATCH (g:someIndex)-[r]-()
WHERE g.name in ["name1", "name2"]
RETURN count(r)
My query is:
MATCH (n)-[:NT]->(p)
WHERE ...some properties filters...
RETURN n,p
The result is on the screenshot below.
How to count the total nodes?
I need 14 as a text result. Something like RETURN COUNT(n)+COUNT(p) but it shows 24.
The following request doesn't work correctly:
MATCH (n)-[:NT]->(p)
WHERE ...some properties filters...
RETURN count(n)
Returns me 12, which is the number of relationships pairs as on the picture, not nodes.
MATCH (n)-[:NT]-(p)
WHERE ...some properties filters...
RETURN count(n)
Returns 24.
How to count toward that two nodes (in this example) that have outgoing ONLY arrows? Should be 14 at once.
UPD:
MATCH (n)-[:NT]->(p)
WHERE ...
RETURN DISTINCT FILTER(x in n.myID WHERE NOT x in p.myID)
MATCH (n)-[:NT]->(p)
WHERE ...
RETURN DISTINCT FILTER(x in p.myID WHERE NOT x in n.myID)
The COUNT of DISTINCT UNION of myID gives me the result.
I don't know how to make it with cypher.
Or the DISTINCT UNION of collections:
MATCH (n)-[:NT]->(p)
WHERE ...
RETURN collect(DISTINCT p.myID), collect(DISTINCT n.myID)
The result is:
collect(DISTINCT p.myID)
26375, 26400, 21636, 29939, 20454, 26543, 19089, 4483, 26607, 30375, 26608, 26605
collect(DISTINCT n.myID)
11977, 19478, 20454
Which is 15 items. One is common. If you UNION or DISTINCT the 20454 the total COUNT would be 14. The actual number of nodes on the picture.
I can not achieve this simple pattern.
Your original queries are working correctly.
If you want to get a count of distinct n nodes, your queries should RETURN COUNT(DISTINCT n).
To count the number of nodes that only have outgoing relationships:
MATCH (n)-->()
WHERE NOT ()-->(n)
COUNT(DISTINCT n);
To count the number of distinct nodes that are directly involved in an :NT relationship:
MATCH (n)-[:NT]-()
COUNT(DISTINCT n);
MATCH (n)-[:NT]->(p)
WHERE ...some properties filters...
WITH collect(DISTINCT p.myID) AS set1
MATCH (n)-[:NT]->(p)
WHERE ...some properties filters...
WITH collect(DISTINCT n.myID) AS set2, set1
WITH set1 + set2 AS BOTH
UNWIND BOTH AS res
RETURN COUNT(DISTINCT res);