Totalize and Relationships in depth - neo4j

I'm using neo4j 3.5 community edition and trying totalize and get hops from a specific node. I'm running to totalize(its working):
MATCH (p:Person{id:123})-[r]->(m)
RETURN p,count(distinct(r.idSale)) as qtd,sum(r.value) as value,type(r) as type,r.seg as segment,m
How I get hops when its totalize?? I'm trying something like these (not working):
MATCH (p:Person{id:123})-[r]->(m)-[r2*2]->(m2)
RETURN p,count(distinct(r.idSale)) as qtd,sum(r.value) as value,type(r),m,count(distinct(r2.idSale)) as qtd2,sum(r2.value) as t
error : Type mismatch: expected Any, Map, Node, Relationship, Point,
Duration, Date, Time, LocalTime, LocalDateTime or DateTime but was
List (line 1, column 149 (offset: 148))
My graph is like this:
I'm trying:

Since [r2*2] specifies a variable length relationship pattern, r2 is a list of relationships.
Here is one way to get what it seems you want:
MATCH (p:Person{id:123})-[r]->(m), path=(m)-[*2]->(m2)
RETURN p,count(distinct(r.idSale)) as qtd,sum(r.value) as value,type(r),m,
SIZE(apoc.coll.toSet([r2 IN RELATIONSHIPS(path) | r2.idSale])) AS qtd2,
REDUCE(s = 0, r2 IN RELATIONSHIPS(path) | s + r2.value) AS total2,
[r2 IN RELATIONSHIPS(path) | TYPE(r2)] AS types2,
m2
This query uses the APOC function apoc.coll.toSet to get a distinct list. Also, it returns a list of relationship types in types2.

Related

Efficiently getting relationship histogram for a set of nodes

Background
I want to create a histogram of the relationships starting from a set of nodes.
Input is a set of node ids, for example set = [ id_0, id_1, id_2, id_3, ... id_n ].
The output is a the relationship type histogram for each node (e.g. Map<Long, Map<String, Long>>):
id_0:
- ACTED_IN: 14
- DIRECTED: 1
id_1:
- DIRECTED: 12
- WROTE: 5
- ACTED_IN: 2
id_2:
...
The current cypher query I've written is:
MATCH (n)-[r]-()
WHERE id(n) IN [ id_0, id_1, id_2, id_3, ... id_n ] # set
RETURN id(n) as id, type(r) as type, count(r) as count
It returns the pair of [ id, type ] count like:
id | rel type | count
id0 | ACTED_IN | 14
id0 | DIRECTED | 1
id1 | DIRECTED | 12
id1 | WROTE | 5
id1 | ACTED_IN | 2
...
The result is collected using java and merged to the first structure (e.g. Map<Long, Map<String, Long>>).
Problem
Getting the relationship histogram on smaller graphs is fast but can be very slow on bigger datasets. For example if I want to create the histogram where the set-size is about 100 ids/nodes and each of those nodes have around 1000 relationships the cypher query took about 5 minutes to execute.
Is there more efficient way to collect the histogram for a set of nodes?
Could this query be parallelized? (With java code or using UNION?)
Is something wrong with how I set up my neo4j database, should these queries be this slow?
There is no need for parallel queries, just the need to understand Cypher efficiency and how to use statistics.
Bit of background :
Using count, will execute an expandAll, which is as expensive as the number of relationships a node has
PROFILE
MATCH (n) WHERE id(n) = 21
MATCH (n)-[r]-(x)
RETURN n, type(r), count(*)
Using size and a relationship type, uses internally getDegree which is a statistic a node has locally, and thus is very efficient
PROFILE
MATCH (n) WHERE id(n) = 0
RETURN n, size((n)-[:SEARCH_RESULT]-())
Morale of the story, for using size you need to know the relationship types a labeled node can have. So, you need to know the schema of the database ( in general you will want that, it makes things easily predictable and building dynamically efficient queries becomes a joy).
But let's assume you don't know the schema, you can use APOC cypher procedures, allowing you to build dynamic queries.
The flow is :
Get all the relationship types from the database ( fast )
Get the nodes from id list ( fast )
Build dynamic queries using size ( fast )
CALL db.relationshipTypes() YIELD relationshipType
WITH collect(relationshipType) AS types
MATCH (n) WHERE id(n) IN [21, 0]
UNWIND types AS type
CALL apoc.cypher.run("RETURN size((n)-[:`" + type + "`]-()) AS count", {n: n})
YIELD value
RETURN id(n), type, value.count

Cypher - Only show node name, not full node in path variable

In Cypher I have the following query:
MATCH p=(n1 {name: "Node1"})-[r*..6]-(n2 {name: "Node2"})
RETURN p, reduce(cost = 0, x in r | cost + x.cost) AS cost
It is working as expected. However, it prints the full n1 node, then the full r relationship (with all its attributes), and then full n2.
What I want instead is to just show the value of the name attribute of n1, the type attribute of r and again the name attribute of n2.
How could this be possible?
Thank you.
The tricky part of your request is the type attribute of r, as r is a collection of relationships of the path, not a single relationship. We can use EXTRACT to produce a list of relationship types for all relationships in your path. See if this will work for you:
MATCH (n1 {name: "Node1"})-[r*..6]-(n2 {name: "Node2"})
RETURN n1.name, EXTRACT(rel in r | TYPE(rel)) as types, n2.name, reduce(cost = 0, x in r | cost + x.cost) AS cost
You also seem to be calculating a cost for the path. Have you looked at the shortestPath() function?

Using Match with Multiple Clauses Causes Odd Results

I am writing a Cypher query in Neo4j 2.0.4 that attempts to get the total number of inbound and outbound relationships for a selected node. I can do this easily when I only use this query one-node-at-a-time, like so:
MATCH (g1:someIndex{name:"name1"})
MATCH g1-[r1]-()
RETURN count(r1);
//Returns 305
MATCH (g2:someIndex{name:"name2"})
MATCH g2-[r2]-()
RETURN count(r2);
//Returns 2334
But when I try to run the query with 2 nodes together (i.e. get the total number of relationships for both g1 and g2), I seem to get a bizarre result.
MATCH (g1:someIndex{name:"name1"}), (g2:someIndex{name:"name2"})
MATCH g1-[r1]-(), g2-[r2]-()
RETURN count(r1)+count(r2);
//Returns 1423740
For some reason, the number is much much greater than the total of 305+2334.
It seems like other Neo4j users have run into strange issues when using multiple MATCH clauses, so I read through Michael Hunger's explanation at https://groups.google.com/d/msg/neo4j/7ePLU8y93h8/8jpuopsFEFsJ, which advised Neo4j users to pipe the results of one match using WITH to avoid "identifier uniqueness". However, when I run the following query, it simply times out:
MATCH (g1:gene{name:"SV422_HUMAN"}),(g2:gene{name:"BRCA1_HUMAN"})
MATCH g1-[r1]-()
WITH r1
MATCH g2-[r2]-()
RETURN count(r1)+count(r2);
I suspect this query doesn't return because there's a lot of records returned by r1. In this case, how would I operate my "get-number-of-relationships" query on 2 nodes? Am I just using some incorrect syntax, or is there some fundamental issue with the logic of my "2 node at a time" query?
Your first problem is that you are returning a Cartesian product when you do this:
MATCH (g1:someIndex{name:"name1"}), (g2:someIndex{name:"name2"})
MATCH g1-[r1]-(), g2-[r2]-()
RETURN count(r1)+count(r2);
If there are 305 instances of r1 and 2334 instances of r2, you're returning (305 * 2334) == 711870 rows, and because you are summing this (count(r1)+count(r2)) you're getting a total of 711870 + 711870 == 1423740.
Your second problem is that you are not carrying over g2 in the WITH clause of this query:
MATCH (g1:gene{name:"SV422_HUMAN"}),(g2:gene{name:"BRCA1_HUMAN"})
MATCH g1-[r1]-()
WITH r1
MATCH g2-[r2]-()
RETURN count(r1)+count(r2);
You match on g2 in the first MATCH clause, but then you leave it behind when you only carry over r1 in the WITH clause at line 3. Then, in line 4, when you match on g2-[r2]-() you are matching literally everything in your graph, because g2 has been unbound.
Let me walk through a solution with the movie dataset that ships with the Neo4j browser, as you have not provided sample data. Let's say I want to get the total count of relationships attached to Tom Hanks and Hugo Weaving.
As separate queries:
MATCH (:Person {name:'Tom Hanks'})-[r]-()
RETURN COUNT(r)
=> 13
MATCH (:Person {name:'Hugo Weaving'})-[r]-()
RETURN COUNT(r)
=> 5
If I try to do it your way, I'll get (13 * 5) * 2 == 90, which is incorrect:
MATCH (:Person {name:'Tom Hanks'})-[r1]-(),
(:Person {name:'Hugo Weaving'})-[r2]-()
RETURN COUNT(r1) + COUNT(r2)
=> 90
Again, this is because I've matched on all combinations of r1 and r2, of which there are 65 (13 * 5 == 65) and then summed this to arrive at a total of 90 (65 + 65 == 90).
The solution is to use DISTINCT:
MATCH (:Person {name:'Tom Hanks'})-[r1]-(),
(:Person {name:'Hugo Weaving'})-[r2]-()
RETURN COUNT(DISTINCT r1) + COUNT(DISTINCT r2)
=> 18
Clearly, the DISTINCT modifier only counts the distinct instances of each entity.
You can also accomplish this with WITH if you wanted:
MATCH (:Person {name:'Tom Hanks'})-[r]-()
WITH COUNT(r) AS r1
MATCH (:Person {name:'Hugo Weaving'})-[r]-()
RETURN r1 + COUNT(r)
=> 18
TL;DR - Beware of Cartesian products. DISTINCT is your friend:
MATCH (:someIndex{name:"name1"})-[r1]-(),
(:someIndex{name:"name2"})-[r2]-()
RETURN COUNT(DISTINCT r1) + COUNT(DISTINCT r2);
The explosion of results you're seeing can be easily explained:
MATCH (g1:someIndex{name:"name1"}), (g2:someIndex{name:"name2"})
MATCH g1-[r1]-(), g2-[r2]-()
RETURN count(r1)+count(r2);
//Returns 1423740
In the 2nd line every combination of any relationship from g1 is combined with any relationship of g2, this explains the number since 1423740 = 305 * 2334 * 2. So you're evaluating basically a cross product here.
The right way to calculate the sum of all relationships for name1 and name2 is:
MATCH (g:someIndex)-[r]-()
WHERE g.name in ["name1", "name2"]
RETURN count(r)

neo4j collecting nodes and relations type b-->a<--c,a<--d

I am extending maxdemarzi's excellent graph visualisation example (http://maxdemarzi.com/2013/07/03/the-last-mile/) using VivaGraph backed by neo4j.
I want to display relationships of the type
a-->b<--c,b<--d
I tried the query
MATCH p = (a)--(b:X)--(c),(b:X)--(d)
RETURN EXTRACT(n in nodes(p) | {id:ID(n), name:COALESCE(n.name, n.title, ID(n)), type:LABELS(n)}) AS nodes,
EXTRACT(r in relationships(p)| {source:ID(startNode(r)) , target:ID(endNode(r))}) AS rels
It looks like the named query picks up only a-->b<--c pattern and omits the b<--d patterns.
Am i missing something... can i not add multiple patterns in a named query?
The most immediate problem is that the comma in the MATCH clause separates the first pattern from the second. The variable 'p' only stores the first pattern. This is why you aren't getting the results you desire. Independent of that, you are at risk of having a 'loose binding' by putting a label on both of your nodes named 'b' in the two patterns. The second 'b' node should not have a label.
So here is a version of your query that should work.
MATCH p1=(a)-->(b:X)<--(c), p2=(b)<--(d)
WITH nodes(p1) + d AS ns, relationships(p1) + relationships(p2) AS rs
RETURN EXTRACT(n IN ns | {id:ID(n), name:COALESCE(n.name, n.title, ID(n)), type:LABELS(n)}) AS nodes,
EXTRACT(r in rs| {source:ID(startNode(r)) , target:ID(endNode(r))}) AS rels
Capture both paths, then build collections from the nodes and relationships of both paths. The collection of nodes actually only extracts the nodes from p1 and adds the 'd' node. You could write that part as
nodes(p1) + nodes(p2) as ns
but then the 'b' node will appear in the list twice.

Multiple relationships in Match Cypher

Trying to find similar movies on the basis of tags. But I also need all the tags for the given movie and its each similar movie (to do some calculations). But surprisingly collect(h.w) gives repeated values of h.w (where w is a property of h)
Here is the cypher query. Please help.
MATCH (m:Movie{id:1})-[h1:Has]->(t:Tag)<-[h2:Has]-(sm:Movie),
(m)-[h:Has]->(t0:Tag),
(sm)-[H:Has]->(t1:Tag)
WHERE m <> sm
RETURN distinct(sm), collect(h.w)
Basically a query like
MATCH (x)-[h]->(y), (a)-[H]->(b)
RETURN h
is returning each result for h n times where n is the number of results for H. Any way around this?
I replicated the data model for this question to help answer it.
I then setup a sample dataset using Neo4j's online console: http://console.neo4j.org/?id=dakmi3
Running the following query from your question:
MATCH (m:Movie { title: "The Matrix" })-[h1:HAS_TAG]->(t:Tag),
(t)<-[h2:HAS_TAG]-(sm:Movie),
(m)-[h:HAS_TAG]->(t0:Tag),
(sm)-[H:HAS_TAG]->(t1:Tag)
WHERE m <> sm
RETURN DISTINCT sm, collect(h.weight)
Which results in:
(1:Movie {title:"The Matrix: Reloaded"}) [0.31, 0.12, 0.31, 0.12, 0.31, 0.01, 0.31, 0.01]
The issue is that there are duplicate relationships being returned, which results in duplicated weight in the collection. The solution is to use WITH to limit relationships to distinct records and then return the collection of weights of those relationships.
MATCH (m:Movie { title: "The Matrix" })-[h1:HAS_TAG]->(t:Tag),
(t)<-[h2:HAS_TAG]-(sm:Movie),
(m)-[h:HAS_TAG]->(t0:Tag),
(sm)-[H:HAS_TAG]->(t1:Tag)
WHERE m <> sm
WITH DISTINCT sm, h
RETURN sm, collect(h.weight)
(1:Movie {title:"The Matrix: Reloaded"}) [0.31, 0.12, 0.01]
I'm afraid I still don't quite get your intention, but about the general question of duplicate results, that is just the way a disconnected pattern works. Cypher must consider something like
(:A), (:B)
as one pattern, not two. That means that any satisfying graph structure is considered a distinct match. Suppose you have the graph resulting from
CREATE (:A), (:B), (:B)
and query it for the pattern above, you get two results, namely
neo4j-sh (?)$ MATCH (a:A),(b:B) RETURN *;
==> +-------------------------------+
==> | a | b |
==> +-------------------------------+
==> | Node[15204]{} | Node[15207]{} |
==> | Node[15204]{} | Node[15208]{} |
==> +-------------------------------+
==> 2 rows
==> 53 ms
Similarly when matching your pattern (x)-[h]->(y), (a)-[H]->(b) cypher considers each combination of the two pattern parts to make up a unique match for the one whole pattern–so the results for h are compounded by the results for H.
This the way the pattern matching works. To achieve what you want you could first consider if you really need to query for a disconnected pattern. If you do, or if a connected pattern also generates redundant matches, then aggregate one or more of the pattern parts. A simple case might be
CREATE (a:A), (b1:B), (b2:B)
, (c1:C), (c2:C), (c3:C)
, a-[:X]->b1, a-[:X]->b2
, a-[:Y]->c1, a-[:Y]->c2, a-[:Y]->c3
queried with
MATCH (b:B)<-[:X]-(a:A)-[:Y]->(c:C) // with 1 (a), 2 (b) and 3 (c) you get 6 matched paths
RETURN a, collect (b) as bb, collect (c) as cc // after aggregation by (a) there is one path
Sometimes it makes sense to do the aggregation as an intermediate step
MATCH (b)<-[:X]-(a:A) // 2 paths
WITH a, collect(b) as bb // 1 path
MATCH a-[:Y]->(c) // 3 paths
RETURN a, bb, collect(c) as cc // 1 path

Resources