Cypher query: Get a count grouped by relationship property - neo4j

I am new to Neo4j/Cypher and I am having some trouble grouping by relationship properties.
Firstly a toy example:
CREATE (A1:Worker {ref:"A1"})
CREATE (A2:Worker {ref:"A2"})
CREATE (B1:Worker {ref:"B1"})
CREATE (B2:Worker {ref:"B2"})
CREATE (A1)-[:StreamsTo {type:"stream1"}]->(B1)
CREATE (A1)-[:StreamsTo {type:"stream2"}]->(B1)
CREATE (A1)-[:StreamsTo {type:"stream1"}]->(B2)
CREATE (A1)-[:StreamsTo {type:"stream2"}]->(B2)
CREATE (A2)-[:StreamsTo {type:"stream1"}]->(B1)
CREATE (A2)-[:StreamsTo {type:"stream1"}]->(B2)
This creates a graph with 4 worker nodes, where the A nodes are connected to the B nodes by relationships that can have different values for the "type" property. In this case A1 is connected to the B's by 2 different types of streams and A2 only by 1 type:
What I want to be able to do is to count the number of outgoing relationships from each source node but have them grouped by the various values of the "type" property in the relationship to get something like this:
+--------+-------------+---------------+
| Worker | StreamType | OutgoingCount |
+--------+-------------+---------------+
| A1 | stream1 | 2 |
+--------+-------------+---------------+
| A1 | stream2 | 2 |
+--------+-------------+---------------+
| A2 | stream1 | 2 |
+--------+-------------+---------------+
So far I can get the total outgoing and number of distinct outgoing types:
MATCH (source:Worker)-[st:StreamsTo]->(:Worker)
RETURN source.ref as Source,
COUNT(st) as TotalOutgoing,
COUNT(distinct st.type) as NumberOfTypes;
Any hints would be helpful.

So it turns out to be trivial! I had not understood that what you return along with the COUNT() function performs the group by:
MATCH (source:Worker)-[st:StreamsTo]->(:Worker)
RETURN source.ref as Worker,
st.type as StreamType,
COUNT(st) as OutgoingCount;

Related

In neo4j, a query to count the number of distinct structures

In neo4j my database consists of chains of nodes. For each distinct stucture/layout (does graph theory has a better word?), I want to count the number of chains. For example, the database consists of 9 nodes and 5 relationships as this:
(:a)->(:b)
(:b)->(:a)
(:a)->(:b)
(:a)->(:b)->(:b)
where (:a) is a node with label a. Properties on nodes and relationships are irrelevant.
The result of the counting should be:
------------------------
| Structure | n |
------------------------
| (:a)->(:b) | 2 |
| (:b)->(:a) | 1 |
| (:a)->(:b)->(:b) | 1 |
------------------------
Is there a query that can achieve this?
Appendix
Query to create test data:
create (:a)-[:r]->(:b), (:b)-[:r]->(:a), (:a)-[:r]->(:b), (:a)-[:r]->(:b)-[:r]->(:b)
EDIT:
Thanks for the clarification.
We can get the equivalent of what you want, a capture of the path pattern using the labels present:
MATCH path = (start)-[*]->(end)
WHERE NOT ()-->(start) and NOT (end)-->()
RETURN [node in nodes(path) | labels(node)[0]] as structure, count(path) as n
This will give you a list of the labels of the nodes (the first label present for each...remember that nodes can be multi-labeled, which may throw off your results).
As for getting it into that exact format in your example, that's a different thing. We could do this with some text functions in APOC Procedures, specifically apoc.text.join().
We would need to first add formatting around the extraction of the first label to add the prefixed : as well as the parenthesis. Then we could use apoc.text.join() to get a string where the nodes are joined by your desired '->' symbol:
MATCH path = (start)-[*]->(end)
WHERE NOT ()-->(start) and NOT (end)-->()
WITH [node in nodes(path) | labels(node)[0]] as structure, count(path) as n
RETURN apoc.text.join([label in structure | '(:' + label + ')'], '->') as structure, n

Neo4j Cypher - How to Count Multiple Property Values With Cypher Efficiently And Paginate Properly

I am struggling to get the proper cypher that is both efficient and allows pagination through skip and limit.
Here is the simple scenario: I have the related nodes (company)<-[A]-(set)<-[B]-(job) where there are multiple instances of (set) with distinct (job) instances related to them. The (job) nodes have a specific status property that can hold one of several states. We need to count the number of (job) nodes in a particular state per (set) and use skip and limit to paginate on the distinct (set) nodes.
So we can get a very efficient query for job.status counts using this.
match (c:Company {id: 'MY.co'})<-[:type_of]-(s:Set)<-[:job_for]-(j:Job)
return s.Description, j.Status, count(*) as StatusCount;
Which will give us a rows of the Set.Description, Job.Status, and JobStatus count. But we will get multiple rows for the Set based on the Job.Status. This is not conducive to paging over distinct sets though. Something like:
s.Description j.Status StatusCount
-------------------+--------------+----------------
Set 1 | Unassigned | 10
Set 1 | Completed | 2
Set 2 | Unassigned | 3
Set 1 | Reviewed | 10
Set 3 | Completed | 4
Set 2 | Reviewed | 7
What we are trying to achieve with the proper cypher is result rows based on distinct Sets. Something like this:
s.Description Unassigned Completed Reviewed
-------------------+--------------+-------------+----------
Set 1 | 10 | 2 | 10
Set 2 | 3 | 0 | 7
Set 3 | 0 | 4 | 0
This would then allow us to paginate over Sets using skip and limit properly.
I have tried many different approaches and cannot seem to find the right combination for this type of result. Anyone have any ideas? Thanks!
** EDIT - Using the answer provided by MIchael, here's how to get the status count values in java **
match (c:Company {id: 'MY.co'})<-[:type_of]-(s:Set)<-[:job_for]-(j:Job)
with s, j.Status as Status,count(*) as StatusCount
return s.Description, collect({Status:Status,StatusCount:StatusCount]) as StatusCounts;
List<Object> statusMaps = (List<Object>) row.get("StatusCounts");
for(Object statusEntry : statusMaps ) {
Map<String,Object> statusMap = (Map<String,Object>) statusEntry;
String status = (String) statusMap.get("Status");
Number count = statusMap.get("StatusCount");
}
You can use WITH and aggregation, and optionally a map result
match (c:Company {id: 'MY.co'})<-[:type_of]-(s:Set)<-[:job_for]-(j:Job)
with s, j.Status as Status,count(*) as StatusCount
return s.Description, collect([Status,StatusCount]);
or
match (c:Company {id: 'MY.co'})<-[:type_of]-(s:Set)<-[:job_for]-(j:Job)
with s, j.Status as Status,count(*) as StatusCount
return s.Description, collect({Status:Status,StatusCount:StatusCount]);

Do labels order effects search time?

I'm using neo4j 2.1.7 Recently i was experimenting with Match queries, searching for nodes with several labels. And i found out, that generally query
Match (p:A:B) return count(p) as number
and
Match (p:B:A) return count(p) as number
works different time, extremely in cases when you have for example 2 millions of Nodes A and 0 of Nodes B.
So do labels order effects search time? Is this future is documented anywhere?
Neo4j internally maintains a labelscan store - that's basically a lookup to quickly get all nodes carrying a definied label A.
When doing a query like
MATCH (n:A:B) return count(n)
labelscanstore is used to find all A nodes and then they're filtered if those nodes carry label B as well. If n(A) >> n(B) it's way more efficient to do MATCH (n:B:A) instead since you look up only a few B nodes and filter those for A.
You can use PROFILE MATCH (n:A:B) return count(n) to see the query plan. For Neo4j <= 2.1.x you'll see a different query plan depending on the order of the labels you've specified.
Starting with Neo4j 2.2 (milestone M03 available as of writing this reply) there's a cost based Cypher optimizer. Now Cypher is aware of node statistics and they are used to optimize the query.
As an example I've used the following statements to create some test data:
create (:A:B);
with 1 as a foreach (x in range(0,1000000) | create (:A));
with 1 as a foreach (x in range(0,100) | create (:B));
We have now 100 B nodes, 1M A nodes and 1 AB node. In 2.2 the two statements:
MATCH (n:B:A) return count(n)
MATCH (n:A:B) return count(n)
result in the exact same query plan (and therefore in the same execution speed):
+------------------+---------------+------+--------+-------------+---------------+
| Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
+------------------+---------------+------+--------+-------------+---------------+
| EagerAggregation | 3 | 1 | 0 | count(n) | |
| Filter | 12 | 1 | 12 | n | hasLabel(n:A) |
| NodeByLabelScan | 12 | 12 | 13 | n | :B |
+------------------+---------------+------+--------+-------------+---------------+
Since there are only few B nodes, it's cheaper to scan for B's and filter for A. Smart Cypher, isn't it ;-)

Neo4j: Create slef relation cypher query

I have big number of EmpBase imprted (from csv file from postgreSQL )nodes like:
neo4j-sh (?)$ match (e:EmpBase) return e limit 10;
+-------------------------------------------------------------------------+
| e |
+-------------------------------------------------------------------------+
| Node[8992]{neo_eb_id:8993,neo_eb_name:"emp_no_8993",neo_eb_bossID:5503} |
| Node[8993]{neo_eb_id:8994,neo_eb_name:"emp_no_8994",neo_eb_bossID:8131} |
| Node[8994]{neo_eb_id:8995,neo_eb_name:"emp_no_8995",neo_eb_bossID:8624} |
What cypher query can create self relations on every node so that every node with neo_eb_bossid can have the relationship to the adequate node ?
In postgreSQl the data is about 1020MB table. In Neo4j, after import, it is 6.42 GiB as the console says.
In order to create the relationship based on neo_eb_bossID, you can match the nodes and run a foreach loop that will create the relationships to the related node :
MATCH (e:EmpBase) WITH collect(e) AS empbases
FOREACH (emp in empbases |
MERGE (target:EmpBase {neo_eb_id:emp.neo_eb_bossID}
MERGE (emp)-[:YOUR_RELATIONSHIP]->(target)
)
Concerning the self relationship, I've hard to understand what you exactly want.
Chris

Get Node ID's in Neo4j using Python

I have recently begun using Neo4j and am struggling to understand how things work. I am trying to create relationships between nodes that I created earlier in my script. The cypher query that I found looks like it should work, but I don't know how to get the id's to replace the #'s
START a= node(#), b= node(#)
CREATE UNIQUE a-[r:POSTED]->b
RETURN r
If you want to use plain cypher, the documentation has a lot of usage examples.
When you create nodes you can return them (or just their ids by returning id(a)), like this:
CREATE (a {name:'john doe'}) RETURN a
This way you can keep the id around to add relationships.
If you want to attach relationships later, you should not use the internal id of the nodes to reference them from external system. They can for example be re-used if you delete and create nodes.
You can either search for a node by scanning over all and filtering using WHERE or add an index to your database, e.g. if you add an auto_index on name:
START n = node:node_auto_index(name='john doe')
and continue from there. Neo4j 2.0 will support index lookup transparently so that MATCH and WHERE should be as efficient.
If you are using python, you can also take a look at py2neo which provides you with a more pythonic interface while using cypher and the REST interface to communicate with the server.
This could be what you are looking for:
START n = node(*) , x = node(*)
Where x<>n
CREATE UNIQUE n-[r:POSTED]->x
RETURN r
It will create POSTED relationship between all the nodes like this
+-----------------------+
| r |
+-----------------------+
| (0)-[10:POSTED]->(1) |
| (0)-[10:POSTED]->(2) |
| (0)-[10:POSTED]->(3) |
| (1)-[10:POSTED]->(0) |
| (1)-[10:POSTED]->(2) |
| (1)-[10:POSTED]->(3) |
| (2)-[10:POSTED]->(0) |
| (2)-[10:POSTED]->(1) |
| (2)-[10:POSTED]->(3) |
| (3)-[10:POSTED]->(0) |
| (3)-[10:POSTED]->(1) |
| (3)-[10:POSTED]->(2) |
And if you don't want a relation between the reference node(0) and the other nodes, you can make the query like this
START n = node(*), x = node(*)
WHERE x<>n AND id(n)<>0 AND id(x)<>0
CREATE UNIQUE n-[r:POSTED]->x
RETURN r
and the result will be like that:
+-----------------------+
| r |
+-----------------------+
| (1)-[10:POSTED]->(2) |
| (1)-[10:POSTED]->(3) |
| (2)-[10:POSTED]->(1) |
| (2)-[10:POSTED]->(3) |
| (3)-[10:POSTED]->(1) |
| (3)-[10:POSTED]->(2) |
On the client side using Javascript I post the cypher query:
start n = node(*) WHERE n.name = '" + a.name + "' return n
and then parse the id number from response "self" in the form of:
server_url:7474/db/data/node/node_id
After hours of trying to figure this out, I finally found what I was looking for. I was struggling with how nodes were getting returned and found that
userId=person[0][0][0].id
would return what I wanted. Thanks for all your help though!
Using py2neo, the way I've found that is really useful is to use the remote module.
from py2neo import Graph, remote
graph = Graph()
graph.run('CREATE (a)-[r:POSTED]-(b)')
a = graph.run('MATCH (a)-[r]-(b) RETURN a').evaluate()
a_id = remote(a)._id
b = graph.run('MATCH (a)-[r]-(b) WHERE ID(a) = {num} RETURN b', num=a_id).evaluate()
b_id = remote(b)._id
graph.run('MATCH (a)-[r]-(b) WHERE ID(a)={num1} AND ID(b)={num2} CREATE (a)-[x:UPDATED]-(b)', num1=a_id, num2=b_id)
The remote function takes in a py2neo Node object and has an _id attribute that you can use to return the current ID number from the graph database.

Resources