Neo4j Cypher Aggregating Value Counts - neo4j

I am returning date that looks like this:
"Jonathan" | "Chicago" | 6 | ["Hot","Warm","Cold","Cold","Cold","Warm"]
Where the third column is a count of the values in column 4.
I want to extract values out of the collection in column 4 and create new columns based on the values. My expected output would be:
Hot | Cold | Warm with the values 1 | 3 | 2 representing the counts of each value.
My current query is match (p)-[]->(c)-[]->(w) return distinct p.name, c.name, count(w), collect (w.weather)
I'd imagine this is simple, but i cant figure it out for the life of me.

Cypher does not have way to "pivot" data (as discussed here). That is in part because it does not support dynamically generating the names of return values (e.g., "Cold") -- and it is these names that appear as "column" headers in the Text and Table visualizations provided by the neo4j Browser.
However, if you know that you only have, say, 3 possible "weather" names, you can use a query like this, which hardcodes those names in the RETURN clause:
MATCH (c:City)-[:HAS_WEATHER]->(w:Weather)
WITH c, {weather: w.weather, count: COUNT(*)} AS weatherCount
WITH c, REDUCE(s = {Cold: 0, Warm: 0, Hot: 0}, x IN COLLECT(weatherCount) | apoc.map.setKey(s, x.weather, x.count)) AS counts
MATCH (p:Person)-[:LIVES_IN]->(c)
RETURN p.name AS pName, c.name AS cName, counts.Cold AS Cold, counts.Warm AS Warm, counts.Hot AS Hot
The above query efficiently gets the weather data for a city once (for all people in that city), instead of once per person.
The APOC function apoc.map.setKey is a convenient way to get a map with an updated key value.

Related

A method to sum all values in a returned column using Cypher in Neo4j

I have written the following Cypher query to get the frequency of a certain item from a set of orders.
MATCH (t:Trans)-[r:CONTAINS]->(i:Item)
WITH i,COUNT(*) AS CNT,size(collect(t)) as NumTransactions
RETURN i.ITEM_ID as item, NumTransactions, NumTransactions/CNT as support
I get a table like this as my output
Item NumTransactions Support
A 2 1
B 1132 1
C 2049 1
And so on. What I mean to do is divide each NumTransaction by the total number of transactions. i.e. the sum of the entire num transactions column, to get the support but instead it divides NumTransactions by itself. Can someone point me to the correct function if it exists or an approach to do so?
This should work:
MATCH (:Trans)-[:CONTAINS]->(i:Item)
WITH i, COUNT(*) as c
WITH COLLECT({i: i, c: c}) AS data
WITH data, REDUCE(s = 0.0, n IN data | s + n.c) AS total
UNWIND data AS d
RETURN d.i.ITEM_ID as item, d.c AS NumTransactions, d.c/total as support
By the way, SIZE(COUNT(t)) is inefficient, as it first creates a new collection of t nodes, gets its size, and then deletes the collection. COUNT(t) would have been more efficient.
Also, given your MATCH clause, as long as every t has at most a single CONTAINS relationship to a given i, COUNT(*) (which counts the number of result rows) would give you the same result as COUNT(t).

Efficiently getting relationship histogram for a set of nodes

Background
I want to create a histogram of the relationships starting from a set of nodes.
Input is a set of node ids, for example set = [ id_0, id_1, id_2, id_3, ... id_n ].
The output is a the relationship type histogram for each node (e.g. Map<Long, Map<String, Long>>):
id_0:
- ACTED_IN: 14
- DIRECTED: 1
id_1:
- DIRECTED: 12
- WROTE: 5
- ACTED_IN: 2
id_2:
...
The current cypher query I've written is:
MATCH (n)-[r]-()
WHERE id(n) IN [ id_0, id_1, id_2, id_3, ... id_n ] # set
RETURN id(n) as id, type(r) as type, count(r) as count
It returns the pair of [ id, type ] count like:
id | rel type | count
id0 | ACTED_IN | 14
id0 | DIRECTED | 1
id1 | DIRECTED | 12
id1 | WROTE | 5
id1 | ACTED_IN | 2
...
The result is collected using java and merged to the first structure (e.g. Map<Long, Map<String, Long>>).
Problem
Getting the relationship histogram on smaller graphs is fast but can be very slow on bigger datasets. For example if I want to create the histogram where the set-size is about 100 ids/nodes and each of those nodes have around 1000 relationships the cypher query took about 5 minutes to execute.
Is there more efficient way to collect the histogram for a set of nodes?
Could this query be parallelized? (With java code or using UNION?)
Is something wrong with how I set up my neo4j database, should these queries be this slow?
There is no need for parallel queries, just the need to understand Cypher efficiency and how to use statistics.
Bit of background :
Using count, will execute an expandAll, which is as expensive as the number of relationships a node has
PROFILE
MATCH (n) WHERE id(n) = 21
MATCH (n)-[r]-(x)
RETURN n, type(r), count(*)
Using size and a relationship type, uses internally getDegree which is a statistic a node has locally, and thus is very efficient
PROFILE
MATCH (n) WHERE id(n) = 0
RETURN n, size((n)-[:SEARCH_RESULT]-())
Morale of the story, for using size you need to know the relationship types a labeled node can have. So, you need to know the schema of the database ( in general you will want that, it makes things easily predictable and building dynamically efficient queries becomes a joy).
But let's assume you don't know the schema, you can use APOC cypher procedures, allowing you to build dynamic queries.
The flow is :
Get all the relationship types from the database ( fast )
Get the nodes from id list ( fast )
Build dynamic queries using size ( fast )
CALL db.relationshipTypes() YIELD relationshipType
WITH collect(relationshipType) AS types
MATCH (n) WHERE id(n) IN [21, 0]
UNWIND types AS type
CALL apoc.cypher.run("RETURN size((n)-[:`" + type + "`]-()) AS count", {n: n})
YIELD value
RETURN id(n), type, value.count

Update nodes by a list of ids and values in one cypher query

I've got a list of id's and a list of values. I want to catch each node with the id and set a property by the value.
With just one Node that is super basic:
MATCH (n) WHERE n.id='node1' SET n.name='value1'
But i have a list of id's ['node1', 'node2', 'node3'] and same amount of values ['value1', 'value2', 'value3'] (For simplicity i used a pattern but values and id's vary a lot). My first approach was to use the query above and just call the database each time. But nowadays this isn't appropriate since i got thousand of id's which would result in thousand of requests.
I came up with this approach that I iterate over each entry in both lists and set the values. The first node from the node list has to get the first value from the value list and so on.
MATCH (n) WHERE n.id IN["node1", "node2"]
WITH n, COLLECT(n) as nodeList, COLLECT(["value1","value2"]) as valueList
UNWIND nodeList as nodes
UNWIND valueList as values
FOREACH (index IN RANGE(0, size(nodeList)) | SET nodes.name=values[index])
RETURN nodes, values
The problem with this query is that every node gets the same value (the last of the value list). The reason is in the last part SET nodes.name=values[index] I can't use the index on the left side nodes[index].name - doesn't work and the database throws error if i would do so. I tried to do it with the nodeList, node and n. Nothing worked out well. I'm not sure if this is the right way to achieve the goal maybe there is a more elegant way.
Create pairs from the ids and values first, then use UNWIND and simple MATCH .. SET query:
// THe first line will likely come from parameters instead
WITH ['node1', 'node2', 'node3'] AS ids,['value1', 'value2', 'value3'] AS values
WITH [i in range(0, size(ids)) | {id:ids[i], value:values[i]}] as pairs
UNWIND pairs AS pair
MATCH (n:Node) WHERE n.id = pair.id
SET n.value = pair.value
The line
WITH [i in range(0, size(ids)) | {id:ids[i], value:values[i]}] as pairs
combines two concepts - list comprehensions and maps. Using the list comprehension (with omitted WHERE clause) it converts list of indexes into a list of maps with id,value keys.

Neo4j Excluding nodes from result

I am quite new to Neo4j and Graph.
I have a simple Graph database with just one relationship: table -[contains]-> column
I'm trying to get a list of tables and columns that contain a specific term in their name.
I want to show that as a list of Tables and a count of columns for that table. If the table name does not contain the term but one of it columns does, then it should be in the list. Also, the count of columns should only include columns that contain the term.
Here is an example:
Table: "Chicago", Columns: "Chi_Address", "Chi_Weather", "Latitude"
Table: "Miami" , Columns: "Mia_to_Chi", "Mia_Weather"
Table: "Dallas" , Columns: "Dal_to_Mia", "Dal_Weather"
If I search for the term "chi", the desired result would be:
Table -- Col Count
Chicago -- 2
Miami -- 1
This is my current query:
MATCH (t:table)-[r:contains]->(c:column)
where toLower(t.name) contains toLower('CHI') or toLower(c.name) contains toLower('CHI')
return t.name as Table_Name,count(c.name) as Column_Count
My problem is that if a table contains the term, then I get a count of all its columns, not just the ones with the term. So I'm my example I would get
Chicago -- 3 //Instead of 2
Miami -- 1
I was thinking of doing something like:
count(c.name WHERE c.name contains('CHI')
But that doesn't seem to be a valid syntax.
Any help would be appreciated.
PS: Happy to take any advice on how to improve my current query. For example, I'm sure that having the search term twice is something that I should improve.
Since your approach isn't going to use an index lookup anyway, we might as well change the approach here.
We can start off by matching to all :table nodes, then OPTIONAL MATCH to all :column nodes with your CONTAINS predicate. When we count() the matches this will only include the column count where the CONTAINS check is true (in some cases it will be 0). So that gets our Column_Count correct.
Next we'll filter the results, only keeping rows where we found a positive Column_Count or where the CONTAINS check is true for the :table node.
MATCH (t:table)
OPTIONAL MATCH (t)-[:contains]->(c:column)
WHERE toLower(c.name) contains toLower('CHI')
WITH t, count(c) as Column_Count
WHERE Column_Count <> 0 OR toLower(t.name) contains toLower('CHI')
RETURN t.name as Table_Name, Column_Count

Reduce function appears to be off by 1 when counting properties in Cypher

I have a csv file here, which once read into Neo4J using the commands at the bottom of this post, creates a set of 4 family trees.
Over all families, I would like to return the family_id for a family having 2 or more people where is_ill = '1'.
When I look a priori, I can easily see what my expected results will be, just with some quick filtering in excel to show only nodes where is_ill = '1'.
You can see that there are 8, 3, and 3 appearances of is_ill = '1' for fam1, fam2, and fam4. So, if I write my query correctly, I'd expect to get 3 family_id's back.
So, I'm all ready to practice some Cypher. Here's what I've got:
MATCH p=(:Person)-[]->(:Person)
WHERE 2 <= REDUCE(s = 0, x IN NODES(p) | CASE WHEN x.is_ill = '1' THEN s + 1 ELSE s END)
WITH p
RETURN distinct([x IN nodes(p) | x.family_id][0]) as Family_ID
;
This looks great. Except for the fact that I realize it is returning a ton of subgraphs, and I have to use this hacky distinct() and index [0]. That makes me nervous, but I can never tell when it is or isn't a good idea to do a full graph traversal, and when Cypher is going to return a slew of subgraphs or not.
But I digress. The crux of my problem is that for some reason this query returns only
| Family_ID |
|-----------|
| fam2 |
How? I clearly asked it to count up how many times the full path has nodes with that property. The key is 2 <= .... That's exactly what I want!! And, if Cypher is returning fam2...why not also fam3? They have the same number of occurrences of my property!
I can monkey around and get the results I want, if I change it to 1 <= .... Why does this work? The counting is off by one??
Should s = 0 be indexed differently for some reason? That doesn't seem likely. Perhaps my initial MATCH p=(:Person)-[]->(:Person) is undershooting the target, and I'm missing nodes? I don't think so either. There must be some problem with the way REDUCE is being applied.
//LOAD UP THE CSV
LOAD CSV WITH HEADERS FROM
'file:///...neo_test2_00.csv' AS line
CREATE (
person:Person {
date_of_birth: line.`date_of_birth`
,does_forget: line.`does_forget`
,family_id: line.`family_id`
,father_person_id: line.`father_person_id`
,first_name: line.`first_name`
,gender: line.`gender`
,is_ill: line.`is_ill`
,is_proband: line.`is_proband`
,last_name: line.`last_name`
,mother_person_id: line.`mother_person_id`
,pidn: line.`pidn`
,subject_person_id: line.`subject_person_id`
}
)
// create parent relationship for mother
MATCH (m:Person),(s:Person)
WHERE
(
m.family_id = s.family_id
AND
m.subject_person_id = s.mother_person_id
)
CREATE (m)-[:PARENT_OF]->(s)
;
// create parent relationship for father
MATCH (f:Person),(s:Person)
WHERE
(
f.family_id = s.family_id
AND
f.subject_person_id = s.father_person_id
)
CREATE (f)-[:PARENT_OF]->(s)
;
Why not use aggregation?
MATCH (p:Person {is_ill: '1'})
WITH p.family_id AS familyId,
COUNT(p) AS illCount WHERE illCount >= 2
RETURN familyId,
illCount
P.S. Your query returns a pattern for two ill people. Look at this query:
MATCH p=(:Person {is_ill:'1'})-[]->(:Person {is_ill:'1'})
RETURN p

Resources