Return node after aggregation in Cypher - neo4j

I am having a hard time understanding how to properly use aggregate functions in Cypher.
Let say I have nodes labelled as Animal, with properties size and species.
For each species, I want to get the largest.
So far, I understand I can do it with the following :
MATCH (n:Animal)
WITH n.species as species, max(n.size) as size
RETURN species, size
And I will effectively get the largest sizes with corresponding species.
But how can I get the nodes instead of species ?
I can't return n because of the WITH statement, and I can't inject it into the WITH because it will break species aggregation.
I know this question has already been asked a few times, but the different solutions I came accross were case-specific and used relations
Any advice is very welcome
EDIT: I finally made it work with :
MATCH (n:Animal)
WITH n.species as species, max(n.size) as size, collect(n) as ns
UNWIND ns as n
WITH n
WHERE n.size = size
RETURN n
Is this the Cypher-way to settle things ? Seems a bit verbose and not efficient (all nodes are fetched here) to me, isn't there a more straightforward option ?

Since the MAX aggregation function does not return the node with the max value, you should not use it. Otherwise, you'd have to test the size of every animal twice to get both the max value and the node of interest (as you discovered).
You can instead use the REDUCE function to test the size of every animal just once:
MATCH (n:Animal)
WITH n.species AS species, COLLECT(n) as ns
RETURN species, REDUCE(s = {size: -1}, a IN ns |
CASE WHEN a.size > s.size THEN {size: a.size, a: a} ELSE s END
) AS result;

This is a frequently encountered limitation with our max() and min() aggregation functions, so we added an APOC function that can help: apoc.agg.maxItems():
apoc.agg.maxItems(item, value, groupLimit: -1) - returns a map {items:[], value:n} where value is the maximum value present, and items are all items with the same value. The number of items can be optionally limited.
MATCH (n:Animal)
WITH n.species as species, apoc.agg.maxItems(n.size, n) as sizeData
RETURN species, sizeData.value as size, sizeData.items as animals

Related

Cypher Query to Collect Arbitrary Depth Nodes and Edge Properties

I have a graph that looks like the image below. However, the depth and the number of rollups from the Person to the topmost Rollup is variable depending on how the rollups have been structured by the user. The edges from the Person to the Metric (HAS_METRIC) have the score values and the relationships from the metrics to the Rollup (HAS_PARENT) has the weighting that should be applied by to the value as it is rolled up to a top score.
Ideally, I would like to have a query that produces a table with the rollup and the summed/weighted scores. Like this:
node | value
-------------------
Metric A 23
Metric B 55
Metric C 29
Metric D 78
Rollup A 45.4
Rollup B 58.4
Rollup Tot 51.9
However, I am not understanding how to collect the edge properties for the HAS_PARENTS.
MATCH (p:Person)-[score:HAS_METRIC]->(m:Metric)-[weight:HAS_PARENT]->(ru:Rollup)
-[par_rel:HAS_PARENT*..8]->(ru_par:Rollup)
WITH p, score, m, weight, par_rel, ru, ru_par
RETURN p.uid, score.score, m.uid, weight.weight, ru.uid par_rel.weight, ru_par.uid
This query is giving me a type mismatch because it does not know what to do with the par_rel.weight. Any pointers are appreciated.
I believe what you are searching for is the relationships(path) function. It is one of the default path functions in Cypher. It returns all relationship is a defined path, and you can combine it with one or more Cypher list expressions to get the values you need from the relationships.
Generally speaking, you could do something like:
MATCH p = (n)-[:HAS_PARENT*..8]->()
RETURN [x IN relationships(p) | x.weight] AS weights
You might also find useful the reduce function. E.g.:
...
RETURN reduce(s = 0, x IN relationships(p) | s + x.weight) AS sumWeight
But you need to be careful with your variable length path queries and probably constrain them in order to get only the paths you are interested in.
A good advice would be probably to mark your leaf and root nodes in order to match only paths from a leaf to a/the root, not just intermediate ones. E.g.:
MATCH p = (n)-[:HAS_PARENT*..8]->(root)
WHERE NOT (root)-[:HAS_PARENT]->() AND NOT (n)<-[:HAS_PARENT]-()
...
And of course you can combine these cypher with others in order to return everything you need in one single query.
I hope this helps. Let us know when you succeed.

Count the number of relationship types to add them as a frequency-property to the edges

I'm trying to count the different types of relationships in my neo4j graph to add them as a "frequency" property to the corresponding edges (ie I have 4 e:EX relationship types, so I would like my edges of type EX to have an e.frequency=4).
So far I have played around with this code:
MATCH ()-[e:EX]-()
WITH e, count(e) as amount
SET e.frequency = amount
RETURN e
For this piece of code my returned e.frequency is 2 for all EX edges. Maybe anyone here knows how to correct this?
It sounds like you want this information for quick access later for any node with that type. If you're planning on deleting or adding edges in your graph, you should realize that your data will get stale quickly, and a graph-wide query to update the property on every edge in the graph just doesn't make sense.
Thankfully Neo4j keeps a transactional count store of various statistics, including the number of relationships per relationship type.
It's easiest to get these via procedure calls, either in Neo4j itself or APOC Procedures.
If you have APOC installed, you can see the map of relationship type counts like this:
CALL apoc.meta.stats() YIELD relTypesCount
RETURN relTypesCount
If you know which type you want the count of, you can use dot notation into the relTypesCount map to get the value in question.
If it's dynamic (either passed in as a parameter, or obtained after matching to a relationship in the query) you can use the map index notation to get the count in question like this:
CALL apoc.meta.stats() YIELD relTypesCount
MATCH ()-[r]->()
WITH relTypesCount, r
LIMIT 5
RETURN type(r) as type, relTypesCount[type(r)] as count
If you don't have APOC, you can make use of db.stats.retrieve('GRAPH COUNTS')
YIELD data, but you'll have to do some additional filtering to make sure you get the counts for ALL of the relationships of the given type, and exclude the counts that include the labels of the start or end nodes:
CALL db.stats.retrieve('GRAPH COUNTS') YIELD data
WITH [entry IN data.relationships WHERE NOT exists(entry.startLabel) AND NOT exists(entry.endLabel)] as relCounts
MATCH ()-[r]->()
WITH relCounts, r
LIMIT 5
RETURN type(r) as type, [rel in relCounts WHERE rel.relationshipType = type(r) | rel.count][0] as count
First, here is what your query is doing
// Match all EX edges (relationships), ignore direction
MATCH ()-[e:EX]-()
// Logical partition; With the edges, count how many times that instance occurred (will always be 2. (a)-[e:EX]->(b) and the reverse order of (b)<-[e:EX]-(a)
WITH e, count(e) as amount
// Set the property frequency on the instance of e to amount (2)
SET e.frequency = amount
// return the edges
RETURN e
So to filter the duplicates (reverse direction match), you need to specify the direction on the MATCH. So MATCH ()-[e:EX]->(). for the frequency part, you don't even need a match; you can just count the occurences of the pattern WITH SIZE(()-[:EX]->()) as c (SIZE because pattern matching returns a list, not a row set)
So
WITH SIZE(()-[:EX]->()) as c
MATCH ()-[e:EX]->()
SET e.frequency = c
return e
Although, frequency will be invalidated as soon as an EX edge is created or deleted, so I would just open your Cypher with asking for the edge count.
Also, in this trivial case, the best way to get the relation count is with a MATCH - COUNT because this form helps the Cypher planer to recognize that it can just fetch the edge count from it's internal metadata store.
MATCH ()-[e:EX]->()
WITH COUNT(e) as c
MATCH ()-[e:EX]->()
SET e.frequency = c
return e

Cypher query (or APOC procedure) that samples all labels and returns graph representing x% of nodes

I am working with a graph that has many object types (e.g. LABELS).
I would like to be able to run a query that samples every label, and returns a small but representative set of data containing nodes (and relationships) for each label. Has anyone seen or achieved this?
Thanks, John
This returns for each label, five nodes associated with this label :
call db.labels() yield label
call apoc.cypher.run("match (x:`"+label+"`) RETURN x LIMIT 5", null) yield value
return label, collect(value.x) AS nodes
Without knowing your model you could display your complete label structure as graph by the Cypher statement CALL apoc.meta.graph();.
For a representative set of data for each label we should know your underlying model or rather labels. I could imagine a solution based on the Limit clause:
MATCH (n)
OPTIONAL MATCH (n)-[r]-()
WITH n, r
LIMIT 5000
RETURN n, r;

Find a set of (n) nodes where relationship weight between each pair of node is greater than a value(w)

I have a database where each node is connected to all other nodes with a relationship, and each relationship has a weight. I need a query where given a weight w and a number of nodes n, I want all n nodes where each pair of relationship has a weight greater than w.
Any help on this would be great
It depends on what you would like your result set to look like. Something as simple as this query would return all paths that fall under your criteria:
MATCH p=()-[r:my_rel]->() WHERE r.weight > w RETURN p;
This would return all such paths.
If you would like the two nodes only (and not the entire pattern's results), you can return only those two nodes:
MATCH (n1)-[r:my_rel]->(n2) WHERE r.weight > w RETURN n1,n2;
Do note that due to Neo4J's storage internals, performing a search based on the properties of a relationship tends to not perform as well as those based on properties of a node.

Query optimization for matching nodes with equal values

I want to collect all nodes that have the same property value
MATCH (rhs:DataValue),(lhs:DataValue) WHERE rhs.value = lhs.value RETURN rhs,lhs
I have created an index on the property
CREATE INDEX ON :DataValue(value)
the index is created:
Indexes
ON :DataValue(value) ONLINE
I have only 2570 DataValues.
match (n:DataValue) return count(n)
> 2570
However, the query takes ages/does not terminate within the timeout of my browser.
This surprises me as I have an index and expected the query to run within O(n) with n being the amount of nodes.
My train of thought is: If I'd implement it myself I could just match all nodes O(n) sort them by value O(n log n) and then go through the sorted list and return all sublists that are longer than 1 O(n). Thus, the time I could archive is O(n log n). However, I expect the sorting already being covered by the indexing.
How am I mistaken and how can I optimize this query?
Your complexity is actually O(n^2), since your match creates a cartesian product for rhs and lhs, and then does filtering for every single pairing to see if they are equal. The index doesn't apply in your query at all. You should be able to confirm that by running EXPLAIN or PROFILE on the query.
You'll want to tweak your query a little to get it to O(n). Hopefully in a future neo4j version query planning will be smarter so we don't have to be so explicit.
MATCH (lhs:DataValue)
WITH lhs
MATCH (rhs:DataValue)
WHERE rhs.value = lhs.value
RETURN rhs,lhs
Note that your returned values will include opposite pairs ((A, B), (B, A)).

Resources