number of connected nodes to specific nodes in a path

number of connected nodes to specific nodes in a path - neo4j

I have a cypher query (below).
It works but I was wondering if there's a more elegant way to write this.
Based on a given starting node, the query tries to:
Find the following pattern/motif: (inputko)-->(:cpd)-->(ko2:ko)-->(:cpd)-->(ko3:ko).
Foreach the motifs/patterns found, find connected nodes with labels contigs, for the following nodes in the pattern: [inputko, ko2, ko3].
A summary of the 3 nodes and their connected contigs, ie. the name property .ko of the 3 nodes and the number of connected :contig nodes in each of the (inputko)-->(:cpd)-->(ko2:ko)-->(:cpd)-->(ko3:ko) motifs that were found.
+--------------------------------------------------------------------------+
| KO1 | KO1count | KO2 | KO2count | KO3 | KO3count |
+--------------------------------------------------------------------------+
| "ko:K00001" | 102 | "ko:K14029" | 512 | "ko:K03736" | 15 |
| "ko:K00001" | 102 | "ko:K00128" | 792 | "ko:K12972" | 7 |
| "ko:K00001" | 102 | "ko:K00128" | 396 | "ko:K01624" | 265 |
| "ko:K00001" | 102 | "ko:K03735" | 448 | "ko:K00138" | 33 |
| "ko:K00001" | 102 | "ko:K14029" | 512 | "ko:K15228" | 24 |
+--------------------------------------------------------------------------+
I'm puzzled for the syntax to operate on each match.
From the documentation the foreach clause doesn't seem to be what I need.
Any ideas guys?
The FOREACH clause is used to update data within a collection, whether
components of a path, or result of aggregation.
Collections and paths are key concepts in Cypher. To use them for
updating data, you can use the FOREACH construct. It allows you to do
updating commands on elements in a collection — a path, or a
collection created by aggregation.
START
inputko=node:koid('ko:\"ko:K00001\"')
MATCH
(inputko)--(c1:contigs)
WITH
count(c1) as KO1count, inputko
MATCH
(inputko)-->(:cpd)-->(ko2:ko)-->(:cpd)-->(ko3:ko)
WITH
inputko.ko as KO1,
KO1count,
ko2,
ko3
MATCH
(ko2)--(c2:contigs)
WITH
KO1,
KO1count,
ko2.ko as KO2,
count(c2) as KO2count,
ko3
MATCH
(ko3)--(c3:contigs)
RETURN
KO1,
KO1count,
KO2,
KO2count,
ko3.ko AS KO3,
count(c3) AS KO3count
LIMIT
5;
realised that i have to place distinct for in count(distinct cX) to get a accurate count. Do not know why.

I am not sure how elegant this is but I think it does give you some notion about how you could extend your query for n ko nodes in a path and still return the data as you have laid it out below. It should also demonstrate the power of combining the with directive and collections.
// match the ko/cpd node paths starting with K00001
match p=(ko1:ko {name:'K00001' } )-->(:cpd)-->(ko2:ko)-->(:cpd)-->(ko3:ko)
// remove the cpd nodes from each path and name the collection row
with collect([n in nodes(p) where labels(n)[0] = 'ko' | n]) as row
// create a range for the number of rows and number of ko nodes per row
with row
, range(0, length(row)-1, 1) as idx
, range(0, 2, 1) as idx2
// iterate over each row and node in the order it was collected
unwind idx as i
unwind idx2 as j
with i, j, row[i][j] as ko_n
// find all of the contigs nodes atttached to each ko node
match ko_n--(:contigs)
// group the ko node data together in a collection preserving the order and the count
with i, [j, ko_n.name, count(*)] as ko_set
order by i, ko_set[0]
// re-collect the ko node sets as ko rows
with i, collect(ko_set) as ko_row
order by i
//return the original paths in the ko node order with the counts
return reduce( ko_str = "", ko in ko_row |
case
when ko_str = "" then ko_str + ko[1] + ", " + ko[2]
else ko_str + ", " + ko[1] + ", " + ko[2]
end) as `KO-Contigs Counts`
The foreach directive in cypher is strictly for mutating data. For instance , you could use one query to collect the contigs counts per ko node.
This is a bit convoluted and you would never update the number of contigs on a ko node like this but it illustrates the use of foreach in cypher.
match (ko:ko)-->(:contigs)
with ko,count(*) as ct
with collect(ko) as ko_nodes, collect(ct) as ko_counts
with ko_nodes, ko_counts, range(0,length(ko_nodes)-1, 1) as idx
foreach ( i in idx |
set (ko_nodes[i]).num_contigs = ko_counts[i] )
A simpler way to perform the above update task on each ko node would be to do something like this...
match (ko:ko)-->(:contigs)
with ko, count(*) as ct
set ko.num_contigs = ct
If you were to carry teh number of contigs on each ko node then you could perform a query like this to return the number of
// match all the paths starting with K00001
match p=(ko1:ko {name:'K00001' } )-->(:cpd)-->(ko2:ko)-->(:cpd)-->(ko3:ko)
// build a csv line per path
return reduce( ko_str = "", ko in nodes(p) | ko_str +
// using just the ko nodes in the path
// exclude the cpd nodes
case
when labels(ko)[0] = "ko" then ko.name + ", " + toString(ko.num_contigs) + ", "
else ""
end
) as `KO-Contigs Counts`

Related

Neo4j: Cypher query to parallelize a row of the result from a previous query

I have a database where sentences are related to each other. I have to perform a big update on the whole database, thus I'm trying to parallelize the update.
The relevant cypher query looks like this:
match (s:Sentence)-[r:RELATED]-(t:Sentence)
return s as sentence, collect(t.embedding) as neighbours_embeddings
embedding is a list of numbers.
This returns a result like this:
---------------------------------------
| sentence | neighbours_embeddings |
---------------------------------------
| sentence1 | [[1, 2, 3], [4, 5, 6]] |
---------------------------------------
| sentence2 | [[2, 3, 5]] |
---------------------------------------
Now I wanna perform some operations on the neighbours_embeddings and set a property in the corresponding Sentence node.
I've looked at different parallelization techniques in Neo4j and as far as I understood, all of them need a list as input. But my input would be a tuple like (sentence, neighbours_embeddings). How do I achieve this?
Full query for interested folks:
match (s:Sentence)-[r:RELATED]-(t:Sentence)
with s as sentence, collect(t.embedding) as neighbours
with sentence, [
w in reduce(s=[], neighbour IN neighbours |
case when size(s) = 0 then
neighbour else [
i in range(0, size(s)-1) |
s[i] + neighbour[i]] end) |
w / tofloat(size(neighbours))
] as average
with sentence, [
i in range(0, size(sentence.embedding)-1) |
(0.8 * sentence.embedding[i]) + (0.2 *average[i])
] as unnormalized
with sentence, unnormalized, sqrt(reduce(sum = 0.0, element in unnormalized | sum + element^2)) as divideby
set sentence.normalized = [
i in range(0, size(unnormalized)-1) | (unnormalized[i] / divideby)
]

For paralelizing, apoc is your friend, specifically the apoc.periodic.iterate procedure. In your usecase you can parallelize as you are only updating a property of a single node in each row.
The resulted query would look something like:
CALL apoc.periodic.iterate("
match (s:Sentence) RETURN s",
"
match (s)-[r:RELATED]-(t:Sentence)
with s as sentence, collect(t.embedding) as neighbours
with sentence, [
w in reduce(s=[], neighbour IN neighbours |
case when size(s) = 0 then
neighbour else [
i in range(0, size(s)-1) |
s[i] + neighbour[i]] end) |
w / tofloat(size(neighbours))
] as average
with sentence, [
i in range(0, size(sentence.embedding)-1) |
(0.8 * sentence.embedding[i]) + (0.2 *average[i])
] as unnormalized
with sentence, unnormalized, sqrt(reduce(sum = 0.0, element in unnormalized | sum + element^2)) as divideby
set sentence.normalized = [
i in range(0, size(unnormalized)-1) | (unnormalized[i] / divideby)
]", {batchSize:1000, parallel:true})
You can play around with the batchSize parameter. For more information look at the docs.

Aggregating over Distinct Nodes

The data model I have is: (Item)-[:HAS_AN]->(ItemType) and both node types have a field called ID. Items can be related to multiple ItemTypes and in some cases, these ItemTypes can have the same IDs. I'm trying to populate a structure {ID:..., SameType:...}, where SameType = 1 if a node has the same item type as some node (with ID = 1234), and 0 otherwise.
First, I get the candidate list of nodes qList and the source node's ItemType:
MATCH (p:Item{ID:1234})-[:HAS_AN]->(i)
WITH i as pItemType, qList
Then, I go through qList, comparing each node's ItemType to pItemType (which is the ItemType of the source node):
UNWIND qList as q
MATCH (q)-[:HAS_AN]->(i)
WITH q.ID as qID, pItemType, i,
CASE i
WHEN pItemType THEN 1
ELSE 0
END as SameType
RETURN DISTINCT i, qID, pItemType, SameType
The problem I have is when some nodes have two ItemTypes with the same ID. This gives results where some of the entries are duplicates:
{ | | { |
"ID": 18 | 35258417 | "ID": 71 | 0
} | | } |
{ | | { |
"ID": 18 | 35258417 | "ID": 71 | 0
} | | } |
while I'd like to only take one such row, if more than one exists.
Placing DISTINCT where I have in the last part of the query doesn't seem to work. What's the best way to filter out such duplicates?
Update:
Here is a sample data subset: http://console.neo4j.org/r/f74pdq
Here are the queries that I'm running
MATCH (q:Item) WHERE q.ID <> 1234 WITH COLLECT(DISTINCT(q)) as qList
MATCH (p:Item{ID:1234})-[:HAS_AN]->(i:ItemType) WITH i as pItemType, qList
UNWIND qList as q
MATCH (q)-[:HAS_AN]->(i:ItemType) WITH q.ID as qID, pItemType, i,
CASE i
WHEN pItemType THEN 1
ELSE 0
END as SameType
RETURN DISTINCT i, qID, pItemType, SameType
In this example, Item with ID = 2 has two HAS_AN relations with 2 ItemType nodes with the same ID. I would like only one of them to be returned.

I've tried to simplify your query. Take a look:
MATCH (:Item {ID : 1234})-[:HAS_AN]->(target:ItemType)
MATCH (item:Item)-[:HAS_AN]->(itemType:ItemType)
WHERE item.ID <> 1234
WITH
itemType.ID AS i,
item.ID AS qID,
collect({
pItemType : target,
SameType : CASE exists((item)-[:HAS_AN]-(target))
WHEN true THEN 1 ELSE 0 END
})[0] as Item
RETURN i, qID, Item.pItemType AS pItemType, Item.SameType AS SameType
The trick is in the two lines after WITH. At this point I'm grouping by itemType.ID and item.ID and not ( and not itemType and item). In your original query you are using pItemType to group. This does not work because the two ItemTypes with ID = 34 are different nodes although they have the same ID.
The output from your console:
+-------------------------------------+
| i | qID | pItemType | SameType |
+-------------------------------------+
| 31 | 4 | Node[2]{ID:5} | 0 |
| 5 | 3 | Node[2]{ID:5} | 1 |
| 31 | 5 | Node[2]{ID:5} | 0 |
| 45 | 5 | Node[2]{ID:5} | 0 |
| 5 | 1 | Node[2]{ID:5} | 1 |
| 34 | 2 | Node[2]{ID:5} | 0 |
+-------------------------------------+
6 rows
33 ms

Thanks to #Bruno's solution, I was able to get the right answers. However, the original solution did not work right off the bat for me for two reasons - I needed the qList since I was referring to it later, and it had approximately 4 times the DB hits as the query in my question. So, I tried a few optimizations that brought the number of DB hits down to half, and am sharing it here for posterity.
MATCH (q:Item) WHERE q.ID <> 1234 WITH COLLECT(DISTINCT(q)) as qList
MATCH (p:Item{ID:1234})-[:HAS_AN]->(i:ItemType) WITH i as pItemType, qList
UNWIND qList as item
MATCH (item)-[:HAS_AN]->(i)
WITH
i, pItemType,
item.ID AS qID,
collect({
pItemType : pItemType,
SameType : CASE i.ID
WHEN pItemType.ID THEN 1 ELSE 0 END
})[0] as Item
RETURN i, qID, Item.pItemType AS pItemType, Item.SameType AS SameType
Turns out running MATCH (item:Item)-[:HAS_AN]->(itemType:ItemType) was adding a Filter operation that took almost as many DB hits as it had matches.

Make a path where next node is not the previous node?

I have ~1.5 M nodes in a graph, that are structured like this (picture)
I run a Cypher query that performs calculations on each relationship traversed:
WITH 1 AS startVal
MATCH x = (c:Currency)-[r:Arb*2]->(m)
WITH x, REDUCE(s = startVal, e IN r | s * e.rate) AS endVal, startVal
RETURN EXTRACT(n IN NODES(x) | n) as Exchanges,
extract ( e IN relationships(x) | startVal * e.rate) AS Rel,
endVal, endVal - startVal AS Profit
ORDER BY Profit DESC LIMIT 5
The problem is it returns the path ("One")->("hop")->("One"), which is useless for me.
How can I make it not choose the previously walked node as the next node (i.e. "One"->"hop"->"any_other_node_but_not_"one")?
I have read that NODE_RECENT should address my issue. However, there was no example on how to specify the length of recent nodes in RestAPI or APOC procedures.
Is there a Cypher query for my case?
Thank you.
P.S. I am extremely new (less than 2 month) to Neo4j and coding. So my apologies if there is an obvious simple solution.

I don't know if I understood your question completely, but I believe that you problem can be solved putting a WHERE clause on the MATCH to prevent the not desired relationship be matched, like this:
WITH 1 AS startVal
MATCH x = (c:Currency)-[r:Arb*2]->(m)
WHERE NOT (m)-[:Arb]->(c)
WITH x, REDUCE(s = startVal, e IN r | s * e.rate) AS endVal, startVal
RETURN EXTRACT(n IN NODES(x) | n) as Exchanges,
extract ( e IN relationships(x) | startVal * e.rate) AS Rel,
endVal, endVal - startVal AS Profit
ORDER BY Profit DESC LIMIT 5

Try inserting this clause after your MATCH clause, to filter out cases where c and m are the same:
WHERE c <> m
[EDITED]
That is:
WITH 1 AS startVal
MATCH x = (c:Currency)-[r:Arb*2]->(m)
WHERE c <> m
WITH x, REDUCE(s = startVal, e IN r | s * e.rate) AS endVal, startVal
RETURN EXTRACT(n IN NODES(x) | n) as Exchanges,
extract ( e IN relationships(x) | startVal * e.rate) AS Rel,
endVal, endVal - startVal AS Profit
ORDER BY Profit DESC LIMIT 5;
After using this query to create test data:
CREATE
(c:Currency {name: 'One'})-[:Arb {rate:1}]->(h:Account {name: 'hop'})-[:Arb {rate:2}]->(t:Currency {name: 'Two'}),
(t)-[:Arb {rate:3}]->(h)-[:Arb {rate:4}]->(c)
the above query produces these results:
+-----------------------------------------------------------------------------------------+
| Exchanges | Rel | endVal | Profit |
+-----------------------------------------------------------------------------------------+
| [Node[8]{name:"Two"},Node[7]{name:"hop"},Node[6]{name:"One"}] | [3,4] | 12 | 11 |
| [Node[6]{name:"One"},Node[7]{name:"hop"},Node[8]{name:"Two"}] | [1,2] | 2 | 1 |
+-----------------------------------------------------------------------------------------+

How to get the index of FOREACH iterations

Within a FOREACH statement [e.g. day in range(dayX, dayY)] is there an easy way to find out the index of the iteration ?

Yes, you can.
Here is an example query that creates 8 Day nodes that contain an index and day:
WITH 5 AS day1, 12 AS day2
FOREACH (i IN RANGE(0, day2-day1) |
CREATE (:Day { index: i, day: day1+i }));
This query prints out the resulting nodes:
MATCH (d:Day)
RETURN d
ORDER BY d.index;
and here is an example result:
+--------------------------+
| d |
+--------------------------+
| Node[54]{day:5,index:0} |
| Node[55]{day:6,index:1} |
| Node[56]{day:7,index:2} |
| Node[57]{day:8,index:3} |
| Node[58]{day:9,index:4} |
| Node[59]{day:10,index:5} |
| Node[60]{day:11,index:6} |
| Node[61]{day:12,index:7} |
+--------------------------+

FOREACH does not yield the index during iteration. If you want the index you can use a combination of range and UNWIND like this:
WITH ["some", "array", "of", "things"] AS things
UNWIND range(0,size(things)-2) AS i
// Do something for each element in the array. In this case connect two Things
MERGE (t1:Thing {name:things[i]})-[:RELATED_TO]->(t2:Thing {name:things[i+1]})
This example iterates a counter i over which you can use to access the item at index i in the array.

A complex match

I got the following cypher query:
neo4j-sh$ start n=node(1344) match (n)-[t:_HAS_TRANSLATION]-(p) return t,p;
+-----------------------------------------------------------------------------------+
| t | p |
+-----------------------------------------------------------------------------------+
| :_HAS_TRANSLATION[2224]{of:"value"} | Node[1349]{language:"hi-hi",text:"(>0#"} |
| :_HAS_TRANSLATION[2223]{of:"value"} | Node[1348]{language:"es-es",text:"hembra"} |
| :_HAS_TRANSLATION[2222]{of:"value"} | Node[1347]{language:"ru-ru",text:"65=A:89"} |
| :_HAS_TRANSLATION[2221]{of:"value"} | Node[1346]{language:"en-us",text:"female"} |
| :_HAS_TRANSLATION[2220]{of:"value"} | Node[1345]{language:"it-it",text:"femmina"} |
+-----------------------------------------------------------------------------------+
and the following array:
["it-it", "en-us", "fr-fr", "de-de", "ru-ru", "hi-hi"]
How can I change the query to return just one result, where 'language' is the same as the first occurrence of the array?
If the array should be
["fr-fr","jp-jp","en-us", "it-it", "de-de", "ru-ru", "hi-hi"]
I'd need to return the Node[1346], because it is the first with a match in the language array [en-us], being no entry for [fr-fr] and [jp-jp]
Thank you
Paolo

Cypher can express arrays, and indexes into them. So on one level, you could do this:
start n=node(1344)
match (n)-[t:_HAS_TRANSLATION]-(p)
where p.language = ["it-it", "en-us", "fr-fr", "de-de", "ru-ru", "hi-hi"][0] return t,p;
This is really just the same as asking for those nodes p where p.language="it-it" (the first element in your array).
Now, if what you mean is that the language attribute itself can be an array, then just treat it like one:
$ create (p { language: ['it-it', 'en-us']});
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 1
Properties set: 1
$ match (p) where p.language[0] = 'it-it' return p;
+-------------------------------------+
| p |
+-------------------------------------+
| Node[1]{language:["it-it","en-us"]} |
+-------------------------------------+
1 row
38 ms
Note the array brackets on p.language[0].
Finally, if what you're talking about is splitting your language parts into pieces (i.e. "en-us" = ["en", "us"]) then cypher's string processing functions are a little on the weak side, and I wouldn't try to do this. Instead, I'd pre-process that before inserting it into the graph in the first place, and query them as separate pieces.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

number of connected nodes to specific nodes in a path - neo4j

Related

Neo4j: Cypher query to parallelize a row of the result from a previous query

Aggregating over Distinct Nodes

Make a path where next node is not the previous node?

How to get the index of FOREACH iterations

A complex match

Categories

Resources