Cypher - How to query multiple Neo4j Node property fragments with "STARTS WITH" - neo4j

I'm looking for a way to combine the Cypher "IN" and "STARTS WITH" query. In other words I'm looking for a way to look up nodes that start with specific string sequences that are provided as Array using IN.
The goal is to have the query run in as less as possible calls against the DB.
I browsed the documentation and played around with Neo4j a bit but wasn't able to combine the following two queries into one:
MATCH (a:Node_type_A)-[]->(b:Node_type_B)
WHERE a.prop_A IN [...Array of Strings]
RETURN a.prop_A, COLLECT ({result_b: b.prop_B})
and
MATCH (a:Node_type_A)-[]->(b:Node_type_B)
WHERE a.prop_A STARTS WITH 'String'
RETURN a.prop_A, b.prop_B
Is there a way to combine these two approaches?
Any help is greatly appreciated.
Krid

You'll want to make sure there is an index or unique constraint (whichever is appropriate) on your :Node_type_A(prop_A) to speed up your lookups.
If I'm reading your requirements right, this query may work for you, adding your input strings as appropriate (parameterize them if you can).
WITH [...] as inputs
UNWIND inputs as input
// each string in inputs is now on its own row
MATCH (a:Node_type_A)
WHERE a.prop_A STARTS WITH input
// should be an fast index lookup for each input string
WITH a
MATCH (a)-[]->(b:Node_type_B)
RETURN a.prop_A, COLLECT ({result_b: b.prop_B})

Something like this should work:
MATCH (a:Node_type_A)-[]->(b:Node_type_B)
WITH a.prop_A AS pa, b.prop_B AS pb
WITH pa, pb,
REDUCE(s = [], x IN ['a','b','c'] |
CASE WHEN pa STARTS WITH x THEN s + pb ELSE s END) AS pbs
RETURN pa, pbs;

Related

Nested unions in Cypher/Neo4j

I've this metagraph in Neo4j:
:Protein ---> :is_a ---------------> :Enzyme ---> :activated_by|inhibited_by ---> :Compound
\<-- :activated_by <---/
:Compound --> :consumed_by|:produced_by ---> :Transport
\<---- :catalyzed_by -------<---/
:Transport --> part_of ---> :Pathway
(yes, it's biology, yes, it's imported from BioPAX).
I want to use Cypher to return all pairs of (:Protein, :Compounds) that are linked by some graph path. While one simple way would be to follow each possible path between the two node types and issue one query per each of them, clearly a query using some pattern union would be more compact (maybe slower, I'm assessing the performance of different approaches).
But how to write such a query? I've read that Cypher has a UNION construct, but I haven't clear how to combine and nest them, e.g., regarding the subgraphs from enzymes to transports, I would like to be able to write something equivalent to this informal expression:
enz:Enzyme :activated|inhibited_by comp:Compund
join with: (
(comp :consumed_by|produced_by :Transport)
UNION (:Transport :catalyzed_by comp )
)
I've read there should be some way, but I didn't get much of it and I'm trying to understand if there is a relatively easy way (in SPARQL or SQL the above is rather simple).
Using Periodic Collect
In Cypher, you can break a query into steps with WITH, and you can join two lists by concatenating them together.
MATCH (e:Enzyme)-[:activated]->(compA:Compound), (e)-[:inhibited_by]->(compB:Compund)
WITH e, COLLECT(compA)+COLLECT(compB) as compList
UNWIND compList as comp
WITH DISTINCT e, comp // if a comp can appear in both lists
MATCH ... // Repeat above at each path step
Using Union
When using Union to combine different queries, think of it like a comma separated list of queries, but instead of commas, you use the word UNION instead (Also, every query in this list has to have the same return columns.
MATCH (e:Enzyme)-[r1:activated]->(comp:Compound)-[r2:consumed_by]->(trans:Transport)
RETURN e as protein, comp as compound, trans as transport
UNION
MATCH (e:Enzyme)-[r1:inhibited_by]->(comp:Compound)-[r2:produced_by]->(trans:Transport)
RETURN e as protein, comp as compound, trans as transport
// Just to show only return names have to match
UNION
WITH "rawr" as a
RETURN a as protein, 51 as compound, NULL as transport
This is good for combining the results of completely different queries, but since the queries you are combining are usually related, most of the time COLLECT will be more efficient, and gives you better control over the results.
Using OR
You can get the name of a relationship with the TYPE function and filter on that.
MATCH (e:Enzyme)-[r1]->(comp:Compound)-[r2]->(trans:Transport)
WHERE (TYPE(r1) = "activated" OR TYPE(r1) = "inhibited_by") AND (TYPE(r2) = "consumed_by" OR TYPE(r2) = "produced_by")
RETURN *
Note: For using OR on just they relation type, you can also use -[:A|:B]-> for OR.
Using Path Filtering
As of Neo4j 3.1.x, Cypher is fairly good at free path finding. (By that, I mean finding a valid path without searching all possible paths. The pattern for relation matching) The upper bound is not strictly necessary, but good for preventing/controlling runaway queries.
MATCH p=(e:Protein)-[r1*..10]->(c:Compound)
WHERE ALL(r in RELATIONSHIPS(p) WHERE TYPE(r) in ["activated","inhibited_by","produced_by","consumed_by"])
RETURN e as protein, c as compound
Other
If they relationship type doesn't actually matter, you can just use (e:Enzyme)-->(c:Compound) or to ignore direction (e:Enzyme)--(c:Compound)
If it is an option, I would recommend refactoring your schema to be more consistent (or add a relation type relevant to this matching criteria) so that you don't need to union results. (This will give you the best performance, since the Cypher planner will better know how to quickly find your results)
The query below will return all paths in which an Enzyme is activated or inhibited by a Compound that is consumed, produced, or catalyzed by a Transport.
The query's r2 relationship pattern is non-directional, since the directionality of catalyzed_by is opposite that of consumed_by and produced_by.
MATCH p=
(e:Enzyme)-[r1:activated_by|:inhibited_by]->
(comp:Compound)-[r2:consumed_by|:produced_by|:catalyzed_by]-
(trans:Transport)
RETURN p;
So, I haven't still found a satisfactory way to do what I initially asked, but I get quite close to it and I wish to report about.
First, my initial graph was a bit different than the one I actually have, so I'll base my next examples on the real one (sorry for the confusion):
I can work branches quite comfortably with paths in the WHERE clause (plus, in simpler cases, UNIONS):
// Branch 2

MATCH (prot:Protein), (enz:Enzyme), (tns:Transport) - [:part_of] -> (path:Path)

WHERE ( (enz) - [:ac_by|:in_by] -> (:Comp) - [:pd_by|:cs_by] -> (tns) // Branch 2.1

 OR (tns) - [:ca_by] -> (enz) ) //Branch 2.2 (pt1)
AND ( (prot) - [:is_a] -> (enz) OR (prot) <- [:ac_by] - (enz) ) // Branch 2.2 (pt2)

RETURN prot, path LIMIT 30

UNION

// Branch1
MATCH (prot:Protein) - [:pd_by|:cs_by] -> (:Reaction) - [:part_of] -> (path:Path)

RETURN prot, path LIMIT 30
(I'm also sorry for all those abbreviations, e.g., pd_by is produced_by, ac_by is activated_by and so on).
This query yields results in about 1 minute. Too long and that's clearly due to the way the query is interpreted, as one can see from its plan:
I really can't understand why there are those huge cartesian products. I've tried the WITH/UNWIND approach, but I've been unable to get the correct results (see my comments above, thanks #Tezra and #cybersam) and, even if I was, it's a very difficult syntax.

Neo4J stacked query for desicion tree and expert system

I am new to Neo4J and struggling to get the output I need without extra external processing. I am pretty sure that Neo4J is easily capable of doing that. Please advice on best approach.
I have a 2 types of nodes: Ingredient and Function that are connected using relation that has property of {Weight: 1}. I am querying a list of ingredients and I want to get a sum of connections multiplied by weight of each connection leading to each distinct function.
Here's what I came up with.
MATCH (q:Ingredient)-[r]->(p) WHERE q.RefNo IN [1,2,3] RETURN r,p
This produces following output
{"weight":1}│{"Function":"X1"}
{"weight":0.5}│{"Function":"X1"}
{"weight":0.7}│{"Function":"X2"}
{"weight":0.4}│{"Function":"X3"}
{"weight":0.5}│{"Function":"X4"}
What I want to get in a single/stacked query is
X1:1.5, X2:1=0.7, X3:0.4, X4:0.5
Please advice on possible solution of this problem.
Almost there, you just need to use an aggregation function (in this case, sum()) on the weight, and the Function property as your non-aggregation column:
MATCH (q:Ingredient)-[r]->(p)
WHERE q.RefNo IN [1,2,3]
RETURN p.Function, sum(r.weight) as weight
Or if you only want a single row result:
MATCH (q:Ingredient)-[r]->(p)
WHERE q.RefNo IN [1,2,3]
WITH p, sum(r.weight) as weight
RETURN collect(p {.Function, weight}) as functionWeights

Neo4j query for complete paths

I have the following structure.
CREATE
(`0` :Sentence {`{text`:'This is a sentence'}}) ,
(`1` :Word {`{ text`:'This' }}) ,
(`2` :Word {`{text`:'is'}}) ,
(`3` :Sentence {`{'text'`:'Sam is a dog'}}) ,
(`0`)-[:`RELATED_TO`]->(`1`),
(`0`)-[:`RELATED_TO`]->(`2`),
(`3`)-[:`RELATED_TO`]->(`2`)
schema example
So my question is this. I have a bunch of sentences that I have decomposed into word objects. These word objects are all unique and therefore will point to different sentences. If I perform a search for one word it's very easy to figure out all of the sentences that word is related to. How can I structure a query to figure out the same information for two words instead of one.
I would like to submit two or more words and find a path that includes all words submitted picking up all sentences of interest.
I just remembered an alternate approach that may work better. Compare the PROFILE on this query with the profiles for the others, see if it works better for you.
WITH {myListOfWords} as wordList
WITH wordList, size(wordList) as wordCnt
MATCH (s)-[:RELATED_TO]->(w:Word)
WHERE w.text in wordList
WITH s, wordCnt, count(DISTINCT w) as cnt
WHERE wordCnt = cnt
RETURN s
Unfortunately it's not a very pretty approach, it basically comes down to collecting :Word nodes and using the ALL() predicate to ensure that the pattern you want holds true for all elements of the collection.
MATCH (w:Word)
WHERE w.text in {myListOfWords}
WITH collect(w) as words
MATCH (s:Sentence)
WHERE ALL(word in words WHERE (s)-[:RELATED_TO]->(word))
RETURN s
What makes this ugly is that the planner isn't intelligent enough right now to infer that when you say MATCH (s:Sentence) WHERE ALL(word in words ... that the initial matches for s ought to come from the match from the first w in your words collection, so it starts out from all :Sentence nodes first, which is a major performance hit.
So to get around this, we have to explicitly match from the first of the words collection, and then use WHERE ALL() for the remaining.
MATCH (w:Word)
WHERE w.text in {myListOfWords}
WITH w, size(()-[:RELATED_TO]->(w)) as rels
WITH w ORDER BY rels ASC
WITH collect(w) as words
WITH head(words) as head, tail(words) as words
MATCH (s)-[:RELATED_TO]->(head)
WHERE ALL(word in words WHERE (s)-[:RELATED_TO]->(word))
RETURN s
EDIT:
Added an optimization to order your w nodes by the degree of their incoming :RELATED_TO relationships (this is a degree lookup on very few nodes), as this will mean the initial match to your :Sentence nodes is the smallest possible starting set before you filter for relationships from the rest of the words.
As an alternative, you could consider using manual indexing (also called "legacy indexing") instead of using Word nodes and RELATED_TO relationships. Manual indexes support "fulltext" searches using lucene.
There are many apoc procedures that help you with this.
Here is an example that might work for you. In this example, I assume case-insensitive comparisons are OK, you retain the Sentence nodes (and their text properties), and you want to automatically add the text properties of all Sentence nodes to a manual index.
If you are using neo4j 3.2+, you have to add this setting to the neo4j.conf file to make some expensive apoc.index procedures (like apoc.index.addAllNodes) available:
dbms.security.procedures.unrestricted=apoc.*
Execute this Cypher code to initialize a manual index named "WordIndex" with the text text from all existing Sentence nodes, and to enable automatic indexing from that point onwards:
CALL apoc.index.addAllNodes('WordIndex', {Sentence: ['text']}, {autoUpdate: true})
YIELD label, property, nodeCount
RETURN *;
To find (case insensitively) the Sentence nodes containing all the words in a collection (passed via a $words parameter), you'd execute a Cypher statement like the one below. The WITH clause builds the lucene query string (e.g., "foo AND bar") for you. Caveat: since lucene's special boolean terms (like "AND" and "OR") are always in uppercase, you should make sure the words you pass in are lowercased (or modify the WITH clause below to use the TOLOWER()` function as needed).
WITH REDUCE(s = $words[0], x IN $words[1..] | s + ' AND ' + x) AS q
CALL apoc.index.search('WordIndex', q) YIELD node
RETURN node;

Cypher query: CONCAT and TOKENIZE in the SET clause

I need to generate a cypher query where I should SET the value of one property based on the value of the other property as like this -
MATCH (n) SET n.XXX = n.YYY return n;
So XXX will be set to YYY. But YYY has values like this - "ABCD.net/ABC-MNO-XYZ-1234" and I should eliminate all the special characters (/,- etc) and then concat the splitted substrings. So the logical statement should be like -
MATCH (n) SET n.XXX = CONCAT(SPLIT(n.YYY, "/")) return n;
Neo4j doesn't have any CONCAT function. So how is it possible to accomplish this stuff in a cypher query?
Any help will be highly appreciated.
Thanks
You could do this:
MATCH (n:SomeNode) set n.uuid = reduce(s="",x in split(n.uuid,'/')| s+x)
and run this query for every special character.
If there are a lot of special characters, write this query:
UNWIND ['/','#'] as delim match (n:SomeNode) set n.uuid = reduce(s="",x in split(n.uuid,delim)| s+x)
Replace '/','#' with a list of your special characters.
Your use case really looks like you want removal of characters in the string, and neo4j does offer a replace() function that you can leverage for this.
But in the event that you do want a join(), the APOC Procedures library has this among the text functions it offers.

How to optimize the calculation of the shortest time in a certain path in Neo4j?

I have a database in Neo4j of modules that I imported through CSV. The data looks something like this. Each module has its name, it's module that is the successor, average time duration and another duration called medtime.
I have been able to import the data and to set the relationships through a Cypher Query script that looks like this:
LOAD CSV WITH HEADERS FROM "file:c:/users/Skelo/Desktop/Neo4J related/Statistic Dependencies/Simple.csv" AS row FIELDTERMINATOR ';'
CREATE (n:Module)
SET n = row, n.name = row.name, n.mafter = row.mafter, n.avgtime = row.avgtime, n.medtime = row.medtime
WITH n
RETURN n
Then I have set the relationships like this:
Match (p:Module),(q:Module)
Where p.mafter = q.name
Merge (p)-[:PRECEEDS]->(q)
Return p,q
Now to the point. I want to calculate the shortest path from a certain module to another, more specifically the time that it takes to get from a module to another and for this, I use the more or less copied part of the script from
http://www.neo4j.org/graphgist?8412907 and that is
MATCH p = (trop:Module {name:'BLSACXAMT0A_00'})-[prec:PRECEEDS*]->(hop:Module {name:'BL_LOAD_CLOSE'})
WITH p, REDUCE(x = 0, a IN NODES(p) | x + a.avgtime) AS cum_duration
ORDER BY cum_duration DESC
LIMIT 1
RETURN cum_duration AS `Total Average Time`
This, however, takes about 50 second to execute and that is outrageous. You can see it on the screenshot right below. The ammount of modules imported into the database is only about 2000 and what I want to achieve, is to successfully work with more than 50 000 nodes and perform such tasks much faster.
Other issue is, that the results are somehow suspicious. The format looks wrong, every number I have in the database has max 4 digits after the decimal point and I am only adding these values to zero, therefore if the result looks like this: 00103,68330,51670, I have serious doubts. Please, help me, if it is wrong, why is it so, and what can I do to correct it.
Neo4j claims that it is efficient and fast, therefore I presume that the fault is in my code (the performance of my computer is more than enough). Please, If you can, help me to shorten this time and explain the patterns needed to perform this.
A few observations that should help:
You have several errors in how you are importing. These errors will create many more nodes than you think, and create the "suspicious" issue you raised:
Your file has multiple rows with the same name, but your import is creating a new Module node every time. Therefore, you are ending up with multiple nodes for some of your modules. You should be using MERGE instead of CREATE.
Your mafter property needs to contain a collection of strings, not a single string.
You are importing the numeric values as strings, so code such as x + a.avgtime is just doing string concatenation, not numeric addition. Furthermore, even if you did attempt to convert your strings to numbers, that would fail because your numbers use a comma instead of a period to indicate the decimal place.
Try this for importing (into an empty DB):
LOAD CSV WITH HEADERS FROM "file:c:/users/Skelo/Desktop/Neo4J related/Statistic Dependencies/Simple.csv" AS row FIELDTERMINATOR ';'
MERGE (n:Module {name: row.name})
ON CREATE SET
n.mafter = [row.mafter],
n.avgtime = TOFLOAT(REPLACE(row.avgtime, ',', '.')),
n.medtime = TOFLOAT(REPLACE(row.medtime, ',', '.'))
ON MATCH SET
n.mafter = n.mafter + row.mafter;
You also need to change your current merge query so that you can handle an mafter that is a collection. Note that the following query is designed to NOT create any new nodes (even if a name in mafter does not yet have a module node).
MATCH (p:Module)
OPTIONAL MATCH (p)-[:PRECEEDS]->(z:Module)
WITH p, COLLECT(z.name) AS existing
WITH p, filter(x IN p.mafter
WHERE NOT x IN existing) AS todo
MATCH (q:Module)
WHERE q.name IN todo
MERGE (p)-[:PRECEEDS]->(q)
RETURN p, q;
You should create an index to speed up the matching of modules by name:
CREATE INDEX ON :Module(name)
Cypher does have a shortestPath function, see http://neo4j.com/docs/stable/query-match.html#_shortest_path. However this calculates the shortest path based on the number of hops and does not take a weight into account.
Neo4j has couple of graph algorithms on board, e.g. Dijekstra or AStar. Unfortunately these are not yet available via cypher. Instead you have two alternatives to use them:
1) write an unmanaged extension to Neo4j and use GraphAlgoFactory in the implmentation. This requires to write same java code and deploy it to the Neo4j server. Using a custom CostEvaluator you can use the avgTime property on your nodes as cost parameter.
2) use the REST API as documented on http://neo4j.com/docs/stable/rest-api-graph-algos.html#rest-api-execute-a-dijkstra-algorithm-and-get-a-single-path. This approach requires to have the weight as a property on the relationship and not on a node (like in your data model)

Resources