Function to get position of substring, using backreferences in Neo4j Cypher regex - neo4j

I'm using the web interface and Neo4j Community 2.1.3.
Is there a Cypher function or combination of functions that gives the position of a substring in a string or uses its location as a starting point for a SET clause, similar to FIND or SEARCH in Google Spreadsheets or substring-before() and substring-after() in XPATH? Something like:
MATCH (p)
WHERE p.name=~"^.*?\\, .*?$"
SET p.lastName=LEFT(p.name,FIND(p.name,", ")), p.firstName=RIGHT(p.name,FIND(p.name,",")+1)
RETURN p ;
The FIND() function would return the position of the substring (in this case: a comma plus single space), so the LEFT and RIGHT functions can be used to extract a partial string. Something like the XPATH substring-before() and substring-after() accomplish the same thing in one function. The Cypher string functions SUBSTRING, LEFT, RIGHT are only of limited use without this additional functionality — unless I'm just missing something.
Along these lines (string manipulation), is there a way to use backreferences to Cypher regex WHERE matches? We can use groups for matching, but I can't figure out how to reuse those groups using \1 or $1 in a SET clause. The above code could be made simpler using regex groups and some kind of backreferences:
MATCH (p)
WHERE p.name=~"^(.*?)\\, (.*?)$"
SET p.lastName=\1, p.firstName=\2
RETURN p ;
Are these things possible yet? I can't find documentation or examples. I've seen the Regx4Neo plugin, but the command shell is beyond my abilities at this point.

If your example is so contrived that this is irrelevant, my apologies, but you could just split on ", " and then SET properties equal to the different elements resulting from the split. So, with the following example data:
CREATE (:Person {name:'White, Nicole'}),
(:Person {name:'Bastani, Kenny'}),
(:Person {name:'Hunger, Michael'})
We can get first and last names and set them as properties:
MATCH (p:Person)
WITH p, SPLIT(p.name, ", ") AS names
SET p.firstName = names[1],
p.lastName = names[0]
Result:
MATCH (p:Person)
RETURN p.firstName, p.lastName
p.firstName p.lastName
Nicole White
Kenny Bastani
Michael Hunger

Related

Neo4j query for complete paths

I have the following structure.
CREATE
(`0` :Sentence {`{text`:'This is a sentence'}}) ,
(`1` :Word {`{ text`:'This' }}) ,
(`2` :Word {`{text`:'is'}}) ,
(`3` :Sentence {`{'text'`:'Sam is a dog'}}) ,
(`0`)-[:`RELATED_TO`]->(`1`),
(`0`)-[:`RELATED_TO`]->(`2`),
(`3`)-[:`RELATED_TO`]->(`2`)
schema example
So my question is this. I have a bunch of sentences that I have decomposed into word objects. These word objects are all unique and therefore will point to different sentences. If I perform a search for one word it's very easy to figure out all of the sentences that word is related to. How can I structure a query to figure out the same information for two words instead of one.
I would like to submit two or more words and find a path that includes all words submitted picking up all sentences of interest.
I just remembered an alternate approach that may work better. Compare the PROFILE on this query with the profiles for the others, see if it works better for you.
WITH {myListOfWords} as wordList
WITH wordList, size(wordList) as wordCnt
MATCH (s)-[:RELATED_TO]->(w:Word)
WHERE w.text in wordList
WITH s, wordCnt, count(DISTINCT w) as cnt
WHERE wordCnt = cnt
RETURN s
Unfortunately it's not a very pretty approach, it basically comes down to collecting :Word nodes and using the ALL() predicate to ensure that the pattern you want holds true for all elements of the collection.
MATCH (w:Word)
WHERE w.text in {myListOfWords}
WITH collect(w) as words
MATCH (s:Sentence)
WHERE ALL(word in words WHERE (s)-[:RELATED_TO]->(word))
RETURN s
What makes this ugly is that the planner isn't intelligent enough right now to infer that when you say MATCH (s:Sentence) WHERE ALL(word in words ... that the initial matches for s ought to come from the match from the first w in your words collection, so it starts out from all :Sentence nodes first, which is a major performance hit.
So to get around this, we have to explicitly match from the first of the words collection, and then use WHERE ALL() for the remaining.
MATCH (w:Word)
WHERE w.text in {myListOfWords}
WITH w, size(()-[:RELATED_TO]->(w)) as rels
WITH w ORDER BY rels ASC
WITH collect(w) as words
WITH head(words) as head, tail(words) as words
MATCH (s)-[:RELATED_TO]->(head)
WHERE ALL(word in words WHERE (s)-[:RELATED_TO]->(word))
RETURN s
EDIT:
Added an optimization to order your w nodes by the degree of their incoming :RELATED_TO relationships (this is a degree lookup on very few nodes), as this will mean the initial match to your :Sentence nodes is the smallest possible starting set before you filter for relationships from the rest of the words.
As an alternative, you could consider using manual indexing (also called "legacy indexing") instead of using Word nodes and RELATED_TO relationships. Manual indexes support "fulltext" searches using lucene.
There are many apoc procedures that help you with this.
Here is an example that might work for you. In this example, I assume case-insensitive comparisons are OK, you retain the Sentence nodes (and their text properties), and you want to automatically add the text properties of all Sentence nodes to a manual index.
If you are using neo4j 3.2+, you have to add this setting to the neo4j.conf file to make some expensive apoc.index procedures (like apoc.index.addAllNodes) available:
dbms.security.procedures.unrestricted=apoc.*
Execute this Cypher code to initialize a manual index named "WordIndex" with the text text from all existing Sentence nodes, and to enable automatic indexing from that point onwards:
CALL apoc.index.addAllNodes('WordIndex', {Sentence: ['text']}, {autoUpdate: true})
YIELD label, property, nodeCount
RETURN *;
To find (case insensitively) the Sentence nodes containing all the words in a collection (passed via a $words parameter), you'd execute a Cypher statement like the one below. The WITH clause builds the lucene query string (e.g., "foo AND bar") for you. Caveat: since lucene's special boolean terms (like "AND" and "OR") are always in uppercase, you should make sure the words you pass in are lowercased (or modify the WITH clause below to use the TOLOWER()` function as needed).
WITH REDUCE(s = $words[0], x IN $words[1..] | s + ' AND ' + x) AS q
CALL apoc.index.search('WordIndex', q) YIELD node
RETURN node;

Cypher - How to query multiple Neo4j Node property fragments with "STARTS WITH"

I'm looking for a way to combine the Cypher "IN" and "STARTS WITH" query. In other words I'm looking for a way to look up nodes that start with specific string sequences that are provided as Array using IN.
The goal is to have the query run in as less as possible calls against the DB.
I browsed the documentation and played around with Neo4j a bit but wasn't able to combine the following two queries into one:
MATCH (a:Node_type_A)-[]->(b:Node_type_B)
WHERE a.prop_A IN [...Array of Strings]
RETURN a.prop_A, COLLECT ({result_b: b.prop_B})
and
MATCH (a:Node_type_A)-[]->(b:Node_type_B)
WHERE a.prop_A STARTS WITH 'String'
RETURN a.prop_A, b.prop_B
Is there a way to combine these two approaches?
Any help is greatly appreciated.
Krid
You'll want to make sure there is an index or unique constraint (whichever is appropriate) on your :Node_type_A(prop_A) to speed up your lookups.
If I'm reading your requirements right, this query may work for you, adding your input strings as appropriate (parameterize them if you can).
WITH [...] as inputs
UNWIND inputs as input
// each string in inputs is now on its own row
MATCH (a:Node_type_A)
WHERE a.prop_A STARTS WITH input
// should be an fast index lookup for each input string
WITH a
MATCH (a)-[]->(b:Node_type_B)
RETURN a.prop_A, COLLECT ({result_b: b.prop_B})
Something like this should work:
MATCH (a:Node_type_A)-[]->(b:Node_type_B)
WITH a.prop_A AS pa, b.prop_B AS pb
WITH pa, pb,
REDUCE(s = [], x IN ['a','b','c'] |
CASE WHEN pa STARTS WITH x THEN s + pb ELSE s END) AS pbs
RETURN pa, pbs;

Cypher query: CONCAT and TOKENIZE in the SET clause

I need to generate a cypher query where I should SET the value of one property based on the value of the other property as like this -
MATCH (n) SET n.XXX = n.YYY return n;
So XXX will be set to YYY. But YYY has values like this - "ABCD.net/ABC-MNO-XYZ-1234" and I should eliminate all the special characters (/,- etc) and then concat the splitted substrings. So the logical statement should be like -
MATCH (n) SET n.XXX = CONCAT(SPLIT(n.YYY, "/")) return n;
Neo4j doesn't have any CONCAT function. So how is it possible to accomplish this stuff in a cypher query?
Any help will be highly appreciated.
Thanks
You could do this:
MATCH (n:SomeNode) set n.uuid = reduce(s="",x in split(n.uuid,'/')| s+x)
and run this query for every special character.
If there are a lot of special characters, write this query:
UNWIND ['/','#'] as delim match (n:SomeNode) set n.uuid = reduce(s="",x in split(n.uuid,delim)| s+x)
Replace '/','#' with a list of your special characters.
Your use case really looks like you want removal of characters in the string, and neo4j does offer a replace() function that you can leverage for this.
But in the event that you do want a join(), the APOC Procedures library has this among the text functions it offers.

Neo4j: Cypher query on property array

I have a domain class which has a property by name "alias" which is an arraylist of strings like below:
private List<String> alias;
alias contains the following values: {"John Doe","Alex Smith","Greg Walsh"}
I'd like to be able to make a query like: "I saw Smith today" using my repository query shown below and get the array value Output "Alex Smith":
#Query("MATCH (p:Person) WHERE {0} IN p.alias RETURN p")
Iterable<Person> findByAlias(String query);
I tried a bunch of different queries like the one shown above but this would only match if the input query matches the array value exactly.
I want do a input query sub-string match with the array values.
Eg:
Input Query: "I saw Smith today"
Output: "Alex Smith"
Summary
It is sort of possible to do what you want, but the query will be horribly ugly and slow. You'd be better off using nodes and relationships instead of collection properties: it will make your queries more sensible and allows you to use a full-text index. You should also figure out what part of the 'input string' you are looking for before you send your query to the database. As it stands, you are confusing the regex pattern with the data it is supposed to match and even if it were possible to express your intention as a regex it would be much better to do that processing your application, before sending the query.
1) WHERE ... IN ... doesn't do regex
WHERE x IN y will not treat x as a regular expression, it will take x's value for what it is and look for an exact match. WHERE ... IN ... is analogous to WHERE ... = ... in this sense, and you would need the analogue of =~ for collections, something like IN~, for this. There is no such construct in Cypher.
2) You can do regex over collections with predicates, but it is inefficient
You can use a string as a regular expression to test for matches over a collection by using a predicate like ANY or FILTER.
CREATE (p:Person {collectionProperty:["Paulo","Jean-Paul"]})
and
WITH "(?i).*Paul" as param
MATCH (p:Person)
WHERE ANY(item IN p.collectionProperty WHERE item =~ param)
RETURN p
will return the node because it makes a successful regular expression match on "Jean-Paul".
This, however, will have terrible performance since you will run your regular expression on every item in every collectionProperty for every :Person in your database. The solution is to use a full-text index, but his query can't use indices for two reasons:
the values you are querying are in an array
you are using regular expressions to filter results instead of doing an index query
3) You can't do regex over collections at all with your kind of input
The biggest problem with your query is that you are trying to turn "I saw Smith today" into "Smith" by adding some regular expression sugar. How do you intend to do that? If you use the string as a regular expression, each of those characters is a literal character expected to be in the data that you match it on. You are confused about .*, which when used in 'Smith.*' would match 'Smith' plus zero or more additional characters in the data. But you try to use it to say that zero or more characters may follow something in the pattern.
Take the query in your comment:
MATCH (p:Person)
WHERE '(?i).*I saw Smith today.*' IN p.alias
RETURN p
The regular expression '(?i).*I saw Smith today.*' will match
ignoring the case of the literal string–'i SAW smith TOday', etc.
with zero or more characters before and after the literal string–'Yes, I saw Smith today and he looked happy.'
But adding .* won't somehow magically make the pattern mean '.*Smith.*'. What's more, it's almost impossible to express 'I saw Smith today' as a subset of 'Alex Smith' by any amount of added regular expression sugar. Instead, you should process that string and figure out what parts of it you want to use in a regular expression before you send your query. How is the database to know that 'Smith' is the part of the input string that you want to use? However it is that you know it, you should know it before you send the query, and only include that relevant part.
Aside: examples of added regular expression sugar that won't work and why
You could insert a ? after each character in the pattern to make each character optional
RETURN "Smith" =~ "I? ?s?a?w? ?S?m?i?t?h? ?t?o?d?a?y?"
But now your pattern is way too loosie goosie, and it will match strings like 'I sat today' and 'sam toy'. Moreover, it won't match 'Alex Smith' unless you prepend .*, but then it is even more loosie goosie and will match any string whatever.
You could divide characters that belong together into groups and make the groups and the spaces between them optional.
RETURN "Smith" =~ "(I)? ?(saw)? ?(Smith)? ?(today)?"
But this also is too broad, fails to match 'Alex Smith' and will match any string whatever if you prepend .*.
4) Bad solution
The only 'solution' I can think of is a hideous query that splits your string on whitespace, concatenates some regex sugar into each word and compares it as a regular expression in a predicate clause. It really is hideous, and it assumes that you already know that you want match the words in the string and not the whole string, in which case you should be doing that processing before you send your query and not in Cypher. Look upon this hideousness and weep
WITH "I saw Paul today" AS paramString
MATCH (p:Person)
WHERE ANY (param IN split(paramString, ' ')
WHERE ANY (item IN p.collectionProperty
WHERE item =~('(?i).*' + param)))
RETURN p
5) Conclusion
The conclusion is as follows:
1) Change your model.
Keep a node for each alias like so
CREATE (a:Alias)
SET a.aliasId = "Alex Smith"
Create a full-text index for these nodes. See blog and docs for the generic case and docs for SDN.
Connect the nodes that now have the alias in a collection property to the new alias node with a relationship.
Look up the alias node that you want and follow the relationship to the node that 'has' the alias. A node can still have many aliases, but you no longer store them in a collection property–your query logic will be more straightforward and you will benefit from the full-text lucene index. Query with START n=node:indexName("query") when using cypher and use findAllByQuery() in SDN. This is necessary for the query to use the full-text index.
Your query might then finally look something like
START n=node:myIndex("aliasId:*smith")
MATCH n<-[:HAS_ALIAS]-smith
RETURN smith
2) Don't do all your work in the database.
If your program is supposed to receive a string like 'I saw Smith today' and give back a node based on a pattern match on 'Smith', then don't send 'I saw' and 'today' to the database. You're better off identifying 'Smith' as the relevant part of the string in your application–when you send the query you should already know what it is you want.
You should use something like this:
MATCH (n:Test)
WHERE single(x IN n.prop WHERE x = "elem1")
RETURN n
It checks that collection has exactly one "elem1".
More info.

Neo4j rename property using regex of current property value

From my research (lots of Googling), I can not see that this is possible, but it is still worth asking I think. I have a large collection of nodes like:
(org:Organization {name: "Organization 1234"})
where 1234 can be any non-negative integer.
In order to update the DB to work with a new API, I am wanting to rename each of these in the collection so that the result would look something like:
(org:Organization {name: "Org_1234"})
So, I need to mashup Org_ with the [0-9]+ regex match on the current property.
Truly, I am at a loss on where to even start. All I see in the documentation is that regex can be used as a predicate in the WHERE clause (WHERE n.property =~ {regex}). Is there a way using just Cypher as I am not using a Java library?
Assuming there is always a single space between "Organization" and the integer, you can brute force this pretty easily with just string functions.
CREATE (:Organization {name:'Organization 1234'}),
(:Organization {name:'Organization 5678'})
MATCH (o:Organization)
WITH o, SPLIT(o.name, " ")[1] AS id
SET o.name = "Org_" + id
RETURN o.name
Which returns
o.name
Org_1234
Org_5678

Resources