Cypher query to find a node based on a regexp on a property - neo4j

I have a Neo4J database mapped with COMPANY as nodes and RELATED as edges. COMPANY has a CODE property. I want to query the database and get the first node that matches the regexp COMPANY.CODE =~ '12345678.*', i.e., a COMPANY whose first 8 letters of CODE is equal a given string literal.
After several attempts, the best I could come up with was the following query:
START p=node(*) where p.CODE =~ '12345678.*' RETURN p;
The result is the following exception:
org.neo4j.cypher.EntityNotFoundException:
The property 'CODE' does not exist on Node[0]
It looks like Node[0] is a special kind of node in the database, that obviously doesn't have my CODE property. So, my query is failing because I'm not choosing the appropriate type of node to query upon. But I couldn't figure out how to specify the type of node to query on.
What's the query that returns what I want?
I think I need an index on CODE to run this query, but I'd like to know whether there's a query that can do the job without using such an index.
Note: I'm using Neo4J version 1.9.2. Should I upgrade to 2.0?

You can avoid the exception by checking for the existence of the property,
START p=node(*)
WHERE HAS(p.CODE) AND p.CODE =~ '12345678.*'
RETURN p;
You don't need an index for the query to work, but an index may increase performance. If you don't want indices there are several other options. If you keep working with Neo4j 1.9.x you may group all nodes representing companies under one or more sorting nodes. When you query for company nodes you can then retrieve them from their sorting node and filter on their properties. You can partition your graph by grouping companies with a certain range of values for .code under one sorting node, and a different range under another; and you can extend this partitioning as needed if your graph grows.
If you upgrade to 2.0 (note: not released yet, so not suitable for production) you can benefit from labels. You can then assign a Company label to all those nodes and this label can maintain it's own index and uniqueness constraints.
The zero node, called 'reference node', will not remain in future versions of Neo4j.

Related

How to query Neo4j N levels deep with variable length relationship and filters on each level

I'm new(ish) to Neo4j and I'm attempting to build a tool that allows users on a UI to essentially specify a path of nodes they would like to query neo4j for. For each node in the path they can specify specific properties of the node and generally they don't care about the relationship types/properties. The relationships need to be variable in length because the typical use case for them is they have a start node and they want to know if it reaches some end node without caring about (all of) the intermediate nodes between the start and end.
Some restrictions the user has when building the path from the UI is that it can't have cycles, it can't have nodes who has more than one child with children and nodes can't have more than one incoming edge. This is only enforced from their perspective, not in the query itself.
The issue I'm having is being able to specify filtering on each level of the path without getting strange behavior.
I've tried a lot of variations of my Cypher query such as breaking up the path into multiple MATCH statements, tinkering with the relationships and anything else I could think of.
Here is a Gist of a sample Cypher dump
cypher-dump
This query gives me the path that I'm trying to get however it doesn't specify name or type on n_four.
MATCH path = (n_one)-[*0..]->(n_two)-[*0..]->(n_three)-[*0..]->(n_four)
WHERE n_one.type IN ["JCL_JOB"]
AND n_two.type IN ["JCL_PROC"]
AND n_three.name IN ["INPA", "OUTA", "PRGA"]
AND n_three.type IN ["RESOURCE_FILE", "COBOL_PROGRAM"]
RETURN path
This query is what I'd like to work however it leaves out the leafs at the third level which I am having trouble understanding.
MATCH path = (n_one)-[*0..]->(n_two)-[*0..]->(n_three)-[*0..]->(n_four)
WHERE n_one.type IN ["JCL_JOB"]
AND n_two.type IN ["JCL_PROC"]
AND n_three.name IN ["INPA", "OUTA", "PRGA"]
AND n_three.type IN ["RESOURCE_FILE", "COBOL_PROGRAM"]
AND n_four.name IN ["TAB1", "TAB2", "COPYA"]
AND n_four.type IN ["RESOURCE_TABLE", "COBOL_COPYBOOK"]
RETURN path
What I've noticed is that when I "... RETURN n_four" in my query it is including nodes that are at the third level as well.
This behavior is caused by your (probably inappropriate) use of [*0..] in your MATCH pattern.
FYI:
[*0..] matches 0 or more relationships. For instance, (a)-[*0..]->(b) would succeed even if a and b are the same node (and there is no relationship from that node back to itself).
The default lower bound is 1. So [*] is equivalent to [*..] and [*1..].
Your 2 queries use the same MATCH pattern, ending in ...->(n_three)-[*0..]->(n_four).
Your first query does not specify any WHERE tests for n_four, so the query is free to return paths in which n_three and n_four are the same node. This lack of specificity is why the query is able to return 2 extra nodes.
Your second query specifies WHERE tests for n_four that make it impossible for n_three and n_four to be the same node. The query is now more picky, and so those 2 extra nodes are no longer returned.
You should not use [*0..] unless you are sure you want to optionally match 0 relationships. It can also add unnecessary overhead. And, as you now know, it also makes the query a bit trickier to understand.

Adding relationship to existing nodes with Cypher doesn't work

I am working on Panama dataset using Neo4J graph database 1.1.5 web version. I identified Ion Sturza, former Prime Minister of Moldova on the database and want to make a map of his related network. I used following code to query using Cypher (creating a variable 'IonSturza'):
MATCH (IonSturza {name: "Ion Sturza"}) RETURN IonSturza
I identified that the entity 'CONSTANTIN LUTSENKO' linked differently to entities like 'Quade..' and 'Kinbo...' with a name in small letters as in this picture. I hence want to map a relationship 'SAME_COMPANY_AS' between the capslock and the uncapped version. I tried the following code based on this answer by #StefanArmbruster:
MATCH (a:Officer {name :"Constantin Lutsenko"}),(b:Officer{name :
"CONSTANTIN LUTSENKO"})
where (a:Officer{name :"Constantin Lutsenko"})-[:SHAREHOLDER_OF]->
(b:Entity{id:'284429'})
CREATE (a)-[:SAME_COMPANY_AS]->(b)
Instead of indexing, I used the 'where' statement to specify the uncapped version which is linked only to the entity bearing id '284429'.
My code however shows the cartesian product error message:
This query builds a cartesian product between disconnected patterns.If a part of a query contains multiple disconnected patterns, this will build a cartesian product between all those parts. This may produce a large amount of data and slow down query processing. While occasionally intended, it may often be possible to reformulate the query that avoids the use of this cross product, perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (b))<<
Also when I execute, there are no changes, no rows!! What am I missing here? Can someone please help me with inserting this relationship between the nodes. Thanks in advance!
The cartesian product warning will appear whenever you're matching on two or more disconnected patterns. In this case, however, it's fine, because you're looking up both of them by what is likely a unique name, s your result should be one node each.
If each separate part of that pattern returned multiple nodes, then you would have (rows of a) x (rows of b), a cartesian product between the two result sets.
So in this particular case, don't mind the warning.
As for why you're not seeing changes, note that you're reusing variables for different parts of the graph: you're using variable b for both the uppercase version of the officer, and for the :Entity in your WHERE. There is no node that matches to both.
Instead, use different variables for each, and include the :Entity in your match. Also, once you match to nodes and bind them to variables, you can reuse the variable names later in your query without having to repeat its labels or properties.
Try this:
MATCH (a:Officer {name :"Constantin Lutsenko"})-[:SHAREHOLDER_OF]->
(:Entity{id:'284429'}),(b:Officer{name : "CONSTANTIN LUTSENKO"})
CREATE (a)-[:SAME_COMPANY_AS]->(b)
Though I'm not quite sure of what you're trying to do...is an :Officer a company? That relationship type doesn't quite seem right.
I tried the answer by #InverseFalcon and thanks to it, by modifying the property identifier from 'id' to 'name' and using the property for both 'a' and 'b', 4 relationships were created by the following code:
MATCH (a:Officer {name :"Constantin Lutsenko"})-[:SHAREHOLDER_OF]->
(:Entity{name:'KINBOROUGH PORTFOLIO LTD.'}),(b:Officer{name : "CONSTANTIN
LUTSENKO"})-[:SHAREHOLDER_OF]->(:Entity{name:'Chandler Group Holdings Ltd'})
CREATE (a)-[:SAME_NAME_AS]->(b)
Thank you so much #InverseFalcon!

Use of index in Neo4j

Suppose I have 2 types of nodes :Server and :Client.
(Client)-[:CONNECTED_TO]->(Server)
I want to find the Female clients connected to some Server ordered by age.
I did this
Match (s:Server{id:"S1"})<-[:CONNNECTED_TO]-(c{gender:"F"}) return c order by c.age DESC
Doing this, all the Client nodes linked to my Server node are traversed to find the highest age.
Is there a way to index the Client nodes on gender and age properties to avoid the full scan?
You can create an index on :Client(gender), as follows:
CREATE INDEX ON :Client(gender);
However, your particular query will probably benefit more from creating an index on :Server(id), since there are probably a lot of female clients but probably only a single Server with that id. So, you probably want to do this instead:
CREATE INDEX ON :Server(id);
But, even better, if every Server has a unique id property, you should create a uniqueness constraint (which also automatically creates an index for you):
CREATE CONSTRAINT ON (s:Server) ASSERT s.id IS UNIQUE;
Currently, neo4j does not use indexes to perform ordering, but there are some APOC procedures that do support that. However, the procedures do not support returning results in descending order, which is what you want. So, if you really need to use indexing for this purpose, a workaround would be to add an extra minusAge property to your Client nodes that contains the negative value of the age property. If you do this, then first create an index:
CREATE INDEX ON :Client(minusAge);
and then use this query:
MATCH (s:Server{id:"S1"})<-[:CONNNECTED_TO]-(cl:Client {gender:"F"})
CALL apoc.index.orderedRange('Client', 'minusAge', -200, 0, false, -1) YIELD node AS c
RETURN c;
The 3rd and 4th parameters of that procedure are for the minimum and maximum values you want to use for matching (against minusAge). The 5th parameter should be false for your purposes (and is actually currently ignored by the implementation). The last parameter is for the LIMIT value, and -1 means you do not want a limit.
If that is a request you're doing quite frequently, then you might want to write that data out. As you're experiencing, that query can be quite expensive and it won't get better the more clients you get, as in fact, all of the connected nodes have to be retrieved and get a property check/comparison run on them.
As a workaround, you can add another connection to your clients when you modify their age data.
As a suggestion, you can create an Age node and create a MATURE relationship to your oldest clients.
(:Server)<-[:CONNNECTED_TO]-(:Client)-[:MATURE]->(:Age)
You can do this for all the ages, and run queries off the Age nodes (with an indexed/unique age property on the) as needed. If you have 100,000 clients, but only are interested in the top 100 ordered by age, there's no need to get all the clients and order them... It really depends on your use case and client apps.
While this is certainly not a nice pattern, I've seen it work in production and is the only workaround that's been doing well in different production environments that I've seen.
If this answer didn't solve your problem (I'd rather use an age property, too), I hope it gave you at least an idea on what to do next.

Cypher query optimisation - Utilising known properties of nodes

Setup:
Neo4j and Cypher version 2.2.0.
I'm querying Neo4j as an in-memory instance in Eclipse created TestGraphDatabaseFactory().newImpermanentDatabase();.
I'm using this approach as it seems faster than the embedded version and I assume it has the same functionality.
My graph database is randomly generated programmatically with varying numbers of nodes.
Background:
I generate cypher queries automatically. These queries are used to try and identify a single 'target' node. I can limit the possible matches of the queries by using known 'node' properties. I only use a 'name' property in this case. If there is a known name for a node, I can use it to find the node id and use this in the start clause. As well as known names, I also know (for some nodes) if there are names known not to belong to a node. I specify this in the where clause.
The sorts of queries that I am running look like this...
START
nvari = node(5)
MATCH
(target:C5)-[:IN_LOCATION]->(nvara:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarb:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvare:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarh:LOCATION),
(nvari:C4)-[:IN_LOCATION]->(nvarg:LOCATION),
(nvarj:C2)-[:IN_LOCATION]->(nvarg:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvark:LOCATION),
(nvarm:C3)-[:IN_LOCATION]->(nvarg:LOCATION),
WHERE
NOT(nvarj.Name IN ['nf']) AND NOT(nvarm.Name IN ['nb','nj'])
RETURN DISTINCT target
Another way to think about this (if it helps), is that this is an isomorphism testing problem where we have some information about how nodes in a query and target graph correspond to each other based on restrictions on labels.
Question:
With regards to optimisation:
Would it help to include relation variables in the match clause? I took them out because the node variables are sufficient to distinguish between relationships but this might slow it down?
Should I restructure the match clause to have match/where couples including the where clauses from my previous example first? My expectation is that they can limit possible bindings early on. For example...
START
nvari = node(5)
MATCH
(nvarj:C2)-[:IN_LOCATION]->(nvarg:LOCATION)
WHERE NOT(nvarj.Name IN ['nf'])
MATCH
(nvarm:C3)-[:IN_LOCATION]->(nvarg:LOCATION)
WHERE NOT(nvarm.Name IN ['nb','nj'])
MATCH
(target:C5)-[:IN_LOCATION]->(nvara:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarb:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvare:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarh:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvark:LOCATION)
RETURN DISTINCT target
On the side:
(Less important but still an interest) If I make each relationship in a match clause an optional match except for relationships containing the target node, would cypher essentially be finding a maximum common sub-graph between the query and the graph data base with the constraint that the MCS contains the target node?
Thanks a lot in advance! I hope I have made my requirements clear but I appreciate that this is not a typical use-case for Neo4j.
I think querying with node properties is almost always preferable to using relationship properties (if you had a choice), as that opens up the possibility that indexing can help speed up the query.
As an aside, I would avoid using the IN operator if the collection of possible values only has a single element. For example, this snippet: NOT(nvarj.Name IN ['nf']), should be (nvarj.Name <> 'nf'). The current versions of Cypher might not use an index for the IN operator.
Restructuring a query to eliminate undesirable bindings earlier is exactly what you should be doing.
First of all, you would need to keep using MATCH for at least the first relationship in your query (which binds target), or else your result would contain a lot of null rows -- not very useful.
But, thinking clearly about this, if all the other relationships were placed in separate OPTIONAl MATCH clauses, you'd be essentially saying that you want a match even if none of the optional matches succeeded. Therefore, the logical equivalent would be:
MATCH (target:C5)-[:IN_LOCATION]->(nvara:LOCATION)
RETURN DISTINCT target
I don't think this is a useful result.

Assumptions regarding Node ID strings in Neo4j - cypher

In my recent question, Modeling conditional relationships in neo4j v.2 (cypher), the answer has led me to another question regarding my data model and the cypher syntax to represent it. Lets say in my model, there is a node CLT1 that is what I'll call the Source node. CLT1 has relationships to other 286 Target nodes. This is a model of a target node:
CREATE
(Abnormally_high:Label1:Label2:Label3:Label4:Label5:Label6:Label7:Label8:Label9:Label10
{Pro1:'x',Prop2:'y',Prop3:'z'})
Key point: I am assuming the string after the CREATE clause is
The ID of this target node
The ID is significant because its content has domain-specific meaning
and is query-able.
in this case its the phrase ...."Abnormally_high".
I made this assumption based on the movie database example.
CREATE (Keanu:Person {name:'Keanu Reeves', born:1964})
CREATE (Carrie:Person {name:'Carrie-Anne Moss', born:1967})
The first strings after CREATE definitely have domain-specific meaning!
In my earlier post I discuss Problem 2. I find that problem 2 arises because among the 286 target nodes, there are many instances where there was at least one more Target node who shares the identical ID. In this instance, the ID is "Abnormally_high". The other Target nodes may differ in the value of any of Label1 - Label10 or the associated properties.
Apparently, Cypher doesn't like that. In Problem 2, I was discussing the ways to deal with the fact that cypher doesn't like using the same node ID multiple times even though the labels or properties were different.
My problem are my assumptions about the Target node ID.
AM I RIGHT?
I am now thinking that I could instead use this....
CREATE (CLT1_target_1:Label1:Label2:Label3:Label4:Label5:Label6:Label7:Label8:Label9:Label10
{name:'Abnormally_high',Prop2:'y',Prop3:'z'})
If indeed the first string after the CREATE clause is an ID, then all I have to do is put a unique target node identifier.... like CLT1_target_1 and increment up to CLT1_target_286. If I do this, then I can have the name as a property and change whatever label or property I want.
Do I have this right?
You are wrong. In Cypher, a node name (like "Abnormally_high") is just a variable name that exists for the lifetime of the query (and sometimes not even that long). The node name used in a Cypher query is never persisted in any way, and can be any arbitrary string.
Also, in neo4j, the term "ID" has a specific meaning. The neo4j DB will automatically assign a (currently) unique integer ID to each new node. You have no control over the ID value assigned to a node. And when a node is deleted, neo4j can reassign its ID to a new node.
You should read the neo4j manual (available at docs.neo4j.org), especially the section on Cypher, to get a better understanding.

Resources