Cypher query on variable length path with specified end point - neo4j

My graph model holds information on data lineage and how data moves from one column to another through column mappings in our ETL tool. A basic one hop pattern would look like this...
(source:Column)-[:SOURCE_OF_MAPPING]->(map:ColumnMapping)-[:TARGET_OF_MAPPING]->(target:Column)
so
source might be a column called "STAGING_TABLE_1.FULL_NAME",
target might be a column called "STAGING_TABLE_2.FULL_NAME" and
map would be whatever was specified in the select query within the ETL tool's dataflow. Perhaps something like "UPPER(STAGING_TABLE_1.FULL_NAME || STAGING_TABLE_1.TITLE)"
What I need to be able to do is say if I look at a specific target column, lets say "DATA_MART_FACT_1.FULL_NAME", from which column does this data originate from?
The following is the cypher query I am trying to use but this only pulls back a single hop, i.e. the source and column mapping where the target is "DATA_MART_FACT_1.FULL_NAME".
MATCH (source:Column)-[:SOURCE_OF_MAPPING*]->(c:ColumnMapping)-[:TARGET_OF_MAPPING*]->(target:Column)
WHERE target.name = 'DATA_MART_FACT_1.FULL_NAME'
RETURN source, target, c
I have tried removing the relationship names, and just having an asterisk in the square brackets, but this just kills my neo4j installation (Currently sitting at 5GB memory and 50% CPU usage and hanging for around 10 minutes). There are constraints on all of the unique properties.
I know the data contains what I need because in the neo4j browser I can expand the nodes and follow the path through as I would expect to be able. Can anyone provide me with a cypher query that will allow me to do this? Perhaps my graph model needs a slight refactor in terms of relationship names and directions to allow this to work, which I'm perfectly happy to explore.
Here is some cypher to generate a basic example.
CREATE
(_0:`Column` {`name`:"STAGING_TABLE_1.FULL_NAME"}),
(_1:`Column` {`name`:"STAGING_TABLE_2.FULL_NAME"}),
(_2:`Column` {`name`:"DATA_MART_FACT_1.FULL_NAME"}),
(_3:`ColumnMapping` {`mappingText`:"UPPER(STAGING_TABLE_1.FULL_NAME)"}),
(_4:`ColumnMapping` {`mappingText`:"LOWER(STAGING_TABLE_2.FULL_NAME)"}),
(_0)-[:`SOURCE_OF_MAPPING`]->(_3),
(_3)-[:`MAPS_TO`]->(_1),
(_1)-[:`SOURCE_OF_MAPPING`]->(_4),
(_4)-[:`MAPS_TO`]->(_2)
Then the query I used only return a single hop
MATCH (source:Column)-[:SOURCE_OF_MAPPING*..10]->(c:ColumnMapping)-[:MAPS_TO*..10]->(target:Column) WHERE target.name = 'DATA_MART_FACT_1.FULL_NAME' RETURN source, target, c
Then next query kind of returns what I'm after but is missing the relationship between the first 2 nodes.
MATCH (source:Column)-[:SOURCE_OF_MAPPING|MAPS_TO*..10]->(n)-[:MAPS_TO]->(target:Column)
WHERE target.name = 'DATA_MART_FACT_1.FULL_NAME'
AND (n:Column or n:ColumnMapping)
RETURN *;
The end result that I would like from this is as follows (note, aliases are just included here to illustrate the dataflow, and for my requirements the actual results don't need to be aliased)...
(c1:Column)-[:SOURCE_OF_MAPPING]->(cm1:ColumnMapping)-[:MAPS_TO]->(c2:Column)-[:SOURCE_OF_MAPPING]->(cm2:ColumnMapping)-[:MAPS_TO]-(target:Column)
and in tabular format
source | mapping | target
STAGING_TABLE_1.FULL_NAME | UPPER(STAGING_TABLE_1.FULL_NAME) | STAGING_TABLE_2.FULL_NAME
STAGING_TABLE_2.FULL_NAME | LOWER(STAGING_TABLE_2.FULL_NAME) | DATA_MART_FACT_1.FULL_NAME
Oddly, when I created an an interactive example (the site can be flaky and can sometimes take a few refreshes before it works) and although the table returns one row as per my local installation, the visual graph representation shows all of the expected nodes and relationships.
Any and all advice is appreciated. Thanks in advance.
EDIT: I've refactored my column mapping to target column relationship direction to make the flow core natural as if it were the data flowing from source, to column mapping, to target. There was no change in behaviour though.

You could try breaking your query using WITH, as:
MATCH (t:Column {name:'DATA_MART_FACT_1.FULL_NAME'})-[:TARGET_OF_MAPPING*]->(c)
WITH t, c
MATCH (c)-[:SOURCE_OF_MAPPING*]->(s)
RETURN s,t,c
That can cut down on the Cartesian product situations. I haven't tried in your case, but it's generally a good way to look at queries. Also, trim the fat on criteria--if :TARGET_OF_MAPPING only connects to :ColumnMapping, then you may not need to specify and test for that.

For the example, the following is how to get the start to the end
MATCH (n)-[:SOURCE_OF_MAPPING|MAPS_TO*..4]->(target:Column)
WHERE target.name = 'DATA_MART_FACT_1.FULL_NAME'
AND (n:Column or n:ColumnMapping)
RETURN *;
I suspect for a real life example, where there may be many consecutive column mappings the results may have to be aggregated in some way in order to avoid killing performance with open ended variable length paths.

Related

How do you perform a recursive query in cypher where there is a conditional within the path relationship?

I am attempting to setup a new graph database to contain records of products and their relationship on each other's versioned components. For each product it can have many components, and each component is made up of multiple versions. Each version can be dependent on none or many versions of any other components. I want to be able to query this database to pick any version of a component and determine what other versioned components it is depended on, or what depends on it.
The data structure I have attempted in my examples is not defined yet, so if a completely different structure is more suitable I'm open to changing it. I originally considered setting the DEPENDS_ON relationship directly between members. However, as new members will be added over time if a new member is added and falls within the version_min and version_max range of an existing records dependancy range, I would then need to go back and identify all affected records and update all of them, which doesn't feel like it would scale over time. This is what lead to the idea of having a member being dependent on a component, with the version limits defined in the relationship parameters.
I have put together a very simple example of 3 products (sample data at the end), with a single type of component and 1 version of each in all cases except one. I've then added only two dependencies into this example, 'a' depends on a range of 'b' versions, and one of the 'b' versions depends on a version of 'c'.
I would like to be able to perform a query to say "give me all downstream members which member prod_a_comp_1_v_1 depends on". Similarly I would like to do this in reverse too, which I imagine is achieved by just reversing some of the relationship parameters.
So far I've achieved this for a single hop (list b versions which a depends on), shown here:
MATCH
p=(a:member{name:'prod_a_comp_1_v_1'})-[d:DEPENDS_ON]->(c:component)<-[v:VERSION_OF]-(b:member) WHERE b.version >= d.version_min AND b.version <= d.version_max
RETURN p
But I don't know how to get it to recursively perform this query on the results of this first match. I investigated variable length/depths, but because there is a conditional parameter in the relationship in the variable depth (DEPENDS_ON), I could not get this to work.
From the example data if querying all downstream dependencies of prod_a_comp_1_v_1 it should return: [prod_b_comp_1_v_2, prod_b_comp_1_v_3, prod_c_comp_1_v_1].
e.g. this figure:
Currently my thought is to use the above query and perform the repeated call on the database based on the results from the client end (capturing circular loops etc.), but that seems very undesirable.
Sample data:
CREATE
(prod_a:product {name:'prod_a'}),
(prod_a_comp_1:component {name: 'prod_a_comp_1', type:'comp_1'}),
(prod_a_comp_1)-[:COMPONENT_OF {type:'comp_1'}]->(prod_a),
(prod_a_comp_1_v_1:member {name:'prod_a_comp_1_v_1', type:'comp_1', version:1}),
(prod_a_comp_1_v_1)-[:VERSION_OF {version:1}]->(prod_a_comp_1)
CREATE
(prod_b:product {name:'prod_b'}),
(prod_b_comp_1:component {name: 'prod_b_comp_1', type:'comp_1'}),
(prod_b_comp_1)-[:COMPONENT_OF {type:'comp_1'}]->(prod_b),
(prod_b_comp_1_v_1:member {name:'prod_b_comp_1_v_1', type:'comp_1', version:1}),
(prod_b_comp_1_v_2:member {name:'prod_b_comp_1_v_2', type:'comp_1', version:2}),
(prod_b_comp_1_v_3:member {name:'prod_b_comp_1_v_3', type:'comp_1', version:3}),
(prod_b_comp_1_v_1)-[:VERSION_OF {version:1}]->(prod_b_comp_1),
(prod_b_comp_1_v_2)-[:VERSION_OF {version:2}]->(prod_b_comp_1),
(prod_b_comp_1_v_3)-[:VERSION_OF {version:3}]->(prod_b_comp_1)
CREATE
(prod_c:product {name:'prod_c'}),
(prod_c_comp_1:component {name: 'prod_c_comp_1', type:'comp_1'}),
(prod_c_comp_1)-[:COMPONENT_OF {type:'comp_1'}]->(prod_c),
(prod_c_comp_1_v_1:member {name:'prod_c_comp_1_v_1', type:'comp_1', version:1}),
(prod_c_comp_1_v_1)-[:VERSION_OF {version:1}]->(prod_c_comp_1)
CREATE
(prod_a_comp_1_v_1)-[:DEPENDS_ON {version_min:2, version_max:3}]->(prod_b_comp_1),
(prod_b_comp_1_v_3)-[:DEPENDS_ON {version_min:1, version_max:100}]->(prod_c_comp_1)
Figure showing full sample data set:
Apologies if I have missunderstood your question but I believe this may be possible with the APOC Expand Paths function: https://neo4j.com/docs/apoc/current/graph-querying/expand-paths/
Example Cypher for your graph:
MATCH (a:member{name:'prod_a_comp_1_v_1'})
CALL apoc.path.expand(a, ">DEPENDS_ON|<VERSION_OF", null, 1, -1)
YIELD path
RETURN path, length(path) AS hops
ORDER BY hops;
Example Results:

Cypher return multiple hops through pattern of relationships and nodes

I'm making a proof of concept access control system with neo4j at work, and I need some help with Cypher.
The data model is as follows:
(:User|Business)-[:can]->(:Permission)<-[:allows]-(:Business)
Now I want to get a path from a User or a Business to all the Business-nodes that you can reach trough the
-[:can]->(:Permission)<-[:allows]-
pattern. I have managed to write a MATCH that gets me halfway there:
MATCH
path =
(:User {userId: 'e96cca53-475c-4534-9fe1-06671909fa93'})-[:can|allows*]-(b:Business)
but this doesn't have any directions, and I can't figure out how to include the directions without reducing the returned matches to only the direct matches (i.e it doesn't continue after the first hit on a :Business node)
So what I'm wondering is:
Is there a way to match multiple of these hops in one query?
Should I model this entirely different?
Am I on the wrong path completely and the query should be completely
rewritten
Currently the syntax of variable-length expansions doesn't allow fine control for separate directions for certain types. There are improvements in the pipeline around this, but for the moment Cypher alone won't get you what you want.
We can use APOC Procedures for this, as fine control of the direction of types in the expansion, and sequences of relationships, are supported in the path expander procs.
First, though, you'll need to figure out how to address your user-or-business match, either by adding a common label to these nodes by which you can MATCH either type by property, or you can use a subquery with two UNIONed queries, one for :Business nodes, the other for :User nodes, that way you can still take advantage of an index on either, and get possible results in a single variable.
Once you've got that, you can use apoc.path.expandConfig(), passing some options to get what you want:
// assume you've matched to your `start` node already
CALL apoc.path.expandConfig(start, {relationshipFilter:'can>|<allows', labelFilter:'>Business'}) YIELD path
RETURN path
This one doesn't use sequences, but it does restrict the direction of expansion per relationship type. We are also setting the labelFilter such that :Business nodes are the end node of the path and not nodes of any other label.
You can specify the path as follows:
MATCH path = (:User {userId: $id})-[:can]->(:Permission)
<-[:allows]-(:Business))
RETURN path
This should return the results you're after.
I see a good solution has been provided via path expanding APOC procedures.
But I'll focus on your point #2: "Should I model this entirely differently?"
Well, not entirely but I think yes.
The really liberating part of working with Neo4j is that you can change the road you are driving over as easily as you can change your driving strategy: model vs query. And since you are at an early stage in your project, you can experiment with different models. There's a good opportunity to make just a semantic change to make an 'end run' around the problem.
The semantics of a relationship in Neo4j are expressed through
the mandatory TYPE you assign to the relationship, combined with
the direction you choose to point the mandatory arrow
The trick you solved with APOC was how to traverse a path of relationships that alternate between pointing forward and backward along the query's path. But before reaching for a power tool, why not just reverse the direction of either of your relationship types. You can change the model for allows from
<-[:allows]-
to
-[:is_allowed_by]->
and that buys you a lot. Now the directions of both relationships are the same and you can combine both relationships into a single relationship in the match pattern. And the path traversal can be expressed like this, short & sweet:
(u:User)-[:can|is_allowed_by*]->(c:Company)
That will literally go to all lengths to find every user-to-company path, branching included.

Adding relationship to existing nodes with Cypher doesn't work

I am working on Panama dataset using Neo4J graph database 1.1.5 web version. I identified Ion Sturza, former Prime Minister of Moldova on the database and want to make a map of his related network. I used following code to query using Cypher (creating a variable 'IonSturza'):
MATCH (IonSturza {name: "Ion Sturza"}) RETURN IonSturza
I identified that the entity 'CONSTANTIN LUTSENKO' linked differently to entities like 'Quade..' and 'Kinbo...' with a name in small letters as in this picture. I hence want to map a relationship 'SAME_COMPANY_AS' between the capslock and the uncapped version. I tried the following code based on this answer by #StefanArmbruster:
MATCH (a:Officer {name :"Constantin Lutsenko"}),(b:Officer{name :
"CONSTANTIN LUTSENKO"})
where (a:Officer{name :"Constantin Lutsenko"})-[:SHAREHOLDER_OF]->
(b:Entity{id:'284429'})
CREATE (a)-[:SAME_COMPANY_AS]->(b)
Instead of indexing, I used the 'where' statement to specify the uncapped version which is linked only to the entity bearing id '284429'.
My code however shows the cartesian product error message:
This query builds a cartesian product between disconnected patterns.If a part of a query contains multiple disconnected patterns, this will build a cartesian product between all those parts. This may produce a large amount of data and slow down query processing. While occasionally intended, it may often be possible to reformulate the query that avoids the use of this cross product, perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (b))<<
Also when I execute, there are no changes, no rows!! What am I missing here? Can someone please help me with inserting this relationship between the nodes. Thanks in advance!
The cartesian product warning will appear whenever you're matching on two or more disconnected patterns. In this case, however, it's fine, because you're looking up both of them by what is likely a unique name, s your result should be one node each.
If each separate part of that pattern returned multiple nodes, then you would have (rows of a) x (rows of b), a cartesian product between the two result sets.
So in this particular case, don't mind the warning.
As for why you're not seeing changes, note that you're reusing variables for different parts of the graph: you're using variable b for both the uppercase version of the officer, and for the :Entity in your WHERE. There is no node that matches to both.
Instead, use different variables for each, and include the :Entity in your match. Also, once you match to nodes and bind them to variables, you can reuse the variable names later in your query without having to repeat its labels or properties.
Try this:
MATCH (a:Officer {name :"Constantin Lutsenko"})-[:SHAREHOLDER_OF]->
(:Entity{id:'284429'}),(b:Officer{name : "CONSTANTIN LUTSENKO"})
CREATE (a)-[:SAME_COMPANY_AS]->(b)
Though I'm not quite sure of what you're trying to do...is an :Officer a company? That relationship type doesn't quite seem right.
I tried the answer by #InverseFalcon and thanks to it, by modifying the property identifier from 'id' to 'name' and using the property for both 'a' and 'b', 4 relationships were created by the following code:
MATCH (a:Officer {name :"Constantin Lutsenko"})-[:SHAREHOLDER_OF]->
(:Entity{name:'KINBOROUGH PORTFOLIO LTD.'}),(b:Officer{name : "CONSTANTIN
LUTSENKO"})-[:SHAREHOLDER_OF]->(:Entity{name:'Chandler Group Holdings Ltd'})
CREATE (a)-[:SAME_NAME_AS]->(b)
Thank you so much #InverseFalcon!

Cypher query optimisation - Utilising known properties of nodes

Setup:
Neo4j and Cypher version 2.2.0.
I'm querying Neo4j as an in-memory instance in Eclipse created TestGraphDatabaseFactory().newImpermanentDatabase();.
I'm using this approach as it seems faster than the embedded version and I assume it has the same functionality.
My graph database is randomly generated programmatically with varying numbers of nodes.
Background:
I generate cypher queries automatically. These queries are used to try and identify a single 'target' node. I can limit the possible matches of the queries by using known 'node' properties. I only use a 'name' property in this case. If there is a known name for a node, I can use it to find the node id and use this in the start clause. As well as known names, I also know (for some nodes) if there are names known not to belong to a node. I specify this in the where clause.
The sorts of queries that I am running look like this...
START
nvari = node(5)
MATCH
(target:C5)-[:IN_LOCATION]->(nvara:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarb:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvare:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarh:LOCATION),
(nvari:C4)-[:IN_LOCATION]->(nvarg:LOCATION),
(nvarj:C2)-[:IN_LOCATION]->(nvarg:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvark:LOCATION),
(nvarm:C3)-[:IN_LOCATION]->(nvarg:LOCATION),
WHERE
NOT(nvarj.Name IN ['nf']) AND NOT(nvarm.Name IN ['nb','nj'])
RETURN DISTINCT target
Another way to think about this (if it helps), is that this is an isomorphism testing problem where we have some information about how nodes in a query and target graph correspond to each other based on restrictions on labels.
Question:
With regards to optimisation:
Would it help to include relation variables in the match clause? I took them out because the node variables are sufficient to distinguish between relationships but this might slow it down?
Should I restructure the match clause to have match/where couples including the where clauses from my previous example first? My expectation is that they can limit possible bindings early on. For example...
START
nvari = node(5)
MATCH
(nvarj:C2)-[:IN_LOCATION]->(nvarg:LOCATION)
WHERE NOT(nvarj.Name IN ['nf'])
MATCH
(nvarm:C3)-[:IN_LOCATION]->(nvarg:LOCATION)
WHERE NOT(nvarm.Name IN ['nb','nj'])
MATCH
(target:C5)-[:IN_LOCATION]->(nvara:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarb:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvare:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarh:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvark:LOCATION)
RETURN DISTINCT target
On the side:
(Less important but still an interest) If I make each relationship in a match clause an optional match except for relationships containing the target node, would cypher essentially be finding a maximum common sub-graph between the query and the graph data base with the constraint that the MCS contains the target node?
Thanks a lot in advance! I hope I have made my requirements clear but I appreciate that this is not a typical use-case for Neo4j.
I think querying with node properties is almost always preferable to using relationship properties (if you had a choice), as that opens up the possibility that indexing can help speed up the query.
As an aside, I would avoid using the IN operator if the collection of possible values only has a single element. For example, this snippet: NOT(nvarj.Name IN ['nf']), should be (nvarj.Name <> 'nf'). The current versions of Cypher might not use an index for the IN operator.
Restructuring a query to eliminate undesirable bindings earlier is exactly what you should be doing.
First of all, you would need to keep using MATCH for at least the first relationship in your query (which binds target), or else your result would contain a lot of null rows -- not very useful.
But, thinking clearly about this, if all the other relationships were placed in separate OPTIONAl MATCH clauses, you'd be essentially saying that you want a match even if none of the optional matches succeeded. Therefore, the logical equivalent would be:
MATCH (target:C5)-[:IN_LOCATION]->(nvara:LOCATION)
RETURN DISTINCT target
I don't think this is a useful result.

Create Unique Relationship is taking much amount of time

START names = node(*),
target=node:node_auto_index(target_name="TARGET_1")
MATCH names
WHERE NOT names-[:contains]->()
AND HAS (names.age)
AND (names.qualification =~ ".*(?i)B.TECH.*$"
OR names.qualification =~ ".*(?i)B.E.*$")
CREATE UNIQUE (names)-[r:contains{type:"declared"}]->(target)
RETURN names.name,names,names.qualification
Iam consisting of nearly 1,80,000 names nodes, i had iterated the above process to create unique relationships above 100 times by changing the target. its taking too much amount of time.How can i resolve it..
i build the query with java and iterated.iam using neo4j 2.0.0.5 and java 1.7 .
I edited your cypher query because I think I understand it, but I can barely read the rest of your question. If you edit it with white spaces and punctuation it might be easier to understand what you are trying to do. Until then, here are some thoughts about your query being slow.
You bind all the nodes in the graph, that's typically pretty slow.
You bind all the nodes in the graph twice. First you bind universally in your start clause: names=node(*), and then you bind universally in your match clause: MATCH names, and only then you limit your pattern. I don't quite know what the Cypher engine makes of this (possibly it gets a migraine and goes off to make a pot of coffee). It's unnecessary, you can at least drop the names=node(*) from your start clause. Or drop the match clause, I suppose that could work too, since you don't really do anything there, and you will still need a start clause for as long as you use legacy indexing.
You are using Neo4j 2.x, but you use legacy indexing instead of labels, at least in this query. Without knowing your data and model it's hard to know what the difference would be for performance, but it would certainly make it much easier to write (and read) your queries. So, that's a different kind of slow. It's likely that if you had labels and label indices, the query performance would improve.
So, first try removing one of the universal bindings of nodes, then use the 2.x schema tools to structure your data. You should be able to write queries like
MATCH target:Target
WHERE target.target_name="TARGET_1"
WITH target
MATCH names:Name
WHERE NOT names-[:contains]->()
AND HAS (names.age)
AND (names.qualification =~ ".*(?i)B.TECH.*$"
OR names.qualification =~ ".*(?i)B.E.*$")
CREATE UNIQUE (names)-[r:contains{type:"declared"}]->(target)
RETURN names.name,names,names.qualification
I have no idea if such a query would be fast on your data, however. If you put the "Name" label on all your nodes, then MATCH names:Name will still bind all nodes in the database, so it'll probably still be slow.
P.S. The relationships you create have a TYPE called contains, and you give them a property called type with value declared. Maybe you have a good reason, but that's potentially very confusing.
Edit:
Reading through your question and my answer again I no longer think that I understand even your cypher query. (Why are you returning both the bound nodes and properties of those nodes?) Please consider posting sample data on console.neo4j.org and explain in more detail what your model looks like and what you are trying to do. Let me know if my answer meets your question at all or I'll consider removing it.

Resources