Create on NOT MATCH command for Neo4j's CQL? - neo4j

I have a non-unique node (:Neighborhood) that uniquely appears [:IN] a (:City) node. I would like to create a new neighborhood node and establish its relationship ONLY if that neighborhood node does not exist in that city. There can be multiple neighborhoods that have the same name, but each neighborhood must appear uniquely appear in the property city.
Following the advice from the Gil's answer here: Return node if relationship is not present, how can I do something like:
MATCH a WHERE NOT (a:Neighborhood {name : line.Neighborhood})-[r:IN]->(c:City {name : line.City})
ON MATCH SET (a)-[r]-(c)
So then it would only create a new neighborhood node if it doesn't already exist in the city.
**UPDATE:**I upgraded and profiled it and still can't take advantage of any optimizations...
PROFILE LOAD CSV WITH HEADERS FROM "file://THEFILE" as line
WITH line LIMIT 0
MATCH (c:City { name : line.City})
MERGE (n:Neighborhood {name : toInt(line.Neighborhood)})-[:IN]->(c)
;
+--------------+------+--------+---------------------------+------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+--------------+------+--------+---------------------------+------------------------------+
| EmptyResult | 0 | 0 | | |
| UpdateGraph | 5 | 16 | anon[340], b, neighborhood, line | MergePattern |
| SchemaIndex | 5 | 10 | b, line | line.City; :City(name) |
| ColumnFilter | 5 | 0 | line | keep columns line |
| Filter | 5 | 0 | anon[216], line | anon[216] |
| Extract | 5 | 0 | anon[216], line | anon[216] |
| Slice | 5 | 0 | line | { AUTOINT0} |
| LoadCSV | 5 | 0 | line | |
+--------------+------+--------+---------------------------+------------------------------+

I think you could simply use MERGE for this:
MATCH (c:City {name: line.City})
MERGE c<-[:IN]-(a:Neighborhood {name : line.Neighborhood})
If you haven't already imported all of the cities, you can create those with MERGE:
MATCH (c:City {name: line.City})
MERGE c<-[:IN]-(a:Neighborhood {name : line.Neighborhood})
But beware of the Eager operator:
http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/
In short: You should run your LOAD CSV (I assume that's what you're doing here) twice, once to load the cities and once to load the neighborhoods.

Related

how to create relationship using cypher

I have been learning neo4j/cypher for the last week. I have finally been able to upload two csv files and create a relationship,"captured". However, I am not fully confident in my understanding of the code as I was following the tutorial on the neo4j site. Could you please help me confirm what I did is correct.
I have two csv files, a "cap.csv" and a "survey.csv". The survey table contains data of each unique survey conducted at the survey sites. the cap table contains data of each unique organisms captured. In the cap table I have a foreign key, "survey_id", which in the Postgres db you would join to the p.key in the survey table.
I want to create a relationship, "captured", showing each unique organsism that was captured based on the "date" column in the survey table.
Survey table
| lake_id | date |survey_id | duration |
| -------- | -------------- | --| --
| 1 | 05/27/14 |1 | 7 |
| 2 | 03/28/13 | 2|10 |
| 2 | 06/29/19 | 3|23 |
| 3 | 08/21/21 | 4|54 |
| 1 | 07/23/18 | 5|23 |
| 2 | 07/22/23 | 6|12 |
Capture table
| cap_id | species |capture_life_stage | weight | survey_id |
| -------- | -------------- | --| -----|---|
| 1 | a |adult | 10 | 1|
| 2 | a | adult|10 | 2 |
| 3 | b | juv|23 | 3 |
| 4 | a | adult|54 | 4 |
| 5 | b | juv|23 | 5 |
| 6 | c | juv |12 | 6 |
LOAD CSV WITH HEADERS FROM 'file:///cap.csv' AS row
WITH
row.id as id,
row.species as species,
row.capture_life_stage as capture_life_stage,
toInteger(row.weight) as weight,
row.survey_id as survey_id
MATCH (c:cap {id: id})
MERGE (s) - [rel:captured {survey_id: survey_id}] ->(c)
return count(rel)
I am struggling to understand the code I wrote above. I followed the neo4j tutorial exactly but used my data (https://neo4j.com/developer/desktop-csv-import/).
I am fairly confident from data checks, but did the above code create the "captured" relationship showing each unique organism captured on that unique survey date? Based on the visual I can see I believe it did but I don't fully understand each step in the code.
What is the purpose of the MATCH (c:cap {id: id}) in the code?
The code below
MATCH (c:cap {id: id})
is the same as
MATCH (c:cap)
Where c.id = id
It is a shorter way of finding Captured node based on id and then you are creating a relationship with Survey node.
Question: s is not defined in your query. Where is it?

Neo4j WHERE causes duplicates?

I'm running Neo4j Desktop v1.4.1 the db is 4.2.1 enterprise.
I have a simple graph of placements, campaigns and a placement to campaign "contains" relationship. This is a fresh dataset, every node is unique. Some placements "contain" thousands of campaigns, so I want to filter the returned campaigns by an inclusion list of campaign ids.
When I return all the matched nodes it works:
neo4j#neo4j> MATCH (:Placement {id: 5})-[:CONTAINS]->(c:Campaign)
WHERE c.id IN [400,263,150470,25810,37578]
RETURN *;
+--------------------------+
| c |
+--------------------------+
| (:Campaign {id: 37578}) |
| (:Campaign {id: 263}) |
| (:Campaign {id: 25810}) |
| (:Campaign {id: 150470}) |
+--------------------------+
When I request just the campaign:id, I get duplicates:
neo4j#neo4j> MATCH (:Placement {id: 5})-[:CONTAINS]->(c:Campaign)
WHERE c.id IN [400,263,150470,25810,37578]
RETURN c.id;
+--------+
| c.id |
+--------+
| 150470 |
| 150470 |
| 150470 |
| 150470 |
+--------+
There is only one CONTAINS relationship between placement 5 and campaign 15070:
neo4j#neo4j> MATCH (:Placement {id: 5})-[rel:CONTAINS]->(:Campaign {id:150470})
RETURN count(rel);
+------------+
| count(rel) |
+------------+
| 1 |
+------------+
EXPLAIN returns the following query plan, the cache[c.id] seems like it might be the culprit?
+---------------------------+------------------------------------------------------------------------------------------------------+----------------+---------------------+
| Operator | Details | Estimated Rows | Other |
+---------------------------+------------------------------------------------------------------------------------------------------+----------------+---------------------+
| +ProduceResults#neo4j | `c.id` | 4 | Fused in Pipeline 1 |
| | +------------------------------------------------------------------------------------------------------+----------------+---------------------+
| +Projection#neo4j | cache[c.id] AS `c.id` | 4 | Fused in Pipeline 1 |
| | +------------------------------------------------------------------------------------------------------+----------------+---------------------+
| +Expand(Into)#neo4j | (anon_7)-[anon_27:CONTAINS]->(c) | 4 | Fused in Pipeline 1 |
| | +------------------------------------------------------------------------------------------------------+----------------+---------------------+
| +MultiNodeIndexSeek#neo4j | UNIQUE anon_7:Placement(id) WHERE id = $autoint_0, cache[c.id], UNIQUE c:Campaign(id) WHERE id IN $a | 25 | In Pipeline 0 |
| | utolist_1, cache[c.id] | | |
+---------------------------+------------------------------------------------------------------------------------------------------+----------------+---------------------+
Edit: if I prepend the query with CYPHER runtime=SLOTTED I get the expected output:
+--------+
| c.id |
+--------+
| 37578 |
| 263 |
| 25810 |
| 150470 |
+--------+
If I omit the WHERE clause I get unique campaign ids (but too many). I feel like I'm missing something obvious, but I've read the neo4j docs and I'm not getting it. Thanks!

Why does my cypher query take 10 times longer when I run it with count()?

I start with the following query:
PROFILE
MATCH Base = (SBase:Snapshot {timestamp:1454983481.304583})-[:contains]->()
MATCH Prime = (:Snapshot {timestamp:1454983521.642284})-[PContains:contains]->(SPrimePackage)
WHERE NOT (SBase)-[:contains]->(SPrimePackage)
RETURN PContains
LIMIT 10
I get "5834 total db hits in 119 ms". The graph correctly shows 9 nodes, and 8 edges connecting them. Then I run an almost-identical query, except that I instead return count(distinct()):
PROFILE
MATCH Base = (SBase:Snapshot {timestamp:1454983481.304583})-[:contains]->()
MATCH Prime = (:Snapshot {timestamp:1454983521.642284})-[PContains:contains]->(SPrimePackage)
WHERE NOT (SBase)-[:contains]->(SPrimePackage)
RETURN count(distinct(SPrimePackage))
LIMIT 10
This gives "1382270 total db hits in 1771 ms". The result is correct: 8. However, why is count(distinct()) so much slower and more expensive? Should I be doing this some other way?
I'm running Neo4j 2.3.1
EDIT 1
To ensure I'm comparing apples to apples, and to highlight the question, here is a similar pair of queries and results:
MATCH Base = (SBase:Snapshot {timestamp:1454983481.304583})-[:contains]->()
MATCH Prime = (:Snapshot {timestamp:1454983521.642284})-[PContains:contains]->(SPrimePackage)
WHERE NOT (SBase)-[:contains]->(SPrimePackage)
RETURN SPrimePackage
LIMIT 10
Note it's returning "SPrimePackage" instead of "PContains" in the original. The result is "5834 total db hits in 740 ms".
Here is that exact same query with "count()":
MATCH Base = (SBase:Snapshot {timestamp:1454983481.304583})-[:contains]->()
MATCH Prime = (:Snapshot {timestamp:1454983521.642284})-[PContains:contains]->(SPrimePackage)
WHERE NOT (SBase)-[:contains]->(SPrimePackage)
RETURN count(SPrimePackage)
LIMIT 10
The result: "1382270 total db hits in 2731 ms". Note the only difference is the "count()". Intuitively, I would expect "count()" to add a single tallying step, but clearly it's doing much more than that. Why is "count()" triggering all of this extra work?
[UPDATED]
If you compared the PROFILE output of your 2 (edited) queries, you'd probably see that the only significant difference was the existence of an EagerAggregation operation in the COUNT() version of the query. Aggregation functions use EagerAggregation to collect in memory all the data being aggregated before actually performing the aggregation function (in this case, COUNT()). That requires additional work that is not needed when you do not use the aggregation function.
The following query still uses COUNT() in order to get the count, but greatly reduces the data that has to be aggregated, thus reducing the amount of work that needs to be done in the EagerAggregation step:
PROFILE
MATCH (SBase:Snapshot { timestamp:1454983481.304583 })
USING INDEX SBase:Snapshot(timestamp)
WHERE (SBase)-[:contains]->()
MATCH (s:Snapshot { timestamp:1454983521.642284 })-[:contains]->(SPrimePackage)
USING INDEX s:Snapshot(timestamp)
WHERE NOT (SBase)-[:contains]->(SPrimePackage)
RETURN COUNT(DISTINCT SPrimePackage)
LIMIT 10;
The above query assumes you have already created an index on :Snapshot(timestamp), to greatly speed up the search for the 2 :Snapshot nodes:
CREATE INDEX ON :Snapshot(timestamp);
Using some simple data, the profile I get is:
+-------------------+----------------+------+---------+--------------------------------------+--------------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Variables | Other |
+-------------------+----------------+------+---------+--------------------------------------+--------------------------------------+
| +ProduceResults | 1 | 1 | 0 | COUNT(DISTINCT SPrimePackage) | COUNT(DISTINCT SPrimePackage) |
| | +----------------+------+---------+--------------------------------------+--------------------------------------+
| +Limit | 1 | 1 | 0 | COUNT(DISTINCT SPrimePackage) | Literal(10) |
| | +----------------+------+---------+--------------------------------------+--------------------------------------+
| +EagerAggregation | 1 | 1 | 0 | COUNT(DISTINCT SPrimePackage) | |
| | +----------------+------+---------+--------------------------------------+--------------------------------------+
| +AntiSemiApply | 1 | 7 | 0 | anon[180], s -- SBase, SPrimePackage | |
| |\ +----------------+------+---------+--------------------------------------+--------------------------------------+
| | +Expand(Into) | 1 | 0 | 34 | anon[266] -- SBase, SPrimePackage | (SBase)-[:contains]->(SPrimePackage) |
| | | +----------------+------+---------+--------------------------------------+--------------------------------------+
| | +Argument | 4 | 8 | 0 | SBase, SPrimePackage | |
| | +----------------+------+---------+--------------------------------------+--------------------------------------+
| +CartesianProduct | 4 | 8 | 0 | SBase -- anon[180], SPrimePackage, s | |
| |\ +----------------+------+---------+--------------------------------------+--------------------------------------+
| | +Expand(All) | 4 | 8 | 10 | anon[180], SPrimePackage -- s | (s)-[:contains]->(SPrimePackage) |
| | | +----------------+------+---------+--------------------------------------+--------------------------------------+
| | +NodeIndexSeek | 2 | 2 | 4 | s | :Snapshot(timestamp) |
| | +----------------+------+---------+--------------------------------------+--------------------------------------+
| +SemiApply | 1 | 2 | 0 | SBase | |
| |\ +----------------+------+---------+--------------------------------------+--------------------------------------+
| | +Expand(All) | 4 | 0 | 2 | anon[112], anon[126] -- SBase | (SBase)-[:contains]->() |
| | | +----------------+------+---------+--------------------------------------+--------------------------------------+
| | +Argument | 2 | 2 | 0 | SBase | |
| | +----------------+------+---------+--------------------------------------+--------------------------------------+
| +NodeIndexSeek | 2 | 2 | 3 | SBase | :Snapshot(timestamp) |
+-------------------+----------------+------+---------+--------------------------------------+--------------------------------------+
In addition to using indexing, the above query:
Does not bother to find all nodes contained by SBase, since we need to find just one contained node in order to identify a matching SBase node. The SemiApply operation will complete as soon as a single (SBase)-[:contains]->() match is found, and so the first MATCH clause will result in a single row per SBase instead of N rows. Based on the info in your question, I suspect N would have been about 8.
Has a Cartesian Product that should be pretty fast, since both "legs" of the product should have low cardinality.

Cypher FOREACH MERGE not hitting the index

I've got a following parametrized Cypher query:
MERGE (p:Person {pid: {personId}}) ON CREATE SET p.value=rand()
MERGE (c:Page {url: {pageUrl}}) ON CREATE SET c.value=rand()
MERGE p-[:REL]->c
FOREACH (tagValue IN {tags} |
MERGE (t:Tag {value:tagValue})
MERGE c-[:hasTag]->t)
This is very slow, the profiling shows:
EmptyResult
|
+UpdateGraph(0)
|
+Eager(0)
|
+UpdateGraph(1)
|
+Eager(1)
|
+UpdateGraph(2)
+----------------+------+--------+------------------------------+------------------------------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------------+------+--------+------------------------------+------------------------------------------------------------------------------+
| EmptyResult | 0 | 0 | | |
| UpdateGraph(0) | 1 | 79222 | | Foreach |
| Eager(0) | 1 | 0 | | |
| UpdateGraph(1) | 1 | 5 | p, c, UNNAMED163 | MergePattern |
| Eager(1) | 1 | 0 | | |
| UpdateGraph(2) | 1 | 14 | p, p, c, c |
MergeNode; {personId}; :Person(pid); MergeNode; {pageUrl}; :Page(url) |
+----------------+------+--------+------------------------------+------------------------------------------------------------------------------+
Total database accesses: 79241
As you can see, it's apparently not using the index I've defined on :Tag(value)
Any ideas how to fix this? I'm running out of ideas and I'm starting to think this might be connected to https://github.com/neo4j/neo4j/issues/861
FYI, the MERGEs are really convenient for me and this query perfectly matches (or would if it worked:) the usage I need for data ingestion.
Hmmm, does it use an index if you use UNWIND instead of FOREACH?
MERGE (p:Person {pid: {personId}}) ON CREATE SET p.value=rand()
MERGE (c:Page {url: {pageUrl}}) ON CREATE SET c.value=rand()
MERGE p-[:REL]->c
WITH c
UNWIND {tags} AS tagValue
MERGE (t:Tag {value:tagValue})
MERGE c-[:hasTag]->t

Cypher / Should I use the WITH clause to pass values to next MATCH?

Using Neo4j 2.1.X, let's suppose this query, returning the user 123's friends that bought a Car:
MATCH (u1:User("123"))-[:KNOWS]-(friend)
MATCH (friend)-[:BUYS]->(c:Car)
RETURN friend
In this article, it is written regarding the WITH clause:
So, how does it work? Well, with is basically just a stream, as lazy
as it can be (as lazy as return can be), passing results on to the
next query.
So it seems I should transform the query like this:
MATCH (u1:User("123"))-[:KNOWS]-(friend)
WITH friend
MATCH (friend)-[:BUYS]->(c:Car)
RETURN friend
Should I? Or does the current version of Cypher already handle MATCH chaining while passing values through them?
The more accurate starting point you give in the upfront of your query, the more efficient it will be.
Your first match is not so accurate, indeed it will use the traversal matcher to match all possible relationships.
Taken the following neo4j console example : http://console.neo4j.org/r/jsx71g
And your first query who will look like this in the example :
MATCH (n:User { login: 'nash99' })-[:KNOWS]->(friend)
RETURN count(*)
You can see the amount of dbhits in the upfront :
Execution Plan
ColumnFilter
|
+EagerAggregation
|
+Filter
|
+TraversalMatcher
+------------------+------+--------+-------------+-----------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+------------------+------+--------+-------------+-----------------------------------------+
| ColumnFilter | 1 | 0 | | keep columns count(*) |
| EagerAggregation | 1 | 0 | | |
| Filter | 8 | 320 | | Property(n,login(2)) == { AUTOSTRING0} |
| TraversalMatcher | 160 | 201 | | friend, UNNAMED32, friend |
+------------------+------+--------+-------------+-----------------------------------------+
Total database accesses: 521
If you use a more accurate starting point, you're the king of the road when you start from this point, look at this example query and see the difference in db hits :
Execution Plan
ColumnFilter
|
+EagerAggregation
|
+SimplePatternMatcher
|
+Filter
|
+NodeByLabel
+----------------------+------+--------+------------------------+-----------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------------------+------+--------+------------------------+-----------------------------------------+
| ColumnFilter | 1 | 0 | | keep columns count(*) |
| EagerAggregation | 1 | 0 | | |
| SimplePatternMatcher | 8 | 0 | n, friend, UNNAMED51 | |
| Filter | 1 | 40 | | Property(n,login(2)) == { AUTOSTRING0} |
| NodeByLabel | 20 | 21 | n, n | :User |
+----------------------+------+--------+------------------------+-----------------------------------------+
Total database accesses: 61
So to terminate your query, I will do something like this :
MATCH (n:User { login: 'nash99' })
WITH n
MATCH (n)-[:KNOWS]->(friend)-[:BUYS]->(c:Car)
RETURN friend
You can also specify that the friends can not be the same as the user :
MATCH (n:User { login: 'nash99' })
WITH n
MATCH (n)-[:KNOWS]->(friend)-[:BUYS]->(c:Car)
WHERE NOT friend.id = n.id
RETURN friend
Note that there is no difference between the above query and the following in matter of db hits :
MATCH (n:User { login: 'nash99' })
WITH n
MATCH (n)-[:KNOWS]->(friend)
WITH friend
MATCH (friend)-[:BUYS)->(c:Car)
RETURN (friend)
I recommend that you use the neo4j console to look at the result details showing you the above informations.
If you need to quickly protoype a graph for test, you can use Graphgen, export the graph in cypher statements and load these statements in the neo4j console.
Here is the link to the graphgen generation I used for the console http://graphgen.neoxygen.io/?graph=29l9XJ0HxJ2pyQ
Chris

Resources