I've got a following parametrized Cypher query:
MERGE (p:Person {pid: {personId}}) ON CREATE SET p.value=rand()
MERGE (c:Page {url: {pageUrl}}) ON CREATE SET c.value=rand()
MERGE p-[:REL]->c
FOREACH (tagValue IN {tags} |
MERGE (t:Tag {value:tagValue})
MERGE c-[:hasTag]->t)
This is very slow, the profiling shows:
EmptyResult
|
+UpdateGraph(0)
|
+Eager(0)
|
+UpdateGraph(1)
|
+Eager(1)
|
+UpdateGraph(2)
+----------------+------+--------+------------------------------+------------------------------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------------+------+--------+------------------------------+------------------------------------------------------------------------------+
| EmptyResult | 0 | 0 | | |
| UpdateGraph(0) | 1 | 79222 | | Foreach |
| Eager(0) | 1 | 0 | | |
| UpdateGraph(1) | 1 | 5 | p, c, UNNAMED163 | MergePattern |
| Eager(1) | 1 | 0 | | |
| UpdateGraph(2) | 1 | 14 | p, p, c, c |
MergeNode; {personId}; :Person(pid); MergeNode; {pageUrl}; :Page(url) |
+----------------+------+--------+------------------------------+------------------------------------------------------------------------------+
Total database accesses: 79241
As you can see, it's apparently not using the index I've defined on :Tag(value)
Any ideas how to fix this? I'm running out of ideas and I'm starting to think this might be connected to https://github.com/neo4j/neo4j/issues/861
FYI, the MERGEs are really convenient for me and this query perfectly matches (or would if it worked:) the usage I need for data ingestion.
Hmmm, does it use an index if you use UNWIND instead of FOREACH?
MERGE (p:Person {pid: {personId}}) ON CREATE SET p.value=rand()
MERGE (c:Page {url: {pageUrl}}) ON CREATE SET c.value=rand()
MERGE p-[:REL]->c
WITH c
UNWIND {tags} AS tagValue
MERGE (t:Tag {value:tagValue})
MERGE c-[:hasTag]->t
Related
I'm running Neo4j Desktop v1.4.1 the db is 4.2.1 enterprise.
I have a simple graph of placements, campaigns and a placement to campaign "contains" relationship. This is a fresh dataset, every node is unique. Some placements "contain" thousands of campaigns, so I want to filter the returned campaigns by an inclusion list of campaign ids.
When I return all the matched nodes it works:
neo4j#neo4j> MATCH (:Placement {id: 5})-[:CONTAINS]->(c:Campaign)
WHERE c.id IN [400,263,150470,25810,37578]
RETURN *;
+--------------------------+
| c |
+--------------------------+
| (:Campaign {id: 37578}) |
| (:Campaign {id: 263}) |
| (:Campaign {id: 25810}) |
| (:Campaign {id: 150470}) |
+--------------------------+
When I request just the campaign:id, I get duplicates:
neo4j#neo4j> MATCH (:Placement {id: 5})-[:CONTAINS]->(c:Campaign)
WHERE c.id IN [400,263,150470,25810,37578]
RETURN c.id;
+--------+
| c.id |
+--------+
| 150470 |
| 150470 |
| 150470 |
| 150470 |
+--------+
There is only one CONTAINS relationship between placement 5 and campaign 15070:
neo4j#neo4j> MATCH (:Placement {id: 5})-[rel:CONTAINS]->(:Campaign {id:150470})
RETURN count(rel);
+------------+
| count(rel) |
+------------+
| 1 |
+------------+
EXPLAIN returns the following query plan, the cache[c.id] seems like it might be the culprit?
+---------------------------+------------------------------------------------------------------------------------------------------+----------------+---------------------+
| Operator | Details | Estimated Rows | Other |
+---------------------------+------------------------------------------------------------------------------------------------------+----------------+---------------------+
| +ProduceResults#neo4j | `c.id` | 4 | Fused in Pipeline 1 |
| | +------------------------------------------------------------------------------------------------------+----------------+---------------------+
| +Projection#neo4j | cache[c.id] AS `c.id` | 4 | Fused in Pipeline 1 |
| | +------------------------------------------------------------------------------------------------------+----------------+---------------------+
| +Expand(Into)#neo4j | (anon_7)-[anon_27:CONTAINS]->(c) | 4 | Fused in Pipeline 1 |
| | +------------------------------------------------------------------------------------------------------+----------------+---------------------+
| +MultiNodeIndexSeek#neo4j | UNIQUE anon_7:Placement(id) WHERE id = $autoint_0, cache[c.id], UNIQUE c:Campaign(id) WHERE id IN $a | 25 | In Pipeline 0 |
| | utolist_1, cache[c.id] | | |
+---------------------------+------------------------------------------------------------------------------------------------------+----------------+---------------------+
Edit: if I prepend the query with CYPHER runtime=SLOTTED I get the expected output:
+--------+
| c.id |
+--------+
| 37578 |
| 263 |
| 25810 |
| 150470 |
+--------+
If I omit the WHERE clause I get unique campaign ids (but too many). I feel like I'm missing something obvious, but I've read the neo4j docs and I'm not getting it. Thanks!
Sample Data:
Sample Query
CREATE (a1:A {title: "a1"})
CREATE (a2:A {title: "a2"})
CREATE (a3:A {title: "a3"})
CREATE (b1:B {title: "b1"})
CREATE (b2:B {title: "b2"})
MATCH (a:A {title: "a1"}), (b:B {title: "b1"})
CREATE (a)-[r:LINKS]->(b)
MATCH (a:A {title: "a2"}), (a1:A {title: "a1"})
CREATE (a)-[:CONNECTED]->(a1)
MATCH (a:A), (b:B) return a,b
Objective: Finding some connections in the where clause
Now lets write some variations to find A's not directly connected to B (a2 and b3)
// Q1. Both work fine
MATCH (a:A) WHERE (a)--(:B) RETURN a
MATCH (a:A) WHERE (:B)--(a) RETURN a
// Q2. Works
MATCH (a:A)-[r]-(b:B) WHERE (a)-[r]-(b) RETURN a
// Q3. Fails
MATCH (a:A)-[r]-(b:B) WHERE (b)-[r]-(a) RETURN a
Any idea why Q2, Q3 are not behaving the same way even if the direction is specified as bi-directional? Is this a NEO4J bug?
All credits to stdob at this
answer for narrowing
down the anomaly that was happening in my other query.
Update: Posted the same to the NEO4J GitHub issues
Update: NEO4J has accepted this as a bug are will be fixing it at 3.1
While this might not be a complete answer, it is too much info for a comment. This should hopefully provide some helpful insight though.
I would consider this a bug. Below are some variations of what should give the same results from the sample data. They should all pass with the given data (pass being return anything)
MATCH (a:A)-[r]-(b:B) WHERE (b)-[r]-(a) RETURN * -> fails
remove r
MATCH (a:A)--(b:B) WHERE (b)--(a) RETURN * -> pass
MATCH (a:A)-[r]-(b:B) WHERE (b)--(a) RETURN * -> pass
add direction
MATCH (a:A)-[r]-(b:B) WHERE (b)<-[r]-(a) RETURN * -> pass
reverse order
MATCH (a:A)-[r]-(b:B) WHERE (a)-[r]-(b) RETURN * -> pass
And, from the profile of the failed test
+---------------------+----------------+------+---------+-----------+--------------+
| Operator | Estimated Rows | Rows | DB Hits | Variables | Other |
+---------------------+----------------+------+---------+-----------+--------------+
| +ProduceResults | 1 | 0 | 0 | a | a |
| | +----------------+------+---------+-----------+--------------+
| +SemiApply | 1 | 0 | 0 | a, b, r | |
| |\ +----------------+------+---------+-----------+--------------+
| | +ProjectEndpoints | 1 | 0 | 0 | a, b, r | r, b, a |
| | | +----------------+------+---------+-----------+--------------+
| | +Argument | 2 | 1 | 0 | a, b, r | |
| | +----------------+------+---------+-----------+--------------+
| +Filter | 2 | 1 | 1 | a, b, r | a:A |
| | +----------------+------+---------+-----------+--------------+
| +Expand(All) | 2 | 1 | 3 | a, r -- b | (b)-[r:]-(a) |
| | +----------------+------+---------+-----------+--------------+
| +NodeByLabelScan | 2 | 2 | 3 | b | :B |
+---------------------+----------------+------+---------+-----------+--------------+
and the equivalent passed test (reverse order)
+---------------------+----------------+------+---------+-----------+--------------+
| Operator | Estimated Rows | Rows | DB Hits | Variables | Other |
+---------------------+----------------+------+---------+-----------+--------------+
| +ProduceResults | 1 | 1 | 0 | a | a |
| | +----------------+------+---------+-----------+--------------+
| +SemiApply | 1 | 1 | 0 | a, b, r | |
| |\ +----------------+------+---------+-----------+--------------+
| | +ProjectEndpoints | 1 | 0 | 0 | a, b, r | r, a, b |
| | | +----------------+------+---------+-----------+--------------+
| | +Argument | 2 | 1 | 0 | a, b, r | |
| | +----------------+------+---------+-----------+--------------+
| +Filter | 2 | 1 | 1 | a, b, r | a:A |
| | +----------------+------+---------+-----------+--------------+
| +Expand(All) | 2 | 1 | 3 | a, r -- b | (b)-[r:]-(a) |
| | +----------------+------+---------+-----------+--------------+
| +NodeByLabelScan | 2 | 2 | 3 | b | :B |
+---------------------+----------------+------+---------+-----------+--------------+
Notice the row count after step 1 in each. The same plan should not produce different results. I can speculate that is is a bug related to the graph pruning shortcuts (namely, once Neo4j traverses an edge in one direction, it will not traverse back on the same edge in the same match. This is an anti-cycle fail-safe/performance feature) So, in theory, after reversing the order in the where part from the match part, Neo4j has to traverse a pruned edge to validate the relationship. If it is the same direction, it auto-passes. If Neo4j tries to do the same check in reverse, it fails because that edge has been pruned. (This is just theory though. The validation that is failing is technically on the r validation in reverse)
I start with the following query:
PROFILE
MATCH Base = (SBase:Snapshot {timestamp:1454983481.304583})-[:contains]->()
MATCH Prime = (:Snapshot {timestamp:1454983521.642284})-[PContains:contains]->(SPrimePackage)
WHERE NOT (SBase)-[:contains]->(SPrimePackage)
RETURN PContains
LIMIT 10
I get "5834 total db hits in 119 ms". The graph correctly shows 9 nodes, and 8 edges connecting them. Then I run an almost-identical query, except that I instead return count(distinct()):
PROFILE
MATCH Base = (SBase:Snapshot {timestamp:1454983481.304583})-[:contains]->()
MATCH Prime = (:Snapshot {timestamp:1454983521.642284})-[PContains:contains]->(SPrimePackage)
WHERE NOT (SBase)-[:contains]->(SPrimePackage)
RETURN count(distinct(SPrimePackage))
LIMIT 10
This gives "1382270 total db hits in 1771 ms". The result is correct: 8. However, why is count(distinct()) so much slower and more expensive? Should I be doing this some other way?
I'm running Neo4j 2.3.1
EDIT 1
To ensure I'm comparing apples to apples, and to highlight the question, here is a similar pair of queries and results:
MATCH Base = (SBase:Snapshot {timestamp:1454983481.304583})-[:contains]->()
MATCH Prime = (:Snapshot {timestamp:1454983521.642284})-[PContains:contains]->(SPrimePackage)
WHERE NOT (SBase)-[:contains]->(SPrimePackage)
RETURN SPrimePackage
LIMIT 10
Note it's returning "SPrimePackage" instead of "PContains" in the original. The result is "5834 total db hits in 740 ms".
Here is that exact same query with "count()":
MATCH Base = (SBase:Snapshot {timestamp:1454983481.304583})-[:contains]->()
MATCH Prime = (:Snapshot {timestamp:1454983521.642284})-[PContains:contains]->(SPrimePackage)
WHERE NOT (SBase)-[:contains]->(SPrimePackage)
RETURN count(SPrimePackage)
LIMIT 10
The result: "1382270 total db hits in 2731 ms". Note the only difference is the "count()". Intuitively, I would expect "count()" to add a single tallying step, but clearly it's doing much more than that. Why is "count()" triggering all of this extra work?
[UPDATED]
If you compared the PROFILE output of your 2 (edited) queries, you'd probably see that the only significant difference was the existence of an EagerAggregation operation in the COUNT() version of the query. Aggregation functions use EagerAggregation to collect in memory all the data being aggregated before actually performing the aggregation function (in this case, COUNT()). That requires additional work that is not needed when you do not use the aggregation function.
The following query still uses COUNT() in order to get the count, but greatly reduces the data that has to be aggregated, thus reducing the amount of work that needs to be done in the EagerAggregation step:
PROFILE
MATCH (SBase:Snapshot { timestamp:1454983481.304583 })
USING INDEX SBase:Snapshot(timestamp)
WHERE (SBase)-[:contains]->()
MATCH (s:Snapshot { timestamp:1454983521.642284 })-[:contains]->(SPrimePackage)
USING INDEX s:Snapshot(timestamp)
WHERE NOT (SBase)-[:contains]->(SPrimePackage)
RETURN COUNT(DISTINCT SPrimePackage)
LIMIT 10;
The above query assumes you have already created an index on :Snapshot(timestamp), to greatly speed up the search for the 2 :Snapshot nodes:
CREATE INDEX ON :Snapshot(timestamp);
Using some simple data, the profile I get is:
+-------------------+----------------+------+---------+--------------------------------------+--------------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Variables | Other |
+-------------------+----------------+------+---------+--------------------------------------+--------------------------------------+
| +ProduceResults | 1 | 1 | 0 | COUNT(DISTINCT SPrimePackage) | COUNT(DISTINCT SPrimePackage) |
| | +----------------+------+---------+--------------------------------------+--------------------------------------+
| +Limit | 1 | 1 | 0 | COUNT(DISTINCT SPrimePackage) | Literal(10) |
| | +----------------+------+---------+--------------------------------------+--------------------------------------+
| +EagerAggregation | 1 | 1 | 0 | COUNT(DISTINCT SPrimePackage) | |
| | +----------------+------+---------+--------------------------------------+--------------------------------------+
| +AntiSemiApply | 1 | 7 | 0 | anon[180], s -- SBase, SPrimePackage | |
| |\ +----------------+------+---------+--------------------------------------+--------------------------------------+
| | +Expand(Into) | 1 | 0 | 34 | anon[266] -- SBase, SPrimePackage | (SBase)-[:contains]->(SPrimePackage) |
| | | +----------------+------+---------+--------------------------------------+--------------------------------------+
| | +Argument | 4 | 8 | 0 | SBase, SPrimePackage | |
| | +----------------+------+---------+--------------------------------------+--------------------------------------+
| +CartesianProduct | 4 | 8 | 0 | SBase -- anon[180], SPrimePackage, s | |
| |\ +----------------+------+---------+--------------------------------------+--------------------------------------+
| | +Expand(All) | 4 | 8 | 10 | anon[180], SPrimePackage -- s | (s)-[:contains]->(SPrimePackage) |
| | | +----------------+------+---------+--------------------------------------+--------------------------------------+
| | +NodeIndexSeek | 2 | 2 | 4 | s | :Snapshot(timestamp) |
| | +----------------+------+---------+--------------------------------------+--------------------------------------+
| +SemiApply | 1 | 2 | 0 | SBase | |
| |\ +----------------+------+---------+--------------------------------------+--------------------------------------+
| | +Expand(All) | 4 | 0 | 2 | anon[112], anon[126] -- SBase | (SBase)-[:contains]->() |
| | | +----------------+------+---------+--------------------------------------+--------------------------------------+
| | +Argument | 2 | 2 | 0 | SBase | |
| | +----------------+------+---------+--------------------------------------+--------------------------------------+
| +NodeIndexSeek | 2 | 2 | 3 | SBase | :Snapshot(timestamp) |
+-------------------+----------------+------+---------+--------------------------------------+--------------------------------------+
In addition to using indexing, the above query:
Does not bother to find all nodes contained by SBase, since we need to find just one contained node in order to identify a matching SBase node. The SemiApply operation will complete as soon as a single (SBase)-[:contains]->() match is found, and so the first MATCH clause will result in a single row per SBase instead of N rows. Based on the info in your question, I suspect N would have been about 8.
Has a Cartesian Product that should be pretty fast, since both "legs" of the product should have low cardinality.
I am running this query (bisac_code is uniquely indexed).
Execution time is more than 2.5 minutes.
52 main codes are selected from almost 4000 in total.
The total number of wokas is very large, 19 million nodes.
Are there any possibilities to make it run faster?
neo4j-sh (?)$ MATCH (b:Bisac)-[r:INCLUDED_IN]-(w:Woka)
> WHERE (b.bisac_code =~ '.*000000')
> RETURN b.bisac_code as bisac_code, count(w) as wokas_count
> ORDER BY b.bisac_code
> ;
+---------------------------+
| bisac_code | wokas_count |
+---------------------------+
| "ANT000000" | 13865 |
| "ARC000000" | 32905 |
| "ART000000" | 79600 |
| "BIB000000" | 2043 |
| "BIO000000" | 256082 |
| "BUS000000" | 226173 |
| "CGN000000" | 16424 |
| "CKB000000" | 26410 |
| "COM000000" | 44922 |
| "CRA000000" | 18720 |
| "DES000000" | 2713 |
| "DRA000000" | 62610 |
| "EDU000000" | 228182 |
| "FAM000000" | 42951 |
| "FIC000000" | 474004 |
| "FOR000000" | 41999 |
| "GAM000000" | 8803 |
| "GAR000000" | 37844 |
| "HEA000000" | 36939 |
| "HIS000000" | 3908869 |
| "HOM000000" | 5123 |
| "HUM000000" | 29270 |
| "JNF000000" | 40396 |
| "JUV000000" | 200144 |
| "LAN000000" | 89059 |
| "LAW000000" | 153138 |
| "LCO000000" | 1528237 |
| "LIT000000" | 89611 |
| "MAT000000" | 58134 |
| "MED000000" | 80268 |
| "MUS000000" | 75997 |
| "NAT000000" | 35991 |
| "NON000000" | 107513 |
| "OCC000000" | 42134 |
| "PER000000" | 26989 |
| "PET000000" | 4980 |
| "PHI000000" | 72069 |
| "PHO000000" | 8546 |
| "POE000000" | 104609 |
| "POL000000" | 309153 |
| "PSY000000" | 55710 |
| "REF000000" | 96477 |
| "REL000000" | 133619 |
| "SCI000000" | 86017 |
| "SEL000000" | 40901 |
| "SOC000000" | 292713 |
| "SPO000000" | 172284 |
| "STU000000" | 10508 |
| "TEC000000" | 77459 |
| "TRA000000" | 9093 |
| "TRU000000" | 12041 |
| "TRV000000" | 27706 |
+---------------------------+
52 rows
198310 ms
And the response time is not consistent.
After a while drops to less than half of a minute.
52 rows
31207 ms
In Neo4j 2.3 there will be index support for prefix LIKE searches but probably not for postfix ones.
There are two ways of making #user2194039's solution faster:
Use path expression to count the Woka per Bisac:
MATCH (b:Bisac) WHERE (b.bisac_code =~ '.*000000')
WITH b, size((b)-[:INCLUDED_IN]->()) as wokas_count
RETURN b.bisac_code as bisac_code, wokas_count
ORDER BY b.bisac_code
Mark the Bisac's with that pattern with a label
MATCH (b:Bisac) WHERE (b.bisac_code =~ '.*000000') SET b:Main;
MATCH (b:Main:Bisac)
WITH b, size((b)-[:INCLUDED_IN]->()) as wokas_count
RETURN b.bisac_code as bisac_code, wokas_count
ORDER BY b.bisac_code;
The slow speed is caused by your regular expression pattern matching (=~ ). Although your bisac_code is indexed, the regex match causes the index to be ineffective. The index only works when you are matching full bisac_code values.
Cypher does include some string manipulation facilities that might let you get by without using a regex =~, but I doubt it would make any difference, because the index will still be useless.
I might suggest considering if you can further categorize your bisac_codes so that you do not need to do a pattern match. Maybe an extra indexed property that somehow denotes those codes that end in 000000?
If you do not want to add properties, you may try matching only the Bisacs first, and then including the Wokas. Something like this:
MATCH (b:Bisac) WHERE (b.bisac_code =~ '.*000000')
WITH b
MATCH (b)-[r:INCLUDED_IN]-(w:Woka)
RETURN b.bisac_code as bisac_code, count(w) as wokas_count
ORDER BY b.bisac_code
This may help Cypher stick to the 4000 Bisac nodes while doing the pattern match, before getting involved with all 19 million Woka nodes, but I am not sure if this will make a material difference. Even slogging through 4000 nodes (effectively without an index) is a slow process.
Hash Tables in Database Indexing
The reason that your index is ineffective for regex pattern matching is that Neo4j likely uses a hash table for indexing properties. This is common of many databases. Wikipedia has an article here.
The basics though are that the index is not storing all of the properties that you want to search through. It is storing values that represent the properties you want to search through, and the representation is only valid for the whole property. If you are searching for only a part of the property value, the hashes stored in the index are useless, and the database must search through the properties the old-fashioned way -- one by one.
Edit re: your edit
The improvement in response time after running this query multiple times is certainly due to caching. Neo4j is remembering that you access the Bisac nodes and bisac_code properties frequently, and is keeping them in memory. This makes future queries faster because the values do not need to be read off disk.
However, eventually, those nodes a properties will likely be dropped from the cache, as Neo4j finds you manipulating different nodes, which it will cache instead. There are only so many nodes Neo4j can cache before running out of memory, so it picks the most recent and/or frequently used data.
I have a non-unique node (:Neighborhood) that uniquely appears [:IN] a (:City) node. I would like to create a new neighborhood node and establish its relationship ONLY if that neighborhood node does not exist in that city. There can be multiple neighborhoods that have the same name, but each neighborhood must appear uniquely appear in the property city.
Following the advice from the Gil's answer here: Return node if relationship is not present, how can I do something like:
MATCH a WHERE NOT (a:Neighborhood {name : line.Neighborhood})-[r:IN]->(c:City {name : line.City})
ON MATCH SET (a)-[r]-(c)
So then it would only create a new neighborhood node if it doesn't already exist in the city.
**UPDATE:**I upgraded and profiled it and still can't take advantage of any optimizations...
PROFILE LOAD CSV WITH HEADERS FROM "file://THEFILE" as line
WITH line LIMIT 0
MATCH (c:City { name : line.City})
MERGE (n:Neighborhood {name : toInt(line.Neighborhood)})-[:IN]->(c)
;
+--------------+------+--------+---------------------------+------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+--------------+------+--------+---------------------------+------------------------------+
| EmptyResult | 0 | 0 | | |
| UpdateGraph | 5 | 16 | anon[340], b, neighborhood, line | MergePattern |
| SchemaIndex | 5 | 10 | b, line | line.City; :City(name) |
| ColumnFilter | 5 | 0 | line | keep columns line |
| Filter | 5 | 0 | anon[216], line | anon[216] |
| Extract | 5 | 0 | anon[216], line | anon[216] |
| Slice | 5 | 0 | line | { AUTOINT0} |
| LoadCSV | 5 | 0 | line | |
+--------------+------+--------+---------------------------+------------------------------+
I think you could simply use MERGE for this:
MATCH (c:City {name: line.City})
MERGE c<-[:IN]-(a:Neighborhood {name : line.Neighborhood})
If you haven't already imported all of the cities, you can create those with MERGE:
MATCH (c:City {name: line.City})
MERGE c<-[:IN]-(a:Neighborhood {name : line.Neighborhood})
But beware of the Eager operator:
http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/
In short: You should run your LOAD CSV (I assume that's what you're doing here) twice, once to load the cities and once to load the neighborhoods.