I am trying to evaluate Neo4j (using the community version).
I am importing some data (1 million rows) using the LOAD CSV process. It needs to match previously imported nodes to create a relationship between them.
Here is my query:
//Query #3
//create edges between Tr and Ad nodes
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///1M.txt'
AS line
FIELDTERMINATOR '\t'
//find appropriate tx and ad
MATCH (tx:Tr { txid: TOINT(line.txid) }), (ad:Ad {p58: line.p58})
//create the edge (relationship)
CREATE (tx)-[out:OUT_TO]->(ad)
//set properties on the edge
SET out.id= TOINT(line.id)
SET out.n = TOINT(line.n)
SET out.v = TOINT(line.v)
I have indicies on:
Indexes
ON :Ad(p58) ONLINE (for uniqueness constraint)
ON :Tr(txid) ONLINE
ON :Tr(h) ONLINE (for uniqueness constraint)
This query has been running for 5 days now and it has so far created 270K relationships (out of 1M).
Java heap is 4g
Machine has 32G of RAM and an SSD for a drive, only running linux and Neo4j
Any hints to speed this process up would be highly appreciated.
Should I try the enterprise edition?
Query Plan:
+--------------------------------------------+
| No data returned, and nothing was changed. |
+--------------------------------------------+
If a part of a query contains multiple disconnected patterns,
this will build a cartesian product between all those parts.
This may produce a large amount of data and slow down query processing.
While occasionally intended,
it may often be possible to reformulate the query that avoids the use of this cross product,
perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (ad))
20 ms
Compiler CYPHER 3.0
Planner COST
Runtime INTERPRETED
+---------------------------------+----------------+---------------------+----------------------------+
| Operator | Estimated Rows | Variables | Other |
+---------------------------------+----------------+---------------------+----------------------------+
| +ProduceResults | 1 | | |
| | +----------------+---------------------+----------------------------+
| +EmptyResult | | | |
| | +----------------+---------------------+----------------------------+
| +Apply | 1 | line -- ad, out, tx | |
| |\ +----------------+---------------------+----------------------------+
| | +SetRelationshipProperty(4) | 1 | ad, out, tx | |
| | | +----------------+---------------------+----------------------------+
| | +CreateRelationship | 1 | out -- ad, tx | |
| | | +----------------+---------------------+----------------------------+
| | +ValueHashJoin | 1 | ad -- tx | ad.p58; line.p58 |
| | |\ +----------------+---------------------+----------------------------+
| | | +NodeIndexSeek | 1 | tx | :Tr(txid) |
| | | +----------------+---------------------+----------------------------+
| | +NodeUniqueIndexSeek(Locking) | 1 | ad | :Ad(p58) |
| | +----------------+---------------------+----------------------------+
| +LoadCSV | 1 | line | |
+---------------------------------+----------------+---------------------+----------------------------+
OKAY, so by splitting the MATCH statement into two it sped up the query immensely. Thanks #William Lyon for pointing me to the Plan. I noticed the warning.
old MATCH atatement
MATCH (tx:Tr { txid: TOINT(line.txid) }), (ad:Ad {p58: line.p58})
split into two:
MATCH (tx:Tr { txid: TOINT(line.txid) })
MATCH (ad:Ad {p58: line.p58})
on 750K relationships the query took 83 seconds.
Next up 9 Million CSV LOAD
Related
I'm running a Cypher query that gets me all friends up to 3 hops away from a source:
MATCH (:Person {id:"768"})-[:KNOWS*1..3]-(friend:Person)
WITH DISTINCT friend
RETURN count(friend)
The query returns the right results but is taking longer that I would have expected since the graph has only 10k people in it. When I profile the query, I find that the VarLengthExpand(All) operator is returning a surprisingly large number of rows (~282k):
+-----------------------+----------------+--------+---------+
| Operator | Estimated Rows | Rows | DB Hits |
+-----------------------+----------------+--------+---------+
| +ProduceResults | 6 | 1 | 0 |
| | +----------------+--------+---------+
| +EagerAggregation | 6 | 1 | 0 |
| | +----------------+--------+---------+
| +Distinct | 35 | 9082 | 0 |
| | +----------------+--------+---------+
| +Filter | 37 | 282635 | 282635 |
| | +----------------+--------+---------+
| +VarLengthExpand(All) | 37 | 282635 | 290014 |
| | +----------------+--------+---------+
| +NodeIndexSeek | 1 | 1 | 2 |
+-----------------------+----------------+--------+---------+
It appears that the VarLengthExpand(All) step is doing more work than is necessary. To check that, I looked at the number of nodes at each minimum distance away, and the number of incident KNOWS relationships to those nodes:
----------------------------------------------------
| min. dist. from src: | 0 | 1 | 2 | 3 |
|----------------------|----|------|--------|------|
| nodes | 1 | 12 | 2815 | 6270 |
| incident knows rels | 12 | 3697 | 164828 | |
----------------------------------------------------
So we see that node 768 has 12 direct friends, 2,815 friends-of-friends, and 6,270 friends-of-friends-of-friends. The friends-of-friends group has 164,828 incident KNOWS relationships, and of all the nodes touched by those relationships, 6,270 are nodes that have a minimum distance of 3 to the source node.
So, if a traversal were to proceed in a breadth-first-search fashion from the source... in theory it would only need to do ~ 12 + 3,697 + 164,828 = 168,537 reads, which is a lot less than 282k.
Does anybody know what's going on here, and whether or not there is a better way to formulate this query?
I start with the following query:
PROFILE
MATCH Base = (SBase:Snapshot {timestamp:1454983481.304583})-[:contains]->()
MATCH Prime = (:Snapshot {timestamp:1454983521.642284})-[PContains:contains]->(SPrimePackage)
WHERE NOT (SBase)-[:contains]->(SPrimePackage)
RETURN PContains
LIMIT 10
I get "5834 total db hits in 119 ms". The graph correctly shows 9 nodes, and 8 edges connecting them. Then I run an almost-identical query, except that I instead return count(distinct()):
PROFILE
MATCH Base = (SBase:Snapshot {timestamp:1454983481.304583})-[:contains]->()
MATCH Prime = (:Snapshot {timestamp:1454983521.642284})-[PContains:contains]->(SPrimePackage)
WHERE NOT (SBase)-[:contains]->(SPrimePackage)
RETURN count(distinct(SPrimePackage))
LIMIT 10
This gives "1382270 total db hits in 1771 ms". The result is correct: 8. However, why is count(distinct()) so much slower and more expensive? Should I be doing this some other way?
I'm running Neo4j 2.3.1
EDIT 1
To ensure I'm comparing apples to apples, and to highlight the question, here is a similar pair of queries and results:
MATCH Base = (SBase:Snapshot {timestamp:1454983481.304583})-[:contains]->()
MATCH Prime = (:Snapshot {timestamp:1454983521.642284})-[PContains:contains]->(SPrimePackage)
WHERE NOT (SBase)-[:contains]->(SPrimePackage)
RETURN SPrimePackage
LIMIT 10
Note it's returning "SPrimePackage" instead of "PContains" in the original. The result is "5834 total db hits in 740 ms".
Here is that exact same query with "count()":
MATCH Base = (SBase:Snapshot {timestamp:1454983481.304583})-[:contains]->()
MATCH Prime = (:Snapshot {timestamp:1454983521.642284})-[PContains:contains]->(SPrimePackage)
WHERE NOT (SBase)-[:contains]->(SPrimePackage)
RETURN count(SPrimePackage)
LIMIT 10
The result: "1382270 total db hits in 2731 ms". Note the only difference is the "count()". Intuitively, I would expect "count()" to add a single tallying step, but clearly it's doing much more than that. Why is "count()" triggering all of this extra work?
[UPDATED]
If you compared the PROFILE output of your 2 (edited) queries, you'd probably see that the only significant difference was the existence of an EagerAggregation operation in the COUNT() version of the query. Aggregation functions use EagerAggregation to collect in memory all the data being aggregated before actually performing the aggregation function (in this case, COUNT()). That requires additional work that is not needed when you do not use the aggregation function.
The following query still uses COUNT() in order to get the count, but greatly reduces the data that has to be aggregated, thus reducing the amount of work that needs to be done in the EagerAggregation step:
PROFILE
MATCH (SBase:Snapshot { timestamp:1454983481.304583 })
USING INDEX SBase:Snapshot(timestamp)
WHERE (SBase)-[:contains]->()
MATCH (s:Snapshot { timestamp:1454983521.642284 })-[:contains]->(SPrimePackage)
USING INDEX s:Snapshot(timestamp)
WHERE NOT (SBase)-[:contains]->(SPrimePackage)
RETURN COUNT(DISTINCT SPrimePackage)
LIMIT 10;
The above query assumes you have already created an index on :Snapshot(timestamp), to greatly speed up the search for the 2 :Snapshot nodes:
CREATE INDEX ON :Snapshot(timestamp);
Using some simple data, the profile I get is:
+-------------------+----------------+------+---------+--------------------------------------+--------------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Variables | Other |
+-------------------+----------------+------+---------+--------------------------------------+--------------------------------------+
| +ProduceResults | 1 | 1 | 0 | COUNT(DISTINCT SPrimePackage) | COUNT(DISTINCT SPrimePackage) |
| | +----------------+------+---------+--------------------------------------+--------------------------------------+
| +Limit | 1 | 1 | 0 | COUNT(DISTINCT SPrimePackage) | Literal(10) |
| | +----------------+------+---------+--------------------------------------+--------------------------------------+
| +EagerAggregation | 1 | 1 | 0 | COUNT(DISTINCT SPrimePackage) | |
| | +----------------+------+---------+--------------------------------------+--------------------------------------+
| +AntiSemiApply | 1 | 7 | 0 | anon[180], s -- SBase, SPrimePackage | |
| |\ +----------------+------+---------+--------------------------------------+--------------------------------------+
| | +Expand(Into) | 1 | 0 | 34 | anon[266] -- SBase, SPrimePackage | (SBase)-[:contains]->(SPrimePackage) |
| | | +----------------+------+---------+--------------------------------------+--------------------------------------+
| | +Argument | 4 | 8 | 0 | SBase, SPrimePackage | |
| | +----------------+------+---------+--------------------------------------+--------------------------------------+
| +CartesianProduct | 4 | 8 | 0 | SBase -- anon[180], SPrimePackage, s | |
| |\ +----------------+------+---------+--------------------------------------+--------------------------------------+
| | +Expand(All) | 4 | 8 | 10 | anon[180], SPrimePackage -- s | (s)-[:contains]->(SPrimePackage) |
| | | +----------------+------+---------+--------------------------------------+--------------------------------------+
| | +NodeIndexSeek | 2 | 2 | 4 | s | :Snapshot(timestamp) |
| | +----------------+------+---------+--------------------------------------+--------------------------------------+
| +SemiApply | 1 | 2 | 0 | SBase | |
| |\ +----------------+------+---------+--------------------------------------+--------------------------------------+
| | +Expand(All) | 4 | 0 | 2 | anon[112], anon[126] -- SBase | (SBase)-[:contains]->() |
| | | +----------------+------+---------+--------------------------------------+--------------------------------------+
| | +Argument | 2 | 2 | 0 | SBase | |
| | +----------------+------+---------+--------------------------------------+--------------------------------------+
| +NodeIndexSeek | 2 | 2 | 3 | SBase | :Snapshot(timestamp) |
+-------------------+----------------+------+---------+--------------------------------------+--------------------------------------+
In addition to using indexing, the above query:
Does not bother to find all nodes contained by SBase, since we need to find just one contained node in order to identify a matching SBase node. The SemiApply operation will complete as soon as a single (SBase)-[:contains]->() match is found, and so the first MATCH clause will result in a single row per SBase instead of N rows. Based on the info in your question, I suspect N would have been about 8.
Has a Cartesian Product that should be pretty fast, since both "legs" of the product should have low cardinality.
I'm trying to find all possible path between two nodes. I've used few cypher queries which does the required job but it take a lot of time if the hops increases. This is the query
match p = (n{name:"Node1"})-[:Route*1..5]-(b{name:"Node2"}) return p
Also if I use shortestpath it limits the result if a path with minimum hop is found. So I don't get the results with 2 or more than two hops if a direct connection (1 hop) is found between the nodes.
match p = shortestpath((n{name:"Node1"})-[:Route*1..5]-(b{name:"Node2"})) return p
and if I increase the hop to 2 or more it throws an exception.
shortestPath(...) does not support a minimal length different from 0 or 1
Is there any other alternative framework or algorithm to get all path with minimum time ?
P.S. I'm looking for something in the order of ms. Currently all queries with hops greater than 3 takes few seconds to complete.
I gather you are trying to speed up your original query involving variable-length paths. The shortestpath function is not appropriate for your query, as it literally tries to find a shortest path -- not all paths up to a certain length.
The execution plan for your original query (using sample data) looks like this:
+-----------------------+----------------+------+---------+-------------------+---------------------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Identifiers | Other |
+-----------------------+----------------+------+---------+-------------------+---------------------------------------------+
| +ProduceResults | 0 | 1 | 0 | p | p |
| | +----------------+------+---------+-------------------+---------------------------------------------+
| +Projection | 0 | 1 | 0 | anon[30], b, n, p | ProjectedPath(Set(anon[30], n),) |
| | +----------------+------+---------+-------------------+---------------------------------------------+
| +Filter | 0 | 1 | 2 | anon[30], b, n | n.name == { AUTOSTRING0} |
| | +----------------+------+---------+-------------------+---------------------------------------------+
| +VarLengthExpand(All) | 0 | 2 | 7 | anon[30], b, n | (b)<-[:Route*]-(n) |
| | +----------------+------+---------+-------------------+---------------------------------------------+
| +Filter | 0 | 1 | 3 | b | b.name == { AUTOSTRING1} |
| | +----------------+------+---------+-------------------+---------------------------------------------+
| +AllNodesScan | 3 | 3 | 4 | b | |
+-----------------------+----------------+------+---------+-------------------+---------------------------------------------+
So, your original query is scanning through every node to find the node(s) that match the b pattern. Then, it expands all variable-length paths starting at b. And then it filters those paths to find the one(s) that end with a node that matches the pattern for n.
Here are a few suggestions that should speed up your query, although you'll have to test it on your data to see by how much:
Give each node a label. For example, Foo.
Create an index that can speed up the search for your end nodes. For example:
CREATE INDEX ON :Foo(name);
Modify your query to force the use of the index on both end nodes. For example:
MATCH p =(n:Foo { name:"Node1" })-[:Route*1..5]-(b:Foo { name:"Node2" })
USING INDEX n:Foo(name)
USING INDEX b:Foo(name)
RETURN p;
After the above changes, the execution plan is:
+-----------------+------+---------+-----------------------------+-----------------------------+
| Operator | Rows | DB Hits | Identifiers | Other |
+-----------------+------+---------+-----------------------------+-----------------------------+
| +ColumnFilter | 1 | 0 | p | keep columns p |
| | +------+---------+-----------------------------+-----------------------------+
| +ExtractPath | 1 | 0 | anon[33], anon[34], b, n, p | |
| | +------+---------+-----------------------------+-----------------------------+
| +PatternMatcher | 1 | 3 | anon[33], anon[34], b, n | |
| | +------+---------+-----------------------------+-----------------------------+
| +SchemaIndex | 1 | 2 | b, n | { AUTOSTRING1}; :Foo(name) |
| | +------+---------+-----------------------------+-----------------------------+
| +SchemaIndex | 1 | 2 | n | { AUTOSTRING0}; :Foo(name) |
+-----------------+------+---------+-----------------------------+-----------------------------+
This query plan uses the index to directly get the b and n nodes -- without scanning. This, by itself, should provide a speed improvement. And then this plan uses the "PatternMatcher" to find the variable-length paths between those end nodes. You will have to try this query out to see how efficient the "PatternMatcher" is in doing that.
From your description I assume that you want to get a shortest path based on some weight like a duration property on the :Route relationships.
If that is true using shortestPath in cypher is not helpful since it just takes into account the number of hops. Weighted shortest paths are not yet available in Cypher in an efficient way.
The Java API has support for weighted shortest paths via dijekstra or astar via the GraphAlgoFactory class. For the simple case that your cost function is just the value of a relationship property (as mentioned above) you can also use an existing REST endpoint.
I am running this query (bisac_code is uniquely indexed).
Execution time is more than 2.5 minutes.
52 main codes are selected from almost 4000 in total.
The total number of wokas is very large, 19 million nodes.
Are there any possibilities to make it run faster?
neo4j-sh (?)$ MATCH (b:Bisac)-[r:INCLUDED_IN]-(w:Woka)
> WHERE (b.bisac_code =~ '.*000000')
> RETURN b.bisac_code as bisac_code, count(w) as wokas_count
> ORDER BY b.bisac_code
> ;
+---------------------------+
| bisac_code | wokas_count |
+---------------------------+
| "ANT000000" | 13865 |
| "ARC000000" | 32905 |
| "ART000000" | 79600 |
| "BIB000000" | 2043 |
| "BIO000000" | 256082 |
| "BUS000000" | 226173 |
| "CGN000000" | 16424 |
| "CKB000000" | 26410 |
| "COM000000" | 44922 |
| "CRA000000" | 18720 |
| "DES000000" | 2713 |
| "DRA000000" | 62610 |
| "EDU000000" | 228182 |
| "FAM000000" | 42951 |
| "FIC000000" | 474004 |
| "FOR000000" | 41999 |
| "GAM000000" | 8803 |
| "GAR000000" | 37844 |
| "HEA000000" | 36939 |
| "HIS000000" | 3908869 |
| "HOM000000" | 5123 |
| "HUM000000" | 29270 |
| "JNF000000" | 40396 |
| "JUV000000" | 200144 |
| "LAN000000" | 89059 |
| "LAW000000" | 153138 |
| "LCO000000" | 1528237 |
| "LIT000000" | 89611 |
| "MAT000000" | 58134 |
| "MED000000" | 80268 |
| "MUS000000" | 75997 |
| "NAT000000" | 35991 |
| "NON000000" | 107513 |
| "OCC000000" | 42134 |
| "PER000000" | 26989 |
| "PET000000" | 4980 |
| "PHI000000" | 72069 |
| "PHO000000" | 8546 |
| "POE000000" | 104609 |
| "POL000000" | 309153 |
| "PSY000000" | 55710 |
| "REF000000" | 96477 |
| "REL000000" | 133619 |
| "SCI000000" | 86017 |
| "SEL000000" | 40901 |
| "SOC000000" | 292713 |
| "SPO000000" | 172284 |
| "STU000000" | 10508 |
| "TEC000000" | 77459 |
| "TRA000000" | 9093 |
| "TRU000000" | 12041 |
| "TRV000000" | 27706 |
+---------------------------+
52 rows
198310 ms
And the response time is not consistent.
After a while drops to less than half of a minute.
52 rows
31207 ms
In Neo4j 2.3 there will be index support for prefix LIKE searches but probably not for postfix ones.
There are two ways of making #user2194039's solution faster:
Use path expression to count the Woka per Bisac:
MATCH (b:Bisac) WHERE (b.bisac_code =~ '.*000000')
WITH b, size((b)-[:INCLUDED_IN]->()) as wokas_count
RETURN b.bisac_code as bisac_code, wokas_count
ORDER BY b.bisac_code
Mark the Bisac's with that pattern with a label
MATCH (b:Bisac) WHERE (b.bisac_code =~ '.*000000') SET b:Main;
MATCH (b:Main:Bisac)
WITH b, size((b)-[:INCLUDED_IN]->()) as wokas_count
RETURN b.bisac_code as bisac_code, wokas_count
ORDER BY b.bisac_code;
The slow speed is caused by your regular expression pattern matching (=~ ). Although your bisac_code is indexed, the regex match causes the index to be ineffective. The index only works when you are matching full bisac_code values.
Cypher does include some string manipulation facilities that might let you get by without using a regex =~, but I doubt it would make any difference, because the index will still be useless.
I might suggest considering if you can further categorize your bisac_codes so that you do not need to do a pattern match. Maybe an extra indexed property that somehow denotes those codes that end in 000000?
If you do not want to add properties, you may try matching only the Bisacs first, and then including the Wokas. Something like this:
MATCH (b:Bisac) WHERE (b.bisac_code =~ '.*000000')
WITH b
MATCH (b)-[r:INCLUDED_IN]-(w:Woka)
RETURN b.bisac_code as bisac_code, count(w) as wokas_count
ORDER BY b.bisac_code
This may help Cypher stick to the 4000 Bisac nodes while doing the pattern match, before getting involved with all 19 million Woka nodes, but I am not sure if this will make a material difference. Even slogging through 4000 nodes (effectively without an index) is a slow process.
Hash Tables in Database Indexing
The reason that your index is ineffective for regex pattern matching is that Neo4j likely uses a hash table for indexing properties. This is common of many databases. Wikipedia has an article here.
The basics though are that the index is not storing all of the properties that you want to search through. It is storing values that represent the properties you want to search through, and the representation is only valid for the whole property. If you are searching for only a part of the property value, the hashes stored in the index are useless, and the database must search through the properties the old-fashioned way -- one by one.
Edit re: your edit
The improvement in response time after running this query multiple times is certainly due to caching. Neo4j is remembering that you access the Bisac nodes and bisac_code properties frequently, and is keeping them in memory. This makes future queries faster because the values do not need to be read off disk.
However, eventually, those nodes a properties will likely be dropped from the cache, as Neo4j finds you manipulating different nodes, which it will cache instead. There are only so many nodes Neo4j can cache before running out of memory, so it picks the most recent and/or frequently used data.
Is there any way that I can perform my benchmarks for multiple queries in neo4j?
Assuming that I have loaded my graph, I want to initiate 10000 distinct shortest path queries in the database, without loading the data to a client. Is there a way that I can do this in a batch and get the execution times?
Try using the profile keyword inside of the neo4j-shell. This will give you some basic facts about how quickly, and how a query executes.
Here's a simple example:
neo4j-sh (?)$ CREATE (a {label:"foo"})-[:bar]->(b {label: "bar"})-[:bar]->(c {label: "baz"});
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 3
Relationships created: 2
Properties set: 3
1180 ms
neo4j-sh (?)$ profile match (a {label: "foo"}), (c {label: "baz"}), p=shortestPath(a-[*]-c) return p;
+--------------------------------------------------------------------------------------+
| p |
+--------------------------------------------------------------------------------------+
| [Node[0]{label:"foo"},:bar[0]{},Node[1]{label:"bar"},:bar[1]{},Node[2]{label:"baz"}] |
+--------------------------------------------------------------------------------------+
1 row
ColumnFilter
|
+ShortestPath
|
+Filter(0)
|
+AllNodes(0)
|
+Filter(1)
|
+AllNodes(1)
+--------------+------+--------+-------------+-----------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+--------------+------+--------+-------------+-----------------------------------------+
| ColumnFilter | 1 | 0 | | keep columns p |
| ShortestPath | 1 | 0 | p | |
| Filter(0) | 1 | 6 | | Property(c,label(0)) == { AUTOSTRING1} |
| AllNodes(0) | 3 | 4 | c, c | |
| Filter(1) | 1 | 6 | | Property(a,label(0)) == { AUTOSTRING0} |
| AllNodes(1) | 3 | 4 | a, a | |
+--------------+------+--------+-------------+-----------------------------------------+
This other answer indicates that you're usually looking for lower DbHits values to be indicative of better performance, since those are expensive.
The WebAdmin tools (usually at http://localhost:7474/webadmin/ for a local neo4j installation), has Data browser and Console tabs that allow you to enter your query, see the results, and also see the actual time it took to perform the query.
Interestingly, from my limited testing of the Data browser and Console tabs, the latter seems to report faster query times for the same queries. So, the Console probably has less overhead, possibly making its timing results a bit more accurate.