I am new to using Neo4j and have setup a test graph db in neo4j for organizing some click stream data with a very small subset of what we actually use on a day to day basis. This graph has about 23 million nodes and 34 million relationships. The queries seem to be taking forever to run i.e. I haven't seen the response come back even after waiting for more than 30 mins.
The data is organized as Year->Month->Day->Session{1..n}->Event{1..n}
I am running the db on a Windows 7 machine with 1.5 gb of heap allocated to Neo4j server
These are the configurations in the neo4j-wrapper.conf
wrapper.java.additional.1=-Dorg.neo4j.server.properties=conf/neo4j-server.properties
wrapper.java.additional.2=-Djava.util.logging.config.file=conf/logging.properties
wrapper.java.additional.3=-Dlog4j.configuration=file:conf/log4j.properties
wrapper.java.additional.6=-XX:+UseParNewGC
wrapper.java.additional.7=-XX:+UseConcMarkSweepGC
wrapper.java.additional.8=-Xloggc:data/log/neo4j-gc.log
wrapper.java.initmemory=1500
wrapper.java.maxmemory=1500
This is what my query looks like
START n=node(3)
MATCH (n)-[:HAS]->(s)
WITH distinct s
MATCH (s)-[:HAS]->(e) WHERE e.page_name = 'Login'
WITH s.session_id as session, e
MATCH (e)-[:FOLLOWEDBY*0..1]->(e1)
WITH count(session) as session_cnt, e.page_name as startPage, e1.page_name as nextPage
RETURN startPage, nextPage, session_cnt
Also i have these properties set
node_auto_indexing=true
node_keys_indexable=name,page_name,geo_country
relationship_auto_indexing=true
Can anyone help me to figure out what might be wrong.
Even when I run portions of the query it takes 10-15 minutes before I can see a response.
Note: I have no other applications running on the Windows Machine
Why would you want to return all the nodes in the first place?
If you really want to do that, use the transactional http endpoint and curl to stream the response:
I tested it with a database of 100k nodes. It takes 0.9 seconds to transfer them (1.5MB) over the wire.
If you transfer all their properties by using "return n", it takes 1.4 seconds and results in 4.1MB transferred.
If you just want to know how many nodes are in your db. use something like this instead:
match (n) return count(*);
Related
I am using Neo4j community 4.2.1, playing with graph databases. I plan to operate on lots of data and want to get familiar with indexes and stuff.
However, I'm stuck at a very basic level because in the browser Neo4j reports query runtimes which have nothing to do with reality.
I'm executing the following query in the browser at http://localhost:7687/:
match (m:Method),(o:Method) where m.name=o.name and m.name <> '<init>' and
m.signature=o.signature and toInteger(o.access)%8 in [1,4]
return m,o
The DB has ~5000 Method labels.
The browser returns data after about 30sec. However, Neo4j reports
Started streaming 93636 records after 1 ms and completed after 42 ms, displaying first 1000 rows.
Well, 42ms and 30sec is really far away from each other! What am I supposed to do with this message? Did the query take only milliseconds and the remaining 30secs were spent rendering the stuff in the browser? What is going on here? How can I improve my query if I cannot even tell how long it really ran?
I modified the query, returning count(m) + count(n) instead of m,n which changed things, now runtime is about 2secs and Neo4j reports about the same amount.
Can somebody tell me how I can get realistic runtime figures of my queries without using the stop watch of my cell?
I'm running my cypher queryies on a very large social network (over 1B records). I'm trying to get all paths between two person with variable relationship lengths. I get a reasonable response time running a query for a single relationship length (between 0.5 -2 seconds) [the person ids are index].
MATCH paths=( (pr1:person)-[*0..1]-(pr2:person) )
WHERE pr1.id='123456'
RETURN paths
However when I run the query with multiple lengths (i.e. 2 or more) my response time goes up to several minutes. Assuming that each person has in average the same number of connection I should be running my queries for 2-3 minutes Max (but I get up to 5+ min).
MATCH paths=( (pr1:person)-[*0..2]-(pr2:person) )
pr1.id='123456'
RETURN paths
I tried to use the EXPLAIN did not show extreme values for the VarLengthExpand(All) .
Maybe the traversing is not using the index for the pr2.
Is there anyway to improve the performance of my query?
Since variable-length relationship searches have exponential complexity, your *0..2 query might be generating a very large number of paths, which can cause the neo4j server (or your client code, like the neo4j browser) to run a long time or even run out of memory.
This query might be able to finish and show you how many matching paths there are:
MATCH (pr1:person)-[*0..2]-(:person)
WHERE pr1.id='123456'
RETURN COUNT(*);
If the returned number is very large, then you should modify your query to reduce the size of the result. For example, you can adding a LIMIT clause after your original RETURN clause to limit the number of returned paths.
By the way, the EXPLAIN clause just estimates the query cost, and can be way off. The PROFILE clause performs the actual query, and gives you an accurate accounting of the DB hits (however, if your query never finishes running, then a PROFILE of it will also never finish).
Rather than using the explain, try the "profile" instead.
I have installed the APOC Procedures and used "CALL apoc.warmup.run."
The result is as follow:
pageSize
8192
nodesPerPage nodesTotal nodesLoaded nodesTime
546 156255221 286182 21
relsPerPage relsTotal relsLoaded relsTime
240 167012639 695886 8
totalTime
30
It looks like the neo4j server only caches part of nodes and relations.
But I want it to cache all the nodes and relationships in order to improve query performance.
First of all, for all data to be cached, you need a page cache large enough.
Then, the problem is not that Neo4j does not cache all it can, it's more of a bug in the apoc.warmup.run procedure: it retrieves the number of nodes (resp. relationships) in the database, and expects them to all have ids between 1 and that number of nodes (resp. relationships). However, it's not true if you've had some churn in the DB, like creating more nodes then deleting some of them.
I believe that could be fixed by using another query instead:
MATCH (n) RETURN count(n) AS count, max(id(n)) AS maxId
as profiling it shows about the same number of DB hits as the number of nodes, and takes about 650 ms on my machine for 1.4 million nodes.
Update: I've opened an issue on the subject.
Update 2
While the issue with the ids is real, I missed the real reason why the procedure reports reading far less nodes: it only reads one node per page (assuming they're stored sequentially), since it's the pages that are cached. With the current values, that means trying to read one node every 546 nodes. It happens that 156255221 ÷ 546 = 286181, and with node 0 that makes it 286182 nodes loaded.
I have an API in Django and its structure is something like -
FetchData():
run cypher query1
run cypher query2
run cypher query3
return
When I run these queries in neo4j query window each take around 100ms. But when I call this API, query1 takes 1s and other 2 take expected 100ms to execute. This pattern is repeated every time I call the API.
Can anyone explain what should be done here to run the first query in expected time.
Neo4j tries to cache the graph in RAM. Upon first invocations caches are not warmed up yet, so it takes longer to do the IO operations. Subsequent invocations don't hit IO and read directly from RAM.
That sounds weird. The cache should only need to be warmed if the server or db is shut down, not after each of your API calls. Are you using paramterized queries? The only thing I can think of is maybe each set of queries is different causing them to have to be re-parsed and re-planned.
Same cypher query taking different time when executed through different consoles:
Executed via spring-data-neo4j: (took 8 seconds)
#Query(
"MATCH (user:User {uid:{0}})-[:FRIEND]-(friend:User)" +
"RETURN friend"
)
public List<User> getFriends(String userId);
Executed via http://localhost:7474/browser/ : (took 250 ms)
Executed via http://localhost:7474/webadmin/#/console/ : (took 18 ms)
Even though queries executed via console are very fast and taking time under acceptable range but for production I have to execute those queries from my java app and in which case the time taken by queries are totally unacceptable.
Edit:
#NodeEntity
public class User extends AbstractEntity {
#RelatedToVia(elementClass = Friendship.class, type = FRIEND, direction = BOTH)
private Set<Friendship> friendships;
...
}
Make sure when benchmarking that you run tests at least 3 times and plan to drop the first one. I find that it takes one or two runs to warm the cache for Neo4j for any given query. This is different than most RDBMS which don't cache the same way.
My practice is to run test queries 5 times and drop the first 2. This works pretty consistently for me regardless of the size of data set (I test with tens of thousands and hundreds of thousands of nodes) and the complexity of the script (I have some Cypher statements that run more than 50 or so lines, multiple WITHs, etc). What I've found is that after the first two runs, the performance from one run to the next of the same query tends to stay right around the same value..
So make sure that in production you've warmed your cache for your common queries. And make sure that you've got sufficient memory available to the JVM and Neo4j. Most data sets are smaller than a few GB and so can fit in memory, if Neo4j has access to enough memory.
Finally, be sure you have your indexes in place (e.g. CREATE INDEX ON :User(uid)). With most of the graph and its properties loaded in memory, and the right indexes in place, Neo4j should really perform.