I started testing Neo4j for a program and I am facing some performance issues. As mentioned in the title, Neo4j is directly embedded in the java code.
My graphs contains about 4 millions nodes and several hundred million relationships. My test is simply to send a query counting the number of inbound relationships for a node.
This program uses ExecutionEngine execute procedure to send the following query:
start n=node:node_auto_index(id="United States") match s-[:QUOTES]->n return count(s)
By simply adding some prints I can see how much time this query took which is usually about 900ms which is a lot.
What surprises me the most is that I receive a "query execution time" in the response, which is really different.
For instance a query returned:
+----------+
| count(n) |
+----------+
| 427738 |
+----------+
1 row
1 ms
According to this response, I undertand that Neo4j took 1ms for the query, but when I print some log messages I can see that it actually took 917ms.
I guess that 1ms is equal to the time required to find the indexed object "United States", which would mean that Neo4j required about 916ms for the rest, like counting the number of relationships. In this case, how can I get getter performances for this query?
Thanks in advance!
Query timers were broken in 1.8.1 and 1.9.M04, when the cypher lazy stuff was fixed. (Definitely a worthwhile trade for most use cases). But yeah, I think it will be fixed soon.
For now you'll have to time things externally.
Update:
As for your question about whether that time is reasonable... It basically needs to scan all ~400k nodes to count them. This is probably reasonable, even if the cache is warmed up and all of those fit into RAM. Having "super nodes" like this is usually not best practice if it can be avoided, although they are going to be making a lot of improvements for this case in future versions (at least, that's what I hear).
Make sure not to measure the first query b/c that one only measures how long it takes to load the data from disk into memory.
Make sure to give Neo4j enough memory to cache your data.
And try this query if it is faster.
start n=node:node_auto_index(id="United States")
return length(()-[:QUOTES]->n) as cnt
Related
I make experiments with querying Neo4j data whose size gradually increases. Some of my query expressions behave as I would expect - a larger graph means a longer execution time. But, e.g., the query
MATCH (`g`:Graph)
MATCH (`g`)<-[:`Graph`]-(person)-[:birthPlace {Value: "http://persons/data/birthPlace"}]->(place)-[:`Graph`]->(`g`)
WITH count(person.Value) AS persons, place
WHERE persons > 5
RETURN place.Value AS place, persons
ORDER BY persons
has these execution times (in ms):
|80.2 |208.1 |301.7 |399.23 |0.1 |2.07 |2.61 |2.81 |7.3 |1.5 |.
How to explain the rapid acceleration from the fifth step? The data are the same, just extended; no indexes were created.
The data on 4th experiment:
201673 nodes,
601189 relationships,
859225 properties.
The data size on the 5th experiment:
242113 nodes,
741500 relationships,
1047060 properties.
All I can think about is that maybe Cypher will start using some indexes from a certain data size, but I can't find anywhere if that's the case.
Thank you for any comments.
Neo4j cache management may explain your observations. You might explain what you are doing more precisely. What version of Neo4j are you using? What is the Graph node? You are repeatedly running the same query on graph and trying this again with a larger or smaller graph?
If you are running the same query multiple times on the same data set with more rapid execution times, then the cache may be the reason. In v 3.5 and earlier it would "warm up" with repeated execution. Putatively this does not occur in v 4.x.
You might look at
cold start
or these tips. You might also look at your transaction log; is it accumulate large files.
Why the '' around your node identifiers ('g'); just use (g:Graph) and [r:Graph]; no quotes.
I am using Neo4j community 4.2.1, playing with graph databases. I plan to operate on lots of data and want to get familiar with indexes and stuff.
However, I'm stuck at a very basic level because in the browser Neo4j reports query runtimes which have nothing to do with reality.
I'm executing the following query in the browser at http://localhost:7687/:
match (m:Method),(o:Method) where m.name=o.name and m.name <> '<init>' and
m.signature=o.signature and toInteger(o.access)%8 in [1,4]
return m,o
The DB has ~5000 Method labels.
The browser returns data after about 30sec. However, Neo4j reports
Started streaming 93636 records after 1 ms and completed after 42 ms, displaying first 1000 rows.
Well, 42ms and 30sec is really far away from each other! What am I supposed to do with this message? Did the query take only milliseconds and the remaining 30secs were spent rendering the stuff in the browser? What is going on here? How can I improve my query if I cannot even tell how long it really ran?
I modified the query, returning count(m) + count(n) instead of m,n which changed things, now runtime is about 2secs and Neo4j reports about the same amount.
Can somebody tell me how I can get realistic runtime figures of my queries without using the stop watch of my cell?
I have a decently large graph (1.8 billion nodes and roughly the same number of relationships) where I am performing the follow query:
MATCH (n:Article)
WHERE n.id IN $pmids
MATCH (n)-[:HAS_MENTION]->(m:Mention)
WITH n, collect(m) as mentions
RETURN n.id as pmid, mentions
ORDER BY pmid
where $pmids are a list of strings, e.g. ["1234", "4567"] where the length of this list varies from 100-500 length.
I am currently am holding the data within neo4j docker community instance with the following conf modifications: NEO4J_dbms_memory_pagecache_size=32G, NEO4J_dbms_memory_heap_max__size=32G. Index has been created for Article.id.
This query has been quite slow to run (roughly 5 seconds) and I would like to optimize to make for faster runtime. As part of work, I have access to neo4j enterprise so one approach would be to ingest this data as part of a neo4j enterprise account where I can tweak advanced configuration settings.
In general, does anyone have any tips in how I may improve performance, whether it be optimizing the cypher query itself, increase workers or other settings in neo4j.conf?
Thanks in advance.
For anyone interested - I posed this question in the neo4j forums as well and there have already been some interesting optimization suggestions (especially around the "type hint" to trigger backward-indexing, and using pattern comprehension instead of collect()
Initial thoughts
you are using a string field to store PMID, but PMIDs are numeric, it might reduce the database size, and possibly perform better if stored as int (and indexed as int, and searched as int)
if the PMID list is usually large, and the server has over half dozen cores, it might be worth looking into the apoc parallel cypher functions
do you really need every property from the Mention nodes? if not try gathering just what you need
what is the size of the database in GBs? (some context is required in terms of memory settings), and what did neo4j-admin memrec recommend?
If this is how the db is always used, all the time, a sql database might be better, and when building that sql db, collect the mentions into one field (once and done)
Note: Go PubMed!
I'm running a production website for 4 years with azure SQL.
With help of 'Top Slow Request' query from alexsorokoletov on github I have 1 super slow query according to Azure query stats.
The one on top is the one that uses a lot of CPU.
When looking at the linq query and the execution plans / live stats, I can't find the bottleneck yet.
And the live stats
The join from results to project is not directly, there is a projectsession table in between, not visible in the query, but maybe under the hood of entity framework.
Might I be affected by parameter sniffing? Can I reset a hash? Maybe the optimized query plan was used in 2014 and now result table is about 4Million rows and the query is far from optimal?
If I run this query in Management Studio its very fast!
Is it just the stats that are wrong?
Regards
Vincent - The Netherlands.
I would suggest you try adding option(hash join) at the end of the query, if possible. Once you start getting into large arity, loops join is not particularly efficient. That would prove out if there is a more efficient plan (likely yes).
Without seeing more of the details (your screenshots are helpful but cut off whether auto-param or forced parameterization has kicked in and auto-parameterized your query), it is hard to confirm/deny this explicitly. You can read more about parameter sniffing in a blog post I wrote a bit longer ago than I care to admit ;) :
https://blogs.msdn.microsoft.com/queryoptteam/2006/03/31/i-smell-a-parameter/
Ultimately, if you update stats, dbcc freeproccache, or otherwise cause this plan to recompile, your odds of getting a faster plan in the cache are higher if you have this particular query + parameter values being executed often enough to sniff that during plan compilation. Your other option is to add optimize for unknown hints which will disable sniffing and direct the optimizer to use an average value for the frequency of any filters over parameter values. This will likely encourage more hash or merge joins instead of loops joins since the cardinality estimates of the operators in the tree will likely increase.
I'm using neo4j database to track connections between people. I need to track 3rd order connection(something similar to how linkedin does this), but i've faced some issues with performance. In my test database i have approximately 3 thousand users with 3 to 8 connections of the first order(contacts). When fetching second order connections everything seems to be good with the performance. But fetching 3rd order connections takes a long time. I use CYPHER queries to fetch the data. Only profile ids and connections between them are stored in the database.
here is the query itself:
THIRD_ORDER_CONNECTIONS = <<-CYPHER
START n=node:profile(id='%{id}')
MATCH n-[:contacts]-common_contact_1-[:contacts]-common_contact_2-[:contacts]-profile
WHERE common_contact.id <> %{exclude_id} AND common_contact_1.id <> common_contact_2.id
RETURN COLLECT(DISTINCT profile.id)
CYPHER
It takes 48 seconds on my local machine. So the question is - how can i improve the performance or change the query to get 3rd order connections for appropriate time?
Your query is not valid: common_contact.id does not resolve to an identifier
How many results do you get back?
How does the query time change if you add a direction --> t your query?
Please use parameters instead of ruby substitution.
Try RETURN profile.id (distinct needs to keep everything in memory for the unique filtering)
Normally cypher takes care of uniqueness, so common_contact_1.id <> common_contact_2.id might be unnecessary
Have you tried with neo4j version 1.9.M01? There are Cypher performance improvements for straight forward patterns like this which could make a huge difference, where it off-loads more work to the traversal framework.