I am using Neo4j community 4.2.1, playing with graph databases. I plan to operate on lots of data and want to get familiar with indexes and stuff.
However, I'm stuck at a very basic level because in the browser Neo4j reports query runtimes which have nothing to do with reality.
I'm executing the following query in the browser at http://localhost:7687/:
match (m:Method),(o:Method) where m.name=o.name and m.name <> '<init>' and
m.signature=o.signature and toInteger(o.access)%8 in [1,4]
return m,o
The DB has ~5000 Method labels.
The browser returns data after about 30sec. However, Neo4j reports
Started streaming 93636 records after 1 ms and completed after 42 ms, displaying first 1000 rows.
Well, 42ms and 30sec is really far away from each other! What am I supposed to do with this message? Did the query take only milliseconds and the remaining 30secs were spent rendering the stuff in the browser? What is going on here? How can I improve my query if I cannot even tell how long it really ran?
I modified the query, returning count(m) + count(n) instead of m,n which changed things, now runtime is about 2secs and Neo4j reports about the same amount.
Can somebody tell me how I can get realistic runtime figures of my queries without using the stop watch of my cell?
Related
working on a pretty small graph of 5000 nodes with low density (mean connectivity < 5), I get the following error which I never got before upgrading to neo4j 3.3.0. The graph contains 900 molecules and their scaffold hierarchy, down to 5 levels.
(:Molecule)<-[:substructureOf*1..5]-(:Scaffold)
Neo.TransientError.General.StackOverFlowError
There is not enough stack size to perform the current task. This is generally considered to be a database error, so please contact Neo4j support. You could try increasing the stack size: for example to set the stack size to 2M, add `dbms.jvm.additional=-Xss2M' to in the neo4j configuration (normally in 'conf/neo4j.conf' or, if you are using Neo4j Desktop, found through the user interface) or if you are running an embedded installation just add -Xss2M as command line flag.
The query is actually very simple, I use distinct because several path may lead to a single scaffold.
match (m:Molecule) <-[:substructureOf*3]- (s:Scaffold) return distinct s limit 20
This query returns the above error message whereas the next query does work.
match (m:Molecule) <-[:substructureOf*3]- (s:Scaffold) return s limit 20
Interestingly, the query works on a much larger database, in this small one the deepest hierarchy happened to be 2. Therefore the result of the last query is "No changes, no records)".
How comes that adding DISTINCT to the query fails with that memory error? Is there a way to avoid it, because I cannot guess the depth of the hierarchy which can be different for each molecules.
I tried the following values for as suggested in other posts.
#dbms.memory.heap.initial_size=512m
#dbms.memory.heap.max_size=512m
dbms.memory.heap.initial_size=512m
dbms.memory.heap.max_size=4096m
dbms.memory.heap.initial_size=4096m
dbms.memory.heap.max_size=4096m
None of these addressed the issue.
Thanks in advance for any help or clues.
Thanks for the additional info, I was able to replicate this on Neo4j 3.3.0 and 3.3.1, and this likely has to do with the behavior of the pruning-var-expand operation (that is meant to help when using variable-length expansions and distinct results) that was introduced in 3.2.x, and only when using an exact number of expansions (not a range). Neo4j engineering will be looking into this.
In the meantime, your requirement is such that we can use a different kind of query to get the results you want that should avoid this operation. Try giving this one a try:
match (s:Scaffold)
where (s)-[:substructureOf*3]->(:Molecule)
return distinct s limit 20
And if you do need to perform queries that may produce this error, you may be able to circumvent them by prepending your query with CYPHER 3.1, which will execute this with a plan produced by an older version of Cypher which doesn't use the pruning var expand operation.
Just upgraded from 2.2.5 to 2.3.2 and a previous query that executed immediately now takes a considerable time. Seems linked to the depth as reducing it form 5 to 3 makes it quicker.
More details are as below,
Following is the Neo4j query used to search the recommended restaurants for the passed user_id, where depth/degree of the search is kept as 5.
MATCH (u:User {user_id:"bf9203fba4484f96b4983152e9ee859a"})-[r*1..5]-(place:Place)
WHERE ALL (rel in r WHERE rel.rating >= 0)
RETURN DISTINCT place.place_id, place.title, length(r) as LoR ORDER BY LoR, place.title
Old server instance has Neo4j 2.2.5, where result is displayed instantly but on new VM with Neo4j 2.3.2 it is taking quite long time to return the result.
If we decrease the search depth value to 2 or 3, queries are running faster
Anyone else experiencing this?
How do the query times compare when running completely non-warmed-up, e.g. just start server and run this query? I'm suspecting property loading may be the biggest problem due to the removal of the object cache in the 2.3 series.
Do the Place nodes and relationships w/ the rating property have many properties each?
I'm new to Neo4j CYPHER query language. I'm discovering it, while analyzing a graph of person to person relationships coming from a CRM system. I'm using Neo4j 2.1.2 Community Edition with Oracle Java JDK 1.7.0_45 on Windows 7 Enterprise, and interacting with Neo4j thru the web interface.
One thing puzzles me: I noticed that the resultset of some of my queries do grow over time, that is, if I run the same query after having used the database for quite a long time (1 or 2 hours later), I get a bit more results the second time -- having not updated, deleted or added anything to the database.
Is that possible? Are there special cases where it could happen? I would expect the database results to be consistent over time, as long as there is no change to the database.
I feel it is, as if the database was growing its indexes over time in the background, and if the query results were depending on the database engine's ability to reach more nodes and relationships thru the grown indexes. Could it be a memory or index configuration issue? Or did I possibly got to much coffee? Alas, it is not easily reproductible.
Sample query:
MATCH (pf:Portfolio)<-[:withRelation]-(p1:Partner)-[:JOINTACC]->(p2:Partner)
WHERE (pf.dateBoucl = '') AND (pf.catClient = 'NO')
AND NOT (p2)-[:relTo]->(:Partner)
MATCH (p1)-[r]->(p3:Partner)
WHERE NOT (p3)-[:relTo]->(:Partner)
AND NOT TYPE( r) IN [ 'relTo', 'ADRESSAT', 'MEMBER']
WITH pf, p1, p2, COLLECT( TYPE( r)) AS types
WHERE ALL( t IN types WHERE t = 'JOINTACC')
RETURN pf.catClient, pf.natureTitulaire, COUNT( DISTINCT pf);
At first I got 98 results. When running it 2 hours later, I get 103 results, and then it seems stable for subsequent runs. And I'm pretty sure I did not change the database contents.
Any hints very appreciated! Kind regards
Schema looks like this:
:schema
Indexes
ON :Country(ID) ONLINE (for uniqueness constraint)
ON :Partner(partnerID) ONLINE (for uniqueness constraint)
ON :Portfolio(partnerID) ONLINE
ON :Portfolio(noCli) ONLINE
ON :Portfolio(noDos) ONLINE
Constraints
ON (partner:Partner) ASSERT partner.partnerID IS UNIQUE
ON (country:Country) ASSERT country.ID IS UNIQUE
Dump / download your query results from both runs and do a diff on them. Then you see what differs and you can investigate where it came from.
Perhaps you also should update to 2.1.3 which has one caching problem resolved that could be related to this.
I am new to using Neo4j and have setup a test graph db in neo4j for organizing some click stream data with a very small subset of what we actually use on a day to day basis. This graph has about 23 million nodes and 34 million relationships. The queries seem to be taking forever to run i.e. I haven't seen the response come back even after waiting for more than 30 mins.
The data is organized as Year->Month->Day->Session{1..n}->Event{1..n}
I am running the db on a Windows 7 machine with 1.5 gb of heap allocated to Neo4j server
These are the configurations in the neo4j-wrapper.conf
wrapper.java.additional.1=-Dorg.neo4j.server.properties=conf/neo4j-server.properties
wrapper.java.additional.2=-Djava.util.logging.config.file=conf/logging.properties
wrapper.java.additional.3=-Dlog4j.configuration=file:conf/log4j.properties
wrapper.java.additional.6=-XX:+UseParNewGC
wrapper.java.additional.7=-XX:+UseConcMarkSweepGC
wrapper.java.additional.8=-Xloggc:data/log/neo4j-gc.log
wrapper.java.initmemory=1500
wrapper.java.maxmemory=1500
This is what my query looks like
START n=node(3)
MATCH (n)-[:HAS]->(s)
WITH distinct s
MATCH (s)-[:HAS]->(e) WHERE e.page_name = 'Login'
WITH s.session_id as session, e
MATCH (e)-[:FOLLOWEDBY*0..1]->(e1)
WITH count(session) as session_cnt, e.page_name as startPage, e1.page_name as nextPage
RETURN startPage, nextPage, session_cnt
Also i have these properties set
node_auto_indexing=true
node_keys_indexable=name,page_name,geo_country
relationship_auto_indexing=true
Can anyone help me to figure out what might be wrong.
Even when I run portions of the query it takes 10-15 minutes before I can see a response.
Note: I have no other applications running on the Windows Machine
Why would you want to return all the nodes in the first place?
If you really want to do that, use the transactional http endpoint and curl to stream the response:
I tested it with a database of 100k nodes. It takes 0.9 seconds to transfer them (1.5MB) over the wire.
If you transfer all their properties by using "return n", it takes 1.4 seconds and results in 4.1MB transferred.
If you just want to know how many nodes are in your db. use something like this instead:
match (n) return count(*);
I started testing Neo4j for a program and I am facing some performance issues. As mentioned in the title, Neo4j is directly embedded in the java code.
My graphs contains about 4 millions nodes and several hundred million relationships. My test is simply to send a query counting the number of inbound relationships for a node.
This program uses ExecutionEngine execute procedure to send the following query:
start n=node:node_auto_index(id="United States") match s-[:QUOTES]->n return count(s)
By simply adding some prints I can see how much time this query took which is usually about 900ms which is a lot.
What surprises me the most is that I receive a "query execution time" in the response, which is really different.
For instance a query returned:
+----------+
| count(n) |
+----------+
| 427738 |
+----------+
1 row
1 ms
According to this response, I undertand that Neo4j took 1ms for the query, but when I print some log messages I can see that it actually took 917ms.
I guess that 1ms is equal to the time required to find the indexed object "United States", which would mean that Neo4j required about 916ms for the rest, like counting the number of relationships. In this case, how can I get getter performances for this query?
Thanks in advance!
Query timers were broken in 1.8.1 and 1.9.M04, when the cypher lazy stuff was fixed. (Definitely a worthwhile trade for most use cases). But yeah, I think it will be fixed soon.
For now you'll have to time things externally.
Update:
As for your question about whether that time is reasonable... It basically needs to scan all ~400k nodes to count them. This is probably reasonable, even if the cache is warmed up and all of those fit into RAM. Having "super nodes" like this is usually not best practice if it can be avoided, although they are going to be making a lot of improvements for this case in future versions (at least, that's what I hear).
Make sure not to measure the first query b/c that one only measures how long it takes to load the data from disk into memory.
Make sure to give Neo4j enough memory to cache your data.
And try this query if it is faster.
start n=node:node_auto_index(id="United States")
return length(()-[:QUOTES]->n) as cnt