I'm new to Neo4j CYPHER query language. I'm discovering it, while analyzing a graph of person to person relationships coming from a CRM system. I'm using Neo4j 2.1.2 Community Edition with Oracle Java JDK 1.7.0_45 on Windows 7 Enterprise, and interacting with Neo4j thru the web interface.
One thing puzzles me: I noticed that the resultset of some of my queries do grow over time, that is, if I run the same query after having used the database for quite a long time (1 or 2 hours later), I get a bit more results the second time -- having not updated, deleted or added anything to the database.
Is that possible? Are there special cases where it could happen? I would expect the database results to be consistent over time, as long as there is no change to the database.
I feel it is, as if the database was growing its indexes over time in the background, and if the query results were depending on the database engine's ability to reach more nodes and relationships thru the grown indexes. Could it be a memory or index configuration issue? Or did I possibly got to much coffee? Alas, it is not easily reproductible.
Sample query:
MATCH (pf:Portfolio)<-[:withRelation]-(p1:Partner)-[:JOINTACC]->(p2:Partner)
WHERE (pf.dateBoucl = '') AND (pf.catClient = 'NO')
AND NOT (p2)-[:relTo]->(:Partner)
MATCH (p1)-[r]->(p3:Partner)
WHERE NOT (p3)-[:relTo]->(:Partner)
AND NOT TYPE( r) IN [ 'relTo', 'ADRESSAT', 'MEMBER']
WITH pf, p1, p2, COLLECT( TYPE( r)) AS types
WHERE ALL( t IN types WHERE t = 'JOINTACC')
RETURN pf.catClient, pf.natureTitulaire, COUNT( DISTINCT pf);
At first I got 98 results. When running it 2 hours later, I get 103 results, and then it seems stable for subsequent runs. And I'm pretty sure I did not change the database contents.
Any hints very appreciated! Kind regards
Schema looks like this:
:schema
Indexes
ON :Country(ID) ONLINE (for uniqueness constraint)
ON :Partner(partnerID) ONLINE (for uniqueness constraint)
ON :Portfolio(partnerID) ONLINE
ON :Portfolio(noCli) ONLINE
ON :Portfolio(noDos) ONLINE
Constraints
ON (partner:Partner) ASSERT partner.partnerID IS UNIQUE
ON (country:Country) ASSERT country.ID IS UNIQUE
Dump / download your query results from both runs and do a diff on them. Then you see what differs and you can investigate where it came from.
Perhaps you also should update to 2.1.3 which has one caching problem resolved that could be related to this.
Related
I have a decently large graph (1.8 billion nodes and roughly the same number of relationships) where I am performing the follow query:
MATCH (n:Article)
WHERE n.id IN $pmids
MATCH (n)-[:HAS_MENTION]->(m:Mention)
WITH n, collect(m) as mentions
RETURN n.id as pmid, mentions
ORDER BY pmid
where $pmids are a list of strings, e.g. ["1234", "4567"] where the length of this list varies from 100-500 length.
I am currently am holding the data within neo4j docker community instance with the following conf modifications: NEO4J_dbms_memory_pagecache_size=32G, NEO4J_dbms_memory_heap_max__size=32G. Index has been created for Article.id.
This query has been quite slow to run (roughly 5 seconds) and I would like to optimize to make for faster runtime. As part of work, I have access to neo4j enterprise so one approach would be to ingest this data as part of a neo4j enterprise account where I can tweak advanced configuration settings.
In general, does anyone have any tips in how I may improve performance, whether it be optimizing the cypher query itself, increase workers or other settings in neo4j.conf?
Thanks in advance.
For anyone interested - I posed this question in the neo4j forums as well and there have already been some interesting optimization suggestions (especially around the "type hint" to trigger backward-indexing, and using pattern comprehension instead of collect()
Initial thoughts
you are using a string field to store PMID, but PMIDs are numeric, it might reduce the database size, and possibly perform better if stored as int (and indexed as int, and searched as int)
if the PMID list is usually large, and the server has over half dozen cores, it might be worth looking into the apoc parallel cypher functions
do you really need every property from the Mention nodes? if not try gathering just what you need
what is the size of the database in GBs? (some context is required in terms of memory settings), and what did neo4j-admin memrec recommend?
If this is how the db is always used, all the time, a sql database might be better, and when building that sql db, collect the mentions into one field (once and done)
Note: Go PubMed!
why neo4j order by is very slow for large database :(
here is the example query:
PROFILE MATCH (n:Item) RETURN n ORDER BY n.name Desc LIMIT 25
and in result it's read all records but i already used index on name property.
here is the result
Click here to see results
it reads all nodes, it's real mess for large number of records.
any solution for this?
or neo4j is not good choice too for us :(
and any way to get last record from nodes?
Your question and problem are not very clear.
1) Are you sure that you added the index correctly?
CREATE INDEX ON :Item(name)
In the Neo4j browser execute :schema to see all your indexes.
2) How many Items does your database hold and what running time are you expecting and achieving?
3) What do you mean by 'last record from nodes'?
Indexes are currently only used to find entry points into the graph, but not for other uses including ordering of results.
Indexed-backed ORDER BY operations have been a highly requested feature for awhile, and while we've been tracking and ordering its priority, we've had several other features that took priority over this work.
I believe indexed-backed ORDER BY operations are currently scheduled very soon, for our 3.5 release coming in the last few months of 2018.
I am using neo4j graph database version 2.1.7. Brief Details around data:
2 million nodes with 6 different type of nodes, 5 million relationships with only 5 different type of relationships and mostly connected graph but contains a few isolated subgraphs.
While resolving paths, i get cycles in path. And to restrict that, i used the solution shared in below:
Returning only simple paths in Neo4j Cypher query
Here is the Query, i am using:
MATCH (n:nodeA{key:905728})
MATCH path = n-[:rel1|rel2|rel3|rel4*0..]->(c:nodeA)-[:rel5*0..1]->(b:nodeA)
WHERE ALL(a in nodes(path) where 1=length (filter (m in nodes(path) where m=a)))
and (length(EXTRACT (p in NODES(path)| p.key)) > 1)
and ((exists ((c)-[:rel5]->(b)) and (not exists((b)-[:rel1|rel2|rel3|rel4]->(:nodeA)) OR ANY (x in nodes(path) where (b)-[]->(x))))
OR (not exists ((c)-[:rel5]->()) and (not exists ((c)-[:rel1|rel2|rel3|rel4]->(:nodeA)) OR ANY (x in nodes(path) where (c)-[]->(x)))))
RETURN distinct EXTRACT (rp in Rels(path)| type(rp)), EXTRACT (p in NODES(path)| p.key);
The above query solves mine requirement but is not cost effective and keeps running if is run for huge subgraph. I have used 'Profile' command to improve query performance from what i started with. But, now stuck at this point. The performance has improved but, not what i expected from neo4j :(
I don't know that I have a solution, but I have a number of suggestions. Some might speed things up, some might just make the query easier to read.
Firstly, rather than putting exists ((c)-[:rel5]->(b)) in your WHERE, I believe you can put it in your MATCH like this:
MATCH path = n-[:rel1|rel2|rel3|rel4*0..]->(c:nodeA)-[:rel5*0..1]->(b:nodeA), (c)-[:rel5]->(b)
I don't think you need the exists keyword. I think you can just say, for example, (NOT (b)-[:rel1|rel2|rel3|rel4]->(:nodeA))
I'd also suggest thinking about the WITH clause for potential performance improvements.
A couple of notes about your variable paths: In *0.. the 0 means that your potentially looking for a self-reference. That may or may not be what you want. Also, leaving the variable path open ended can often cause performance problems (as I think you're seeing). If you can possibly cap it that may help.
Also, if you upgrade to 2.2.1, there are a number of built-in performance improvements with the 2.2.x line, but you also get visual PROFILEing in the console and a new EXPLAIN command which both profiles and tells you the real performance of the query after running it.
One thing to consider too is that I don't think you're hitting performance boundaries of Neo4j but rather, perhaps, you're potentially hitting some boundaries of Cypher. If so, I might suggest you do your querying with the Java APIs that Neo4j provides for better performance and more control. This can either be via embedding your database if you're using a JVM-compatible language or by writing an unmanaged extension which lets you do your own querying in java but provide a custom REST API from the server
Did a couple of more tweaks to my query as suggested above by Brian. And found improvement in query response time. Now, It takes almost 20% of time in execution compared to my original query and the current query makes almost 60% less db hits, compared to the query i shared earlier, during query execution. PFB the updated query:
MATCH (n:nodeA{key:905728})
MATCH path = n-[:rel1|rel2|rel3|rel4*1..]->(c:nodeA)-[:rel5*0..1]->(b:nodeA)
WHERE ALL(a in nodes(path) where 1=length (filter (m in nodes(path) where m=a)))
and (length(path) > 0)
and ((exists ((c)-[:rel5]->(b)) and (not ((c)-[:rel1|rel2|rel3|rel4]->()) OR ANY (x in nodes(path) where (c)-[]->(x))))
OR (not exists ((c)-[:rel5]->()) and (not ((c)-[:rel1|rel2|rel3|rel4]->()) OR ANY (x in nodes(path) where (c)-[]->(x)))))
RETURN distinct EXTRACT (rp in Rels(path)| type(rp)), EXTRACT (p in NODES(path)| p.key);
And observed dramatic improvement when capped the path from *1.. to *1..15. Also, removed one filter from query which too was taking longer time.
But, the query response time increased when queried on nodes having relationships more than 18-20 depths.
I would advise to use profile command oftenly to find pain points in your query. That would help you resolve the issues faster.
Thanks Brian.
I have identified that some queries happen to return less results than expected. I have taken one of the missing results and tried to force Neo4j to return this result - and I succeeded with the following query:
match (q0),(q1),(q2),(q3),(q4),(q5)
where
q0.name='v4' and q1.name='v3' and q2.name='v5' and
q3.name='v1' and q4.name='v3' and q5.name='v0' and
(q1)-->(q0) and (q0)-->(q3) and (q2)-->(q0) and (q4)-->(q0) and
(q5)-->(q4)
return *
I have supposed that the following query is semantically equivalent to the previous one. However in this case, Neo4j returns no result at all.
match (q1)-->(q0), (q0)-->(q3), (q2)-->(q0), (q4)-->(q0), (q5)-->(q4)
where
q0.name='v4' and q1.name='v3' and q2.name='v5' and
q3.name='v1' and q4.name='v3' and q5.name='v0'
return *
I have also manually verified that the required edges among vertices v0, v1, v3, v4 and v5 are present in the database with right directions.
Am I missing some important difference between these queries or is it just a bug of Neo4j? (I have tested these queries on Neo4j 2.1.6 Community Edition.)
Thank you for any advice
/EDIT: Updating to newest version 2.2.1 was of no help.
This might not be a complete answer, but here's what I found out.
These queries aren't synonymous, if I understand correctly.
First of all, use EXPLAIN (or even PROFILE) to look under the hood. The first query will be executed as follows:
The second query:
As you can see (even without going deep down), those are different queries in terms of both efficiency and semantics.
Next, what's actually going on here:
the 1st query will look through all (single) nodes, filter them by name, then - try to group them according to your pattern, which will involve computing Cartesian product (hence the enormous space complexity), then collect those groups into the larger ones, and then evaluate your other conditions.
the 2nd query will first pick a pair of nodes connected with some relationship (which satisfy the condition on the name property), then throw in the third node and filter again, ..., and so on till the end. The number of nodes is expected to decrease after every filter cycle.
By the way, is it possible that you accidentally set the same name twice (for q1 and q3?)
I'm toying around with Neo4J. My Data consists of users who own objects which are tagged by tags. There my schema looks like:
(:User)-[:OWNS]->(:Object)-[:TAGGED_AS]->(:Tag)
I have written a script that generates me a sample Graph. Currently I have 100 User, ~2500 Tag and ~10k Object nodes in the database. Between those I have ~700k relationships. I know want to find every Object that is not owned by a certain User but related over a Tag the User has used himself. There query looks like:
MATCH (user:User {username: 'Cristal'})
WITH user
MATCH (user)-[:OWNS]->(obj:Object)-[:TAGGED_AS]->(tag:Tag)<-[:TAGGED_AS]-(other:Object)
WHERE NOT (user)-[:OWNS]->(other)
RETURN other
LIMIT 20
However, this query runs ~1-5 minutes (depending on the user and how many objects he owns), which is a not only a bit to slow. What am I doing wrong? I consider this a rather "trivial" query against a graph of modest size. I'm using Neo4J 2.1.6 Community and already set the Java Heap to 2000 MB (and I can see that there is a Java process using this much). Am I missing an index or something like that (I'm new to Neo4J)?
I honestly expected the result to be pretty much instant especially considering that the Neo4J docs mention the I should use a heap between 1 and 4 GB for 100 million objects...and I'm only close to a 1/100 of this number.
If it is my Query (which I hope and expect) how can I improve it? What is something you have to be aware when writing queries?
Do you have an index on the username property?
CREATE INDEX ON :User(username)
Also you don't really need that WITH there, so maybe drop it to see if it helps:
MATCH (user:User {username: 'Cristal'})-[:OWNS]->(obj:Object)-[:TAGGED_AS]->(tag:Tag)<-[:TAGGED_AS]-(other:Object)
WHERE NOT (user)-[:OWNS]->(other)
RETURN other
LIMIT 20
Also, I don't think it will make a different, but you can drop the obj and tag variables since you're not using them elsewhere in the query.
Also, if you're generating sample graphs you may want to check out GraphGen:
http://graphgen.neoxygen.io/