Why does a traversal that filters on partition key need a full-table scan? - datastax-enterprise

I'm investigating a possible bug with partition scans using custom vertex IDs in DSE Graph.
For some reason, selecting a vertex by its full ID works as expected, but retrieving the whole partition results in a full table scan (i.e. graph scan warning).
Following this vertex label definition:
schema.vertexLabel('word_hoard').partitionKey('_partition').clusteringKey('wordhoard_id').create()
g.V().hasLabel('word_hoard').has('_partition', 'localhost').has('wordhoard_id', '60bcaeff-f6e5-11e5-9ce9-00aaaaaaaaaa')
leads to efficient CQL that makes sense:
SELECT * FROM topics_dev.word_hoard_p
WHERE "_partition" = 'localhost' AND wordhoard_id = 60bcaeff-f6e5-11e5-9ce9-00aaaaaaaaaa;
g.V().hasLabel('word_hoard').has('_partition', 'localhost')
however, generates CQL that seems uninformed about the partition key:
SELECT "_partition", "wordhoard_id" FROM "topics_dev"."word_hoard_p"
WHERE "~~vertex_exists" = true
To avoid this unnecessary table scan, I would expect something like:
SELECT * FROM topics_dev.word_hoard_p
WHERE "_partition" = 'localhost';
This CQL query performs well, but I cannot seem to generate it with a gremlin traversal.
Does anyone have experience with this issue?
Should I approach it differently, or is this a genuine bug in DSE or tinkerpop?
UPDATE 2018-10-30: this issue still exists as of DSE 6.0.4
UPDATE 2019-10-19: a solution is available for testing in the DataStax Labs graph engine (experimental; non-production): https://community.datastax.com/answers/1150/view.html

Related

Query optimization that collects and orders nodes on very large graph

I have a decently large graph (1.8 billion nodes and roughly the same number of relationships) where I am performing the follow query:
MATCH (n:Article)
WHERE n.id IN $pmids
MATCH (n)-[:HAS_MENTION]->(m:Mention)
WITH n, collect(m) as mentions
RETURN n.id as pmid, mentions
ORDER BY pmid
where $pmids are a list of strings, e.g. ["1234", "4567"] where the length of this list varies from 100-500 length.
I am currently am holding the data within neo4j docker community instance with the following conf modifications: NEO4J_dbms_memory_pagecache_size=32G, NEO4J_dbms_memory_heap_max__size=32G. Index has been created for Article.id.
This query has been quite slow to run (roughly 5 seconds) and I would like to optimize to make for faster runtime. As part of work, I have access to neo4j enterprise so one approach would be to ingest this data as part of a neo4j enterprise account where I can tweak advanced configuration settings.
In general, does anyone have any tips in how I may improve performance, whether it be optimizing the cypher query itself, increase workers or other settings in neo4j.conf?
Thanks in advance.
For anyone interested - I posed this question in the neo4j forums as well and there have already been some interesting optimization suggestions (especially around the "type hint" to trigger backward-indexing, and using pattern comprehension instead of collect()
Initial thoughts
you are using a string field to store PMID, but PMIDs are numeric, it might reduce the database size, and possibly perform better if stored as int (and indexed as int, and searched as int)
if the PMID list is usually large, and the server has over half dozen cores, it might be worth looking into the apoc parallel cypher functions
do you really need every property from the Mention nodes? if not try gathering just what you need
what is the size of the database in GBs? (some context is required in terms of memory settings), and what did neo4j-admin memrec recommend?
If this is how the db is always used, all the time, a sql database might be better, and when building that sql db, collect the mentions into one field (once and done)
Note: Go PubMed!

Can "DISTINCT" in a CYPHER query be responsible of a memory error when the query returns no result?

working on a pretty small graph of 5000 nodes with low density (mean connectivity < 5), I get the following error which I never got before upgrading to neo4j 3.3.0. The graph contains 900 molecules and their scaffold hierarchy, down to 5 levels.
(:Molecule)<-[:substructureOf*1..5]-(:Scaffold)
Neo.TransientError.General.StackOverFlowError
There is not enough stack size to perform the current task. This is generally considered to be a database error, so please contact Neo4j support. You could try increasing the stack size: for example to set the stack size to 2M, add `dbms.jvm.additional=-Xss2M' to in the neo4j configuration (normally in 'conf/neo4j.conf' or, if you are using Neo4j Desktop, found through the user interface) or if you are running an embedded installation just add -Xss2M as command line flag.
The query is actually very simple, I use distinct because several path may lead to a single scaffold.
match (m:Molecule) <-[:substructureOf*3]- (s:Scaffold) return distinct s limit 20
This query returns the above error message whereas the next query does work.
match (m:Molecule) <-[:substructureOf*3]- (s:Scaffold) return s limit 20
Interestingly, the query works on a much larger database, in this small one the deepest hierarchy happened to be 2. Therefore the result of the last query is "No changes, no records)".
How comes that adding DISTINCT to the query fails with that memory error? Is there a way to avoid it, because I cannot guess the depth of the hierarchy which can be different for each molecules.
I tried the following values for as suggested in other posts.
#dbms.memory.heap.initial_size=512m
#dbms.memory.heap.max_size=512m
dbms.memory.heap.initial_size=512m
dbms.memory.heap.max_size=4096m
dbms.memory.heap.initial_size=4096m
dbms.memory.heap.max_size=4096m
None of these addressed the issue.
Thanks in advance for any help or clues.
Thanks for the additional info, I was able to replicate this on Neo4j 3.3.0 and 3.3.1, and this likely has to do with the behavior of the pruning-var-expand operation (that is meant to help when using variable-length expansions and distinct results) that was introduced in 3.2.x, and only when using an exact number of expansions (not a range). Neo4j engineering will be looking into this.
In the meantime, your requirement is such that we can use a different kind of query to get the results you want that should avoid this operation. Try giving this one a try:
match (s:Scaffold)
where (s)-[:substructureOf*3]->(:Molecule)
return distinct s limit 20
And if you do need to perform queries that may produce this error, you may be able to circumvent them by prepending your query with CYPHER 3.1, which will execute this with a plan produced by an older version of Cypher which doesn't use the pruning var expand operation.

will Gremlin graph queries always perform operations in it's own address space?

admittedly, most of my database experience is relational. one of the tenets in that space is to avoid moving data over the network. this manifests by using something like:
select * from person order by last_name limit 10
which will presumably order and limit within the database engine vs using something like:
select * from person
and subsequently ordering and taking the top 10 at the client which could have disastrous effects if there are a million person records.
so, with Gremlin (from Groovy), if i do something like:
g.V().has('#class', 'Person').order{println('!'); it.a.last_name <=> it.b.last_name}[0..9]
i am seeing the ! printed, so i am assuming that this bringing all Person records into the address space of my client prior to the order and limit steps which is not the desired effect.
do my options for processing queries entirely in the database engine become product specific (e.g. for orient-db perhaps submit the query in their flavor of SQL), or is there something about Gremlin that i am missing?
If you want the implementer's query optimizer to kick in, you need to use as many Gremlin steps as possible and avoid pure Groovy/in-memory processing of your graph traversals.
You're most likely looking for something like this (as of TinkerPop v3.2.0):
g.V().has('#class', 'Person').order().by('last_name', incr).limit(10)
If you find yourself using lambdas, chances are often high that this could be done with pure Gremlin steps. Favor Gremlin steps over lambdas.
See TinkerPop v3.2.0 documentation:
Order By step
Limit step

Cypher query resultset growing over subsequent runs?

I'm new to Neo4j CYPHER query language. I'm discovering it, while analyzing a graph of person to person relationships coming from a CRM system. I'm using Neo4j 2.1.2 Community Edition with Oracle Java JDK 1.7.0_45 on Windows 7 Enterprise, and interacting with Neo4j thru the web interface.
One thing puzzles me: I noticed that the resultset of some of my queries do grow over time, that is, if I run the same query after having used the database for quite a long time (1 or 2 hours later), I get a bit more results the second time -- having not updated, deleted or added anything to the database.
Is that possible? Are there special cases where it could happen? I would expect the database results to be consistent over time, as long as there is no change to the database.
I feel it is, as if the database was growing its indexes over time in the background, and if the query results were depending on the database engine's ability to reach more nodes and relationships thru the grown indexes. Could it be a memory or index configuration issue? Or did I possibly got to much coffee? Alas, it is not easily reproductible.
Sample query:
MATCH (pf:Portfolio)<-[:withRelation]-(p1:Partner)-[:JOINTACC]->(p2:Partner)
WHERE (pf.dateBoucl = '') AND (pf.catClient = 'NO')
AND NOT (p2)-[:relTo]->(:Partner)
MATCH (p1)-[r]->(p3:Partner)
WHERE NOT (p3)-[:relTo]->(:Partner)
AND NOT TYPE( r) IN [ 'relTo', 'ADRESSAT', 'MEMBER']
WITH pf, p1, p2, COLLECT( TYPE( r)) AS types
WHERE ALL( t IN types WHERE t = 'JOINTACC')
RETURN pf.catClient, pf.natureTitulaire, COUNT( DISTINCT pf);
At first I got 98 results. When running it 2 hours later, I get 103 results, and then it seems stable for subsequent runs. And I'm pretty sure I did not change the database contents.
Any hints very appreciated! Kind regards
Schema looks like this:
:schema
Indexes
ON :Country(ID) ONLINE (for uniqueness constraint)
ON :Partner(partnerID) ONLINE (for uniqueness constraint)
ON :Portfolio(partnerID) ONLINE
ON :Portfolio(noCli) ONLINE
ON :Portfolio(noDos) ONLINE
Constraints
ON (partner:Partner) ASSERT partner.partnerID IS UNIQUE
ON (country:Country) ASSERT country.ID IS UNIQUE
Dump / download your query results from both runs and do a diff on them. Then you see what differs and you can investigate where it came from.
Perhaps you also should update to 2.1.3 which has one caching problem resolved that could be related to this.

Fastest way to load neo4j DB w/Cypher - how to integrate new subgraphs?

I'm loading a Neo4j database using Cypher commands piped directly into the neo4j-shell. Some experiments suggest that subgraph batches of about 1000 lines give the optimal throughput (about 3.2ms/line, 300 lines/sec (slow!), Neo4j 2.0.1). I use MATCH statements to bind existing nodes to the loading subgraph. Here's a chopped example:
begin
...
MATCH (domain75ea8a4da9d65189999d895f536acfa5:SubDomain { shorturl: "threeboysandanoldlady.blogspot.com" })
MATCH (domainf47c8afacb0346a5d7c4b8b0e968bb74:SubDomain { shorturl: "myweeview.com" })
MATCH (domainf431704fab917205a54b2477d00a3511:SubDomain { shorturl: "www.computershopper.com" })
CREATE
(article1641203:Article { id: "1641203", url: "http://www.coolsocial.net/sites/www/blackhawknetwork.com.html", type: 4, timestamp: 1342549270, datetime: "2012-07-17 18:21:10"}),
(article1641203)-[:PUBLISHED_IN]->(domaina9b3ed6f4bc801731351b913dfc3f35a),(author104675)-[:WROTE]->(article1641203),
....
commit
Using this (ridiculously slow) method, it takes several hours to load 200K nodes (~370K relationships) and, at that point, the loading slows down even more. I presume the asymptotic slowdown is due to the overhead of the MATCH statements. They make up 1/2 of the subgraph load statements by the time the graph hits 200K nodes. There's got to be a better way of doing this, it just doesn't scale.
I'm going to try rewriting the statements with parameters (refs: What is the most efficient way to insert nodes into a neo4j database using cypher AND http://jexp.de/blog/2013/05/on-importing-data-in-neo4j-blog-series/). I expect that to help, but it seems that I will still have problems making the subgraph connections. Would using MERGE or CREATE UNIQUE instead of the MATCH statements be the way to go? There must be best practice ways to do this that I'm missing. Any other speed-up ideas?
many thanks
Use MERGE, and do smaller transactions--I've found best results with batches of 50-100 (while doing index lookups). Bigger batches are better when doing CREATE only without MATCH. Also, I recommend using a driver to send your commands over the transactional API (with parameters) instead of via neo4j-shell--it tends to be a fair bit faster.
Alternatively (might not be applicable to all use cases), keep a local "index" of the node ids you've created. For only 200k items, this should be easy to fit in a normal map/dict of string->long. This will prevent you needing to tax the index on the db, and you can do only node-ID-based lookups and CREATE statements, and create the indexes later.
The load2neo plugin worked well for me. Installation was fast+painless and it has a very cypher-like command structure that easily supports uniqueness requirements. Works with neo4j 2.0 labels.
load2neo install + curl usage example:
http://nigelsmall.com/load2neo
load2neo Geoff syntax:
http://nigelsmall.com/geoff
It is much faster (>>10x) than using Cypher via neo4j-shell.
I wasn't able to get the parameters in Cypher through neo4j-shell working despite trying everything I could find via internet search.

Resources