I have a very long Cypher request in my app (running on Node.Js and Neo4j 2.0.1), which creates at once about 16 nodes and 307 relationships between them. It is about 50K long.
The high number of relationships is determined by the data model, which I probably want to change later, but nevertheless, if I decide to keep everything as it is, two questions:
1) What would be the maximum size of each single Cypher request I send to Neo4J?
2) What would be the best strategy to deal with a request that is too long? Split it into the smaller ones and then batch them in a transaction? I wouldn't like to do that because in this case I lose the consistency that I had resulting from a combination of MERGE and CREATE commands (the request automatically recognized some nodes that did not exist yet, create them, and then I could make relations between them using their indices that I already got through the MERGE).
Thank you!
I usually recommend to
Use smaller statements, so that the query plan cache can kick in and execute your query immediately without compiling, for this you also need
parameters, e.g. {context} or {user}
I think a statement size of up to 10-15 elements is easy to handle.
You can still execute all of them in a single tx with the transactional cypher endpoint, which allows batching of statements and their parameters.
Related
I have a rather long and complex paginated query. I'm trying to optimize it. In the worst case - first, I have to execute the data query in a one call to Neo4j, and then I have to execute pretty much the same query for the count. Of course, I do everything in one transaction. Anyway, I don't like the overall execution time, so I extracted the most common part for both - data and count queries and execute it on the first call. This common query returns the IDs of nodes, which I then pass as parameters to the rest of data and count queries. Now, everything works much faster. One thing I don't like is that a common query can sometimes return quite a large set of IDs.. it can be 20k..50k Long IDs.
So my question is - because I'm doing this in a one transaction - is there a way to preserve such Set of IDs somewhere in Neo4j between common query and data/count query calls and just refer them somehow in the subsequent data/count queries without moving between app JVM and Neo4j?
Also, am I crazy for doing this, or is this a good approach to optimize a complex paginated query?
Only with a custom procedure.
Otherwise you'd need to return them.
But usually it's uncommon to both provide counts (even google doesn't provide "real" counts) and data.
One way is to just stream the results with the reactive driver as long as the user scrolls.
Otherwise I would just query for pageSize+1 and return "more than pageSize results".
If you just stream the id's back (and don't collect them as an aggregation) you can start using the id's received already to issue your new queries (even in parallel).
I am using an apoc.periodic.iterate query to store millions of data . Since the data may contain duplicates I am using MERGE action to create nodes but unfortunately whenever the data is duplicated the whole batch is getting with error like this
"LockClient[200] can't wait on resource RWLock[NODE(14), hash=1645803399] since => LockClient[200] <-[:HELD_BY]- RWLock[NODE(101)"
Changing parallel as false works fine
Also by removing duplicates the query is passed successfully
But both of the above solution takes more time since dealing with millions of data . Is there any alternate solution like making a it to wait for the lock
You cannot use parallel:true, because you are creating relationships in your query. Every time you want to add a relationship to a node, the cypher engine adds a write lock to a node, and other processes can't add to that particular node. That is why you have the write lock exception. Not much you can do except to run it with parallel:false setting.
To avoid deadlocks, concurrent requests that update the DB should avoid touching the same nodes or relationships (including the nodes on both ends of those relationships). One way to achieve this is to figure out a way to have the concurrent requests work on disjoint subgraphs.
Or, you can retry queries that throw a DeadlockDetectedException. The docs show an example of how to do that.
I have a neo4j database that has nearly 5 million nodes and 12 million edges. I want to delete a type (UsedAt) of edge (relationship). There are nearly 3 million edges of the type "UsedAt".
I'm writing a query like
match ()-[e:UsedAt]->() delete e
This takes too much time. Never stops.
I also tried
CALL apoc.periodic.iterate("match ()-[e:UsedAt]->() return e", "delete e", {batchSize:1000, parallel:true})
That also never stops. How can I delete all edges (relationships) of a certain type efficiently on a relatively large database?
It is not a matter of how you write the query, but a matter of neo4j's structure. Simply put, no matter how you write the query, the edges cannot be deleted as efficient as you expect. This is because: 1)It is a huge transaction in neo4j with your data size. In its essential, neo4j ensures transaction for each operation. 2) There are a lot of random read and write from disk or memory and none of them are fast. So if you stick to neo4j, you'd better avoid such operations.
Your query should run faster if you can specify a starting node label. E.g:
match (:SomeLabel)-[e:usedAt]->() delete e
Your original query is doing a full db scan, using labels will constrain the query to look only at the selected node types, which will be faster.
Use the periodic iterate, but try it with parallel set to false. Depending on your overall graph structure you might be suffering some deadlock in the parallel processing.
I am yet trying to make use of neo4j to perform a complex query (similar to shortest path search except I have very strange conditions applied to this search like minimum path length in terms of nodes traversed count).
My dataset contains around 2.5M nodes of one single type and around 1.5 billion edges (One single type as well). Each given node has on average 1000 directional relation to a "next" node.
Yet, I have a query that allows me to retrieve this shortest path given all of my conditions but the only way I found to have decent response time (under one second) is to actually limit the number of results after each new node added to the path, filter it, order it and then pursue to the next node (This is kind of a greedy algorithm I suppose).
I'd like to limit them a lot less than I do in order to yield more path as a result, but the problem is the exponential complexity of this search that makes going from LIMIT 40 to LIMIT 60 usually a matter of x10 ~ x100 processing time.
This being said, I am yet evaluating several solutions to increase the speed of the request but I'm quite unsure of the result they will yield as I'm not sure about how neo4j really stores my data internally.
The solution I think about yet is to actually add a property to my relationships which would be an integer in between 1 and 15 because I usually will only query the relationships that have one or two max different values for this property. (like only relationships that have this property to 8 or 9 for example).
As I can guess yet, for each relationship, neo4j then have to gather the original node properties and use it to apply my further filters which takes a very long time when crossing 4 nodes long path with 1000 relationships each (I guess O(1000^4)). Am I right ?
With relationship properties, will it have direct access to it without further data fetching ? Is there any chance it will make my queries faster? How are neo4j edges properties stored ?
UPDATE
Following #logisima 's advice I've written a procedure directly with the Java traversal API of neo4j. I then switched to the raw Java procedure API of Neo4J to leverage even more power and flexibility as my use case required it.
The results are really good : the lower bound complexity is overall a little less thant it was before but the higher bound is like ten time faster and when at least some of the nodes that will be used for the traversal are in the cache of Neo4j, the performances just becomes astonishing (depth 20 in less than a second for one of my tests when I only need depth 4 usually).
But that's not all. The procedures makes it very very easily customisable while keeping the performances at their best and optimizing every single operation at its best. The results is that I can use far more powerful filters in far less computing time and can easily update my procedure to add new features. Last but not least Procedures are very easily pluggable with spring-data for neo4j (which I use to connect neo4j to my HTTP API). Where as with cypher, I would have to auto generate the queries (as being very complex, there was like 30 java classes to do the trick properly) and I should have used jdbc for neo4j while handling a separate connection pool only for this request. Cannot recommend more to use the awesome neo4j java API.
Thanks again #logisima
If you're trying to do a custom shortespath algo, then you should write a cypher procedure with the traversal API.
The principe of Cypher is to make pattern matching, and you want to traverse the graph in a specific way to find your good solution.
The response time should be really faster for your use-case !
I'm loading a Neo4j database using Cypher commands piped directly into the neo4j-shell. Some experiments suggest that subgraph batches of about 1000 lines give the optimal throughput (about 3.2ms/line, 300 lines/sec (slow!), Neo4j 2.0.1). I use MATCH statements to bind existing nodes to the loading subgraph. Here's a chopped example:
begin
...
MATCH (domain75ea8a4da9d65189999d895f536acfa5:SubDomain { shorturl: "threeboysandanoldlady.blogspot.com" })
MATCH (domainf47c8afacb0346a5d7c4b8b0e968bb74:SubDomain { shorturl: "myweeview.com" })
MATCH (domainf431704fab917205a54b2477d00a3511:SubDomain { shorturl: "www.computershopper.com" })
CREATE
(article1641203:Article { id: "1641203", url: "http://www.coolsocial.net/sites/www/blackhawknetwork.com.html", type: 4, timestamp: 1342549270, datetime: "2012-07-17 18:21:10"}),
(article1641203)-[:PUBLISHED_IN]->(domaina9b3ed6f4bc801731351b913dfc3f35a),(author104675)-[:WROTE]->(article1641203),
....
commit
Using this (ridiculously slow) method, it takes several hours to load 200K nodes (~370K relationships) and, at that point, the loading slows down even more. I presume the asymptotic slowdown is due to the overhead of the MATCH statements. They make up 1/2 of the subgraph load statements by the time the graph hits 200K nodes. There's got to be a better way of doing this, it just doesn't scale.
I'm going to try rewriting the statements with parameters (refs: What is the most efficient way to insert nodes into a neo4j database using cypher AND http://jexp.de/blog/2013/05/on-importing-data-in-neo4j-blog-series/). I expect that to help, but it seems that I will still have problems making the subgraph connections. Would using MERGE or CREATE UNIQUE instead of the MATCH statements be the way to go? There must be best practice ways to do this that I'm missing. Any other speed-up ideas?
many thanks
Use MERGE, and do smaller transactions--I've found best results with batches of 50-100 (while doing index lookups). Bigger batches are better when doing CREATE only without MATCH. Also, I recommend using a driver to send your commands over the transactional API (with parameters) instead of via neo4j-shell--it tends to be a fair bit faster.
Alternatively (might not be applicable to all use cases), keep a local "index" of the node ids you've created. For only 200k items, this should be easy to fit in a normal map/dict of string->long. This will prevent you needing to tax the index on the db, and you can do only node-ID-based lookups and CREATE statements, and create the indexes later.
The load2neo plugin worked well for me. Installation was fast+painless and it has a very cypher-like command structure that easily supports uniqueness requirements. Works with neo4j 2.0 labels.
load2neo install + curl usage example:
http://nigelsmall.com/load2neo
load2neo Geoff syntax:
http://nigelsmall.com/geoff
It is much faster (>>10x) than using Cypher via neo4j-shell.
I wasn't able to get the parameters in Cypher through neo4j-shell working despite trying everything I could find via internet search.