Neo4j conditional batch import - neo4j

The data comes to the system continuosly with rate 300-500 TPS. I need to import it to neo4j with the following scheme:
If N node does not exist, create it
If the relation N-[rel:rel_type]->X does not exist, create it
Increment rel.weight
It seems to be impossible to solve the problem using REST batch.
Different cypher queries are too long because they generate many small transactions.
Gremlin works much faster. I collect parameters for gremlin script in array and execute it as a batch. But even though I could hardly reach the speed of 300 TPS.
I should mention that besides there will be a flow of queries ~500 TPS:
START N=node(...) MATCH N-[rel:rel_type]->X return rel.weight,X.name;
The heap size is set to 5 Gb. Additional options:
-XX:MaxPermSize=1G -XX:+CMSClassUnloadingEnabled -XX:+UseParallelGC -XX:+UseNUMA
What is optimal way and configuration for importing such kind of data?

to check whether or not the incoming node exists and has the rels to other node, you can use create unique syntax.
START n=node:node_index(newNode={N})
CREATE UNIQUE n-[:REL_TYPE]->x ;
to automatically increment the weight of the relationship, i would assume something like this (but no warranty on this, there is probably a faster way of doing it):
START n=node:node_index(newNode={N})
CREATE UNIQUE n-[rel:REL_TYPE]->x
SET rel.weight = coalesce(rel.weight?,0) +1

Related

Neo4J using properties on relationships for quicker lookup?

I am yet trying to make use of neo4j to perform a complex query (similar to shortest path search except I have very strange conditions applied to this search like minimum path length in terms of nodes traversed count).
My dataset contains around 2.5M nodes of one single type and around 1.5 billion edges (One single type as well). Each given node has on average 1000 directional relation to a "next" node.
Yet, I have a query that allows me to retrieve this shortest path given all of my conditions but the only way I found to have decent response time (under one second) is to actually limit the number of results after each new node added to the path, filter it, order it and then pursue to the next node (This is kind of a greedy algorithm I suppose).
I'd like to limit them a lot less than I do in order to yield more path as a result, but the problem is the exponential complexity of this search that makes going from LIMIT 40 to LIMIT 60 usually a matter of x10 ~ x100 processing time.
This being said, I am yet evaluating several solutions to increase the speed of the request but I'm quite unsure of the result they will yield as I'm not sure about how neo4j really stores my data internally.
The solution I think about yet is to actually add a property to my relationships which would be an integer in between 1 and 15 because I usually will only query the relationships that have one or two max different values for this property. (like only relationships that have this property to 8 or 9 for example).
As I can guess yet, for each relationship, neo4j then have to gather the original node properties and use it to apply my further filters which takes a very long time when crossing 4 nodes long path with 1000 relationships each (I guess O(1000^4)). Am I right ?
With relationship properties, will it have direct access to it without further data fetching ? Is there any chance it will make my queries faster? How are neo4j edges properties stored ?
UPDATE
Following #logisima 's advice I've written a procedure directly with the Java traversal API of neo4j. I then switched to the raw Java procedure API of Neo4J to leverage even more power and flexibility as my use case required it.
The results are really good : the lower bound complexity is overall a little less thant it was before but the higher bound is like ten time faster and when at least some of the nodes that will be used for the traversal are in the cache of Neo4j, the performances just becomes astonishing (depth 20 in less than a second for one of my tests when I only need depth 4 usually).
But that's not all. The procedures makes it very very easily customisable while keeping the performances at their best and optimizing every single operation at its best. The results is that I can use far more powerful filters in far less computing time and can easily update my procedure to add new features. Last but not least Procedures are very easily pluggable with spring-data for neo4j (which I use to connect neo4j to my HTTP API). Where as with cypher, I would have to auto generate the queries (as being very complex, there was like 30 java classes to do the trick properly) and I should have used jdbc for neo4j while handling a separate connection pool only for this request. Cannot recommend more to use the awesome neo4j java API.
Thanks again #logisima
If you're trying to do a custom shortespath algo, then you should write a cypher procedure with the traversal API.
The principe of Cypher is to make pattern matching, and you want to traverse the graph in a specific way to find your good solution.
The response time should be really faster for your use-case !

Neo4j long lasting query to be split/executed in smaller chunks?

My import.csv creates many nodes and merging creates a huge cartesian product and runs in a transaction timeout since the data has grown so much. I've currently set the transaction timeout to 1 second because every other query is very quick and is not supposed to take any longer than one second to finish.
Is there a way to split or execute this specific query in smaller chunks to prevent a timeout?
Upping or disabling the transaction timeout in the neo4j.conf is not an option because the neo4j service needs a restart for every change made in the config.
The query hitting the timeout from my import script:
MATCH (l:NameLabel)
MATCH (m:Movie {id: l.id,somevalue: l.somevalue})
MERGE (m)-[:LABEL {path: l.path}]->(l);
Nodecounts: 1000 Movie, 2500 Namelabel
You can try installing APOC Procedures and using the procedure apoc.periodic.commit.
call apoc.periodic.commit("
MATCH (l:Namelabel)
WHERE NOT (l)-[:LABEL]->(:Movie)
WITH l LIMIT {limit}
MATCH (m:Movie {id: l.id,somevalue: l.somevalue})
MERGE (m)-[:LABEL {path: l.path}]->(l)
RETURN count(*)
",{limit:1000})
The below query will be executed repeatedly in separate transactions until it returns 0.
You can change the value of {limit : 1000}.
Note: remember to install APOC Procedures according the version of Neo4j you are using. Take a look in the Version Compatibility Matrix.
The number of nodes and labels in your database suggest this is an indexing problem. Do you have constraints on both the Movie and Namelabel (which should be NameLabel since it is a node) nodes? The appropriate constraints should be in place and active.
Indexing and Performance
Make sure to have indexes and constraints declared and ONLINE for
entities you want to MATCH or MERGE on
Always MATCH and MERGE on a
single label and the indexed primary-key property
Prefix your load
statements with USING PERIODIC COMMIT 10000 If possible, separate node
creation from relationship creation into different statements
If your
import is slow or runs into memory issues, see Mark’s blog post on
Eager loading.
If your Movie nodes have unique names then use the CREATE UNIQUE statement. - docs
If one of the nodes is not unique but will be used in a relationship definition then the CREATE INDEX ON statement. With such a small dataset it may not be readily apparent how inefficient your queries are. Try the PROFILE command and see how many nodes are being searched. Your MERGE statement should only check a couple nodes at each step.

neo4j for fraud detection - efficient data structure

I'm trying to improve a fraud detection system for a commerce website. We deal with direct bank transactions, so fraud is a risk we need to manage. I recently learned of graphing databases and can see how it applies to these problems. So, over the past couple of days I set up neo4j and parsed our data into it: example
My intuition was to create a node for each order, and a node for each piece of data associated with it, and then connect them all together. Like this:
MATCH (w:Wallet),(i:Ip),(e:Email),(o:Order)
WHERE w.wallet="ex" AND i.ip="ex" AND e.email="ex" AND o.refcode="ex"
CREATE (w)-[:USED]->(o),(i)-[:USED]->(o),(e)-[:USED]->(o)
But this query runs very slowly as the database size increases (I assume because it needs to search the whole data set for the nodes I'm asking for). It also takes a long time to run a query like this:
START a=node(179)
MATCH (a)-[:USED*]-(d)
WHERE EXISTS(d.refcode)
RETURN distinct d
This is intended to extract all orders that are connected to a starting point. I'm very new to Cypher (<24 hours), and I'm finding it particularly difficult to search for solutions.
Are there any specific issues with the data structure or queries that I can address to improve performance? It ideally needs to complete this kind of thing within a few seconds, as I'd expect from a SQL database. At this time we have about 17,000 nodes.
Always a good idea to completely read through the developers manual.
For speeding up lookups of nodes by a property, you definitely need to create indexes or unique constraints (depending on if the property should be unique to a label/value).
Once you've created the indexes and constraints you need, they'll be used under the hood by your query to speed up your matches.
START is only used for legacy indexes, and for the latest Neo4j versions you should use MATCH instead. If you're matching based upon an internal id, you can use MATCH (n) WHERE id(n) = xxx.
Keep in mind that you should not persist node ids outside of Neo4j for lookup in future queries, as internal node ids can be reused as nodes are deleted and created, so an id that once referred to a node that was deleted may later end up pointing to a completely different node.
Using labels in your queries should help your performance. In the query you gave to find orders, Neo4j must inspect every end node in your path to see if the property exists. Property access tends to be expensive, especially when you're using a variable-length match, so it's better to restrict the nodes you want by label.
MATCH (a)-[:USED*]-(d:Order)
WHERE id(a) = 179
RETURN distinct d
On larger graphs, the variable-length match might start slowing down, so you may get more performance by installing APOC Procedures and using the Path Expander procedure to gather all subgraph nodes and filter down to just Order nodes.
MATCH (a)
WHERE id(a) = 179
CALL apoc.path.expandConfig(a, {bfs:true, uniqueness:"NODE_GLOBAL"}) YIELD path
RETURN LAST(NODES(path)) as d
WHERE d:Order

What's the longest request in Neo4J Cypher?

I have a very long Cypher request in my app (running on Node.Js and Neo4j 2.0.1), which creates at once about 16 nodes and 307 relationships between them. It is about 50K long.
The high number of relationships is determined by the data model, which I probably want to change later, but nevertheless, if I decide to keep everything as it is, two questions:
1) What would be the maximum size of each single Cypher request I send to Neo4J?
2) What would be the best strategy to deal with a request that is too long? Split it into the smaller ones and then batch them in a transaction? I wouldn't like to do that because in this case I lose the consistency that I had resulting from a combination of MERGE and CREATE commands (the request automatically recognized some nodes that did not exist yet, create them, and then I could make relations between them using their indices that I already got through the MERGE).
Thank you!
I usually recommend to
Use smaller statements, so that the query plan cache can kick in and execute your query immediately without compiling, for this you also need
parameters, e.g. {context} or {user}
I think a statement size of up to 10-15 elements is easy to handle.
You can still execute all of them in a single tx with the transactional cypher endpoint, which allows batching of statements and their parameters.

Fastest way to load neo4j DB w/Cypher - how to integrate new subgraphs?

I'm loading a Neo4j database using Cypher commands piped directly into the neo4j-shell. Some experiments suggest that subgraph batches of about 1000 lines give the optimal throughput (about 3.2ms/line, 300 lines/sec (slow!), Neo4j 2.0.1). I use MATCH statements to bind existing nodes to the loading subgraph. Here's a chopped example:
begin
...
MATCH (domain75ea8a4da9d65189999d895f536acfa5:SubDomain { shorturl: "threeboysandanoldlady.blogspot.com" })
MATCH (domainf47c8afacb0346a5d7c4b8b0e968bb74:SubDomain { shorturl: "myweeview.com" })
MATCH (domainf431704fab917205a54b2477d00a3511:SubDomain { shorturl: "www.computershopper.com" })
CREATE
(article1641203:Article { id: "1641203", url: "http://www.coolsocial.net/sites/www/blackhawknetwork.com.html", type: 4, timestamp: 1342549270, datetime: "2012-07-17 18:21:10"}),
(article1641203)-[:PUBLISHED_IN]->(domaina9b3ed6f4bc801731351b913dfc3f35a),(author104675)-[:WROTE]->(article1641203),
....
commit
Using this (ridiculously slow) method, it takes several hours to load 200K nodes (~370K relationships) and, at that point, the loading slows down even more. I presume the asymptotic slowdown is due to the overhead of the MATCH statements. They make up 1/2 of the subgraph load statements by the time the graph hits 200K nodes. There's got to be a better way of doing this, it just doesn't scale.
I'm going to try rewriting the statements with parameters (refs: What is the most efficient way to insert nodes into a neo4j database using cypher AND http://jexp.de/blog/2013/05/on-importing-data-in-neo4j-blog-series/). I expect that to help, but it seems that I will still have problems making the subgraph connections. Would using MERGE or CREATE UNIQUE instead of the MATCH statements be the way to go? There must be best practice ways to do this that I'm missing. Any other speed-up ideas?
many thanks
Use MERGE, and do smaller transactions--I've found best results with batches of 50-100 (while doing index lookups). Bigger batches are better when doing CREATE only without MATCH. Also, I recommend using a driver to send your commands over the transactional API (with parameters) instead of via neo4j-shell--it tends to be a fair bit faster.
Alternatively (might not be applicable to all use cases), keep a local "index" of the node ids you've created. For only 200k items, this should be easy to fit in a normal map/dict of string->long. This will prevent you needing to tax the index on the db, and you can do only node-ID-based lookups and CREATE statements, and create the indexes later.
The load2neo plugin worked well for me. Installation was fast+painless and it has a very cypher-like command structure that easily supports uniqueness requirements. Works with neo4j 2.0 labels.
load2neo install + curl usage example:
http://nigelsmall.com/load2neo
load2neo Geoff syntax:
http://nigelsmall.com/geoff
It is much faster (>>10x) than using Cypher via neo4j-shell.
I wasn't able to get the parameters in Cypher through neo4j-shell working despite trying everything I could find via internet search.

Resources