Neo4j and Cypher - How can I create/merge chained sequential node relationships (and even better time-series)? - neo4j

To keep things simple, as part of the ETL on my time-series data, I added a sequence number property to each row corresponding to 0..370365 (370,366 nodes, 5,555,490 properties - not that big). I later added a second property and named it "outeseq" (original) and "ineseq" (second) to see if an outright equivalence to base the relationship on might speed things up a bit.
I can get both of the following queries to run properly on up to ~30k nodes (LIMIT 30000) but past that, its just an endless wait. My JVM has 16g max (if it can even use it on a windows box):
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.outeseq-1
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
or
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.ineseq
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
I also added these in hopes of speeding things up:
CREATE CONSTRAINT ON (a:BOOK)
ASSERT a.outeseq IS UNIQUE
CREATE CONSTRAINT ON (b:BOOK)
ASSERT b.ineseq IS UNIQUE
I can't get the relationships created for the entire data set! Help!
Alternatively, I can also get bits of the relationships built with parameters, but haven't figured out how to parameterize the sequence over all of the node-to-node sequential relationships, at least not in a semantically general enough way to do this.
I profiled the query, but did't see any reason for it to "blow-up".
Another question: I would like each relationship to have a property to represent the difference in the time-stamps of each node or delta-t. Is there a way to take the difference between the two values in two sequential nodes, and assign it to the relationship?....for all of the relationships at the same time?
The last Q, if you have the time - I'd really like to use the raw data and just chain the directed relationships from one nodes'stamp to the next nearest node with the minimum delta, but didn't run right at this for fear that it cause scanning of all the nodes in order to build each relationship.
Before anyone suggests that I look to KDB or other db's for time series, let me say I have a very specific reason to want to use a DAG representation.
It seems like this should be so easy...it probably is and I'm blind. Thanks!

Creating Relationships
Since your queries work on 30k nodes, I'd suggest to run them page by page over all the nodes. It seems feasible because outeseq and ineseq are unique and numeric so you can sort nodes by that properties and run query against one slice at time.
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq = b.outeseq-1
WITH a, b ORDER BY a.outeseq SKIP {offset} LIMIT 30000
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
It will take about 13 times to run the query changing {offset} to cover all the data. It would be nice to write a script on any language which has a neo4j client.
Updating Relationship's Properties
You can assign timestamp delta to relationships using SET clause following the MATCH. Assuming that a timestamp is a long:
MATCH (a:BOOK)-[s:FORWARD_SEQ]->(b:BOOK)
SET s.delta = abs(b.timestamp - a.timestamp);
Chaining Nodes With Minimal Delta
When relationships have the delta property inside, the graph becomes a weighted graph. So we can apply this approach to calculate the shortest path using deltas. Then we just save the length of the shortest path (summ of deltas) into the relation between the first and the last node.
MATCH p=(a:BOOK)-[:FORWARD_SEQ*1..]->(b:BOOK)
WITH p AS shortestPath, a, b,
reduce(weight=0, r in relationships(p) : weight+r.delta) AS totalDelta
ORDER BY totalDelta ASC
LIMIT 1
MERGE (a)-[nearest:NEAREST {delta: totalDelta}]->(b)
RETURN nearest;
Disclaimer: queries above are not supposed to be totally working, they just hint possible approaches to the problem.

Related

neo4j for fraud detection - efficient data structure

I'm trying to improve a fraud detection system for a commerce website. We deal with direct bank transactions, so fraud is a risk we need to manage. I recently learned of graphing databases and can see how it applies to these problems. So, over the past couple of days I set up neo4j and parsed our data into it: example
My intuition was to create a node for each order, and a node for each piece of data associated with it, and then connect them all together. Like this:
MATCH (w:Wallet),(i:Ip),(e:Email),(o:Order)
WHERE w.wallet="ex" AND i.ip="ex" AND e.email="ex" AND o.refcode="ex"
CREATE (w)-[:USED]->(o),(i)-[:USED]->(o),(e)-[:USED]->(o)
But this query runs very slowly as the database size increases (I assume because it needs to search the whole data set for the nodes I'm asking for). It also takes a long time to run a query like this:
START a=node(179)
MATCH (a)-[:USED*]-(d)
WHERE EXISTS(d.refcode)
RETURN distinct d
This is intended to extract all orders that are connected to a starting point. I'm very new to Cypher (<24 hours), and I'm finding it particularly difficult to search for solutions.
Are there any specific issues with the data structure or queries that I can address to improve performance? It ideally needs to complete this kind of thing within a few seconds, as I'd expect from a SQL database. At this time we have about 17,000 nodes.
Always a good idea to completely read through the developers manual.
For speeding up lookups of nodes by a property, you definitely need to create indexes or unique constraints (depending on if the property should be unique to a label/value).
Once you've created the indexes and constraints you need, they'll be used under the hood by your query to speed up your matches.
START is only used for legacy indexes, and for the latest Neo4j versions you should use MATCH instead. If you're matching based upon an internal id, you can use MATCH (n) WHERE id(n) = xxx.
Keep in mind that you should not persist node ids outside of Neo4j for lookup in future queries, as internal node ids can be reused as nodes are deleted and created, so an id that once referred to a node that was deleted may later end up pointing to a completely different node.
Using labels in your queries should help your performance. In the query you gave to find orders, Neo4j must inspect every end node in your path to see if the property exists. Property access tends to be expensive, especially when you're using a variable-length match, so it's better to restrict the nodes you want by label.
MATCH (a)-[:USED*]-(d:Order)
WHERE id(a) = 179
RETURN distinct d
On larger graphs, the variable-length match might start slowing down, so you may get more performance by installing APOC Procedures and using the Path Expander procedure to gather all subgraph nodes and filter down to just Order nodes.
MATCH (a)
WHERE id(a) = 179
CALL apoc.path.expandConfig(a, {bfs:true, uniqueness:"NODE_GLOBAL"}) YIELD path
RETURN LAST(NODES(path)) as d
WHERE d:Order

Neo4j Cypher Optimize Create relationship between each node in cartesian product

I have a set of nodes, all labeled Word. I want to connect each Word to all the other words with a relationship called Distance. I do the following query:
match (word1:Word)
with word1
match (word2:Word)
where word1 <> word2
merge (word1)-[distance:DISTANCE ]->(word2)
return word1, distance, word2
It runs forever. There are only ~600 nodes and although I expect 600*600 relationships, the query shouldn't run for two hours! It is quicker in Java than in Neo4j. What advice do you have to make it quicker? I have already added an index on one of the properties and it's not improving.
Some observations:
Your query will try to perform 2*600*599 (or 718,800) MERGE operations in a single transaction. The reason for the factor of 2 is because every pair of words (say, x and y) will be seen twice (as x/y and y/x). You (presumably) only want to perform half that number of operations.
The x/y and y/x behavior also causes an attempt to ensure there are 2 DISTANCE relationships for each word pair -- one in either direction. That is (presumably) twice the number of relationships than you want (or need).
Trying to perform 720K (or even just 360K) operations in a single transaction may cause the DB server to run out of memory.
Here is a modified query that might fix the above issues. The ID(w1) < ID(w2) test makes sure the 2 words in a pair are not the same AND that the same pair is only processed once. It also uses the APOC procedure apoc.periodic.iterate to create 10K relationships at a time in separate transactions, in parallel.
CALL apoc.periodic.iterate(
'MATCH (w1:Word), (w2:Word) WHERE ID(w1) < ID(w2) RETURN w1, w2',
'CREATE (w1)-[:DISTANCE]->(w2)',
{batchSize:10000, parallel:true}) YIELD batches, total
RETURN *
NOTE 1: This query assumes that you start out without any DISTANCE relationships in the DB, so it uses the cheaper CREATE clause instead of MERGE. If DISTANCE relationships exist already, then use MERGE instead (but this could create a second relationship between the same pair if the first relationship was in the opposite direction).
NOTE 2: Performing the batches in parallel should be safe because issue #2 is not possible with the new Cypher code. If 2 transactions were to attempt to create relationships in opposite directions between the same 2 nodes at the same time, that could result in a deadlock, which would cause at least one of the transactions to fail.
NOTE 3: This query assumes that the first statement (with the MATCH clause) does not itself run out of memory or take too long to process. If that assumption is wrong, then using a suitably modified query with apoc.periodic.commit should work.

Seeking Neo4J Cypher query for long but (nearly) unique paths

We have a Neo4J database representing an evolutionary process with about 100K nodes and 200K relations. Nodes are individuals in generations, and edges represent parent-child relationships. The primary goal is to be able to take one or nodes of interest in the final generation, and explore their evolutionary history (roughly, "how did we get here?").
The "obvious" first query to find all their ancestors doesn't work because there are just too many possible ancestors and paths through that space:
match (a)-[:PARENT_OF*]->(c {is_interesting: true})
return distinct a;
So we've pre-processed the data so that some edges are marked as "special" such that almost every node has at most one "special" parent edge, although occasionally both parent edges are marked as "special". My hope, then, was that this query would (efficiently) generate the (nearly) unique path along "special" edges:
match (a)-[r:PARENT_OF* {special: true}]->(c {is_interesting: true})
return distinct a;
This, however, is still unworkably slow.
This is frustrating because "as a human", the logic is simple: Start from the small number of "interesting" nodes (often 1, never more than a few dozen), and chase back along the almost always unique "special" edges. Assuming a very low number of nodes with two "special" parents, this should be something like O(N) where N is the number of generations back in time.
In Neo4J, however, going back 25 steps from a unique "interesting" node where every step is unique, however, takes 30 seconds, and once there's a single bifurcation (where both parents are "special") it gets worse much faster as a function of steps. 28 steps (which gets us to the first bifurcation) takes 2 minutes, 30 (where there's still only the one bifurcation) takes 6 minutes, and I haven't even thought to try the full 100 steps to the beginning of the simulation.
Some similar work last year seemed to perform better, but we used a variety of edge labels (e.g., (a)-[:SPECIAL_PARENT_OF*]->(c) as well as (a)-[:PARENT_OF*]->(c)) instead of using data fields on the edges. Is querying on relationship field values just not a good idea? We have quite a few different values attached to a relationship in this model (some boolean, some numeric) and we were hoping/assuming we could use those to efficiently limit searches, but maybe that wasn't really the case.
Suggestions for how to tune our model or queries would be greatly appreciated.
Update I should have mentioned, this is all with Neo4J 2.1.7. I'm going to give 2.2 a try as per Brian Underwood's suggestion and will report back.
I've had some luck with specifying a limit on the path length. So if you know that it's never more than 30 hops you might try:
MATCH (c {is_interesting: true})
WITH c
MATCH (a)-[:PARENT_OF*1..30]->c
RETURN DISTINCT a
Also, is there an index on the is_interesting property? That could also cause slowness, for sure.
What version of Neo4j are you using? If you are using or if you upgrade to 2.2.0, you get to use the new query profiling tools:
http://neo4j.com/docs/2.2.0/how-do-i-profile-a-query.html
Also if you use them in the web console you get a nice graph-ish tree thing (technical term) showing each step.
After exploring things with the profiling tools in Neo4J 2.2 (thanks to Brian Underwood for the tip) it's pretty clear that (at the moment) Neo4J doesn't do any pre-filtering on edge properties, which leads to nasty combinatorial explosions with long paths.
For example the original query:
match (a)-[r:PARENT_OF* {special: true}]->(c {is_interesting: true})
return distinct a;
finds all the paths from a to c and then eliminates the ones that have edges that aren't special. Since there are many millions of paths from a to c, this is totally infeasible.
If I instead add a IS_SPECIAL edge wherever there was a PARENT_OF edge that had {special: true}, then the queries become really fast, allowing me to push back around 100 generations in under a second.
This query creates all the new edges:
match (a)-[r:PARENT_OF {special: true}]->(b)
create (a)-[:IS_SPECIAL]->(b);
and takes under a second to add 91K relationships in our graph.
Then
match (c {is_interesting: true})
with c
match (a)-[:IS_SPECIAL*]->(c)
return distinct a;
takes under a second to find the 112 nodes along the "special" path back from a unique target node c. Matching c first and limiting the set of nodes using with c seems to also be important, as Neo4J doesn't appear to pre-filter on node properties either, and if there are several "interesting" target nodes things get a lot slower.

Why the node's label affect the query performance significantly in Neo4j?

I try to simplify my question. If all nodes in Neo4jDB have same label Science, what's the difference between MATCH n WHERE n.ID="UUID-0001" RETURN n and MATCH (n:Science) WHERE n.ID="UUID-0001" RETURN n. Why the performance is not the same?
My Neo4j database contains about 70000 nodes and 100 relations.
The nodes have two types: Paper and Author, and they both have an ID field.
I created each node with corresponding label, and I also use ID as the index.
However, since one of my functions need to query nodes by ID without considering the label. The query just like: MATCH n WHERE n.ID="UUID-0001" RETURN n. The query time cost about 4000~5000 ms!
But after adding Science for each node and using MATCH (n:Science) WHERE n.ID="UUID-0001" RETURN n. The query time became about 1000~1100 ms. Does anyone know the difference between these two cases?
PS. Count(n:Science) = Count(n:Paper) + Count(n:Author), which mean each node has two labels.
Because for every label Neo4j automatically creates an extra index. The Cypher language can be broadly thought of as piping + filtering, so Match n WHere ... will first get every node and then filter on the where part. Whereas Match (n:Science) Where... will get every node with label science (using an index) and then try to match the where. From your query performance we can see that about 1/5th of your nodes were marked science so the query runs in a fifth he time, because it did a fifth as many comparisons.
Even though I got the advisement from #phil_20686 and #Michael Hunger, but I think these answers do not solve my question.
I think there are some tricks when using label. If their are 10 thousand nodes in Neo4j DB, and the type of these nodes are the same. The query will perform better when adding label to these nodes.
I hope this post can help some people and give me some feedback if you find the reasons. Thanks.

Is it the optimal way of expressing "go through all nodes" queries in Cypher?

I have a quite large social graph in which I execute global queries like this one:
match (n:User)-[r:LIKES]->(k:User)
where not (k:User)-[]->(n:User)
return count(r);
They take a lot of time and memory, so I am curious if they are expressed in optimal way. I have felling that when I execute such query Cypher is firstly matching everything that fits the expression (and that takes a lot of memory) and then starts to count things. I would rather like to go through every node, check the pattern and update the counter if necessary. This way such queries would not require a lot of memory. So how in fact such query is executed? If it is not optimal, is there a way to make it better (in Cypher)?
If you used the query just as you wrote it, you may not be getting what you think you are. Putting labels on node "variables" can cause them to be treated as fresh (partial) patterns instead of bound nodes. Is your query any faster if you use
MATCH (n:User)-[r:LIKES]->(k:User)
WHERE NOT (n)<--(k)
RETURN count(r)
Here's how this works (not considering internal optimizations, which I don't begin to understand).
For each User node, every outgoing LIKES relationship is followed. If the other end of the LIKES relationship is a User node, the two nodes and the relationship are bound to the names n, k, and r and passed to the WHERE clause. Every outgoing relationship on the bound k node is then tested to see if it connects to the bound n node. If no such relationship is found, the match is considered successful. The count() function in the RETURN clause counts the resulting collection of relationships that were passed from the match.
If you have a densely connected graph, and particularly if there are many other relationships between nodes other than LIKES relationship, this can be quite an extensive search.
As a further experiment, you might try changing the WHERE clause to read
WHERE NOT (k)-->(n)
and see if it makes any difference. I don't think it will, but I could be wrong.

Resources