Let's say that I have 3 different types of nodes, Plans, Big-Ideas and Ideas.
Now each Plan can consist of 0+ Big-Ideas and 0+ Ideas, along with weights (votes) for these relationships. Big-Ideas can also be created by Ideas.
For example, given plan A.
A -[:HAS_BIG_IDEA {Weight: 30}] -> B (Big-Idea)
A -[:HAS_IDEA {Weight: 10}] -> C (Idea)
A -[:HAS_BIG_IDEA {Weight: 1}] -> D (Big-Idea) -[:IS_MADE_OF {PlanID, Date, Weight}] -> E (Idea)
Plans are changed daily, with votes being given daily as well. Initially I decided to have 1 node for each type of Plan and have new relationships added daily with the properties for Date and Weight. The problem is that as the number of ideas and big-ideas number in the millions, the number of relationships starts ballooning and filtering Plans based on Dates and Weights becomes much slower.
As an example, the query to list all ideas used on 7th July 2017 which had votes > 100 takes much longer to execute. In order to reduce the number of starting points, I shifted the Date property to the Plan nodes and create new Plan nodes for each type and day. As this reduces the starting point of the query, it speeds it up. The downside is creating more nodes.
So I guess my question is more design related as to whether my second approach is something which is good practice. I've read that using legacy indexing on relationships hints towards redesigning the graph so I wasn't too keen on adding indexes on dates / weights for relationships.
Related
There is DataSet at my Notebook’s Virtual Machine:
2 million unique Customers [:VISITED] 40000 unique Merchants.
Every [:VISIT] has properties: amount (double) and dt (date).
Every Customer has property “pty_id” (Integer).
And every Merchant has mcht_id (String) property.
One Customer may visit one Merchant for more than one time. And of course, one Customer may visit many Merchants. So there are 43 978 539 relationships in my graph between Customers and Merchants.
I have created Indexes:
CREATE INDEX on :Customer(pty_id)
CREATE INDEX on :Merchant(mcht_id)
Parameters of my VM are:
Oracle (RedHat) Linux 7 with 2 core i7, 2 GB RAM
Parameters of my Neo4j 3.5.7 config:
- dbms.memory.heap.max_size=1024m
- dbms.memory.pagecache.size=512m
My task is:
Get top 10 Customers ordered by total_amount who spent their money at NOT specified Merchant(M) but visit that Merchants which have been visited by Customers who visit this specified Merchant(M)
My Solution is:
Let’s M will have mcht_id = "0000000DA5"
Then the CYPHER query will be:
MATCH
(c:Customer)-[r:VISITED]->(mm:Merchant)<-[:VISITED]-(cc:Customer)-[:VISITED]->(m:Merchant {mcht_id: "0000000DA5"})
WHERE
NOT (c)-[:VISITED]->(m)
WITH
DISTINCT c as uc
MATCH
(uc:Customer)-[rr:VISITED]->()
RETURN
uc.pty_id
,round(100*sum(rr.amount))/100 as v_amt
ORDER BY v_amt DESC
LIMIT 10;
Result is OK. I receive my answer:
uc.pty_id - v_amt: 1433798 - 348925.94; 739510 - 339169.83; 374933 -
327962.95 and so on.
The problem is that this result I have received after 437613 ms! It’s about 7 minutes!!! My estimated time for this query was about 10-20 seconds….
My Question is: What am I doing wrong???
There's a few things to improve here.
First, for graph-wide queries in a graph with millions of nodes and 50 million relationships, 1G of heap and 512M of pagecache is far too low. We usually recommend around 8-10G of heap minimum for medium to large graphs (this is your "scratch space" memory as a query executes), and to try to get as much of the graph size as possible in pagecache if you can to minimize cache misses as you traverse the graph. Neo4j likes memory. Memory is relatively cheap. You can use neo4j-admin memrec to get a recommendation of how to configure your memory settings, but in general you need to run this on a machine with more memory.
And if we're talking about hardware recommendations, usage of SSDs is highly recommended, for when you do need to hit the disk.
As for the query itself, notice in the query plan you posted that your DISTINCT operation drops the number of rows from the neighborhood of 26-35 million to only 153k rows, that's significant. Your most expensive step here (WHERE
NOT (c)-[:VISITED]->(m)) is the Expand(Into) operation on the right side of the plan, with nearly 1 billion db hits. This is happening too early in the query - you should be doing this AFTER your DISTINCT operation, so it operates on only 153k rows instead of 35 million.
You can also improve upon this so you don't even have to hit the graph to do that step of the filtering. Instead of using that WHERE NOT <pattern> approach, you can pre-match to the customers who visited the first merchant, gather them into a list, and keep them around, and instead of using negation of the pattern (where it has to actually expand out all :VISITED relationships of those customers and see if any was the original merchant), we instead do a list membership check, and ensure they aren't one of the 1k or so customers who visited the original merchant. That will happen in memory, since we already collected that list, so it shouldn't hit the graph. In any case you should do DISTINCT before this check.
In your RETURN you're performing an aggregation with respect to a node's unique property, so you're paying the cost of projecting that property across 4 million rows BEFORE the cardinality drops from the aggregation to 153k rows, meaning you're projecting out that property redundantly across a great many duplicate :Customer nodes before they become distinct from the aggregation. That's redundant and expensive property access you can avoid by aggregating with respect to the node instead, and then do your property access after the aggregation, and also after your sort and limit, so you only have to project out 10 properties.
So putting that all together, try this out:
MATCH
(cc:Customer)-[:VISITED]->(m:Merchant {mcht_id: "0000000DA5"})
WITH m, collect(DISTINCT cc) as visitors
UNWIND visitors as cc
MATCH (uc:Customer)-[:VISITED]->(mm:Merchant)<-[:VISITED]-(cc)
WHERE
mm <> m
WITH
DISTINCT visitors, uc
WHERE NOT uc IN visitors
MATCH
(uc:Customer)-[rr:VISITED]->()
WITH
uc, round(100*sum(rr.amount))/100 as v_amt
ORDER BY v_amt DESC
LIMIT 10
RETURN uc.pty_id, v_amt;
EDIT
Okay, let's try something else. I suspect that what we're encountering here is a great deal of duplicates during expansion (many visitors may have visited the same merchants). Cypher won't eliminate duplicates during traversal unless you explicitly ask for it (as it may need this info for doing aggregations such as counting of occurrences), and this query is highly dependent on getting distinct nodes during expansion.
If you can install APOC Procedures, we can make use of some expansion procs which let us change how Cypher expands, only visiting each distinct node once across all paths. That may improve the timing here. At the least it will show us if the slowdown we're seeing is related to deduplication of nodes during expansion, or if it's something else.
MATCH (m:Merchant {mcht_id: "0000000DA5"})
CALL apoc.path.expandConfig(m, {uniqueness:'NODE_GLOBAL', relationshipFilter:'VISITED', minLevel:3, maxLevel:3}) YIELD path
WITH last(nodes(path)) as uc
MATCH
(uc:Customer)-[rr:VISITED]->()
WITH
uc
,round(100*sum(rr.amount))/100 as v_amt
ORDER BY v_amt DESC
LIMIT 10
RETURN uc.pty_id, v_amt;
While this is a more complicated approach, one neat thing is that with NODE_GLOBAL uniqueness (ensuring we only visit each node once across all expanded paths) and bfs expansion, we don't need to include WHERE NOT (c)-[:VISITED]->(m) since this will naturally be ruled out; we would have already visited every visitor of m, and since they've already been visited, we cannot visit them again, so none of them will appear in the final result set at 3 hops.
Give this a try and run it a couple times to get that into pagecache (or as much as possible...with 512MB pagecache you may not be able to get all of the traversed structure into memory).
I have tested all optimised query on Neo4j and on Oracle. Results are:
Oracle - 2.197 sec
Neo4j - 5.326 sec
You can see details here: http://homme.io/41163#run
And there is more complimentared for Neo4j case at http://homme.io/41721.
I am using Neo4j community edition embedded in java application for recommendation purpose. I made a custom function which contains a complex logic of comparing two entities, namely product and users. Both entities are present as nodes in graph and has more than 20 properties each for comparison purpose. For eg. I am calling this function in following format:
match (e:User {user_id:"some-id"}) with e
match (f:Product {product_id:"some-id"}) with e,f
return e,f,findComparisonValue(e,f) as pref_value;
This function call on an average takes about 4-5 ms to run. Now, to recommend best product to a particular user, I wrote a cypher query which iterates over all products, calculate the pref_value and rank them. My cypher query looks like this:
MATCH (source:User) WHERE id(source)={id} with source
MATCH (reco:Product) WHERE reco.is_active='t'
with reco, source, findComparisonValue(source, reco) as score_result
RETURN distinct reco, score_result.score as score, score_result.params as params, score_result.matched_keywords as matched_keywords
order by score desc
Some insights on graph structure:
Total Number of nodes: 2 million
Total Number of relationships: 20 million
Total Number of Users: 0.2 million
Total Number of Products: 1.8 million
The above cypher query is taking more than 10 seconds as it is iterating over all the products. On top of this cypher query, I am using graphaware-reco module for my recommendation needs (Using precompute, filteing, post processing etc). I thought of parallelising this but community edition does not support clustering. Now, as number of users in system is increasing day by day, I need to think of a scalable solution.
Can anyone help me out here, on how to optimize the query.
As others have commented, doing a significant calculation potentially millions of times in a single query is going to be slow, and does not take advantage of neo4j's strengths. You should investigate modifying your data model and calculation so that you can leverage relationships and/or indexes.
In the meantime, there are a number of things to suggest with your second query:
Make sure you have created an index for :Product(is_active), so that it is not necessary to scan all products. (By the way, if that property is actually supposed to be a boolean, then consider making it a boolean rather than a string.)
The RETURN clause should not need the DISTINCT operator, since all the result rows should be distinct anyway. This is because every reco value is already distinct. Removing that keyword should improve performance.
To keep things simple, as part of the ETL on my time-series data, I added a sequence number property to each row corresponding to 0..370365 (370,366 nodes, 5,555,490 properties - not that big). I later added a second property and named it "outeseq" (original) and "ineseq" (second) to see if an outright equivalence to base the relationship on might speed things up a bit.
I can get both of the following queries to run properly on up to ~30k nodes (LIMIT 30000) but past that, its just an endless wait. My JVM has 16g max (if it can even use it on a windows box):
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.outeseq-1
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
or
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.ineseq
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
I also added these in hopes of speeding things up:
CREATE CONSTRAINT ON (a:BOOK)
ASSERT a.outeseq IS UNIQUE
CREATE CONSTRAINT ON (b:BOOK)
ASSERT b.ineseq IS UNIQUE
I can't get the relationships created for the entire data set! Help!
Alternatively, I can also get bits of the relationships built with parameters, but haven't figured out how to parameterize the sequence over all of the node-to-node sequential relationships, at least not in a semantically general enough way to do this.
I profiled the query, but did't see any reason for it to "blow-up".
Another question: I would like each relationship to have a property to represent the difference in the time-stamps of each node or delta-t. Is there a way to take the difference between the two values in two sequential nodes, and assign it to the relationship?....for all of the relationships at the same time?
The last Q, if you have the time - I'd really like to use the raw data and just chain the directed relationships from one nodes'stamp to the next nearest node with the minimum delta, but didn't run right at this for fear that it cause scanning of all the nodes in order to build each relationship.
Before anyone suggests that I look to KDB or other db's for time series, let me say I have a very specific reason to want to use a DAG representation.
It seems like this should be so easy...it probably is and I'm blind. Thanks!
Creating Relationships
Since your queries work on 30k nodes, I'd suggest to run them page by page over all the nodes. It seems feasible because outeseq and ineseq are unique and numeric so you can sort nodes by that properties and run query against one slice at time.
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq = b.outeseq-1
WITH a, b ORDER BY a.outeseq SKIP {offset} LIMIT 30000
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
It will take about 13 times to run the query changing {offset} to cover all the data. It would be nice to write a script on any language which has a neo4j client.
Updating Relationship's Properties
You can assign timestamp delta to relationships using SET clause following the MATCH. Assuming that a timestamp is a long:
MATCH (a:BOOK)-[s:FORWARD_SEQ]->(b:BOOK)
SET s.delta = abs(b.timestamp - a.timestamp);
Chaining Nodes With Minimal Delta
When relationships have the delta property inside, the graph becomes a weighted graph. So we can apply this approach to calculate the shortest path using deltas. Then we just save the length of the shortest path (summ of deltas) into the relation between the first and the last node.
MATCH p=(a:BOOK)-[:FORWARD_SEQ*1..]->(b:BOOK)
WITH p AS shortestPath, a, b,
reduce(weight=0, r in relationships(p) : weight+r.delta) AS totalDelta
ORDER BY totalDelta ASC
LIMIT 1
MERGE (a)-[nearest:NEAREST {delta: totalDelta}]->(b)
RETURN nearest;
Disclaimer: queries above are not supposed to be totally working, they just hint possible approaches to the problem.
We have a Neo4J database representing an evolutionary process with about 100K nodes and 200K relations. Nodes are individuals in generations, and edges represent parent-child relationships. The primary goal is to be able to take one or nodes of interest in the final generation, and explore their evolutionary history (roughly, "how did we get here?").
The "obvious" first query to find all their ancestors doesn't work because there are just too many possible ancestors and paths through that space:
match (a)-[:PARENT_OF*]->(c {is_interesting: true})
return distinct a;
So we've pre-processed the data so that some edges are marked as "special" such that almost every node has at most one "special" parent edge, although occasionally both parent edges are marked as "special". My hope, then, was that this query would (efficiently) generate the (nearly) unique path along "special" edges:
match (a)-[r:PARENT_OF* {special: true}]->(c {is_interesting: true})
return distinct a;
This, however, is still unworkably slow.
This is frustrating because "as a human", the logic is simple: Start from the small number of "interesting" nodes (often 1, never more than a few dozen), and chase back along the almost always unique "special" edges. Assuming a very low number of nodes with two "special" parents, this should be something like O(N) where N is the number of generations back in time.
In Neo4J, however, going back 25 steps from a unique "interesting" node where every step is unique, however, takes 30 seconds, and once there's a single bifurcation (where both parents are "special") it gets worse much faster as a function of steps. 28 steps (which gets us to the first bifurcation) takes 2 minutes, 30 (where there's still only the one bifurcation) takes 6 minutes, and I haven't even thought to try the full 100 steps to the beginning of the simulation.
Some similar work last year seemed to perform better, but we used a variety of edge labels (e.g., (a)-[:SPECIAL_PARENT_OF*]->(c) as well as (a)-[:PARENT_OF*]->(c)) instead of using data fields on the edges. Is querying on relationship field values just not a good idea? We have quite a few different values attached to a relationship in this model (some boolean, some numeric) and we were hoping/assuming we could use those to efficiently limit searches, but maybe that wasn't really the case.
Suggestions for how to tune our model or queries would be greatly appreciated.
Update I should have mentioned, this is all with Neo4J 2.1.7. I'm going to give 2.2 a try as per Brian Underwood's suggestion and will report back.
I've had some luck with specifying a limit on the path length. So if you know that it's never more than 30 hops you might try:
MATCH (c {is_interesting: true})
WITH c
MATCH (a)-[:PARENT_OF*1..30]->c
RETURN DISTINCT a
Also, is there an index on the is_interesting property? That could also cause slowness, for sure.
What version of Neo4j are you using? If you are using or if you upgrade to 2.2.0, you get to use the new query profiling tools:
http://neo4j.com/docs/2.2.0/how-do-i-profile-a-query.html
Also if you use them in the web console you get a nice graph-ish tree thing (technical term) showing each step.
After exploring things with the profiling tools in Neo4J 2.2 (thanks to Brian Underwood for the tip) it's pretty clear that (at the moment) Neo4J doesn't do any pre-filtering on edge properties, which leads to nasty combinatorial explosions with long paths.
For example the original query:
match (a)-[r:PARENT_OF* {special: true}]->(c {is_interesting: true})
return distinct a;
finds all the paths from a to c and then eliminates the ones that have edges that aren't special. Since there are many millions of paths from a to c, this is totally infeasible.
If I instead add a IS_SPECIAL edge wherever there was a PARENT_OF edge that had {special: true}, then the queries become really fast, allowing me to push back around 100 generations in under a second.
This query creates all the new edges:
match (a)-[r:PARENT_OF {special: true}]->(b)
create (a)-[:IS_SPECIAL]->(b);
and takes under a second to add 91K relationships in our graph.
Then
match (c {is_interesting: true})
with c
match (a)-[:IS_SPECIAL*]->(c)
return distinct a;
takes under a second to find the 112 nodes along the "special" path back from a unique target node c. Matching c first and limiting the set of nodes using with c seems to also be important, as Neo4J doesn't appear to pre-filter on node properties either, and if there are several "interesting" target nodes things get a lot slower.
I try to simplify my question. If all nodes in Neo4jDB have same label Science, what's the difference between MATCH n WHERE n.ID="UUID-0001" RETURN n and MATCH (n:Science) WHERE n.ID="UUID-0001" RETURN n. Why the performance is not the same?
My Neo4j database contains about 70000 nodes and 100 relations.
The nodes have two types: Paper and Author, and they both have an ID field.
I created each node with corresponding label, and I also use ID as the index.
However, since one of my functions need to query nodes by ID without considering the label. The query just like: MATCH n WHERE n.ID="UUID-0001" RETURN n. The query time cost about 4000~5000 ms!
But after adding Science for each node and using MATCH (n:Science) WHERE n.ID="UUID-0001" RETURN n. The query time became about 1000~1100 ms. Does anyone know the difference between these two cases?
PS. Count(n:Science) = Count(n:Paper) + Count(n:Author), which mean each node has two labels.
Because for every label Neo4j automatically creates an extra index. The Cypher language can be broadly thought of as piping + filtering, so Match n WHere ... will first get every node and then filter on the where part. Whereas Match (n:Science) Where... will get every node with label science (using an index) and then try to match the where. From your query performance we can see that about 1/5th of your nodes were marked science so the query runs in a fifth he time, because it did a fifth as many comparisons.
Even though I got the advisement from #phil_20686 and #Michael Hunger, but I think these answers do not solve my question.
I think there are some tricks when using label. If their are 10 thousand nodes in Neo4j DB, and the type of these nodes are the same. The query will perform better when adding label to these nodes.
I hope this post can help some people and give me some feedback if you find the reasons. Thanks.