How to optimize the following neo4j Cypher query - neo4j

I am new to cypher and have the below query to find mistmaches between 2 source types(for example). I believe syntactically the query looks fine but it takes 1 minute to run on data set of just 1,00,000 nodes. I am not using relations still. Can someone please help in optimizing the query? Thanks.
MATCH (VW_OXSS41:VW_OrderXStatusSummary4{SourceTypeID: "1"})
WHERE apoc.date.parse(VW_OXSS41.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss'))>=apoc.date.parse("2020-02-10",'s',('yyyy-MM-dd')) AND apoc.date.parse(VW_OXSS41.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss'))<=apoc.date.parse("2020-02-16",'s',('yyyy-MM-dd'))
WITH VW_OXSS41.IdentifierValue as X
MATCH (VW_OXSS42:VW_OrderXStatusSummary4{SourceTypeID: "2"})
WHERE apoc.date.parse(VW_OXSS42.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss'))>=apoc.date.parse("2020-02-10",'s',('yyyy-MM-dd')) AND apoc.date.parse(VW_OXSS42.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss'))<=apoc.date.parse("2020-02-16",'s',('yyyy-MM-dd'))
WITH apoc.coll.disjunction(COLLECT(X), COLLECT(VW_OXSS42.IdentifierValue)) as XX
UNWIND (XX) as YY
The updated query and the error:-
WITH apoc.date.parse("2020-02-20",'s',('yyyy-MM-dd')) AS a, apoc.date.parse("2020-02-25",'s',('yyyy-MM-dd')) AS b
MATCH (x:VW_OrderXStatusSummary4 {SourceTypeID: "2"})
WHERE a <= apoc.date.parse(x.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
WITH a, b, COLLECT(x.IdentifierValue) AS X
MATCH (y:VW_OrderXStatusSummary4 {SourceTypeID: "1"})
WHERE a <= apoc.date.parse(y.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
WITH X, COLLECT(y.IdentifierValue) AS Y
UNWIND apoc.coll.subtract(X,Y) AS XX
MATCH (z:VW_OrderXStatusSummary4 {SourceTypeID: "2"})
WHERE a <= apoc.date.parse(z.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
RETURN XX AS MISMATCHES,MAX(z.TimeStamp);
Variable `a` not defined (line 10, column 7 (offset: 551))
"WHERE a <= apoc.date.parse(z.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b"
Solved the above error like this:-
WITH apoc.date.parse("2020-02-21",'s',('yyyy-MM-dd')) AS a, apoc.date.parse("2020-02-25",'s',('yyyy-MM-dd')) AS b
MATCH (x:VW_OrderXStatusSummary4 {SourceTypeID: "2"})
WHERE a <= apoc.date.parse(x.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
WITH a, b, COLLECT(x.IdentifierValue) AS X
MATCH (y:VW_OrderXStatusSummary4 {SourceTypeID: "1"})
WHERE a <= apoc.date.parse(y.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
WITH X, COLLECT(y.IdentifierValue) AS Y
UNWIND apoc.coll.subtract(X,Y) AS XX
WITH XX, apoc.date.parse("2020-02-20",'s',('yyyy-MM-dd')) AS a, apoc.date.parse("2020-02-25",'s',('yyyy-MM-dd')) AS b
MATCH (z:VW_OrderXStatusSummary4 {SourceTypeID: "2"})
WHERE a <= apoc.date.parse(z.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
AND XX = z.IdentifierValue
RETURN XX AS MISMATCHES,MAX(z.TimeStamp);
With the correct expected output as:-
+---------------------------------------------+
| MISMATCHES | TIMESTAMP |
+---------------------------------------------+
| "W2002201453550218" | "2020-02-21 12:00:16" |
| "W2002201453550222" | "2020-02-21 12:00:16" |
| "W2002201453550223" | "2020-02-21 09:30:36" |
| "W2002201453550224" | "2020-02-21 12:00:16" |
| "W2002201453550226" | "2020-02-21 12:00:16" |
| "W2002201453550227" | "2020-02-21 12:00:16" |
| "W2002201453550237" | "2020-02-21 12:00:16" |
| "3011WOS002978598" | "2020-02-21 10:00:54" |
| "3011WOS002978595" | "2020-02-21 13:00:57" |
| "0010000000006183" | "2020-02-21 16:00:41" |
| "W2002181111547439" | "2020-02-21 04:00:34" |
| "11" | "2020-02-21 16:00:41" |
| "10112787861P1458" | "2020-02-21 10:00:54" |
+---------------------------------------------+
Wondering if there's a better approach?

You need to avoid making a cartesian product between the results of your two MATCH clauses. Let's say the two MATCH clauses would normally return N and M nodes, respectively, when executed in their own queries. Because your query combines those two MATCH clauses in the way that it does, your second MATCH clause is actually performing N*M matches (and producing N*M result rows).
You need to make sure you have created an index on :VW_OrderXStatusSummary4(SourceTypeID). That will optimize the lookups performed by the MATCH clauses.
You can simplify your Cypher code to avoid duplicated function calls.
After creating the index indicated above, try this:
WITH apoc.date.parse("2020-02-10",'s',('yyyy-MM-dd')) AS a, apoc.date.parse("2020-02-16",'s',('yyyy-MM-dd')) AS b
MATCH (x:VW_OrderXStatusSummary4 {SourceTypeID: "1"})
WHERE a <= apoc.date.parse(x.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
WITH a, b, COLLECT(x.IdentifierValue) AS X
MATCH (y:VW_OrderXStatusSummary4 {SourceTypeID: "2"})
WHERE a <= apoc.date.parse(y.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
WITH X, COLLECT(y.IdentifierValue) AS Y
UNWIND apoc.coll.disjunction(X, Y) AS YY
...
Performing the COLLECT(x.IdentifierValue) operation in the first WITH clause causes it to return all the x nodes in a single result row (instead of N result rows). This allows the second MATCH to avoid a cartesian product issue.

Related

Forcing cost planner to start from a specific index seek

My cypher query
EXPLAIN MATCH (b:Block)<-[:INCLUDED_IN]-(tx:Transaction {pstype: 0})
WHERE 1540512000 <= b.time < 1540598400
RETURN count(tx);
produces the following execution plan
--------------------------------------------+
| Operator | Estimated Rows | Identifiers | Other |
+-------------------+----------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +ProduceResults | 12 | count(tx) | |
| | +----------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +EagerAggregation | 12 | count(tx) | |
| | +----------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +Filter | 136 | anon[16], b, tx | AndedPropertyInequalities(Variable(b),Property(Variable(b),PropertyKeyName(time)),GreaterThanOrEqual(Property(Variable(b),PropertyKeyName(time)),Parameter( AUTOINT2,Integer)), LessThan(Property(Variable(b),PropertyKeyName(time)),Parameter( AUTOINT1,Integer))) |
| | +----------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +Expand(All) | 9052 | anon[16], b, tx | (tx)-[anon[16]:INCLUDED_IN]->(b) |
| | +----------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +NodeIndexSeek | 9052 | tx | :Transaction(pstype) |
+-------------------+----------------+-----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
which executes way too slow because the first NodeIndexSeekByRange returns tens of millions of nodes instead of 9052. Using NodeIndexSeekByRange on b:Block(time) would produce around 600 nodes.
I have tried forcing the execution plan to start from b:Block(time), but instead it still keeps using NodeIndexSeek on tx:Transaction(pstype):
EXPLAIN MATCH (b:Block)<-[:INCLUDED_IN]-(tx:Transaction {pstype: 0})
USING INDEX b:Block(time)
WHERE 1540512000 <= b.time < 1540598400
RETURN count(tx);
produces
+-------------------------+----------------+-----------------+--------------------------------------------------------------+
| Operator | Estimated Rows | Identifiers | Other |
+-------------------------+----------------+-----------------+--------------------------------------------------------------+
| +ProduceResults | 12 | count(tx) | |
| | +----------------+-----------------+--------------------------------------------------------------+
| +EagerAggregation | 12 | count(tx) | |
| | +----------------+-----------------+--------------------------------------------------------------+
| +NodeHashJoin | 136 | anon[16], b, tx | b |
| |\ +----------------+-----------------+--------------------------------------------------------------+
| | +NodeIndexSeekByRange | 14703 | b | :Block(time) >= { AUTOINT2} AND :Block(time) < { AUTOINT1} |
| | +----------------+-----------------+--------------------------------------------------------------+
| +Expand(All) | 9052 | anon[16], b, tx | (tx)-[anon[16]:INCLUDED_IN]->(b) |
| | +----------------+-----------------+--------------------------------------------------------------+
| +NodeIndexSeek | 9052 | tx | :Transaction(pstype) |
+-------------------------+----------------+-----------------+--------------------------------------------------------------+
The only way I have gotten it to work fast is by using the rule planner: (multiple orders of magnitude faster)
CYPHER planner=rule MATCH (b:Block)
WHERE 1540512000 <= b.time < 1540598400
WITH b
MATCH (b)<-[:INCLUDED_IN]-(tx:Transaction {pstype: 0})
RETURN count(tx);
Is there a way to make it work when using the cost planner?
Both :Block(time) and :Transaction(pstype) are indexed.
You could try using a join hint on tx along with your index hint, which should ensure you only expand from one direction:
EXPLAIN
MATCH (b:Block)<-[:INCLUDED_IN]-(tx:Transaction {pstype: 0})
USING INDEX b:Block(time)
USING JOIN ON tx
WHERE 1540512000 <= b.time < 1540598400
RETURN count(tx);
Alternately you could restructure your query a bit so the tx node isn't initially part of the pattern, but enforced in the WHERE clause. You'll need to split the MATCH in 2, but I don't think you'll need any planner hints:
EXPLAIN
MATCH (tx:Transaction {pstype: 0})
MATCH (b:Block)<-[:INCLUDED_IN]-(x)
WHERE 1540512000 <= b.time < 1540598400
AND x = tx
RETURN count(tx);
EDIT
Okay, let's try another approach then:
EXPLAIN
MATCH (b:Block)<-[:INCLUDED_IN]-(x)
WHERE 1540512000 <= b.time < 1540598400
AND x.pstype = 0 // AND 'Transaction' in labels(x)
RETURN count(tx);
If we leave off the label then it can't use an indexed lookup. If there are other nodes besides :Transaction nodes that have a pstype property, you could try uncommenting the line where we use an alternate way to see if the node has that label (I don't think this will use an index lookup, but not completely sure).
Another alternative (unsure if this will work) is to use pattern comprehension to get a list of results from a pattern (after the initial match is found to b) and summing the sizes of the results:
EXPLAIN
MATCH (b:Block)
WHERE 1540512000 <= b.time < 1540598400
RETURN sum(size([(b)<-[:INCLUDED_IN]-(x:Transaction) WHERE x.pstype = 0 | x])) as count

How to match node labels using OR?

match (p:Product {id:'5116003'})-[r]->(o:Attributes|ExtraAttribute) return p, o
How to match two possible node labels in such a query?
Per cybersam's suggestion, I changed to the follwoing:
MATCH (p:Product {id:'5116003'})-[r]->(o)
WHERE o:Attributes OR o:ExtraAttributes
**WHERE any(key in keys(o) WHERE toLower(key) contains 'weight')**
return o
Now I need to add the 2nd 'where' clause. How to modify that?
You can try using any() function:
match (p:Product {id:'5116003'})-[r]->(o)
where any (label in labels(o) where label in ['Attributes', 'ExtraAttribute'])
return p, o
Also, if you have APOC procedures, you can use apoc.path.expand path expander procedure that expands from start node following the given relationships from min to max-level adhering to the label filters.
match (p:Product {id:'5116003'})
call apoc.path.expand(p, null,"+Attributes|ExtraAttribute",0,1) yield path
with nodes(path) as nodes
// return p and o nodes
return nodes[0], nodes[1]
See more here.
These two single-label forms of your query:
MATCH (p:Product {id:'5116003'})-->(o:Attributes) RETURN p, o;
MATCH (p:Product {id:'5116003'})-->(o) WHERE o:Attributes RETURN p, o;
produce the same execution plan, as follows (I assume that there is an index on :Product(id)):
+-----------------+----------------+------+---------+------------------+--------------+
| Operator | Estimated Rows | Rows | DB Hits | Variables | Other |
+-----------------+----------------+------+---------+------------------+--------------+
| +ProduceResults | 0 | 0 | 0 | o, p | p, o |
| | +----------------+------+---------+------------------+--------------+
| +Filter | 0 | 0 | 0 | anon[33], o, p | o:Attributes |
| | +----------------+------+---------+------------------+--------------+
| +Expand(All) | 0 | 0 | 0 | anon[33], o -- p | (p)-->(o) |
| | +----------------+------+---------+------------------+--------------+
| +NodeIndexSeek | 0 | 0 | 1 | p | :Product(id) |
+-----------------+----------------+------+---------+------------------+--------------+
This two-label form of the second query above:
MATCH (p:Product {id:'5116003'})-->(o) WHERE o:Attributes OR o: ExtraAttribute RETURN p, o;
produces an execution plan that is very similar (and therefore probably not much more expensive):
+-----------------+----------------+------+---------+------------------+-------------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Variables | Other |
+-----------------+----------------+------+---------+------------------+-------------------------------------+
| +ProduceResults | 0 | 0 | 0 | o, p | p, o |
| | +----------------+------+---------+------------------+-------------------------------------+
| +Filter | 0 | 0 | 0 | anon[33], o, p | Ors(o:Attributes, o:ExtraAttribute) |
| | +----------------+------+---------+------------------+-------------------------------------+
| +Expand(All) | 0 | 0 | 0 | anon[33], o -- p | (p)-->(o) |
| | +----------------+------+---------+------------------+-------------------------------------+
| +NodeIndexSeek | 0 | 0 | 1 | p | :Product(id) |
+-----------------+----------------+------+---------+------------------+-------------------------------------+
By the way, the first query in the answer by #BrunoPeres has a similar execution plan as well, but the Filter operation is very different. It is not clear which would be faster.
[UPDATE]
To answer your updated question: since you cannot have 2 back-to-back WHERE clauses, you can just add more terms to the already existing WHERE clause, like so:
MATCH (p:Product {id:'5116003'})-[r]->(o)
WHERE
(o:Attributes OR o:ExtraAttributes) AND
ANY(key in KEYS(o) WHERE TOLOWER(key) CONTAINS 'weight')
RETURN o;

Make a path where next node is not the previous node?

I have ~1.5 M nodes in a graph, that are structured like this (picture)
I run a Cypher query that performs calculations on each relationship traversed:
WITH 1 AS startVal
MATCH x = (c:Currency)-[r:Arb*2]->(m)
WITH x, REDUCE(s = startVal, e IN r | s * e.rate) AS endVal, startVal
RETURN EXTRACT(n IN NODES(x) | n) as Exchanges,
extract ( e IN relationships(x) | startVal * e.rate) AS Rel,
endVal, endVal - startVal AS Profit
ORDER BY Profit DESC LIMIT 5
The problem is it returns the path ("One")->("hop")->("One"), which is useless for me.
How can I make it not choose the previously walked node as the next node (i.e. "One"->"hop"->"any_other_node_but_not_"one")?
I have read that NODE_RECENT should address my issue. However, there was no example on how to specify the length of recent nodes in RestAPI or APOC procedures.
Is there a Cypher query for my case?
Thank you.
P.S. I am extremely new (less than 2 month) to Neo4j and coding. So my apologies if there is an obvious simple solution.
I don't know if I understood your question completely, but I believe that you problem can be solved putting a WHERE clause on the MATCH to prevent the not desired relationship be matched, like this:
WITH 1 AS startVal
MATCH x = (c:Currency)-[r:Arb*2]->(m)
WHERE NOT (m)-[:Arb]->(c)
WITH x, REDUCE(s = startVal, e IN r | s * e.rate) AS endVal, startVal
RETURN EXTRACT(n IN NODES(x) | n) as Exchanges,
extract ( e IN relationships(x) | startVal * e.rate) AS Rel,
endVal, endVal - startVal AS Profit
ORDER BY Profit DESC LIMIT 5
Try inserting this clause after your MATCH clause, to filter out cases where c and m are the same:
WHERE c <> m
[EDITED]
That is:
WITH 1 AS startVal
MATCH x = (c:Currency)-[r:Arb*2]->(m)
WHERE c <> m
WITH x, REDUCE(s = startVal, e IN r | s * e.rate) AS endVal, startVal
RETURN EXTRACT(n IN NODES(x) | n) as Exchanges,
extract ( e IN relationships(x) | startVal * e.rate) AS Rel,
endVal, endVal - startVal AS Profit
ORDER BY Profit DESC LIMIT 5;
After using this query to create test data:
CREATE
(c:Currency {name: 'One'})-[:Arb {rate:1}]->(h:Account {name: 'hop'})-[:Arb {rate:2}]->(t:Currency {name: 'Two'}),
(t)-[:Arb {rate:3}]->(h)-[:Arb {rate:4}]->(c)
the above query produces these results:
+-----------------------------------------------------------------------------------------+
| Exchanges | Rel | endVal | Profit |
+-----------------------------------------------------------------------------------------+
| [Node[8]{name:"Two"},Node[7]{name:"hop"},Node[6]{name:"One"}] | [3,4] | 12 | 11 |
| [Node[6]{name:"One"},Node[7]{name:"hop"},Node[8]{name:"Two"}] | [1,2] | 2 | 1 |
+-----------------------------------------------------------------------------------------+

How to use average function in neo4j with collection

I want to calculate covariance of two vectors as collection
A=[1, 2, 3, 4]
B=[5, 6, 7, 8]
Cov(A,B)= Sigma[(ai-AVGa)*(bi-AVGb)] / (n-1)
My problem for covariance computation is:
1) I can not have a nested aggregate function
when I write
SUM((ai-avg(a)) * (bi-avg(b)))
2) Or in another shape, how can I extract two collection with one reduce such as:
REDUCE(x= 0.0, ai IN COLLECT(a) | bi IN COLLECT(b) | x + (ai-avg(a))*(bi-avg(b)))
3) if it is not possible to extract two collection in oe reduce how it is possible to relate their value to calculate covariance when they are separated
REDUCE(x= 0.0, ai IN COLLECT(a) | x + (ai-avg(a)))
REDUCE(y= 0.0, bi IN COLLECT(b) | y + (bi-avg(b)))
I mean that can I write nested reduce?
4) Is there any ways with "unwind", "extract"
Thank you in advanced for any help.
cybersam's answer is totally fine but if you want to avoid the n^2 Cartesian product that results from the double UNWIND you can do this instead:
WITH [1,2,3,4] AS a, [5,6,7,8] AS b
WITH REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a,
REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b,
SIZE(a) AS n, a, b
RETURN REDUCE(s = 0.0, i IN RANGE(0, n - 1) | s + ((a[i] - e_a) * (b[i] - e_b))) / (n - 1) AS cov;
Edit:
Not calling anyone out, but let me elaborate more on why you would want to avoid the double UNWIND in https://stackoverflow.com/a/34423783/2848578. Like I said below, UNWINDing k length-n collections in Cypher results in n^k rows. So let's take two length-3 collections over which you want to calculate the covariance.
> WITH [1,2,3] AS a, [4,5,6] AS b
UNWIND a AS aa
UNWIND b AS bb
RETURN aa, bb;
| aa | bb
---+----+----
1 | 1 | 4
2 | 1 | 5
3 | 1 | 6
4 | 2 | 4
5 | 2 | 5
6 | 2 | 6
7 | 3 | 4
8 | 3 | 5
9 | 3 | 6
Now we have n^k = 3^2 = 9 rows. At this point, taking the average of these identifiers means we're taking the average of 9 values.
> WITH [1,2,3] AS a, [4,5,6] AS b
UNWIND a AS aa
UNWIND b AS bb
RETURN AVG(aa), AVG(bb);
| AVG(aa) | AVG(bb)
---+---------+---------
1 | 2.0 | 5.0
Also as I said below, this doesn't affect the answer because the average of a repeating vector of numbers will always be the same. For example, the average of {1,2,3} is equal to the average of {1,2,3,1,2,3}. It is likely inconsequential for small values of n, but when you start getting larger values of n you'll start seeing a performance decrease.
Let's say you have two length-1000 vectors. Calculating the average of each with a double UNWIND:
> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b
UNWIND a AS aa
UNWIND b AS bb
RETURN AVG(aa), AVG(bb);
| AVG(aa) | AVG(bb)
---+---------+---------
1 | 500.0 | 1500.0
714 ms
Is significantly slower than using REDUCE:
> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b
RETURN REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a,
REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b;
| e_a | e_b
---+-------+--------
1 | 500.0 | 1500.0
4 ms
To bring it all together, I'll compare the two queries in full on length-1000 vectors:
> WITH RANGE(0, 1000) AS aa, RANGE(1000, 2000) AS bb
UNWIND aa AS a
UNWIND bb AS b
WITH aa, bb, SIZE(aa) AS n, AVG(a) AS avgA, AVG(b) AS avgB
RETURN REDUCE(s = 0, i IN RANGE(0,n-1)| s +((aa[i]-avgA)*(bb[i]-avgB)))/(n-1) AS
covariance;
| covariance
---+------------
1 | 83583.5
9105 ms
> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b
WITH REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a,
REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b,
SIZE(a) AS n, a, b
RETURN REDUCE(s = 0.0, i IN RANGE(0, n - 1) | s + ((a[i] - e_a) * (b[i
] - e_b))) / (n - 1) AS cov;
| cov
---+---------
1 | 83583.5
33 ms
[EDITED]
This should calculate the covariance (according to your formula), given your sample inputs:
WITH [1,2,3,4] AS aa, [5,6,7,8] AS bb
UNWIND aa AS a
UNWIND bb AS b
WITH aa, bb, SIZE(aa) AS n, AVG(a) AS avgA, AVG(b) AS avgB
RETURN REDUCE(s = 0, i IN RANGE(0,n-1)| s +((aa[i]-avgA)*(bb[i]-avgB)))/(n-1) AS covariance;
This approach is OK when n is small, as is the case with the original sample data.
However, as #NicoleWhite and #jjaderberg point out, when n is not small, this approach will be inefficient. The answer by #NicoleWhite is an elegant general solution.
How do you arrive at collections A and B? The avg function is an aggregating function and cannot be used in the REDUCE context, nor can it be applied to collections. You should calculate your average before you get to that point, but exactly how to do that best depends on how you arrive at the two collections of values. If you are at a point where you have individual result items that you then collect to get A and B, that's the point when you could use avg. For example:
WITH [1, 2, 3, 4] AS aa UNWIND aa AS a
WITH collect(a) AS aa, avg(a) AS aAvg
RETURN aa, aAvg
and for both collections
WITH [1, 2, 3, 4] AS aColl UNWIND aColl AS a
WITH collect(a) AS aColl, avg(a) AS aAvg
WITH aColl, aAvg,[5, 6, 7, 8] AS bColl UNWIND bColl AS b
WITH aColl, aAvg, collect(b) AS bColl, avg(b) AS bAvg
RETURN aColl, aAvg, bColl, bAvg
Once you have the two averages, let's call them aAvg and bAvg, and the two collections, aColl and bColl, you can do
RETURN REDUCE(x = 0.0, i IN range(0, size(aColl) - 1) | x + ((aColl[i] - aAvg) * (bColl[i] - bAvg))) / (size(aColl) - 1) AS covariance
Thank you so much Dears, however I wonder which one is most efficient
1) Nested unwind and range inside reduce -> #cybersam
2) nested Reduce -> #Nicole White
3) Nested With (reset query by with) -> #jjaderberg
BUT Important Issue is :
Why there is an error and difference between your computations and real and actual computations.
I mean your covariance equals to = 1.6666666666666667
But in real world covariance equals to = 1.25
please check: https://www.easycalculation.com/statistics/covariance.php
Vector X: [1, 2, 3, 4]
Vector Y: [5, 6, 7, 8]
I think this differences is because that some computation do not consider (n-1) as divisor and instead of (n-1) , just they use n. Therefore when we grow divisor from n-1 to n the result will be diminished from 1.6 to 1.25.

Neo4j browser interface stops working or reconnecting

The problem is the same no matter if I am using Safari or Chrome.
After running several times the same query shown below, I am getting the error: Disconnected from Neo4j. Please check if the cord is unplugged.
I am able to SSH to the server and run the query from the shell.
This query was the subject of another issue open earlier and the someone optimize it to the form is presented below. So is not a mater of a not optimized query, seems to be something about the browser interface.
What is wrong here?
MATCH (p:Publisher)-[r:PUBLISHED]->(w:Woka)<-[s:AUTHORED]-(a:Author)
MATCH (l:Language)-[t:USED]->(w)-[u:INCLUDED]->(b:Bisac)
WHERE (a.author_name = 'Camus, Albert')
WITH p,r,w,s,a,l,t,u,b
OPTIONAL MATCH (d:Description)-[v:HAS_DESCRIPTION]-(w)
RETURN w, p, a, l, b, d, r, s, t, u, v;
More details: when the browser dies in one computer, dies also in the second computer trying to connect to same database.
Also other commands i.e.
$ rails console
or
$ rails s -d
to start the rails server no longer works.
If I am restarting the Neo4j db server all are working for a little bit and frozen after that.
Below is the execution plan of the query:
neo4j-sh (?)$ EXPLAIN MATCH (p:Publisher)-[r:PUBLISHED]->(w:Woka)<-[s:AUTHORED]-(a:Author{author_name: 'Camus, Albert'}), (l:Language)-[t:USED]->(w)-[u:INCLUDED]->(b:Bisac)
WITH p,r,w,s,a,l,t,u,b
OPTIONAL MATCH (d:Description)-[v:HAS_DESCRIPTION]-(w)
RETURN w, p, a, l, b, d, r, s, t, u, v;
+--------------------------------------------+
| No data returned, and nothing was changed. |
+--------------------------------------------+
73 ms
Compiler CYPHER 2.2
Planner COST
OptionalExpand(All)
|
+Filter(0)
|
+Expand(All)(0)
|
+Filter(1)
|
+Expand(All)(1)
|
+Filter(2)
|
+Expand(All)(2)
|
+Filter(3)
|
+Expand(All)(3)
|
+NodeUniqueIndexSeek
+---------------------+---------------+---------------------------------+-----------------------------+
| Operator | EstimatedRows | Identifiers | Other |
+---------------------+---------------+---------------------------------+-----------------------------+
| OptionalExpand(All) | 5 | a, b, d, l, p, r, s, t, u, v, w | (w)-[v:HAS_DESCRIPTION]-(d) |
| Filter(0) | 5 | a, b, l, p, r, s, t, u, w | b:Bisac |
| Expand(All)(0) | 5 | a, b, l, p, r, s, t, u, w | (w)-[u:INCLUDED]->(b) |
| Filter(1) | 4 | a, l, p, r, s, t, w | l:Language |
| Expand(All)(1) | 4 | a, l, p, r, s, t, w | (w)<-[t:USED]-(l) |
| Filter(2) | 4 | a, p, r, s, w | p:Publisher |
| Expand(All)(2) | 4 | a, p, r, s, w | (w)<-[r:PUBLISHED]-(p) |
| Filter(3) | 4 | a, s, w | w:Woka |
| Expand(All)(3) | 4 | a, s, w | (a)-[s:AUTHORED]->(w) |
| NodeUniqueIndexSeek | 1 | a | :Author(author_name) |
+---------------------+---------------+---------------------------------+-----------------------------+
Total database accesses: ?
neo4j-sh (?)$
Here is a snapshot from top (before having the browser frozen):
top - 14:59:36 up 46 days, 17:03, 2 users, load average: 2.66, 4.58, 3.75
Tasks: 116 total, 2 running, 114 sleeping, 0 stopped, 0 zombie
%Cpu(s): 97.5 us, 0.8 sy, 0.0 ni, 1.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.2 st
KiB Mem: 15666128 total, 3858028 used, 11808100 free, 169612 buffers
KiB Swap: 0 total, 0 used, 0 free. 2144784 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10260 neo4j 20 0 14.348g 1.388g 195316 S 196.9 9.3 1:57.55 java
9879 ubuntu 20 0 23680 1656 1116 R 0.3 0.0 0:00.88 top
1 root 20 0 33508 2236 860 S 0.0 0.0 0:12.25 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.01 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:00.55 ksoftirqd/0
4 root 20 0 0 0 0 S 0.0 0.0 0:30.10 kworker/0:0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
7 root 20 0 0 0 0 S 0.0 0.0 0:39.08 rcu_sched
8 root 20 0 0 0 0 R 0.0 0.0 0:47.50 rcuos/0
9 root 20 0 0 0 0 S 0.0 0.0 1:00.72 rcuos/1
What is the spec of your computer?
How much data does your query return?
MATCH (p:Publisher)-[r:PUBLISHED]->(w:Woka)<-[s:AUTHORED]-(a:Author)
WHERE (a.author_name = 'Camus, Albert')
MATCH (l:Language)-[t:USED]->(w)-[u:INCLUDED]->(b:Bisac)
WITH p,r,w,s,a,l,t,u,b
OPTIONAL MATCH (d:Description)-[v:HAS_DESCRIPTION]-(w)
RETURN w, p, a, l, b, d, r, s, t, u, v;
Also what does your visual query plan look like? Please prefix your query with PROFILE save as png and share it.
The browser interface of this product Neo4j needs a major overhaul. There is no way to use this interface for serious design, modelling and development.
I executed the following stress tests a from Ruby on Rails console. No errors about disconnect, network etc. All run successfully while any of these queries frozen the browser after 5, 6, 7 executions and even if the result set is limited to 25 records. More than that, I executed all of them while the browser interface was still frozen showing that network disconnect error.
(1..1000).each do |n|
q = "MATCH (p:Publisher)-[r:PUBLISHED]->(w:Woka)<-[s:AUTHORED]-(a:Author)
WHERE (a.author_name = 'Freud, Sigmund')
MATCH (l:Language)-[t:USED]->(w)-[u:INCLUDED]->(b:Bisac)
WITH p,r,w,s,a,l,t,u,b
OPTIONAL MATCH (d:Description)-[v:HAS_DESCRIPTION]-(w)
RETURN w, p, a, l, b, d, r, s, t, u, v;"
r = Neo4j::Session.current.query(q)
print n, "\t", r.count, "\t", Time.now, "\n"
end
(1..1000).each do |n|
q = "MATCH (p:Publisher)-[r:PUBLISHED]->(w:Woka)<-[s:AUTHORED]-(a:Author)
WHERE (a.author_name = 'Einstein, Albert')
MATCH (l:Language)-[t:USED]->(w)-[u:INCLUDED]->(b:Bisac)
WITH p,r,w,s,a,l,t,u,b
OPTIONAL MATCH (d:Description)-[v:HAS_DESCRIPTION]-(w)
RETURN w, p, a, l, b, d, r, s, t, u, v;"
r = Neo4j::Session.current.query(q)
print n, "\t", r.count, "\t", Time.now, "\n"
end
(1..1000).each do |n|
q = "MATCH (p:Publisher)-[r:PUBLISHED]->(w:Woka)<-[s:AUTHORED]-(a:Author)
WHERE (a.author_name = 'Freud, Sigmund')
MATCH (l:Language)-[t:USED]->(w)-[u:INCLUDED]->(b:Bisac)
WITH p,r,w,s,a,l,t,u,b
OPTIONAL MATCH (d:Description)-[v:HAS_DESCRIPTION]-(w)
RETURN w, p, a, l, b, d, r, s, t, u, v;"
r = Neo4j::Session.current.query(q)
print n, "\t", r.count, "\t", Time.now, "\n"
end

Resources