How to use average function in neo4j with collection - neo4j

I want to calculate covariance of two vectors as collection
A=[1, 2, 3, 4]
B=[5, 6, 7, 8]
Cov(A,B)= Sigma[(ai-AVGa)*(bi-AVGb)] / (n-1)
My problem for covariance computation is:
1) I can not have a nested aggregate function
when I write
SUM((ai-avg(a)) * (bi-avg(b)))
2) Or in another shape, how can I extract two collection with one reduce such as:
REDUCE(x= 0.0, ai IN COLLECT(a) | bi IN COLLECT(b) | x + (ai-avg(a))*(bi-avg(b)))
3) if it is not possible to extract two collection in oe reduce how it is possible to relate their value to calculate covariance when they are separated
REDUCE(x= 0.0, ai IN COLLECT(a) | x + (ai-avg(a)))
REDUCE(y= 0.0, bi IN COLLECT(b) | y + (bi-avg(b)))
I mean that can I write nested reduce?
4) Is there any ways with "unwind", "extract"
Thank you in advanced for any help.

cybersam's answer is totally fine but if you want to avoid the n^2 Cartesian product that results from the double UNWIND you can do this instead:
WITH [1,2,3,4] AS a, [5,6,7,8] AS b
WITH REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a,
REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b,
SIZE(a) AS n, a, b
RETURN REDUCE(s = 0.0, i IN RANGE(0, n - 1) | s + ((a[i] - e_a) * (b[i] - e_b))) / (n - 1) AS cov;
Edit:
Not calling anyone out, but let me elaborate more on why you would want to avoid the double UNWIND in https://stackoverflow.com/a/34423783/2848578. Like I said below, UNWINDing k length-n collections in Cypher results in n^k rows. So let's take two length-3 collections over which you want to calculate the covariance.
> WITH [1,2,3] AS a, [4,5,6] AS b
UNWIND a AS aa
UNWIND b AS bb
RETURN aa, bb;
| aa | bb
---+----+----
1 | 1 | 4
2 | 1 | 5
3 | 1 | 6
4 | 2 | 4
5 | 2 | 5
6 | 2 | 6
7 | 3 | 4
8 | 3 | 5
9 | 3 | 6
Now we have n^k = 3^2 = 9 rows. At this point, taking the average of these identifiers means we're taking the average of 9 values.
> WITH [1,2,3] AS a, [4,5,6] AS b
UNWIND a AS aa
UNWIND b AS bb
RETURN AVG(aa), AVG(bb);
| AVG(aa) | AVG(bb)
---+---------+---------
1 | 2.0 | 5.0
Also as I said below, this doesn't affect the answer because the average of a repeating vector of numbers will always be the same. For example, the average of {1,2,3} is equal to the average of {1,2,3,1,2,3}. It is likely inconsequential for small values of n, but when you start getting larger values of n you'll start seeing a performance decrease.
Let's say you have two length-1000 vectors. Calculating the average of each with a double UNWIND:
> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b
UNWIND a AS aa
UNWIND b AS bb
RETURN AVG(aa), AVG(bb);
| AVG(aa) | AVG(bb)
---+---------+---------
1 | 500.0 | 1500.0
714 ms
Is significantly slower than using REDUCE:
> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b
RETURN REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a,
REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b;
| e_a | e_b
---+-------+--------
1 | 500.0 | 1500.0
4 ms
To bring it all together, I'll compare the two queries in full on length-1000 vectors:
> WITH RANGE(0, 1000) AS aa, RANGE(1000, 2000) AS bb
UNWIND aa AS a
UNWIND bb AS b
WITH aa, bb, SIZE(aa) AS n, AVG(a) AS avgA, AVG(b) AS avgB
RETURN REDUCE(s = 0, i IN RANGE(0,n-1)| s +((aa[i]-avgA)*(bb[i]-avgB)))/(n-1) AS
covariance;
| covariance
---+------------
1 | 83583.5
9105 ms
> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b
WITH REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a,
REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b,
SIZE(a) AS n, a, b
RETURN REDUCE(s = 0.0, i IN RANGE(0, n - 1) | s + ((a[i] - e_a) * (b[i
] - e_b))) / (n - 1) AS cov;
| cov
---+---------
1 | 83583.5
33 ms

[EDITED]
This should calculate the covariance (according to your formula), given your sample inputs:
WITH [1,2,3,4] AS aa, [5,6,7,8] AS bb
UNWIND aa AS a
UNWIND bb AS b
WITH aa, bb, SIZE(aa) AS n, AVG(a) AS avgA, AVG(b) AS avgB
RETURN REDUCE(s = 0, i IN RANGE(0,n-1)| s +((aa[i]-avgA)*(bb[i]-avgB)))/(n-1) AS covariance;
This approach is OK when n is small, as is the case with the original sample data.
However, as #NicoleWhite and #jjaderberg point out, when n is not small, this approach will be inefficient. The answer by #NicoleWhite is an elegant general solution.

How do you arrive at collections A and B? The avg function is an aggregating function and cannot be used in the REDUCE context, nor can it be applied to collections. You should calculate your average before you get to that point, but exactly how to do that best depends on how you arrive at the two collections of values. If you are at a point where you have individual result items that you then collect to get A and B, that's the point when you could use avg. For example:
WITH [1, 2, 3, 4] AS aa UNWIND aa AS a
WITH collect(a) AS aa, avg(a) AS aAvg
RETURN aa, aAvg
and for both collections
WITH [1, 2, 3, 4] AS aColl UNWIND aColl AS a
WITH collect(a) AS aColl, avg(a) AS aAvg
WITH aColl, aAvg,[5, 6, 7, 8] AS bColl UNWIND bColl AS b
WITH aColl, aAvg, collect(b) AS bColl, avg(b) AS bAvg
RETURN aColl, aAvg, bColl, bAvg
Once you have the two averages, let's call them aAvg and bAvg, and the two collections, aColl and bColl, you can do
RETURN REDUCE(x = 0.0, i IN range(0, size(aColl) - 1) | x + ((aColl[i] - aAvg) * (bColl[i] - bAvg))) / (size(aColl) - 1) AS covariance

Thank you so much Dears, however I wonder which one is most efficient
1) Nested unwind and range inside reduce -> #cybersam
2) nested Reduce -> #Nicole White
3) Nested With (reset query by with) -> #jjaderberg
BUT Important Issue is :
Why there is an error and difference between your computations and real and actual computations.
I mean your covariance equals to = 1.6666666666666667
But in real world covariance equals to = 1.25
please check: https://www.easycalculation.com/statistics/covariance.php
Vector X: [1, 2, 3, 4]
Vector Y: [5, 6, 7, 8]
I think this differences is because that some computation do not consider (n-1) as divisor and instead of (n-1) , just they use n. Therefore when we grow divisor from n-1 to n the result will be diminished from 1.6 to 1.25.

Related

Neo4j: Cypher query to parallelize a row of the result from a previous query

I have a database where sentences are related to each other. I have to perform a big update on the whole database, thus I'm trying to parallelize the update.
The relevant cypher query looks like this:
match (s:Sentence)-[r:RELATED]-(t:Sentence)
return s as sentence, collect(t.embedding) as neighbours_embeddings
embedding is a list of numbers.
This returns a result like this:
---------------------------------------
| sentence | neighbours_embeddings |
---------------------------------------
| sentence1 | [[1, 2, 3], [4, 5, 6]] |
---------------------------------------
| sentence2 | [[2, 3, 5]] |
---------------------------------------
Now I wanna perform some operations on the neighbours_embeddings and set a property in the corresponding Sentence node.
I've looked at different parallelization techniques in Neo4j and as far as I understood, all of them need a list as input. But my input would be a tuple like (sentence, neighbours_embeddings). How do I achieve this?
Full query for interested folks:
match (s:Sentence)-[r:RELATED]-(t:Sentence)
with s as sentence, collect(t.embedding) as neighbours
with sentence, [
w in reduce(s=[], neighbour IN neighbours |
case when size(s) = 0 then
neighbour else [
i in range(0, size(s)-1) |
s[i] + neighbour[i]] end) |
w / tofloat(size(neighbours))
] as average
with sentence, [
i in range(0, size(sentence.embedding)-1) |
(0.8 * sentence.embedding[i]) + (0.2 *average[i])
] as unnormalized
with sentence, unnormalized, sqrt(reduce(sum = 0.0, element in unnormalized | sum + element^2)) as divideby
set sentence.normalized = [
i in range(0, size(unnormalized)-1) | (unnormalized[i] / divideby)
]
For paralelizing, apoc is your friend, specifically the apoc.periodic.iterate procedure. In your usecase you can parallelize as you are only updating a property of a single node in each row.
The resulted query would look something like:
CALL apoc.periodic.iterate("
match (s:Sentence) RETURN s",
"
match (s)-[r:RELATED]-(t:Sentence)
with s as sentence, collect(t.embedding) as neighbours
with sentence, [
w in reduce(s=[], neighbour IN neighbours |
case when size(s) = 0 then
neighbour else [
i in range(0, size(s)-1) |
s[i] + neighbour[i]] end) |
w / tofloat(size(neighbours))
] as average
with sentence, [
i in range(0, size(sentence.embedding)-1) |
(0.8 * sentence.embedding[i]) + (0.2 *average[i])
] as unnormalized
with sentence, unnormalized, sqrt(reduce(sum = 0.0, element in unnormalized | sum + element^2)) as divideby
set sentence.normalized = [
i in range(0, size(unnormalized)-1) | (unnormalized[i] / divideby)
]", {batchSize:1000, parallel:true})
You can play around with the batchSize parameter. For more information look at the docs.

How to optimize the following neo4j Cypher query

I am new to cypher and have the below query to find mistmaches between 2 source types(for example). I believe syntactically the query looks fine but it takes 1 minute to run on data set of just 1,00,000 nodes. I am not using relations still. Can someone please help in optimizing the query? Thanks.
MATCH (VW_OXSS41:VW_OrderXStatusSummary4{SourceTypeID: "1"})
WHERE apoc.date.parse(VW_OXSS41.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss'))>=apoc.date.parse("2020-02-10",'s',('yyyy-MM-dd')) AND apoc.date.parse(VW_OXSS41.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss'))<=apoc.date.parse("2020-02-16",'s',('yyyy-MM-dd'))
WITH VW_OXSS41.IdentifierValue as X
MATCH (VW_OXSS42:VW_OrderXStatusSummary4{SourceTypeID: "2"})
WHERE apoc.date.parse(VW_OXSS42.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss'))>=apoc.date.parse("2020-02-10",'s',('yyyy-MM-dd')) AND apoc.date.parse(VW_OXSS42.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss'))<=apoc.date.parse("2020-02-16",'s',('yyyy-MM-dd'))
WITH apoc.coll.disjunction(COLLECT(X), COLLECT(VW_OXSS42.IdentifierValue)) as XX
UNWIND (XX) as YY
The updated query and the error:-
WITH apoc.date.parse("2020-02-20",'s',('yyyy-MM-dd')) AS a, apoc.date.parse("2020-02-25",'s',('yyyy-MM-dd')) AS b
MATCH (x:VW_OrderXStatusSummary4 {SourceTypeID: "2"})
WHERE a <= apoc.date.parse(x.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
WITH a, b, COLLECT(x.IdentifierValue) AS X
MATCH (y:VW_OrderXStatusSummary4 {SourceTypeID: "1"})
WHERE a <= apoc.date.parse(y.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
WITH X, COLLECT(y.IdentifierValue) AS Y
UNWIND apoc.coll.subtract(X,Y) AS XX
MATCH (z:VW_OrderXStatusSummary4 {SourceTypeID: "2"})
WHERE a <= apoc.date.parse(z.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
RETURN XX AS MISMATCHES,MAX(z.TimeStamp);
Variable `a` not defined (line 10, column 7 (offset: 551))
"WHERE a <= apoc.date.parse(z.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b"
Solved the above error like this:-
WITH apoc.date.parse("2020-02-21",'s',('yyyy-MM-dd')) AS a, apoc.date.parse("2020-02-25",'s',('yyyy-MM-dd')) AS b
MATCH (x:VW_OrderXStatusSummary4 {SourceTypeID: "2"})
WHERE a <= apoc.date.parse(x.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
WITH a, b, COLLECT(x.IdentifierValue) AS X
MATCH (y:VW_OrderXStatusSummary4 {SourceTypeID: "1"})
WHERE a <= apoc.date.parse(y.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
WITH X, COLLECT(y.IdentifierValue) AS Y
UNWIND apoc.coll.subtract(X,Y) AS XX
WITH XX, apoc.date.parse("2020-02-20",'s',('yyyy-MM-dd')) AS a, apoc.date.parse("2020-02-25",'s',('yyyy-MM-dd')) AS b
MATCH (z:VW_OrderXStatusSummary4 {SourceTypeID: "2"})
WHERE a <= apoc.date.parse(z.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
AND XX = z.IdentifierValue
RETURN XX AS MISMATCHES,MAX(z.TimeStamp);
With the correct expected output as:-
+---------------------------------------------+
| MISMATCHES | TIMESTAMP |
+---------------------------------------------+
| "W2002201453550218" | "2020-02-21 12:00:16" |
| "W2002201453550222" | "2020-02-21 12:00:16" |
| "W2002201453550223" | "2020-02-21 09:30:36" |
| "W2002201453550224" | "2020-02-21 12:00:16" |
| "W2002201453550226" | "2020-02-21 12:00:16" |
| "W2002201453550227" | "2020-02-21 12:00:16" |
| "W2002201453550237" | "2020-02-21 12:00:16" |
| "3011WOS002978598" | "2020-02-21 10:00:54" |
| "3011WOS002978595" | "2020-02-21 13:00:57" |
| "0010000000006183" | "2020-02-21 16:00:41" |
| "W2002181111547439" | "2020-02-21 04:00:34" |
| "11" | "2020-02-21 16:00:41" |
| "10112787861P1458" | "2020-02-21 10:00:54" |
+---------------------------------------------+
Wondering if there's a better approach?
You need to avoid making a cartesian product between the results of your two MATCH clauses. Let's say the two MATCH clauses would normally return N and M nodes, respectively, when executed in their own queries. Because your query combines those two MATCH clauses in the way that it does, your second MATCH clause is actually performing N*M matches (and producing N*M result rows).
You need to make sure you have created an index on :VW_OrderXStatusSummary4(SourceTypeID). That will optimize the lookups performed by the MATCH clauses.
You can simplify your Cypher code to avoid duplicated function calls.
After creating the index indicated above, try this:
WITH apoc.date.parse("2020-02-10",'s',('yyyy-MM-dd')) AS a, apoc.date.parse("2020-02-16",'s',('yyyy-MM-dd')) AS b
MATCH (x:VW_OrderXStatusSummary4 {SourceTypeID: "1"})
WHERE a <= apoc.date.parse(x.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
WITH a, b, COLLECT(x.IdentifierValue) AS X
MATCH (y:VW_OrderXStatusSummary4 {SourceTypeID: "2"})
WHERE a <= apoc.date.parse(y.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
WITH X, COLLECT(y.IdentifierValue) AS Y
UNWIND apoc.coll.disjunction(X, Y) AS YY
...
Performing the COLLECT(x.IdentifierValue) operation in the first WITH clause causes it to return all the x nodes in a single result row (instead of N result rows). This allows the second MATCH to avoid a cartesian product issue.

Make a path where next node is not the previous node?

I have ~1.5 M nodes in a graph, that are structured like this (picture)
I run a Cypher query that performs calculations on each relationship traversed:
WITH 1 AS startVal
MATCH x = (c:Currency)-[r:Arb*2]->(m)
WITH x, REDUCE(s = startVal, e IN r | s * e.rate) AS endVal, startVal
RETURN EXTRACT(n IN NODES(x) | n) as Exchanges,
extract ( e IN relationships(x) | startVal * e.rate) AS Rel,
endVal, endVal - startVal AS Profit
ORDER BY Profit DESC LIMIT 5
The problem is it returns the path ("One")->("hop")->("One"), which is useless for me.
How can I make it not choose the previously walked node as the next node (i.e. "One"->"hop"->"any_other_node_but_not_"one")?
I have read that NODE_RECENT should address my issue. However, there was no example on how to specify the length of recent nodes in RestAPI or APOC procedures.
Is there a Cypher query for my case?
Thank you.
P.S. I am extremely new (less than 2 month) to Neo4j and coding. So my apologies if there is an obvious simple solution.
I don't know if I understood your question completely, but I believe that you problem can be solved putting a WHERE clause on the MATCH to prevent the not desired relationship be matched, like this:
WITH 1 AS startVal
MATCH x = (c:Currency)-[r:Arb*2]->(m)
WHERE NOT (m)-[:Arb]->(c)
WITH x, REDUCE(s = startVal, e IN r | s * e.rate) AS endVal, startVal
RETURN EXTRACT(n IN NODES(x) | n) as Exchanges,
extract ( e IN relationships(x) | startVal * e.rate) AS Rel,
endVal, endVal - startVal AS Profit
ORDER BY Profit DESC LIMIT 5
Try inserting this clause after your MATCH clause, to filter out cases where c and m are the same:
WHERE c <> m
[EDITED]
That is:
WITH 1 AS startVal
MATCH x = (c:Currency)-[r:Arb*2]->(m)
WHERE c <> m
WITH x, REDUCE(s = startVal, e IN r | s * e.rate) AS endVal, startVal
RETURN EXTRACT(n IN NODES(x) | n) as Exchanges,
extract ( e IN relationships(x) | startVal * e.rate) AS Rel,
endVal, endVal - startVal AS Profit
ORDER BY Profit DESC LIMIT 5;
After using this query to create test data:
CREATE
(c:Currency {name: 'One'})-[:Arb {rate:1}]->(h:Account {name: 'hop'})-[:Arb {rate:2}]->(t:Currency {name: 'Two'}),
(t)-[:Arb {rate:3}]->(h)-[:Arb {rate:4}]->(c)
the above query produces these results:
+-----------------------------------------------------------------------------------------+
| Exchanges | Rel | endVal | Profit |
+-----------------------------------------------------------------------------------------+
| [Node[8]{name:"Two"},Node[7]{name:"hop"},Node[6]{name:"One"}] | [3,4] | 12 | 11 |
| [Node[6]{name:"One"},Node[7]{name:"hop"},Node[8]{name:"Two"}] | [1,2] | 2 | 1 |
+-----------------------------------------------------------------------------------------+

Neo4j browser interface stops working or reconnecting

The problem is the same no matter if I am using Safari or Chrome.
After running several times the same query shown below, I am getting the error: Disconnected from Neo4j. Please check if the cord is unplugged.
I am able to SSH to the server and run the query from the shell.
This query was the subject of another issue open earlier and the someone optimize it to the form is presented below. So is not a mater of a not optimized query, seems to be something about the browser interface.
What is wrong here?
MATCH (p:Publisher)-[r:PUBLISHED]->(w:Woka)<-[s:AUTHORED]-(a:Author)
MATCH (l:Language)-[t:USED]->(w)-[u:INCLUDED]->(b:Bisac)
WHERE (a.author_name = 'Camus, Albert')
WITH p,r,w,s,a,l,t,u,b
OPTIONAL MATCH (d:Description)-[v:HAS_DESCRIPTION]-(w)
RETURN w, p, a, l, b, d, r, s, t, u, v;
More details: when the browser dies in one computer, dies also in the second computer trying to connect to same database.
Also other commands i.e.
$ rails console
or
$ rails s -d
to start the rails server no longer works.
If I am restarting the Neo4j db server all are working for a little bit and frozen after that.
Below is the execution plan of the query:
neo4j-sh (?)$ EXPLAIN MATCH (p:Publisher)-[r:PUBLISHED]->(w:Woka)<-[s:AUTHORED]-(a:Author{author_name: 'Camus, Albert'}), (l:Language)-[t:USED]->(w)-[u:INCLUDED]->(b:Bisac)
WITH p,r,w,s,a,l,t,u,b
OPTIONAL MATCH (d:Description)-[v:HAS_DESCRIPTION]-(w)
RETURN w, p, a, l, b, d, r, s, t, u, v;
+--------------------------------------------+
| No data returned, and nothing was changed. |
+--------------------------------------------+
73 ms
Compiler CYPHER 2.2
Planner COST
OptionalExpand(All)
|
+Filter(0)
|
+Expand(All)(0)
|
+Filter(1)
|
+Expand(All)(1)
|
+Filter(2)
|
+Expand(All)(2)
|
+Filter(3)
|
+Expand(All)(3)
|
+NodeUniqueIndexSeek
+---------------------+---------------+---------------------------------+-----------------------------+
| Operator | EstimatedRows | Identifiers | Other |
+---------------------+---------------+---------------------------------+-----------------------------+
| OptionalExpand(All) | 5 | a, b, d, l, p, r, s, t, u, v, w | (w)-[v:HAS_DESCRIPTION]-(d) |
| Filter(0) | 5 | a, b, l, p, r, s, t, u, w | b:Bisac |
| Expand(All)(0) | 5 | a, b, l, p, r, s, t, u, w | (w)-[u:INCLUDED]->(b) |
| Filter(1) | 4 | a, l, p, r, s, t, w | l:Language |
| Expand(All)(1) | 4 | a, l, p, r, s, t, w | (w)<-[t:USED]-(l) |
| Filter(2) | 4 | a, p, r, s, w | p:Publisher |
| Expand(All)(2) | 4 | a, p, r, s, w | (w)<-[r:PUBLISHED]-(p) |
| Filter(3) | 4 | a, s, w | w:Woka |
| Expand(All)(3) | 4 | a, s, w | (a)-[s:AUTHORED]->(w) |
| NodeUniqueIndexSeek | 1 | a | :Author(author_name) |
+---------------------+---------------+---------------------------------+-----------------------------+
Total database accesses: ?
neo4j-sh (?)$
Here is a snapshot from top (before having the browser frozen):
top - 14:59:36 up 46 days, 17:03, 2 users, load average: 2.66, 4.58, 3.75
Tasks: 116 total, 2 running, 114 sleeping, 0 stopped, 0 zombie
%Cpu(s): 97.5 us, 0.8 sy, 0.0 ni, 1.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.2 st
KiB Mem: 15666128 total, 3858028 used, 11808100 free, 169612 buffers
KiB Swap: 0 total, 0 used, 0 free. 2144784 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10260 neo4j 20 0 14.348g 1.388g 195316 S 196.9 9.3 1:57.55 java
9879 ubuntu 20 0 23680 1656 1116 R 0.3 0.0 0:00.88 top
1 root 20 0 33508 2236 860 S 0.0 0.0 0:12.25 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.01 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:00.55 ksoftirqd/0
4 root 20 0 0 0 0 S 0.0 0.0 0:30.10 kworker/0:0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
7 root 20 0 0 0 0 S 0.0 0.0 0:39.08 rcu_sched
8 root 20 0 0 0 0 R 0.0 0.0 0:47.50 rcuos/0
9 root 20 0 0 0 0 S 0.0 0.0 1:00.72 rcuos/1
What is the spec of your computer?
How much data does your query return?
MATCH (p:Publisher)-[r:PUBLISHED]->(w:Woka)<-[s:AUTHORED]-(a:Author)
WHERE (a.author_name = 'Camus, Albert')
MATCH (l:Language)-[t:USED]->(w)-[u:INCLUDED]->(b:Bisac)
WITH p,r,w,s,a,l,t,u,b
OPTIONAL MATCH (d:Description)-[v:HAS_DESCRIPTION]-(w)
RETURN w, p, a, l, b, d, r, s, t, u, v;
Also what does your visual query plan look like? Please prefix your query with PROFILE save as png and share it.
The browser interface of this product Neo4j needs a major overhaul. There is no way to use this interface for serious design, modelling and development.
I executed the following stress tests a from Ruby on Rails console. No errors about disconnect, network etc. All run successfully while any of these queries frozen the browser after 5, 6, 7 executions and even if the result set is limited to 25 records. More than that, I executed all of them while the browser interface was still frozen showing that network disconnect error.
(1..1000).each do |n|
q = "MATCH (p:Publisher)-[r:PUBLISHED]->(w:Woka)<-[s:AUTHORED]-(a:Author)
WHERE (a.author_name = 'Freud, Sigmund')
MATCH (l:Language)-[t:USED]->(w)-[u:INCLUDED]->(b:Bisac)
WITH p,r,w,s,a,l,t,u,b
OPTIONAL MATCH (d:Description)-[v:HAS_DESCRIPTION]-(w)
RETURN w, p, a, l, b, d, r, s, t, u, v;"
r = Neo4j::Session.current.query(q)
print n, "\t", r.count, "\t", Time.now, "\n"
end
(1..1000).each do |n|
q = "MATCH (p:Publisher)-[r:PUBLISHED]->(w:Woka)<-[s:AUTHORED]-(a:Author)
WHERE (a.author_name = 'Einstein, Albert')
MATCH (l:Language)-[t:USED]->(w)-[u:INCLUDED]->(b:Bisac)
WITH p,r,w,s,a,l,t,u,b
OPTIONAL MATCH (d:Description)-[v:HAS_DESCRIPTION]-(w)
RETURN w, p, a, l, b, d, r, s, t, u, v;"
r = Neo4j::Session.current.query(q)
print n, "\t", r.count, "\t", Time.now, "\n"
end
(1..1000).each do |n|
q = "MATCH (p:Publisher)-[r:PUBLISHED]->(w:Woka)<-[s:AUTHORED]-(a:Author)
WHERE (a.author_name = 'Freud, Sigmund')
MATCH (l:Language)-[t:USED]->(w)-[u:INCLUDED]->(b:Bisac)
WITH p,r,w,s,a,l,t,u,b
OPTIONAL MATCH (d:Description)-[v:HAS_DESCRIPTION]-(w)
RETURN w, p, a, l, b, d, r, s, t, u, v;"
r = Neo4j::Session.current.query(q)
print n, "\t", r.count, "\t", Time.now, "\n"
end

Follow sets Top-Down parsing

I have a question for the Follow sets of the following rules:
L -> CL'
L' -> epsilon
| ; L
C -> id:=G
|if GC
|begin L end
I have computed that the Follow(L) is in the Follow(L'). Also Follow(L') is in the Follow(L) so they both will contain: {end, $}. However, as L' is Nullable will the Follow(L) contain also the Follow(C)?
I have computed that the Follow(C) = First(L') and also Follow(C) subset Follow(L) = { ; $ end}.
In the answer the Follow(L) and Follow(L') contain only {end, $}, but shouldn't it contain ; as well from the Follow(C) as L' can be null?
Thanks
However, as L' is Nullable will the Follow(L) contain also the Follow(C)?
The opposite. Follow(C) will contain Follow(L). Think of the following sentence:
...Lx...
where X is some terminal and thus is in Follow(L). This could be expanded to:
...CL'x...
and further to:
...Cx...
So what follows L, can also follow C. The opposite is not necessarily true.
To calculate follows, think of a graph, where the nodes are (NT, n) which means non-terminal NT with the length of tokens as follow (in LL(1), n is either 1 or 0). The graph for yours would look like this:
_______
|/_ \
(L, 1)----->(L', 1) _(C, 1)
| \__________|____________/| |
| | |
| | |
| _______ | |
V |/_ \ V V
(L, 0)----->(L', 0) _(C, 0)
\_______________________/|
Where (X, n)--->(Y, m) means the follows of length n of X, depend on follows of length m of Y (of course, m <= n). That is to calculate (X, n), first you should calculate (Y, m), and then you should look at every rule that contains X on the right hand side and Y on the left hand side e.g.:
Y -> ... X REST
take what REST expands to with length n - m for every m in [0, n) and then concat every result with every follow from the (Y, m) set. You can calculate what REST expands to while calculating the firsts of REST, simply by holding a flag saying whether REST completely expands to that first, or partially. Furthermore, add firsts of REST with length n as follows of X too. For example:
S -> A a b c
A -> B C d
C -> epsilon | e | f g h i
Then to find follows of B with length 3 (which are e d a, d a b and f g h), we look at the rule:
A -> B C d
and we take the sentence C d, and look at what it can produce:
"C d" with length 0 (complete):
"C d" with length 1 (complete):
d
"C d" with length 2 (complete):
e d
"C d" with length 3 (complete or not):
f g h
Now we take these and merge with follow(A, m):
follow(A, 0):
epsilon
follow(A, 1):
a
follow(A, 2):
a b
follow(A, 3):
a b c
"C d" with length 0 (complete) concat follow(A, 3):
"C d" with length 1 (complete) concat follow(A, 2):
d a b
"C d" with length 2 (complete) concat follow(A, 1):
e d a
"C d" with length 3 (complete or not) concat follow(A, 0) (Note: follow(X, 0) is always epsilon):
f g h
Which is the set we were looking for. So in short, the algorithm becomes:
Create the graph of follow dependencies
Find the connected components and create a DAG out of it.
Traverse the DAG from the end (from the nodes that don't have any dependency) and calculate the follows with the algorithm above, having calculated firsts beforehand.
It's worth noting that the above algorithm is for any LL(K). For LL(1), the situation is much simpler.

Resources