Cypher syntax clarification. Multiple MATCH clauses vs using a comma [duplicate] - neo4j

are these two Chypher statements identical:
//first
match (a)-[r]->(b),b-[r2]->c
//second
match (a)-[r]->(b)
match b-[r2]->c

The 2 Cypher statements are NOT identical. We can show this by using the PROFILE command, which shows you how the Cypher engine would perform a query.
In the following examples, the queries all end with RETURN a, c, since you cannot have a bare MATCH clause.
As you can see, the first query has a NOT(r == r2) filter that the second query does not. This is because Cypher makes sure that the result of a single MATCH clause does not contain duplicate relationships.
First query
profile match (a)-[r]->(b),b-[r2]->c return a,c;
==> +-----------------------------------------------+
==> | a | c |
==> +-----------------------------------------------+
==> | Node[1]{name:"World"} | Node[0]{name:"World"} |
==> +-----------------------------------------------+
==> 1 row
==> 2 ms
==>
==> Compiler CYPHER 2.3
==>
==> Planner COST
==>
==> Runtime INTERPRETED
==>
==> Projection
==> |
==> +Filter
==> |
==> +Expand(All)(0)
==> |
==> +Expand(All)(1)
==> |
==> +AllNodesScan
==>
==> +----------------+---------------+------+--------+----------------+----------------+
==> | Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
==> +----------------+---------------+------+--------+----------------+----------------+
==> | Projection | 1 | 1 | 0 | a, b, c, r, r2 | a; c |
==> | Filter | 1 | 1 | 0 | a, b, c, r, r2 | NOT(r == r2) |
==> | Expand(All)(0) | 1 | 2 | 4 | a, b, c, r, r2 | (b)-[r2:]->(c) |
==> | Expand(All)(1) | 2 | 2 | 8 | a, b, r | (b)<-[r:]-(a) |
==> | AllNodesScan | 6 | 6 | 7 | b | |
==> +----------------+---------------+------+--------+----------------+----------------+
==>
Second query
profile match (a)-[r]->(b) match b-[r2]->c return a,c;
==> +-----------------------------------------------+
==> | a | c |
==> +-----------------------------------------------+
==> | Node[1]{name:"World"} | Node[1]{name:"World"} |
==> | Node[1]{name:"World"} | Node[0]{name:"World"} |
==> +-----------------------------------------------+
==> 2 rows
==> 2 ms
==>
==> Compiler CYPHER 2.3
==>
==> Planner COST
==>
==> Runtime INTERPRETED
==>
==> Projection
==> |
==> +Expand(All)(0)
==> |
==> +Expand(All)(1)
==> |
==> +AllNodesScan
==>
==> +----------------+---------------+------+--------+----------------+----------------+
==> | Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
==> +----------------+---------------+------+--------+----------------+----------------+
==> | Projection | 1 | 2 | 0 | a, b, c, r, r2 | a; c |
==> | Expand(All)(0) | 1 | 2 | 4 | a, b, c, r, r2 | (b)-[r2:]->(c) |
==> | Expand(All)(1) | 2 | 2 | 8 | a, b, r | (b)<-[r:]-(a) |
==> | AllNodesScan | 6 | 6 | 7 | b | |
==> +----------------+---------------+------+--------+----------------+----------------+

Related

Read non delimited asciif file Apache Pig Latin

I'm trying to read a text file in Apache Pig Latin that has non-delimited ascii comprising each row. That is, each column in that row begins and ends at a specific position in the row.
Sample definition:
+--------+----------------+--------------+
| Column | Start Position | End Position |
+--------+----------------+--------------+
| A | 1 | 6 |
+--------+----------------+--------------+
| B | 8 | 11 |
+--------+----------------+--------------+
| C | 13 | 15 |
+--------+----------------+--------------+
Sample Data:
+---+---+---+---+---+---+---+----+---+----+----+----+----+----+----+
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
+---+---+---+---+---+---+---+----+---+----+----+----+----+----+----+
| s | a | m | p | l | e | | d | a | t | a | | | h | i |
+---+---+---+---+---+---+---+----+---+----+----+----+----+----+----+
| d | u | d | e | | | | hi | | | | | b | r | o |
+---+---+---+---+---+---+---+----+---+----+----+----+----+----+----+
Expected Output:
sample, data, hi
dude, hi, bro
How do I read this in Pig? PigStorage doesn't seem flexible enough to allow positional delimiting, only string delimiting (comma, tab, etc..).
Looks like Apache provides a loader for this specific use case:
LOAD 'data.txt' USING org.apache.pig.piggybank.storage.FixedWidthLoader('1-6, 8-11, 13-15', 'SKIP_HEADER') AS (a, b, c);
https://pig.apache.org/docs/r0.16.0/api/

Neo4j feature or bug with search query?

I have recently updated neo4j from 2.1.7 to 2.2.5. I found out that query
Match (c:C) where id(c) = 111 with c Match (p:I{id: c.id}) return count(p)
worked fine in 2.1.7, but it performs very poor in 2.2.5 (100 times longer). I have all the indexes that are needed.
I modified this query to
Match (c:C) where id(c) = 111 with c.id as c_id Match (p:I{id: c_id}) return count(p)
and after this it works fine in 2.2.5
This two queries have different profile. But I'm not very expirienced with profiling.
UPDATED
One more strange thing is that if i use explain instead of profile - it works fast.
neo4j-sh (?)$ PROFILE Match (c:C) where id(c) = 10563822 with c Match (i:I{id: c.id}) return count(i);
==> +----------+
==> | count(i) |
==> +----------+
==> | 4551 |
==> +----------+
==> 1 row
==> 18257 ms
==>
==> Compiler CYPHER 2.2
==>
==> Planner COST
==>
==> EagerAggregation
==> |
==> +Filter(0)
==> |
==> +CartesianProduct
==> |
==> +Filter(1)
==> | |
==> | +NodeByIdSeek
==> |
==> +NodeByLabelScan
==>
==> +------------------+---------------+---------+---------+-------------+-------------------------+
==> | Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
==> +------------------+---------------+---------+---------+-------------+-------------------------+
==> | EagerAggregation | 26 | 1 | 0 | count(i) | |
==> | Filter(0) | 652 | 4551 | 2522988 | c, i | i.id == c.id |
==> | CartesianProduct | 6521 | 1261494 | 0 | c, i | |
==> | Filter(1) | 0 | 1 | 1 | c | c:C |
==> | NodeByIdSeek | 1 | 1 | 1 | c | |
==> | NodeByLabelScan | 1261494 | 1261494 | 1261495 | i | :I |
==> +------------------+---------------+---------+---------+-------------+-------------------------+
==>
==> Total database accesses: 3784485
sh (?)$ PROFILE Match (c:C) where id(c) = 10563822 with c.id as c_id Match (i:I{id: c_id}) return count(i);
==> +----------+
==> | count(i) |
==> +----------+
==> | 4551 |
==> +----------+
==> 1 row
==> 64 ms
==>
==> Compiler CYPHER 2.2
==>
==> Planner COST
==>
==> EagerAggregation
==> |
==> +Apply
==> |
==> +Projection
==> | |
==> | +Filter
==> | |
==> | +NodeByIdSeek
==> |
==> +NodeIndexSeek
==>
==> +------------------+---------------+------+--------+-------------+---------------------+
==> | Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
==> +------------------+---------------+------+--------+-------------+---------------------+
==> | EagerAggregation | 1 | 1 | 0 | count(i) | |
==> | Apply | 1 | 4551 | 0 | c, c_id, i | |
==> | Projection | 0 | 1 | 1 | c, c_id | c.id |
==> | Filter | 0 | 1 | 1 | c | c:C |
==> | NodeByIdSeek | 1 | 1 | 1 | c | |
==> | NodeIndexSeek | 1 | 4551 | 4552 | i | :I(id) |
==> +------------------+---------------+------+--------+-------------+---------------------+
==>
==> Total database accesses: 4555
I don't have enough knowledge of neo4j internals to know why your query is slower (the CartesianProduct step seems a red flag) in more recent versions, but here is a logically equivalent query that seems like it should be much faster:
START c = node(111)
MATCH (p:I { id: c.id })
RETURN count(p)
Here is the profile:
+------------------+------+--------+----------------------------------------------------------+-----------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+------------------+------+--------+----------------------------------------------------------+-----------------------+
| ColumnFilter | 1 | 0 | count(p) | keep columns count(p) |
| EagerAggregation | 1 | 0 | INTERNAL_AGGREGATE51b25e53-027d-439b-9046-c1a2a6b0fe70 | |
| Filter | 0 | 0 | c, p | p.id == c.id |
| NodeById | 0 | 0 | c, p | Literal(List(111)) |
| NodeByLabel | 0 | 1 | p | :I |
+------------------+------+--------+----------------------------------------------------------+-----------------------+
NOTE: This should be considered a temporary workaround, as START has been deprecated, and I do not know how long this kind of usage will continue to be supported.

cypher get relationships between same nodes,but lose some relationships

data set:
neo4j-sh (?)$ START n = node(*) MATCH n-[r]-m RETURN n,r,m;
==> +---------------------------------------------+
==> | n | r | m |
==> +---------------------------------------------+
==> | Node[1]{} | (2)-[1:KNOWS]->(1)| Node[2]{} |
==> | Node[1]{} | (3)-[2:KNOWS]->(1) | Node[3]{} |
==> | Node[2]{} | (2)-[1:KNOWS]->(1) | Node[1]{} |
==> | Node[2]{} | (3)-[0:KNOWS]->(2) | Node[3]{} |
==> | Node[3]{} | (3)-[0:KNOWS]->(2) | Node[2]{} |
==> | Node[3]{} | (3)-[2:KNOWS]->(1) | Node[1]{} |
==> +---------------------------------------------+
==> 6 rows
==>
==> 0 ms
cypher query:
neo4j-sh (0)$ start x=node(1,2,3),y=node(1,2,3) match x-[r]-y return id(x),id(y) order by id(x) desc;
==> +---------------+
==> | id(x) | id(y) |
==> +---------------+
==> | 1 | 2 |
==> | 1 | 3 |
==> | 2 | 1 |
==> | 3 | 1 |
==> +---------------+
==> 4 rows
in fact,2 and 3 are linked,why no returns;
how to get returns?
thanks
url:http://console.neo4j.org/?id=qwdh4p

Query unique pair of nodes when pair orders is not important in cypher

I am trying to compare users with according to their common interests in this graph.
I know why the following query produces duplicate pairs but can't think of a good way in cypher to avoid it. Is there any way to do it without looping in cypher?
neo4j-sh (?)$ start n=node(*) match p=n-[:LIKES]->item<-[:LIKES]-other where n <> other return n.name,other.name,collect(item.name) as common, count(*) as freq order by freq desc;
==> +-----------------------------------------------+
==> | n.name | other.name | common | freq |
==> +-----------------------------------------------+
==> | "u1" | "u2" | ["f1","f2","f3"] | 3 |
==> | "u2" | "u1" | ["f1","f2","f3"] | 3 |
==> | "u1" | "u3" | ["f1","f2"] | 2 |
==> | "u3" | "u2" | ["f1","f2"] | 2 |
==> | "u2" | "u3" | ["f1","f2"] | 2 |
==> | "u3" | "u1" | ["f1","f2"] | 2 |
==> | "u4" | "u3" | ["f1"] | 1 |
==> | "u4" | "u2" | ["f1"] | 1 |
==> | "u4" | "u1" | ["f1"] | 1 |
==> | "u2" | "u4" | ["f1"] | 1 |
==> | "u1" | "u4" | ["f1"] | 1 |
==> | "u3" | "u4" | ["f1"] | 1 |
==> +-----------------------------------------------+
In order to avoid having duplicates in the form of a--b and b--a, you can exclude one of the combinations in your WHERE clause with
WHERE ID(a) < ID(b)
making your above query
start n=node(*) match p=n-[:LIKES]->item<-[:LIKES]-other where ID(n) < ID(other) return n.name,other.name,collect(item.name) as common, count(*) as freq order by freq desc;
OK, I see that you use (*) as a start point, which mean to loop through the whole graph and make each node as a start point.. So the output is different, not duplicate as you say..
+-----------------------------------------------+
| n.name | other.name | common | freq |
+-----------------------------------------------+
| "u2" | "u1" | ["f1","f2","f3"] | 3 |
not equal to:
+-----------------------------------------------+
| n.name | other.name | common | freq |
+-----------------------------------------------+
| "u1" | "u2" | ["f1","f2","f3"] | 3 |
So, I see that if you try using an index and set a start point, there won't be any duplicates.
start n=node:someIndex(name='C') match p=n-[:LIKES]->item<-[:LIKES]-other where n <> other return n.name,other.name,collect(item.name) as common, count(*) as freq order by freq desc;

calculating total path cost in cypher, taking relation directionality into account

Using a cypher query on neo4j, in a directed, cyclic graph I need a BFS query and a target node sorting per depth level.
For the within-depth sorting, a custom "total path cost function" should be used, calculated based on
all relation attributes r.followrank between start and end node.
relation directionality (followrank if it points towards end node, or 0 if not)
At any search depth level n, a node connected to a high ranked node at level n-m, m>0 should be ranked higher than a node connected to a low ranked node at level n-m. Reverse directionality should result in a 0 rank (which means, the node and its subtree are still part of the ranking).
I'm using neo4j community-1.9.M01. The approach I've taken so far was to extract an array of followranks for the shortest path to each end node
I thought I've come up with a great first idea for this query but it seems to break down at multiple points.
My query is:
START strt=node(7)
MATCH p=strt-[*1..]-tgt
WHERE not(tgt=strt)
RETURN ID(tgt), extract(r in rels(p): r.followrank*length(strt-[*0..]-()-[r]->() )) as rank, extract(n in nodes(p): ID(n));
which outputs
==> +-----------------------------------------------------------------+
==> | ID(tgt) | rank | extract(n in nodes(p): ID(n)) |
==> +-----------------------------------------------------------------+
==> | 14 | [1.0] | [7,14] |
==> | 15 | [1.0,1.0] | [7,14,15] |
==> | 11 | [1.0,1.0,1.0] | [7,14,15,11] |
==> | 8 | [1.0,1.0,1.0,1.0,0.0] | [7,14,15,11,7,8] |
==> | 9 | [1.0,1.0,1.0,1.0,0.0] | [7,14,15,11,7,9] |
==> | 10 | [1.0,1.0,1.0,1.0,0.0] | [7,14,15,11,7,10] |
==> | 12 | [1.0,1.0,1.0,0.0] | [7,14,15,11,12] |
==> | 8 | [0.0] | [7,8] |
==> | 9 | [0.0] | [7,9] |
==> | 10 | [0.0] | [7,10] |
==> | 11 | [1.0] | [7,11] |
==> | 15 | [1.0,1.0] | [7,11,15] |
==> | 14 | [1.0,1.0,1.0] | [7,11,15,14] |
==> | 8 | [1.0,1.0,1.0,1.0,0.0] | [7,11,15,14,7,8] |
==> | 9 | [1.0,1.0,1.0,1.0,0.0] | [7,11,15,14,7,9] |
==> | 10 | [1.0,1.0,1.0,1.0,0.0] | [7,11,15,14,7,10] |
==> | 12 | [1.0,0.0] | [7,11,12] |
==> +-----------------------------------------------------------------+
==> 17 rows
==> 38 ms
It looks similar to what I need, but the issues are
nodes 8, 9, 10, 11 have the same relation direction to 7! The inverse query result ...*length(strt-[*0..]-()-[r]->() )... looks even stranger - see the queries right below.
I don't know how to normalize the results of the length() expression to 1.
Directionality:
START strt=node(7)
MATCH strt<-[r]-m
RETURN ID(m), r.followrank;
==> +----------------------+
==> | ID(m) | r.followrank |
==> +----------------------+
==> | 8 | 1 |
==> | 9 | 1 |
==> | 10 | 1 |
==> | 11 | 1 |
==> +----------------------+
==> 4 rows
==> 0 ms
START strt=node(7)
MATCH strt-[r]->m
RETURN ID(m), r.followrank;
==> +----------------------+
==> | ID(m) | r.followrank |
==> +----------------------+
==> | 14 | 1 |
==> +----------------------+
==> 1 row
==> 0 ms
Inverse query:
START strt=node(7)
MATCH p=strt-[*1..]-tgt
WHERE not(tgt=strt)
RETURN ID(tgt), extract(rr in rels(p): rr.followrank*length(strt-[*0..]-()<-[rr]-() )) as rank, extract(n in nodes(p): ID(n));
==> +-----------------------------------------------------------------+
==> | ID(tgt) | rank | extract(n in nodes(p): ID(n)) |
==> +-----------------------------------------------------------------+
==> | 14 | [1.0] | [7,14] |
==> | 15 | [1.0,1.0] | [7,14,15] |
==> | 11 | [1.0,1.0,1.0] | [7,14,15,11] |
==> | 8 | [1.0,1.0,1.0,1.0,3.0] | [7,14,15,11,7,8] |
==> | 9 | [1.0,1.0,1.0,1.0,3.0] | [7,14,15,11,7,9] |
==> | 10 | [1.0,1.0,1.0,1.0,3.0] | [7,14,15,11,7,10] |
==> | 12 | [1.0,1.0,1.0,2.0] | [7,14,15,11,12] |
==> | 8 | [3.0] | [7,8] |
==> | 9 | [3.0] | [7,9] |
==> | 10 | [3.0] | [7,10] |
==> | 11 | [1.0] | [7,11] |
==> | 15 | [1.0,1.0] | [7,11,15] |
==> | 14 | [1.0,1.0,1.0] | [7,11,15,14] |
==> | 8 | [1.0,1.0,1.0,1.0,3.0] | [7,11,15,14,7,8] |
==> | 9 | [1.0,1.0,1.0,1.0,3.0] | [7,11,15,14,7,9] |
==> | 10 | [1.0,1.0,1.0,1.0,3.0] | [7,11,15,14,7,10] |
==> | 12 | [1.0,2.0] | [7,11,12] |
==> +-----------------------------------------------------------------+
==> 17 rows
==> 30 ms
So my questions are:
what's going on with this query?
is there a working approach?
For an additional detail, I know the min(length(path)) aggregator, but it doesn't work in this case where I'm trying to extract information about the best hit - the additional information I return about the best hit will disaggreate the result again - I think that's a cypher limitation.
Basically, you want to do a rank only considering relationships that are "with the path flow". Unfortunately, to test "with path flow", you need to check the path-index of each relationships' start/end nodes, and that can only be done with APOC right now.
// allshortestpaths to get all non-cyclic paths
MATCH path=allshortestpaths((a{id:"1"})-[*]-(b{id:"2"}))
// Find rank worthy relationships
WITH path, filter(rl in relationships(path) WHERE apoc.coll.indexOf(path, startnode(rl))<apoc.coll.indexOf(path, endnode(rl)))) as comply
// Filter results
RETURN path, REDUCE(rk = 0, rl in comply | rk+rl.followrank) as rank
ORDER BY rank DESC
(I can't test the APOC part, so you might have to pass NODES(path) instead of path to the APOC procedure)

Resources