A complex match

A complex match - neo4j

I got the following cypher query:
neo4j-sh$ start n=node(1344) match (n)-[t:_HAS_TRANSLATION]-(p) return t,p;
+-----------------------------------------------------------------------------------+
| t | p |
+-----------------------------------------------------------------------------------+
| :_HAS_TRANSLATION[2224]{of:"value"} | Node[1349]{language:"hi-hi",text:"(>0#"} |
| :_HAS_TRANSLATION[2223]{of:"value"} | Node[1348]{language:"es-es",text:"hembra"} |
| :_HAS_TRANSLATION[2222]{of:"value"} | Node[1347]{language:"ru-ru",text:"65=A:89"} |
| :_HAS_TRANSLATION[2221]{of:"value"} | Node[1346]{language:"en-us",text:"female"} |
| :_HAS_TRANSLATION[2220]{of:"value"} | Node[1345]{language:"it-it",text:"femmina"} |
+-----------------------------------------------------------------------------------+
and the following array:
["it-it", "en-us", "fr-fr", "de-de", "ru-ru", "hi-hi"]
How can I change the query to return just one result, where 'language' is the same as the first occurrence of the array?
If the array should be
["fr-fr","jp-jp","en-us", "it-it", "de-de", "ru-ru", "hi-hi"]
I'd need to return the Node[1346], because it is the first with a match in the language array [en-us], being no entry for [fr-fr] and [jp-jp]
Thank you
Paolo

Cypher can express arrays, and indexes into them. So on one level, you could do this:
start n=node(1344)
match (n)-[t:_HAS_TRANSLATION]-(p)
where p.language = ["it-it", "en-us", "fr-fr", "de-de", "ru-ru", "hi-hi"][0] return t,p;
This is really just the same as asking for those nodes p where p.language="it-it" (the first element in your array).
Now, if what you mean is that the language attribute itself can be an array, then just treat it like one:
$ create (p { language: ['it-it', 'en-us']});
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 1
Properties set: 1
$ match (p) where p.language[0] = 'it-it' return p;
+-------------------------------------+
| p |
+-------------------------------------+
| Node[1]{language:["it-it","en-us"]} |
+-------------------------------------+
1 row
38 ms
Note the array brackets on p.language[0].
Finally, if what you're talking about is splitting your language parts into pieces (i.e. "en-us" = ["en", "us"]) then cypher's string processing functions are a little on the weak side, and I wouldn't try to do this. Instead, I'd pre-process that before inserting it into the graph in the first place, and query them as separate pieces.

Related

Forcing cost planner to start from a specific index seek

My cypher query
EXPLAIN MATCH (b:Block)<-[:INCLUDED_IN]-(tx:Transaction {pstype: 0})
WHERE 1540512000 <= b.time < 1540598400
RETURN count(tx);
produces the following execution plan
--------------------------------------------+
| Operator | Estimated Rows | Identifiers | Other |
+-------------------+----------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +ProduceResults | 12 | count(tx) | |
| | +----------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +EagerAggregation | 12 | count(tx) | |
| | +----------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +Filter | 136 | anon[16], b, tx | AndedPropertyInequalities(Variable(b),Property(Variable(b),PropertyKeyName(time)),GreaterThanOrEqual(Property(Variable(b),PropertyKeyName(time)),Parameter( AUTOINT2,Integer)), LessThan(Property(Variable(b),PropertyKeyName(time)),Parameter( AUTOINT1,Integer))) |
| | +----------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +Expand(All) | 9052 | anon[16], b, tx | (tx)-[anon[16]:INCLUDED_IN]->(b) |
| | +----------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +NodeIndexSeek | 9052 | tx | :Transaction(pstype) |
+-------------------+----------------+-----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
which executes way too slow because the first NodeIndexSeekByRange returns tens of millions of nodes instead of 9052. Using NodeIndexSeekByRange on b:Block(time) would produce around 600 nodes.
I have tried forcing the execution plan to start from b:Block(time), but instead it still keeps using NodeIndexSeek on tx:Transaction(pstype):
EXPLAIN MATCH (b:Block)<-[:INCLUDED_IN]-(tx:Transaction {pstype: 0})
USING INDEX b:Block(time)
WHERE 1540512000 <= b.time < 1540598400
RETURN count(tx);
produces
+-------------------------+----------------+-----------------+--------------------------------------------------------------+
| Operator | Estimated Rows | Identifiers | Other |
+-------------------------+----------------+-----------------+--------------------------------------------------------------+
| +ProduceResults | 12 | count(tx) | |
| | +----------------+-----------------+--------------------------------------------------------------+
| +EagerAggregation | 12 | count(tx) | |
| | +----------------+-----------------+--------------------------------------------------------------+
| +NodeHashJoin | 136 | anon[16], b, tx | b |
| |\ +----------------+-----------------+--------------------------------------------------------------+
| | +NodeIndexSeekByRange | 14703 | b | :Block(time) >= { AUTOINT2} AND :Block(time) < { AUTOINT1} |
| | +----------------+-----------------+--------------------------------------------------------------+
| +Expand(All) | 9052 | anon[16], b, tx | (tx)-[anon[16]:INCLUDED_IN]->(b) |
| | +----------------+-----------------+--------------------------------------------------------------+
| +NodeIndexSeek | 9052 | tx | :Transaction(pstype) |
+-------------------------+----------------+-----------------+--------------------------------------------------------------+
The only way I have gotten it to work fast is by using the rule planner: (multiple orders of magnitude faster)
CYPHER planner=rule MATCH (b:Block)
WHERE 1540512000 <= b.time < 1540598400
WITH b
MATCH (b)<-[:INCLUDED_IN]-(tx:Transaction {pstype: 0})
RETURN count(tx);
Is there a way to make it work when using the cost planner?
Both :Block(time) and :Transaction(pstype) are indexed.

You could try using a join hint on tx along with your index hint, which should ensure you only expand from one direction:
EXPLAIN
MATCH (b:Block)<-[:INCLUDED_IN]-(tx:Transaction {pstype: 0})
USING INDEX b:Block(time)
USING JOIN ON tx
WHERE 1540512000 <= b.time < 1540598400
RETURN count(tx);
Alternately you could restructure your query a bit so the tx node isn't initially part of the pattern, but enforced in the WHERE clause. You'll need to split the MATCH in 2, but I don't think you'll need any planner hints:
EXPLAIN
MATCH (tx:Transaction {pstype: 0})
MATCH (b:Block)<-[:INCLUDED_IN]-(x)
WHERE 1540512000 <= b.time < 1540598400
AND x = tx
RETURN count(tx);
EDIT
Okay, let's try another approach then:
EXPLAIN
MATCH (b:Block)<-[:INCLUDED_IN]-(x)
WHERE 1540512000 <= b.time < 1540598400
AND x.pstype = 0 // AND 'Transaction' in labels(x)
RETURN count(tx);
If we leave off the label then it can't use an indexed lookup. If there are other nodes besides :Transaction nodes that have a pstype property, you could try uncommenting the line where we use an alternate way to see if the node has that label (I don't think this will use an index lookup, but not completely sure).
Another alternative (unsure if this will work) is to use pattern comprehension to get a list of results from a pattern (after the initial match is found to b) and summing the sizes of the results:
EXPLAIN
MATCH (b:Block)
WHERE 1540512000 <= b.time < 1540598400
RETURN sum(size([(b)<-[:INCLUDED_IN]-(x:Transaction) WHERE x.pstype = 0 | x])) as count

How to match node labels using OR?

match (p:Product {id:'5116003'})-[r]->(o:Attributes|ExtraAttribute) return p, o
How to match two possible node labels in such a query?
Per cybersam's suggestion, I changed to the follwoing:
MATCH (p:Product {id:'5116003'})-[r]->(o)
WHERE o:Attributes OR o:ExtraAttributes
**WHERE any(key in keys(o) WHERE toLower(key) contains 'weight')**
return o
Now I need to add the 2nd 'where' clause. How to modify that?

You can try using any() function:
match (p:Product {id:'5116003'})-[r]->(o)
where any (label in labels(o) where label in ['Attributes', 'ExtraAttribute'])
return p, o
Also, if you have APOC procedures, you can use apoc.path.expand path expander procedure that expands from start node following the given relationships from min to max-level adhering to the label filters.
match (p:Product {id:'5116003'})
call apoc.path.expand(p, null,"+Attributes|ExtraAttribute",0,1) yield path
with nodes(path) as nodes
// return p and o nodes
return nodes[0], nodes[1]
See more here.

These two single-label forms of your query:
MATCH (p:Product {id:'5116003'})-->(o:Attributes) RETURN p, o;
MATCH (p:Product {id:'5116003'})-->(o) WHERE o:Attributes RETURN p, o;
produce the same execution plan, as follows (I assume that there is an index on :Product(id)):
+-----------------+----------------+------+---------+------------------+--------------+
| Operator | Estimated Rows | Rows | DB Hits | Variables | Other |
+-----------------+----------------+------+---------+------------------+--------------+
| +ProduceResults | 0 | 0 | 0 | o, p | p, o |
| | +----------------+------+---------+------------------+--------------+
| +Filter | 0 | 0 | 0 | anon[33], o, p | o:Attributes |
| | +----------------+------+---------+------------------+--------------+
| +Expand(All) | 0 | 0 | 0 | anon[33], o -- p | (p)-->(o) |
| | +----------------+------+---------+------------------+--------------+
| +NodeIndexSeek | 0 | 0 | 1 | p | :Product(id) |
+-----------------+----------------+------+---------+------------------+--------------+
This two-label form of the second query above:
MATCH (p:Product {id:'5116003'})-->(o) WHERE o:Attributes OR o: ExtraAttribute RETURN p, o;
produces an execution plan that is very similar (and therefore probably not much more expensive):
+-----------------+----------------+------+---------+------------------+-------------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Variables | Other |
+-----------------+----------------+------+---------+------------------+-------------------------------------+
| +ProduceResults | 0 | 0 | 0 | o, p | p, o |
| | +----------------+------+---------+------------------+-------------------------------------+
| +Filter | 0 | 0 | 0 | anon[33], o, p | Ors(o:Attributes, o:ExtraAttribute) |
| | +----------------+------+---------+------------------+-------------------------------------+
| +Expand(All) | 0 | 0 | 0 | anon[33], o -- p | (p)-->(o) |
| | +----------------+------+---------+------------------+-------------------------------------+
| +NodeIndexSeek | 0 | 0 | 1 | p | :Product(id) |
+-----------------+----------------+------+---------+------------------+-------------------------------------+
By the way, the first query in the answer by #BrunoPeres has a similar execution plan as well, but the Filter operation is very different. It is not clear which would be faster.
[UPDATE]
To answer your updated question: since you cannot have 2 back-to-back WHERE clauses, you can just add more terms to the already existing WHERE clause, like so:
MATCH (p:Product {id:'5116003'})-[r]->(o)
WHERE
(o:Attributes OR o:ExtraAttributes) AND
ANY(key in KEYS(o) WHERE TOLOWER(key) CONTAINS 'weight')
RETURN o;

Aggregating over Distinct Nodes

The data model I have is: (Item)-[:HAS_AN]->(ItemType) and both node types have a field called ID. Items can be related to multiple ItemTypes and in some cases, these ItemTypes can have the same IDs. I'm trying to populate a structure {ID:..., SameType:...}, where SameType = 1 if a node has the same item type as some node (with ID = 1234), and 0 otherwise.
First, I get the candidate list of nodes qList and the source node's ItemType:
MATCH (p:Item{ID:1234})-[:HAS_AN]->(i)
WITH i as pItemType, qList
Then, I go through qList, comparing each node's ItemType to pItemType (which is the ItemType of the source node):
UNWIND qList as q
MATCH (q)-[:HAS_AN]->(i)
WITH q.ID as qID, pItemType, i,
CASE i
WHEN pItemType THEN 1
ELSE 0
END as SameType
RETURN DISTINCT i, qID, pItemType, SameType
The problem I have is when some nodes have two ItemTypes with the same ID. This gives results where some of the entries are duplicates:
{ | | { |
"ID": 18 | 35258417 | "ID": 71 | 0
} | | } |
{ | | { |
"ID": 18 | 35258417 | "ID": 71 | 0
} | | } |
while I'd like to only take one such row, if more than one exists.
Placing DISTINCT where I have in the last part of the query doesn't seem to work. What's the best way to filter out such duplicates?
Update:
Here is a sample data subset: http://console.neo4j.org/r/f74pdq
Here are the queries that I'm running
MATCH (q:Item) WHERE q.ID <> 1234 WITH COLLECT(DISTINCT(q)) as qList
MATCH (p:Item{ID:1234})-[:HAS_AN]->(i:ItemType) WITH i as pItemType, qList
UNWIND qList as q
MATCH (q)-[:HAS_AN]->(i:ItemType) WITH q.ID as qID, pItemType, i,
CASE i
WHEN pItemType THEN 1
ELSE 0
END as SameType
RETURN DISTINCT i, qID, pItemType, SameType
In this example, Item with ID = 2 has two HAS_AN relations with 2 ItemType nodes with the same ID. I would like only one of them to be returned.

I've tried to simplify your query. Take a look:
MATCH (:Item {ID : 1234})-[:HAS_AN]->(target:ItemType)
MATCH (item:Item)-[:HAS_AN]->(itemType:ItemType)
WHERE item.ID <> 1234
WITH
itemType.ID AS i,
item.ID AS qID,
collect({
pItemType : target,
SameType : CASE exists((item)-[:HAS_AN]-(target))
WHEN true THEN 1 ELSE 0 END
})[0] as Item
RETURN i, qID, Item.pItemType AS pItemType, Item.SameType AS SameType
The trick is in the two lines after WITH. At this point I'm grouping by itemType.ID and item.ID and not ( and not itemType and item). In your original query you are using pItemType to group. This does not work because the two ItemTypes with ID = 34 are different nodes although they have the same ID.
The output from your console:
+-------------------------------------+
| i | qID | pItemType | SameType |
+-------------------------------------+
| 31 | 4 | Node[2]{ID:5} | 0 |
| 5 | 3 | Node[2]{ID:5} | 1 |
| 31 | 5 | Node[2]{ID:5} | 0 |
| 45 | 5 | Node[2]{ID:5} | 0 |
| 5 | 1 | Node[2]{ID:5} | 1 |
| 34 | 2 | Node[2]{ID:5} | 0 |
+-------------------------------------+
6 rows
33 ms

Thanks to #Bruno's solution, I was able to get the right answers. However, the original solution did not work right off the bat for me for two reasons - I needed the qList since I was referring to it later, and it had approximately 4 times the DB hits as the query in my question. So, I tried a few optimizations that brought the number of DB hits down to half, and am sharing it here for posterity.
MATCH (q:Item) WHERE q.ID <> 1234 WITH COLLECT(DISTINCT(q)) as qList
MATCH (p:Item{ID:1234})-[:HAS_AN]->(i:ItemType) WITH i as pItemType, qList
UNWIND qList as item
MATCH (item)-[:HAS_AN]->(i)
WITH
i, pItemType,
item.ID AS qID,
collect({
pItemType : pItemType,
SameType : CASE i.ID
WHEN pItemType.ID THEN 1 ELSE 0 END
})[0] as Item
RETURN i, qID, Item.pItemType AS pItemType, Item.SameType AS SameType
Turns out running MATCH (item:Item)-[:HAS_AN]->(itemType:ItemType) was adding a Filter operation that took almost as many DB hits as it had matches.

How to get the index of FOREACH iterations

Within a FOREACH statement [e.g. day in range(dayX, dayY)] is there an easy way to find out the index of the iteration ?

Yes, you can.
Here is an example query that creates 8 Day nodes that contain an index and day:
WITH 5 AS day1, 12 AS day2
FOREACH (i IN RANGE(0, day2-day1) |
CREATE (:Day { index: i, day: day1+i }));
This query prints out the resulting nodes:
MATCH (d:Day)
RETURN d
ORDER BY d.index;
and here is an example result:
+--------------------------+
| d |
+--------------------------+
| Node[54]{day:5,index:0} |
| Node[55]{day:6,index:1} |
| Node[56]{day:7,index:2} |
| Node[57]{day:8,index:3} |
| Node[58]{day:9,index:4} |
| Node[59]{day:10,index:5} |
| Node[60]{day:11,index:6} |
| Node[61]{day:12,index:7} |
+--------------------------+

FOREACH does not yield the index during iteration. If you want the index you can use a combination of range and UNWIND like this:
WITH ["some", "array", "of", "things"] AS things
UNWIND range(0,size(things)-2) AS i
// Do something for each element in the array. In this case connect two Things
MERGE (t1:Thing {name:things[i]})-[:RELATED_TO]->(t2:Thing {name:things[i+1]})
This example iterates a counter i over which you can use to access the item at index i in the array.

Grouping in MDX Query

I am very newbie to MDX world..
I want to group the Columns based on only 3 rows. But, need join for 4th row also..
My query is :
SELECT
( {
[Measures].[Live Item Count]
}
) DIMENSION PROPERTIES parent_unique_name ON COLUMNS,
Crossjoin(
crossjoin(
[Item].[Class].&[Light],
[Item].[Style].&[Fav]
[Item].[Season Year].members),
[Item].[Count].children ) on rows
FROM Cube
Output comes as :
Light(Row) | FAV(Row) | ALL(Row) | 16(Row) | 2(col)
Light(Row) | FAV(Row) | ALL(Row) | 7(Row) | 1(col)
Light(Row) | FAV(Row) | 2012(Row) | 16(Row)| 2(col)
Light(Row) | FAV(Row) | 2011(Row) | 7(Row) | 1(col)
But, I want my output to be displayed as:
Light(Row) | FAV(Row) | ALL(Row) | | 3(col)
Light(Row) | FAV(Row) | 2012(Row) | 16(Row)| 2(col)
Light(Row) | FAV(Row) | 2011(Row) | 7(Row) | 1(col)
i.e., I want to group my first two rows such that there is no duplicate 'ALL' in 3rd column..
Thanks in advance

Try this - using the level name Season Year with the Attribute name Season Year will pick up every member without teh ALL member:
SELECT
( {
[Measures].[Live Item Count]
}
) DIMENSION PROPERTIES parent_unique_name ON COLUMNS,
Crossjoin(
crossjoin(
[Item].[Class].&[Light],
[Item].[Style].&[Fav]
[Item].[Season Year].[Season Year].members),
[Item].[Count].children ) on rows
FROM Cube

You can use this query if there is an All member on the [Item].[Count] hierarchy:
SELECT {[Measures].[Live Item Count]} DIMENSION PROPERTIES parent_unique_name ON COLUMNS,
Crossjoin(
Crossjoin([Item].[Class].&[Light], [Item].[Style].&[Fav]),
Union(
Crossjoin({"All member of [Item].[Season Year]"}, {"All member of [Item].[Count]"}),
Crossjoin(Except([Item].[Season Year].members, {"All member of [Item].[Season Year]"}), [Item].[Count].children),
) ON ROWS
FROM Cube

Categories

HOME

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

A complex match - neo4j

Related

Forcing cost planner to start from a specific index seek

How to match node labels using OR?

Aggregating over Distinct Nodes

How to get the index of FOREACH iterations

Grouping in MDX Query

Categories

Resources