CREATE combined with FOREACH in CYPHER gives unexpected results - foreach

I have a graph in which versioning information is stored as [:ADD] or [:REMOVE] relations between nodes. I want to replace those rels by another model, based on [:UPDATE] rels with property type and timestamp.
Currently
MATCH (n:tocversion)-[r:ADD]->(m)
RETURN n.version,id(m)
returns this (as expected)
n.version,id(m)
1,13
1,14
2,15
2,16
3,17
3,18
3,19
3,20
4,21
4,22
Now I thought I could collect the versions and m's and use them as a basis to create rels in the new model. Like this.
MATCH (n:tocversion)-[r:ADD]->(m),(t:toc)
WITH t,COLLECT(n.version) AS versions, COLLECT(m) AS ms
FOREACH(i IN versions |
FOREACH(m1 IN [ms[i]]|
CREATE (t)-[r1:UPDATE {type:"ADD", version:versions[i]}]->(m1)))
However, the rels are created in a way I don't understand, because
MATCH (t:toc)-[r:`UPDATE`]->(b) RETURN r.version,r.type,id(b)
returns
r.version,r.type,id(b)
1, ADD, 14
1, ADD, 14
2, ADD, 15
2, ADD, 15
2, ADD, 16
2, ADD, 16
2, ADD, 16
2, ADD, 16
3, ADD, 17
3, ADD, 17
instead of the expected
r.version,r.type,id(b)
1, ADD, 13
1, ADD, 14
2, ADD, 15
2, ADD, 16
3, ADD, 17
3, ADD, 18
3, ADD, 19
3, ADD, 20
4, ADD, 21
4, ADD, 22

Found it. Had to use RANGE
match (n:tocversion)-[r:ADD]->(m),(t:toc)
with t,collect(n.version) as versions, collect(m) as ms
foreach(i in RANGE(0, LENGTH(versions)-1) |
foreach(m1 in [ms[i]]|
create (t)-[r1:UPDATE5 {type:"ADD", version:versions[i]}]->(m1)))

Likely because of this:
FOREACH(i IN versions |
FOREACH(m1 IN [ms[i]] |
Your "i" is going to be: 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, as expected.
But if you're using those as indices into the ms[] collection (which is 0-based), you're going to be looking at ms[] = {13, 14, 15, 16, 17, .. , 22}, and so ms[1] will always be 14, ms[2] will always be 15, ms[3] will always be 16, and ms[4] will always be 17.
Your "foreach" loops need to be rethought as "i" shouldn't be used as a lookup into "ms".
In fact I'm also not certain "i" should be used as an index into "versions" which you do in your CREATE statement as you'll likely have a similar issue as above (e.g. versions[3] will always be 2).

Related

Neo4j - Get certain nodes and relations

I have an application where nodes and relations are shown. After a result is shown, nodes and relations can be added through the gui. When the user is done, I would like to get all the data from the database again (because I don't have all data by this point in the front-end) based on the Neo4j id's of all nodes and links. The difficult part for me is that there are "floating" nodes that don't have a relation in the result of the gui (they will have relations in the database, but I don't want these). Worth mentioning is that on my relations, I have the start and end node id. I was thinking to start from there, but then I don't have these floating nodes.
Let's take a look at this poorly drawn example image:
As you can see:
node 1 is linked (no direction) to node 2.
node 2 is linked to node 3 (from 2 to 3)
node 3 is linked to node 4 (from 3 to 4)
node 3 is also linked to node 5 (no direction)
node 6 is a floating node, without relations
Let's assume that:
id(relation between 1 and 2) = 11
id(relation between 2 and 3) = 12
id(relation between 3 and 4) = 13
id(relation between 3 and 5) = 14
Keeping in mind that behind the real data, there are way more relations between all these nodes, how can I recreate this very image again via Neo4j? I have tried doing something like:
match path=(n)-[rels*]-(m)
where id(n) in [1, 2, 3, 4, 5]
and all(rel in rels where id in [11, 12, 13, 14])
and id(m) in [1, 2, 3, 4, 5]
return path
However, this doesn't work properly because of multiple reasons. Also, just matching on all the nodes doesn't get me the relations. Do I need to union multiple queries? Can this be done in 1 query? Do I need to write my own plugin?
I'm using Neo4j 3.3.5.
You don't need to keep a list of node IDs. Every relationship points to its 2 end nodes. Since you always want both end nodes, you get them for free using just the relationship ID list.
This query will return every single-relationship path from a relationship ID list. If you are using the neo4j Browser, its visualization should knit together these short paths and display your original full paths.
MATCH p=()-[r]-()
WHERE ID(r) IN [11, 12, 13, 14]
RETURN p
By the way, all neo4j relationships have a direction. You may choose not to specify the direction when you create one (using MERGE) and/or query for one, but it still has a direction. And the neo4j Browser visualization will always show the direction.
[UPDATED]
If you also want to include "floating" nodes that are not attached to a relationship in your relationship list, then you could just use a separate floating node ID list. For example:
MATCH p=()-[r]-()
WHERE ID(r) IN [11, 12, 13, 14]
RETURN p
UNION
MATCH p=(n)
WHERE ID(n) IN [6]
RETURN p

Does Flatten have any effects other than flattening collections element-wise?

Specifically, does the Flatten PTransform in Beam perform any sort of:
Deduplication
Filtering
Purging of existing elements
Or does it just "merge" two different PCollections?
The Flatten transform does not do any sort of deduplication, or filtering of any kind. As mentioned, it simply merges the multiple PCollections into one that contains the elements of each of the inputs.
This means that:
with beam.Pipeline() as p:
c1 = p | "Branch1" >> beam.Create([1, 2, 3, 4])
c2 = p | "Branch2" >> beam.Create([4, 4, 5, 6])
result = (c1, c2) | beam.Flatten()
In this case, the result PCollection contains the following elements: [1, 2, 3, 4, 4, 4, 5, 6].
Note how the element 4 appears once in c1, and twice in c2. This is not deduplicated, filtered or removed in any way.
As a curious fact about Flatten, some runners optimize it away, and simply add the downstream transform in both branches. So, in short, no special filtering or dedups. Simply merging of PCollections.

Query to return nodes that have no specific relationship within an already matched set of nodes

The following statement creates the data I am trying to work with:
CREATE (p:P2 {id: '1', name: 'Arthur'})<-[:EXPANDS {recorded: 1, date:1}]-(:P2Data {wage: 1000})
CREATE (d2:P2Data {wage: 1100})-[:EXPANDS {recorded: 2, date:4}]->(p)
CREATE (d3:P2Data {wage: 1150})-[:EXPANDS {recorded: 3, date:3}]->(p)
CREATE (d3)-[:CANCELS]->(d2)
So, Arthur is created and initially has a wage of 1000. On day 2 we add the info that the Wage will be 1100 from day 4 onwards. On day 3 we state that the wage will be increased to 1150, which cancels the entry from day 2.
Now, if I look at the history as it was valid for a given point in time, when the point in time is 2, the following history is correct:
day 1 - wage 1000
day 4 - wage 1100
when the point in time is 3, the following history is correct:
day 1 - wage 1000
day 3 - wage 1150
expressed in graph terms, when I match the P2Data based on the :EXPANDS relationship, I need those that are not cancelled by any other P2Data node that has also been matched.
This is my attempt so far:
MATCH p=(:P2 {id: '1'})<-[x1:EXPANDS]-(d1:P2Data)
WHERE x1.recorded <= 3
WITH x1.date as date,
FILTER(n in nodes(p)
WHERE n:P2Data AND
SIZE(FILTER(n2 IN nodes(p) WHERE (n2:P2Data)-[:CANCELS]->(n))) = 0) AS result
RETURN date, result
The idea was to only get those n in nodes(p) where there are no paths pointing to it via the :CANCELS relationship.
Since I am still new to this and somehow cypher hasn't clicked yet for me, feel free to discard that query completely.
If you modify your data model by removing the CANCELS relationship, and instead add an optional canceled date to the EXPANDS relationship type, you can greatly simplify the required query.
For example, create the test data:
CREATE (p:P2 {id: '1', name: 'Arthur'})<-[:EXPANDS {recorded: 1, date:1}]-(:P2Data {wage: 1000})
CREATE (d2:P2Data {wage: 1100})-[:EXPANDS {recorded: 2, date:4, canceled: 3}]->(p)
CREATE (d3:P2Data {wage: 1150})-[:EXPANDS {recorded: 3, date:3}]->(p)
Perform simple query:
MATCH p=(:P2 {id: '1'})<-[x1:EXPANDS]-(d1:P2Data)
WHERE x1.recorded <= 3 AND (x1.canceled IS NULL OR x1.canceled > 3)
RETURN x1.date AS date, d1
ORDER BY date;
MATCH (:P2 {id: '1'})<-[x1:EXPANDS]-(d1:P2Data)
WHERE x1.recorded <= 3
WITH x1.date AS valid_date, x1.recorded AS transaction_date, d1.wage AS wage
ORDER BY valid_date
WITH COLLECT({v: valid_date, t: transaction_date, w:wage}) AS dates
WITH REDUCE(x = [HEAD(dates)], date IN TAIL(dates)|
CASE
WHEN date.v = LAST(x).v AND date.t > LAST(x).t THEN x[..-1] + [date]
WHEN date.t > LAST(x).t THEN x + [date]
ELSE x
END) AS results
UNWIND results AS result
RETURN result.v, result.w
I'm trying to think of a way to model this better, but I'm honestly pretty stumped.

How to use load csv for large dataset in neo4j?

I have a user.csv file with students:
id, first_name, last_name, locale, gender
1, Hasso, Plattner, en, male
2, Tina, Turner, de, female
and a memberships.csv file with course memberships of the students:
id, user_id, course_id
1, 1, 3
2, 1, 4
3, 2, 4
4, 2, 5
To transform students and courses into vertices
and course memberberships into edges, I joined
the user information into memberships.csv
id, user_id, first_name, last_name, course_id, locale, gender
1, 1, Hasso, Plattner, 3, en, male
2, 1, Hasso, PLattner, 4, en, male
3, 2, Tina, Turner, 4, de, female
4, 2, Tina, Turner, 5, de, female
and used load csv, some constraints and MERGE:
create constraint on (g:Gender) assert g.gender is unique
create constraint on (l:locale) assert l.locale is unique
create constraint on (c:Course) assert c.course is unique
create constraint on (s:Student) assert s.student is unique
USING PERIODIC COMMIT 20000
LOAD CSV WITH HEADERS FROM
'file: memberships.csv'
AS line
MERGE (s:Student {id: line.id, name: line.first_name +" "+line.last_name })
MERGE (c:Course {id: line.course_id})
MERGE (g:Gender {gender:line.gender})
MERGE (l:locale {locale:line.locale})
MERGE (s)-[:HAS_GENDER]->(g)
MERGE (s)-[:HAS_LANGUAGE]->(l)
MERGE (s)-[:ENROLLED_IN]->(c)
For 1 000 memberships neo4j needs 2 seconds to load,
for 10 000 memberships 3 minutes,
for 100 000 it fails with 'Unknown error'.
i) How to get rid of the error?
ii) Is there a more elegant way to load such a structure from .csv
with about 600 000 memberships?
I am using a local machine with 2,4 GHz and 16GB RAM.
The Neo4j browser has a 60 second timeout period on Cypher queries (due to HTTP transport). This does not mean that your query is not running to completion, in fact there has been no error at the database-level. Your query will continue to run via the browser but you will not be able to see its result. To see long running queries run to completion please use the Neo4j shell.
http://docs.neo4j.org/chunked/stable/shell.html
Try to import first the nodes from their CSV and then the rels afterwards.
Also try to do an import run without Gender and Locale nodes and instead store it as a property.
If you really need those (dense) nodes later on, try to run it like this:
CREATE (g:Gender {gender:"male"})
MATCH (s:Student {gender:"male"})
CREATE (s)-[:HAS_GENDER]->(g)
Those relationships will be unique, and create is cheaper than MERGE. I assume that checking 2*(n-1) rels per inserted student adds up as it is then O(n^2)

Natural sorting in neo4j

We have a bunch of nodes with properties that are converted from BigDecimals to string during insert and vice versa during load.
This leads to typical problems during sorting. Values 1, 2, 3, 10 get sorted as 1, 10, 2, 3.
Does cypher has any means of doring natural sorting on strings? Or do we have to convert these properties to doubles or something like that?
Guess the best way is to store them as integers in your db. Also, in the current milestone release, there's a toInt() function which you could use to sort.
START n=node(*)
WITH toInt(n.stringValue) as nbr
RETURN n
ORDER BY nbr
Can you add a primary sort on string length?
CREATE ({val:"3"}),({val:"6"}),({val:"9"}),({val:"12"}),({val:"15"}),({val:"18"}),({val:"21"})
MATCH (n) RETURN n.val ORDER BY n.val
// 12, 15, 18, 21, 3, 6, 9
MATCH (n) RETURN length(n.val), n.val
// 3, 6, 9, 12, 15, 21
http://console.neo4j.org/r/kb0obm
If you keep converting them back and forth it sounds like it would be better to store them as their proper types in the database.

Resources