Merging a set of nodes onto one (only in query) - neo4j

I am currently investigating how to model a bitemporal graph in neo4j. Unfortunately noone seems to have publicly undertaken this before.
One particular thing I am looking at is whether I can store in a new node only those values that have changed and then express a query that would merge all those values ordered by a given timestamp:
This creates the data I am playing with:
CREATE (:P1 {id: '1'})<-[:EXPANDS {date:5200, recorded:5100}]-(:P1Data {name:'Joe', wage: 3000})
// New data, recorded 2014-10-1 for 2015-1-1
MATCH (p:P1 {id: '1'}) CREATE (:P1Data { wage:3100 })-[:EXPANDS { date:5479, recorded: 5387}]->(p)
Now, I can get a history for a given point in time so far, e.g. like
MATCH (:P1 { id: '1' })<-[x:EXPANDS]-(d:P1Data)
WHERE x.recorded < 6000
WITH {date: x.date, data:d} as data
RETURN data
ORDER BY data.date DESC
What I would like to achieve is to merge the name and wage values such that I get a whole view of the data at a given point in time. The answer may also be that this is not really possible.
(PS: I say only in query, because I found a refactor function in apoc which does merge nodes, but that procedure actually merges and persists the node, while I would just want to query it).

As with most things, you can do it using REDUCE like so:
MATCH (:P1 { id: '1' })<-[x:EXPANDS]-(d:P1Data)
WITH x.date AS date, d AS data
ORDER BY date
WITH COLLECT(data) AS datas
WITH REDUCE(s = {}, y IN datas|
{name: COALESCE(y.name, s.name),
wage: COALESCE(y.wage, s.wage)})
AS most_recent_fields
RETURN most_recent_fields.name AS name, most_recent_fields.wage AS wage
You can do it in descending order instead (swap s and y inside the COALESCE statements if so), but there isn't really a way to shortcut processing the entire set of results from your queried time back to the start.
UPDATE: This will, of course, generate a Map and not a Node, but if you only want the properties and don't want to create a permanent record, a Map is actually better suited to your needs.
EXTENDED: If you don't want to specify which keys to use, you can do it without REDUCE like this instead:
MATCH (:P1 { id: '1' })<-[x:EXPANDS]-(d:P1Data)
WITH x.date AS date, d AS data
ORDER BY date
WITH COLLECT(data) AS datas
CREATE (t:Temp)
FOREACH(data IN datas|
SET t += data)
DELETE t
RETURN t
This approach does create a node, but if you DELETE it right before you RETURN it, it won't persist at all. += ensures that pre-existing properties aren't removed, only overwritten if the data node has existing values.

Related

Correct order of operations in neo4j - LOAD, MERGE, MATCH, WITH, SET

I am loading simple csv data into neo4j. The data is simple as follows :-
uniqueId compound value category
ACT12_M_609 mesulfen 21 carbon
ACT12_M_609 MNAF 23 carbon
ACT12_M_609 nifluridide 20 suphate
ACT12_M_609 sulfur 23 carbon
I am loading the data from the URL using the following query -
LOAD CSV WITH HEADERS
FROM "url"
AS row
MERGE( t: Transaction { transactionId: row.uniqueId })
MERGE(c:Compound {name: row.compound})
MERGE (t)-[r:CONTAINS]->(c)
ON CREATE SET c.category= row.category
ON CREATE SET r.price =row.value
Next I do the aggregation to count total orders for a compound and create property for a node in the following way -
MATCH (c:Compound) <-[:CONTAINS]- (t:Transaction)
with c.name as name, count( distinct t.transactionId) as ord
set c.orders = ord
So far so good. I can accomplish what I want but I have the following 2 questions -
How can I create the orders property for compound node in the first step itself? .i.e. when I am loading the data I would like to perform the aggregation straight away.
For a compound node I am also setting the property for category. Theoretically, it can also be modelled as category -contains-> compound by creating Categorynode. But what advantage will I have if I do it? Because I can execute the queries and get the expected output without creating this additional node.
Thank you for your answer.
I don't think that's possible, LOAD CSV goes over one row at a time, so at row 1, it doesn't know how many more rows will follow.
I guess you could create virtual nodes and relationships, aggregate those and then use those to create the real nodes, but that would be way more complicated. Virtual Nodes/Rels
That depends on the questions/queries you want to ask.
A graph database is optimised for following relationships, so if you often do a query where the category is a criteria (e.g. MATCH (c: Category {category_id: 12})-[r]-(:Compound) ), it might be more performant to create a label for it.
If you just want to get the category in the results (e.g. RETURN compound.category), then it's fine as a property.

Can I load in nodes and relationships from a csv file using 1 cypher command?

I have 2 csv files which I am trying to load into a Neo4j database using cypher: drivers.csv which holds every formula 1 driver and lap times.csv which stores every lap ever raced in F1.
I have managed to load in all of the nodes, although the lap times file is very large so it took quite a long time! I then tried to add relationships after, but there is so many that needs to be added that I gave up on it waiting (it was taking multiple days and still had not loaded in fully).
I’m pretty sure there is a way to load in the nodes and relationships at the same time, which would allow me to use periodic commit for the relationships which I cannot do right now. Essentially I just need to combine the 2 commands into one and after some attempts I can’t seem to work out how to do it?
// load in the lap_times.csv, changing the variable names - about half million nodes (takes 3-4 days)
PERIODIC COMMIT 25000
LOAD CSV WITH HEADERS from 'file:///lap_times.csv'
AS row
MERGE (lt: lapTimes {raceId: row.raceId, driverId: row.driverId, lap: row.lap, position: row.position, time: row.time, milliseconds: row.milliseconds})
RETURN lt;
// add a relationship between laptimes, drivers and races - takes 3-4 days
MATCH (lt:lapTimes),(d:Driver),(r:race)
WHERE lt.raceId = r.raceId AND lt.driverId = d.driverId
MERGE (d)-[rel8:LAPPING_AT]->(lt)
MERGE (r)-[rel9:TIMED_LAP]->(lt)
RETURN type(rel8), type(rel9)
Thanks in advance for any help!
You should review the documentation for indexes here:
https://neo4j.com/docs/cypher-manual/current/administration/indexes-for-search-performance/
Basically, indexes, once created, allow quick lookups of nodes of a given label, for the given property or properties. If you DON'T have an index and you do a MATCH or MERGE of a node, then for every row of that MATCH or MERGE, it has to do a label scan of all nodes of the given label and check all of their properties to find the nodes, and that becomes very expensive, especially when loading CSVs because those operations are likely happening for each row in the CSV.
For your :lapTimes nodes (though we would recommend you use singular labels in most cases), if there are none of them in your graph to start with, then a CREATE instead of a MERGE is fine. You may want a composite index on :lapTimes(raceId, driverId, lap), since that should uniquely identify the node, if you need to look it up later. Using CREATE instead of MERGE here should process much much faster.
Your second query should be MATCHing on :lapTimes nodes (label scan), and from each doing an index lookup on the :race and :driver nodes, so indexes are key here for performance.
You need indexes on: :race(raceId) and :Driver(driverId).
MATCH (lt:lapTimes)
WITH lt, lt.raceId as raceId, lt.driverId as driverId
MATCH (d:Driver), (r:race)
WHERE r.raceId = raceId AND d.driverId = driverId
MERGE (d)-[:LAPPING_AT]->(lt)
MERGE (r)-[:TIMED_LAP]->(lt)
You might consider CREATE instead of MERGE for the relationships, if you know there are no duplicate entries.
I removed your RETURN because returning the types isn't useful information.
Also, consider using consistent cases for your node labels, and that you are using the same case between the labels in your graph and the indexes you create.
Also, you would probably want to batch these changes instead of trying to process them all at once.
If you install APOC Procedures you can make use of apoc.periodic.iterate(), which can be used to batch changes, which will be faster and easier on your heap. You will still need indexes first.
CALL apoc.periodic.iterate("
MATCH (lt:lapTimes)
WITH lt, lt.raceId as raceId, lt.driverId as driverId
MATCH (d:Driver), (r:race)
WHERE r.raceId = raceId AND d.driverId = driverId
RETURN lt, d, ir",
"MERGE (d)-[:LAPPING_AT]->(lt)
MERGE (r)-[:TIMED_LAP]->(lt)", {}) YIELD batches, total, errorMessages
RETURN batches, total, errorMessages
Single CSV load
If you want to handle everything all at once in a single CSV load, you can do that, but again you will need indexes first. Here's what you'll need at a minimum:
CREATE INDEX ON :Driver(driverId);
CREATE INDEX ON :Race(raceId);
After those are created, you can use this, assuming you are starting from scratch (I fixed the case of your labels and made them singular:
USING PERIODIC COMMIT 25000
LOAD CSV WITH HEADERS from 'file:///lap_times.csv' AS row
MERGE (d:Driver {driverId:row.driverId})
MERGE (r:Race {raceId:row.raceId})
CREATE (lt:LapTime {raceId: row.raceId, driverId: row.driverId, lap: row.lap, position: row.position, time: row.time, milliseconds: row.milliseconds})
CREATE (d)-[:LAPPING_AT]->(lt)
CREATE (r)-[:TIMED_LAP]->(lt)

How to set all relationships of a certain type -- replacing the old ones if necessary -- in Neo4j?

I have a node id for an event, and list of node ids for users that are hosting the event. I want to update these (:USER)-[:HOSTS]->(:EVENT) relationships. I dont just want to add the new ones, I want to remove the old ones as well.
NOTE: this is coffeescript syntax where #{} is string interpolation, and str() will escape any characters for me.
Right now I'm querying all the hosts:
MATCH (u:USER)-[:HOSTS]->(:EVENT {id:#{str(eventId)}})
RETURN u.id
Then I'm determining which hosts are new and need to be added and which ones are old and need to be removed. For the old ones, I remove them
MATCH (:HOST {id:#{str(host.id)}})-[h:HOSTS]->(:EVENT {id:#{str(eventId)}})
DELETE h
And for the new ones, I add them:
MATCH (e:EVENT {id: #{str(eventId)}})
MERGE (u:USER {id:#{str(id)}})
SET u.name =#{str(name)}
MERGE (u)-[:HOSTS]->(e)
So my question is, can I do this more efficiently all in one query? I want want to set the new relationships, getting rid of any previous relationships that arent in the new set.
If I understand your question correctly, you can achieve your objective in a single query by introducing WITH and FOREACH. On a sample graph created by
CREATE (_1:User { name:"Peter" }),(_2:User { name:"Paul" }),(_3:User { name:"Mary" })
CREATE (_4:Event { name:"End of the world" })
CREATE _1-[:HOSTS]->_4, _2-[:HOSTS]->_4
you can remove the no longer relevant hosts, and add the new hosts, as such
WITH ["Peter", "Mary"] AS hosts, "End of the world" AS eventId
MATCH (event:Event { name:eventId })<-[r:HOSTS]-(u:User)
WHERE NOT u.name IN hosts
DELETE r
WITH COLLECT(u.name) AS oldHosts, hosts, event
WITH FILTER(h IN hosts
WHERE NOT h IN oldHosts) AS newHosts, event, oldHosts
FOREACH (n IN newHosts |
MERGE (nh:User { name:n })
MERGE nh-[:HOSTS]->event
)
I have made some assumptions, at least including
The new host (:User) of the event may already exists, therefore MERGE (nh:User { name:n }) and not CREATE.
The old [:HOSTS]s should be disconnected from the event, but not removed from the database.
Your coffee script stuff can be translated into parameters, and you can translate my pseudo-parameters into parameters. In my sample query I simulate parameters with the first line, but you may need to adapt the syntax according to how you actually pass the parameters to the query (I can't turn Coffee into Cypher).
Click here to test the query. Change the contents of the hosts array to ["Peter", "Paul"], or to ["Peter", "Dragon"], or whatever value makes sense to you, and rerun the query to see how it works. I've used name rather than id to catch the nodes, and again, I've simulated parameters, but you might be able to translate the query to the context from which you want to execute it.
Edit:
Re comment, if you want the query to also match events that don't have any hosts you need to make the -[:HOSTS]- part of the pattern optional. Do so by braking the MATCH clause in two:
MATCH (event:Event { name:eventId })
OPTIONAL MATCH event<-[r:HOSTS]-(u:User)
The rest of the query is the same.

Too much time importing data and creating nodes

i have recently started with neo4j and graph databases.
I am using this Api to make the persistence of my model. I have everything done and working but my problems comes related to efficiency.
So first of all i will talk about the scenary. I have a couple of xml documents which translates to some nodes and relations between the, as i already read that this API still not support a batch insertion, i am creating the nodes and relations once a time.
This is the code i am using for creating a node:
var newEntry = new EntryNode { hash = incremento++.ToString() };
var result = client.Cypher
.Merge("(entry:EntryNode {hash: {_hash} })")
.OnCreate()
.Set("entry = {newEntry}")
.WithParams(new
{
_hash = newEntry.hash,
newEntry
})
.Return(entry => new
{
EntryNode = entry.As<Node<EntryNode>>()
});
As i get it takes time to create all the nodes, i do not understand why the time it takes to create one increments so fats. I have made some tests and am stuck at the point where creating an EntryNode the setence takes 0,2 seconds to resolve, but once it has reached 500 it has incremented to ~2 seconds.
I have also created an index on EntryNode(hash) manually on the console before inserting any data, and made test with both versions, with and without index.
Am i doing something wrong? is this time normal?
EDITED:
#Tatham
Thanks for the answer, really helped. Now i am using the foreach statement in the neo4jclient to create 1000 nodes in just 2 seconds.
On a related topic, now that i create the nodes this way i wanted to also create relationships. This is the code i am trying right now, but got some errors.
client.Cypher
.Match("(e:EntryNode)")
.Match("(p:EntryPointerNode)")
.ForEach("(n in {set} | " +
"FOREACH (e in (CASE WHEN e.hash = n.EntryHash THEN [e] END) " +
"FOREACH (p in pointers (CASE WHEN p.hash = n.PointerHash THEN [p] END) "+
"MERGE ((p)-[r:PointerToEntry]->(ee)) )))")
.WithParam("set", nodesSet)
.ExecuteWithoutResults();
What i want it to do is, given a list of pairs of strings, get the nodes (which are uniques) with the string value as the property "hash" and create a relationship between them. I have tried a couple of variants to do this query but i dont seem to find the solution.
Is this possible?
This approach is going to be very slow because you do a separate HTTP call to Neo4j for every node you are inserting. Each call is then a transaction. Finally, you are also returning the node back, which is probably a waste.
There are two options for doing this in batches instead.
From https://stackoverflow.com/a/21865110/211747, you can do something like this, where you pass in a set of objects and then FOREACH through them in Cypher. This means one, larger, HTTP call to Neo4j and then executing in a single transaction on the DB:
FOREACH (n in {set} | MERGE (c:Label {Id : n.Id}) SET c = n)
http://docs.neo4j.org/chunked/stable/query-foreach.html
The other option, coming soon, is that you will be able to write something like this in Cypher:
LOAD CSV WITH HEADERS FROM 'file://c:/temp/input.csv' AS n
MERGE (c:Label { Id : n.Id })
SET c = n
https://github.com/davidegrohmann/neo4j/blob/2.1-fix-resource-failure-load-csv/community/cypher/cypher/src/test/scala/org/neo4j/cypher/LoadCsvAcceptanceTest.scala

How use Cypher with Merge to create a unique sub graph path

in Neo4j 2.0 M06 I understand that CREATE UNIQUE is depreciated and replaced with MERGE and MATCH instead, but I am finding it hard to see how this can be used to create a unique path.
as an example, I want to create a
MERGE root-[:HAS_CALENDER]->(cal:Calender{name:'Booking'})-[:HAS_YEAR]->(year:Year{value:2013})-[:HAS_MONTH]-(month:Month{value:'January'})-[:HAS_DAY]->(day:Day{value:1})
ON CREATE cal
SET cal.created = timestamp()
ON CREATE year
SET year.created = timestamp()
ON CREATE month
SET month.created = timestamp()
ON CREATE day
SET day.created = timestamp()
intention is that when I try to add a new days to my calender, it should only create the year, and month when it does not exist else just add to the existing path. Now when i run the query, i get an STATEMENT_EXECUTION_ERROR
MERGE only supports single node patterns
should I be executing multiple statements here to achieve this.
So the question is what's the best way in Neo4j to handle cases like this?
Edit
I did change my approach a bit and now even after making multiple calls, I think my merge is happening at a label level and not trying to restrict to the start node I provide as a result I am ending up with nodes that are shared across years and month which is not what I was expecting
I would really appreciate if some one can suggest me how to get a proper graph like below
my c# code is somewhat like this:
var qry = GraphClient.Cypher
.Merge("(cal:CalendarType{ Name: {calName}})")
.OnCreate("cal").Set("cal = {newCal}")
.With("cal")
.Start(new { root = GraphClient.RootNode})
.CreateUnique("(root)-[:HAS_CALENDAR]->(cal)")
.WithParams(new { calName = newCalender.Name, newCal = newCalender })
.Return(cal => cal.Node<CalenderType>());
var calNode = qry.Results.Single();
var newYear = new Year { Name = date.Year.ToString(), Value = date.Year }.RunEntityHousekeeping();
var qryYr = GraphClient.Cypher
.Merge("(year:Year{ Value: {yr}})")
.OnCreate("year").Set("year = {newYear}")
.With("year")
.Start(new { calNode })
.CreateUnique("(calNode)-[:HAS_YEAR]->(year)")
.WithParams(new { yr = newYear.Value, newYear = newYear })
.Return(year => year.Node<Year>());
var yearNode = qryYr.Results.Single();
var newMonth = new Month { Name = date.Month.ToString(), Value = date.Month }.RunEntityHousekeeping();
var qryMonth = GraphClient.Cypher
.Merge("(mon:Month{ Value: {mnVal}})")
.OnCreate("mon").Set("mon = {newMonth}")
.With("mon")
.Start(new { yearNode })
.CreateUnique("(yearNode)-[:HAS_MONTH]->(mon)")
.WithParams(new { mnVal = newMonth.Value, newMonth = newMonth })
.Return(mon => mon.Node<Month>());
var monthNode = qryMonth.Results.Single();
var newDay = new Day { Name = date.Day.ToString(), Value = date.Day, Date = date.Date }.RunEntityHousekeeping();
var qryDay = GraphClient.Cypher
.Merge("(day:Day{ Value: {mnVal}})")
.OnCreate("day").Set("day = {newDay}")
.With("day")
.Start(new { monthNode })
.CreateUnique("(monthNode)-[:HAS_DAY]->(day)")
.WithParams(new { mnVal = newDay.Value, newDay = newDay })
.Return(day => day.Node<Day>());
var dayNode = qryDay.Results.Single();
Regards
Kiran
Nowhere on the documentation page does it say that CREATE UNIQUE has been deprecated.
MERGE is just a new approach that's available to you. It enables some new scenarios (matching based on labels, and ON CREATE and ON MATCH triggers) but also does not cover more complex scenarios (more than a single node).
It sounds like you're already familiar with CREATE UNIQUE. For now, I think you should still be using that.
It seems to me the picture of what you want your graph to look like has the order imposed by relationships, but your code models the order with nodes. If you want that graph, you will need to use relationship types like [2010], [2011] instead of a pattern like [HAS_YEAR]->({value:2010}).
Another way to say the same thing: you are trying to constitute uniqueness for a node intrinsically, by a combination of label and property, e.g. (unique:Day {value:4}). Assuming you have the relevant constraints, this would be database wide uniqueness, so only one fourth-day-of-the-month for all the months to share. What you want is extrinsic local uniqueness, uniqueness established and extended transitively by a hierarchy of relationships. Uniqueness for a node is then not in its internal properties but in its external 'position' or 'order' in relation to its parent. The locally unique pattern (month)-[:locally_unique_rel]->(day) is made unique for a wider scope when the month is made unique, and the month is made unique, not by property and label, but extrinsically by its 'order' or 'position' under its year. Hence the transitivity. I think this is a strength of modeling with graphs, among other things it allows you to continue to partition your structure. If for instance you want to split some of your days into AM and PM or into hours, you can easily do so.
So, in your graph, [HAS_DAY] makes all days equally related to their month, and cannot therefore be used to differentiate between them. You have solved this locally under a month, since the property value differentiates, but since the fourth-day-of-the-month in
(november)-[:HAS_DAY]->(4th)` and `(december)-[:HAS_DAY]->(4th)
are not distinct by property value or label, they are the same node in your graph. Locally, under a month say, unique nodes can be achieved equally with
[11]->()-[4]->(unique1), [11]->()-[5]->(unique2)
and
[HAS_MONTH]->({value:11})-[HAS_DAY]->(unique1 {value:4}),
[HAS_MONTH]->({value:11})-[HAS_DAY]->(unique2 {value:5})
The difference is that with the former extrinsic local uniqueness, you have the benefit of transitivity. Since the months are unique in a year, as (november) in [11]->(november) is locally unique, therefore the days of November are also unique in that year - the (fourth) node is distinct between
[11]->(november)-[4]->(fourth)
and
[12]-(december)->[4]->(fourth)
What this amounts to is transferring more of your semantic model to your relationships, leaving the nodes for storing data. The node identifiers in the picture you posted are only pedagogical, replacing them with x,y,z or empty parentheses would perhaps better reveal the structure or scaffolding of the graph.
If you want to keep the relationship types intact, adding an ordering property to each relationship to create a pattern like (november)-[:HAS_DAY {order:4}]->(4th) will also work. This may be less performant for querying, but you may have other concerns that make it worth it.
This code allows you to create calendar graphs on demand upon creation of an event for a specific day. You'll want to modify it to allow events on multiple days, but it seems more like your issue is creating unique paths, right? And you'd probably want to modify this to use parameters in your language of choice.
First I create the root:
CREATE (r:Root {id:'root'})
Then use this reusable MERGE query to successively match or create subgraphs for the calendar. I pass along the root so I can display the graph at the end:
MATCH (r:Root)
MERGE r-[:HAS_CAL]->(cal:Calendar {id:'General'})
WITH r,cal MERGE (cal)-[:HAS_YEAR]->(y:Year {id:2011})
WITH r,y MERGE (y)-[:HAS_MONTH]->(m:Month {id:'Jan'})
WITH r,m MERGE (m)-[:HAS_DAY]->(d:Day {id:1})
CREATE d-[:SCHEDULED_EVENT]->(e:Event {id:'ev3', t:timestamp()})
RETURN r-[*1..5]-()
Creates a graph like this when called multiple times:
Does this help?

Resources