graphing dorm movements in neo4j - neo4j

I'm really struggling getting my head around neo4j and was hoping someone might be able to help point me in the right direction with the below.
Basically, I have a list of what can be referred to as events; the event can be said to describe a patient entering and leaving a room.
Each event has a unique identifier; it also has an identifier for the student in question along with start and end times (e.g. the student entered the room at 12:00 and left at 12:05) and an identifier for the room.
The event and data might look along the lines of the below, columns separated by a pipe delimiter
ID|SID|ROOM|ENTERS|LEAVES
1|1|BLUE|1/01/2015 11:00|4/01/2015 10:19
2|2|GREEN|1/01/2015 12:11|1/01/2015 12:11
3|2|YELLOW|1/01/2015 12:11|1/01/2015 12:20
4|2|BLUE|1/01/2015 12:20|5/01/2015 10:48
5|3|GREEN|1/01/2015 18:41|1/01/2015 18:41
6|3|YELLOW|1/01/2015 18:41|1/01/2015 21:00
7|3|BLUE|1/01/2015 21:00|9/01/2015 9:30
8|4|BLUE|1/01/2015 19:30|3/01/2015 11:00
9|5|GREEN|2/01/2015 19:08|2/01/2015 19:08
10|5|ORANGE|2/01/2015 19:08|3/01/2015 2:43
11|5|PURPLE|3/01/2015 2:43|4/01/2015 16:44
12|6|GREEN|3/01/2015 11:52|3/01/2015 11:52
13|6|YELLOW|3/01/2015 11:52|3/01/2015 17:45
14|6|RED|3/01/2015 17:45|7/01/2015 10:00
Questions that might be asked could be:
what rooms have student x visited and in what order
what does the movement of students between rooms look like - to which room does students go to when they leave room y
That sounds simple enough but I'm tying myself into knots.
I started off creating unique constraints for both student and room
create constraint on (student: Student) assert student.id is unique
I then did the same for room.
I then loaded student as
using periodic commit 1000 load csv with headers from 'file://c:/event.csv' as line merge (s:Student {id: line.SID});
I also did the same for room and visits.
I have absolutely no idea how to create the relationships though to be able to answer the above questions though. Each event lists the time the student enters and leaves the room but not the room the student went to. Starting with the extract, should the extract be changed so that it contains the room the student left for? If someone could help talk through how I need to think of the relationships that needs to be created, that would be very much appreciated.
Cheers

As the popular saying goes, there is more than one way to skin an Ouphe - or thwart a mage. One way you could do it (which makes for the simplest modeling imo) is as follows :
CREATE CONSTRAINT ON (s:Student) ASSERT s.studentID IS UNIQUE;
CREATE CONSTRAINT ON (r:Room) ASSERT r.roomID IS UNIQUE;
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///dorm.csv" as line fieldterminator '|'
MERGE (s:Student {studentID: line.SID})
MERGE (r:Room {roomID: line.ROOM})
CREATE (s)-[:VISIT {starttime: apoc.date.parse(line.ENTERS,'s',"dd/MM/yyyy HH:mm"), endtime: apoc.date.parse(line.LEAVES,'s',"dd/MM/yyyy HH:mm")}]->(r);
# What rooms has student x visited and in what order
MATCH (s:Student {studentID: "2"})-[v:VISIT]->(r:Room)
RETURN r.roomID,v.starttime ORDER BY v.starttime;
# What does the movement of students between rooms look like - to which room does students go to when they leave room y
MATCH (s:Student)-[v:VISIT]->(r:Room {roomID: "GREEN"})
WITH s, v
MATCH (s)-[v2:VISIT]->(r2:Room)
WHERE v2.starttime > v.endtime
RETURN s.studentID, r2.roomID, v2.starttime ORDER BY s.studentID, v2.starttime;
So actually you would only have Student and Room as nodes and a student VISITing a room would make up the relationship (with the enter/leave times as properties of that relationship). Now, as you can see that might not be ideal for your second query (although it does work). One alternative is to have a Visit node and chain it (as timeline events) to both Students and Rooms. There's plenty of examples around on how to do that.
Hope this helps,
Tom

Related

Correct order of operations in neo4j - LOAD, MERGE, MATCH, WITH, SET

I am loading simple csv data into neo4j. The data is simple as follows :-
uniqueId compound value category
ACT12_M_609 mesulfen 21 carbon
ACT12_M_609 MNAF 23 carbon
ACT12_M_609 nifluridide 20 suphate
ACT12_M_609 sulfur 23 carbon
I am loading the data from the URL using the following query -
LOAD CSV WITH HEADERS
FROM "url"
AS row
MERGE( t: Transaction { transactionId: row.uniqueId })
MERGE(c:Compound {name: row.compound})
MERGE (t)-[r:CONTAINS]->(c)
ON CREATE SET c.category= row.category
ON CREATE SET r.price =row.value
Next I do the aggregation to count total orders for a compound and create property for a node in the following way -
MATCH (c:Compound) <-[:CONTAINS]- (t:Transaction)
with c.name as name, count( distinct t.transactionId) as ord
set c.orders = ord
So far so good. I can accomplish what I want but I have the following 2 questions -
How can I create the orders property for compound node in the first step itself? .i.e. when I am loading the data I would like to perform the aggregation straight away.
For a compound node I am also setting the property for category. Theoretically, it can also be modelled as category -contains-> compound by creating Categorynode. But what advantage will I have if I do it? Because I can execute the queries and get the expected output without creating this additional node.
Thank you for your answer.
I don't think that's possible, LOAD CSV goes over one row at a time, so at row 1, it doesn't know how many more rows will follow.
I guess you could create virtual nodes and relationships, aggregate those and then use those to create the real nodes, but that would be way more complicated. Virtual Nodes/Rels
That depends on the questions/queries you want to ask.
A graph database is optimised for following relationships, so if you often do a query where the category is a criteria (e.g. MATCH (c: Category {category_id: 12})-[r]-(:Compound) ), it might be more performant to create a label for it.
If you just want to get the category in the results (e.g. RETURN compound.category), then it's fine as a property.

how can do an 'OR' search in NEO4J cypher

I have a Neo4j db with different labels on the nodes such as a:Banker , b:Customer. each has an email property I want to search for an email but but not search the entire db. So I want to do something like this Match(a:Banker {email: '123#mymail.com'}) OR Match (b:Customer {email:'123#mymail.com'}). There are constraints on email for both labels but I don't want each label to have the same email so before I add a node I need to determine if the email exist in either Banker or Customer nodes. I suspect this can be done in a very efficient scalable way that would not leave the user staring at a spinner when trying to add the one millionth record.....any help would be much appreciated
How I would do it is have an addition label 'Person' on all Bankers and Customers.
CREATE CONSTRAINT ON (b:Person) ASSERT p.Email IS UNIQUE
CREATE CONSTRAINT ON (b:Banker) ASSERT p.Email IS UNIQUE
CREATE CONSTRAINT ON (b:Customer) ASSERT p.Email IS UNIQUE
CREATE (b:Person:Banker {Email: "123#mymail.com"})
CREATE (b:Person:Customer {Email: "321#mymail.com"})
CREATE (c:Person:Customer {Email: "123#mymail.com"})
The last one will fail as a Person/Banker already has the same email. You can then also search MATCH (p:Person {Email: "123#mymail.com"}) or even b:Banker, c:Customer
You can also do (p:Person:Customer:Banker) if a person is all three.
It will also allow you to do MERGE which creates an entry if it doesn't already exist.
Since you already have a database you can do:
MATCH(b:Banker)
SET b:Person
MATCH(c:Customer)
SET c:Person
A somewhat "safer" approach than #Liam's would be to just have the Person label, without the Banker and Customer labels. That way, it would be harder to accidentally create/merge a node without the Person label, since that would be the only label for a person. Also, this approach would not require 2 (or 3) uniqueness checks every time you added a person.
With this approach, you could also add isCustomer and isBanker boolean properties, as needed, and create indexes on :Person(isCustomer) and :Person(isBanker) to quickly locate customers versus bankers.
Now, having said the above, I wonder if you really need the isCustomer and isBanker properties (or the Customer and Banker labels) at all. That is, the fact that a Person node is a banker and/or a customer may be derivable from that node's relationships. It seems reasonable for your data model to contain Bank nodes with relationships between them and people. For example, in the following data model, b is a banker at "XYZ Bank", c is a customer, and bc is both:
(b:Person)-[:WORKS_AT]->(xyz:Bank {id:123, name: 'XYZ Bank'}),
(c:Person)-[:BANKS_AT]->(xyz),
(bc:Person)-[:BANKS_AT]->(xyz)<-[:WORKS_AT]-(bc)
This query would find all bankers:
MATCH (banker:Person)-[:WORKS_AT]->(:Bank)
RETURN banker;
This would find all customers:
MATCH (banker:Person)-[:BANKS_AT]->(:Bank)
RETURN banker;
This would find all bankers who are also customers at the same bank:
MATCH (both:Person)-[:WORKS_AT]->(:Bank)<-[:BANKS_AT]-(both)
RETURN both;

Returning a count of and all complete paths in Neo4j [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
Being an absolute noob in neo4j and having had very generous help with a previous question, I thought I'd try my luck once again as I'm still struggling.
The example scenario is that of students that enters a house and walks from one room to another. The journey doesn't have to start or end at a particular room but the order of sequence that a student enters a room is important.
What I want to find out is all the complete paths that students have taken along with a count of how many times the path in question was taken. Below is the sample data and what I've tried (thanks to the answer of a previous question along with a series of blog posts):
the file dorm.csv
ID|SID|EID|ROOM|ENTERS|LEAVES
1|1|12|BLUE|1/01/2015 11:00|4/01/2015 10:19
2|2|18|GREEN|1/01/2015 12:11|1/01/2015 12:11
3|2|18|YELLOW|1/01/2015 12:11|1/01/2015 12:20
4|2|18|BLUE|1/01/2015 12:20|5/01/2015 10:48
5|3|28|GREEN|1/01/2015 18:41|1/01/2015 18:41
6|3|28|YELLOW|1/01/2015 18:41|1/01/2015 21:00
7|3|28|BLUE|1/01/2015 21:00|9/01/2015 9:30
8|4|36|BLUE|1/01/2015 19:30|3/01/2015 11:00
9|5|40|GREEN|2/01/2015 19:08|2/01/2015 19:08
10|5|40|ORANGE|2/01/2015 19:08|3/01/2015 2:43
11|5|40|PURPLE|3/01/2015 2:43|4/01/2015 16:44
12|6|48|GREEN|3/01/2015 11:52|3/01/2015 11:52
13|6|48|YELLOW|3/01/2015 11:52|3/01/2015 17:45
14|6|48|RED|3/01/2015 17:45|7/01/2015 10:00
creating nodes for Student, Room and Visit where Visit is the event of a student entering a room uniquely identified by the ID property
CREATE CONSTRAINT ON (student:Student) ASSERT student.studentID IS UNIQUE;
CREATE CONSTRAINT ON (room:Room) ASSERT room.roomID IS UNIQUE;
CREATE CONSTRAINT ON (visit:Visit) ASSERT visit.visitID IS UNIQUE;
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///dorm.csv" as line fieldterminator '|'
MERGE (student:Student {studentID: line.SID})
MERGE (room:Room {roomID: line.ROOM})
MERGE (visit:Visit {visitID: line.ID, roomID: line.ROOM, studentID: line.SID, ticketID: line.EID})
create (student)-[:VERB]->(visit)-[:OBJECT]->(room)
Creating a PREV relationship allows the ordering or sequencing that the student travels in. This uses data in the file dormprev.csv. If a student has only visited a single room, this ID will not appear in the dormprev file as its purpose is to link/chain visits. Data as below
ID|PREV_ID|EID
3|2|18
4|3|18
6|5|28
7|6|28
10|9|40
11|10|40
13|12|48
14|13|48
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///dormprev.csv" as line fieldterminator '|'
MATCH (new:Visit {visitID: line.ID})
MATCH (old:Visit {visitID: line.PREV_ID})
MERGE (new)-[:PREV]->(old)
I can view all student journeys by the below query
MATCH (student:Student)-[:VERB]->(visit:Visit)-[:OBJECT]-(room:Room)
RETURN student, visit, room
However, I have no idea how to return all of the rooms in a complete path.
if I run this query
MATCH p = (:Visit)<-[:PREV]-(:Visit) return p
I can see that it, for example, for student ID 2 returns Green and Yellow and then Yellow and Blue as a separate pair - I want to view that as Green, Yellow, Blue
This also means that if I run the below query:
MATCH p = (:Visit)<-[:PREV]-(:Visit)
WITH p, EXTRACT(v IN NODES(p) | v.roomID) AS rooms
UNWIND rooms AS stays
WITH p, COUNT(DISTINCT stays) AS distinct_stays
WHERE distinct_stays = LENGTH(NODES(p))
RETURN EXTRACT(v in NODES(p) | v.roomID), count(p)
ORDER BY count(p) DESC
it will return a count of those pairings rather than count of "whole paths" if that makes sense.
For example, SID 2 and SID 3 both visit rooms GREEN, YELLOW, BLUE in that order. SID 5 visits GREEN, ORANGE, PURPLE in that order.
What I'm hoping to see is:
[GREEN, YELLOW, BLUE] 2
[GREEN, ORANGE, PURPLE] 1
etc. Is that possible with the above model and if so can anyone please help point me in the right direction? The number of rooms that are visited is not guaranteed and can be anything from one to *. However, if only one room is visited, that's not really of interest and so is the reason why I thought this model might make sense (again, stolen from a blog post series).
I don't know if the above makes sense but any help would be much appreciated - this makes for an excellent use case and would be really useful.
Thank you for your kind help.
What I think you are looking for is variable path length. And you can accomplish that by merely changing this in your query (note the asterisk) :
MATCH p = (:Visit)<-[:PREV*]-(:Visit)
Do allow me a couple of further remarks. Yes, I understand the convenience of having roomID and studentID in the Visit node (keeps this specific query quite a bit simpler), but you are ignoring the whole point of having relationships in the first place (in fact, if you do it this way there's currently actually no point in having the Student and Room nodes at all) and you are going to have trouble maintaining them. Secondly ... if we are going to be splitting the proverbial 3rd normal form hairs ;-), then the relations for a Visit should actually be created as follows (note the direction of the relationships) :
CREATE (student)-[:VERB]->(visit)<-[:OBJECT]-(room)
Other than that I must say you're moving very fast :-)
Hope this helps, Tom
Building a bit on Tom's suggestions, you might consider an alternate model doing away with :Visit nodes completely, and making your relationship types a bit more focused, like this:
(:Student)-[:VISITED]->(:Room)
You can set entered and left properties on the :VISITED relationship, which will allow you to order the relationships (and corresponding :Rooms) in visited order.
Here's an alternate import that will do this, using APOC Procedures (you'll have to install the correct version corresponding with your Neo4j version) to parse out timestamps from your date strings.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///dorm.csv" as line fieldterminator '|'
MERGE (student:Student {studentID: line.SID})
MERGE (room:Room {roomID: line.ROOM})
WITH student, room, apoc.date.parse(line.ENTERS, 'ms', 'MM/dd/yyyy HH:mm') as entered, apoc.date.parse(line.LEAVES, 'ms', 'MM/dd/yyyy HH:mm') as left
CREATE (student)-[r:VISITED]->(room)
SET r.entered = entered, r.left = left
And now your query to get all paths and the number of students who have taken those paths becomes very easy:
MATCH (s:Student)-[v:VISITED]->(r:Room)
WHERE size((s)-[:VISITED]->()) > 1
WITH s, r
ORDER BY v.entered ASC
WITH s, collect(r.roomID) as rooms
RETURN rooms, count(s)

In- and excluding nodes in a cypher query

Good morning,
I want to build a structure in Neo4J where I can handle my users and groups (kind of ACL). The idea is to have for each user and for each group a node with all the details. The groups shall become a graph where a root group will have sub-groups that can have also sub-groups without limit. The relation will be -[:IS_SUBGROUP_OF]- - so far nothing exciting. Every user will be related to a group with -[:IS_MEMBER_OF]- to have a clear assignment. Of course a user can be a member of 1 or more groups. Some users will have a different relation like -[:IS_LEADER_OF]- to identify teamlead of the groups.
My tasks:
Assignment: I can query each member of a group with a simple query, I can even query members of the subgroups using the current logged in and asking user:
MATCH (d1:Group:Local) -- (c:User)
MATCH (d:User) -[:IS_MEMBER_OF|IS_LEADER_OF]- (g:Group:Local)-[:IS_SUBGROUP_OF*0..]->(d1)
WHERE c.login = userLogin
RETURN DISTINCT d.lastname, d.firstname
I get every related user to every group of the current user and below (subgroups). Maybe you have a hint how I cna improve the query or the model.
Approval
Here I am stucked as I want to have all users of the current group from the querying user and all members of all subgroups - except the leader of the current group. The reason behind is that a teamlead shall not be able to approve actions for himself but though for every other member of his group and all members of subgroups including their teamleads.
I tried to use the relations -[:IS_LEADER_OF]- to exclude them but than I loose also the teamleads of the subgroups. Does anyone has an idea how I would either change the model or how I can query the graph to get all users except the teamlead of the current group?
Thanks for your time,
Balael
* EDIT *
I think I am getting close, I just need to understand the results of those both queries:
MATCH (d:User) -- (g:Group) WHERE g.uuid = "xx"
RETURN d.lastname, d.firstname
Returns all user in this group no matter what relationship (leader / member)
MATCH (d:User) -- (g:Group), (g)--(c:User{uuid:"yy"})
RETURN d.lastname, d.firstname
Returns all user of that group except the user c. I would have expected to get c as well in the list with d-users as c is part of that group and should be found with (d:User).
I do not understand the difference between both queries, maybe someone has a hint for me?
You can simplify your query slightly (however this should not have an impact on performance):
MATCH (d:User) -[:IS_MEMBER_OF|IS_LEADER_OF]- (g:Group:Local)-[:IS_SUBGROUP_OF*0..]->(d1:Group:Local)--(c:User{login:"userlogin"})
RETURN DISTINCT d.lastname, d.firstname
Don't completely understand your question, but I assume you want to make sure that d1 and c are not connected by a IS_LEADER_OF relationship. If so, try:
MATCH (d:User) -[:IS_MEMBER_OF|IS_LEADER_OF]- (g:Group:Local)-[:IS_SUBGROUP_OF*0..]->(d1:Group:Local)-[r]-(c:User{login:"userlogin"})
WHERE type(r)<>'IS_LEADER_OF'
RETURN DISTINCT d.lastname, d.firstname
following up on * EDIT * in the question
In a MATCH you specify a path. By definition a path does not use the same relationship twice. Otherwise there is a danger to run into infinite recursion. Looking at the second query in the "EDIT" section above: the right part matches yy's relationship to the group whereas the left part matches all user related to this group. To prevent multiple usage of the same relationship the left part does not hit use yy

Neo4j cyper performance on simple match

I have a very simple cypher which give me a poor performance.
I have approx. 2 million user and 60 book category with relation from user to category around 28 million.
When I do this cypher:
MATCH (u:User)-[read:READ]->(bc:BookCategory)
WHERE read.timestamp >= timestamp() - (1000*60*60*24*30)
RETURN distinct(bc.id);
It returns me 8.5k rows within 2 - 2.5 (First time) minutes
And when I do this cypher:
MATCH (u:User)-[read:READ]->(bc:BookCategory)
WHERE read.timestamp >= timestamp() - (1000*60*60*24*30)
RETURN u.id, u.email, read.timestamp;
It return 55k rows within 3 to 6 (First time) minutes.
I already have index on User id and email, but still I don't think this performance is acceptable. Any idea how can I improve this?
First of all, you can profile your query, to find what happens under the hood.
Currently looks like that query scans all nodes in database to complete query.
Reasons:
Neo4j support indexes only for '=' operation (or 'IN')
To complete query, it traverses all nodes, one by one, checking each node if it has valid timestamp
There is no straightforward way to deal with this problem.
You should look into creating proper graph structure, to deal with Time-specific queries more efficiently. There are several ways how to represent time in graph databases.
You can take look on graphaware/neo4j-timetree library.
Can you explain your model a bit?
Where are the books and the "reading"-Event in it?
Afaik all you want to know, which book categories have been recently read (in the last month)?
You could create a second type of relationship thats RECENTLY_READ which expires (is deleted) by a batch job it is older than 30 days. (That can be two simple cypher statements which create and delete those relationships).
WITH (1000*60*60*24*30) as month
MATCH (a:User)-[read:READ]->(b:BookCategory)
WHERE read.timestamp >= timestamp() - month
MERGE (a)-[rr:RECENTLY_READ]->(b)
WHERE coalesce(rr.timestamp,0) < read.timestamp
SET rr.timestamp = read.timestamp;
WITH (1000*60*60*24*30) as month
MATCH (a:User)-[rr:RECENTLY_READ]->(b:BookCategory)
WHERE rr.timestamp < timestamp() - month
DELETE rr;
There is another way to achieve what you exactly want to do here, but it's unfortunately not possible in Cypher.
With a relationship-index on timestamp on your read relationship you can run a Lucene-NumericRangeQuery in Neo4j's Java API.
But I wouldn't really recommend to go down this route.

Resources