How to use load csv for large dataset in neo4j? - neo4j

I have a user.csv file with students:
id, first_name, last_name, locale, gender
1, Hasso, Plattner, en, male
2, Tina, Turner, de, female
and a memberships.csv file with course memberships of the students:
id, user_id, course_id
1, 1, 3
2, 1, 4
3, 2, 4
4, 2, 5
To transform students and courses into vertices
and course memberberships into edges, I joined
the user information into memberships.csv
id, user_id, first_name, last_name, course_id, locale, gender
1, 1, Hasso, Plattner, 3, en, male
2, 1, Hasso, PLattner, 4, en, male
3, 2, Tina, Turner, 4, de, female
4, 2, Tina, Turner, 5, de, female
and used load csv, some constraints and MERGE:
create constraint on (g:Gender) assert g.gender is unique
create constraint on (l:locale) assert l.locale is unique
create constraint on (c:Course) assert c.course is unique
create constraint on (s:Student) assert s.student is unique
USING PERIODIC COMMIT 20000
LOAD CSV WITH HEADERS FROM
'file: memberships.csv'
AS line
MERGE (s:Student {id: line.id, name: line.first_name +" "+line.last_name })
MERGE (c:Course {id: line.course_id})
MERGE (g:Gender {gender:line.gender})
MERGE (l:locale {locale:line.locale})
MERGE (s)-[:HAS_GENDER]->(g)
MERGE (s)-[:HAS_LANGUAGE]->(l)
MERGE (s)-[:ENROLLED_IN]->(c)
For 1 000 memberships neo4j needs 2 seconds to load,
for 10 000 memberships 3 minutes,
for 100 000 it fails with 'Unknown error'.
i) How to get rid of the error?
ii) Is there a more elegant way to load such a structure from .csv
with about 600 000 memberships?
I am using a local machine with 2,4 GHz and 16GB RAM.

The Neo4j browser has a 60 second timeout period on Cypher queries (due to HTTP transport). This does not mean that your query is not running to completion, in fact there has been no error at the database-level. Your query will continue to run via the browser but you will not be able to see its result. To see long running queries run to completion please use the Neo4j shell.
http://docs.neo4j.org/chunked/stable/shell.html

Try to import first the nodes from their CSV and then the rels afterwards.
Also try to do an import run without Gender and Locale nodes and instead store it as a property.
If you really need those (dense) nodes later on, try to run it like this:
CREATE (g:Gender {gender:"male"})
MATCH (s:Student {gender:"male"})
CREATE (s)-[:HAS_GENDER]->(g)
Those relationships will be unique, and create is cheaper than MERGE. I assume that checking 2*(n-1) rels per inserted student adds up as it is then O(n^2)

Related

How do I calculate totals for the root nodes in a tree in neo4j?

I am learning cypher and was presented with a problem that I have actually solved, but wondering if there is a better way of writing the cypher query.
I have a hierarchy (tree) of arbitrary depth consisting of companies along with their subsidiaries and the subsidiaries of the subsidiaries and so on.
Each company / subsidiary is a node and an attribute on each node is the revenue earned by that particular company / subsidiary.
I want to calculate the total revenue for just the root nodes. That is I need to calculate the total revenue for the top level company being the sum of its own revenue plus the revenue of all the subsidiaries underneath it.
The query I have come up with calculates all of the sub totals for each mini-tree (a parent and its direct subsidiaries). The query appears to start at the bottom of the trees and works its way up.
The output of the first part of the query is a list of all of the nodes (except the leaves) with the total of all of the nodes under it.
Next I calculate all of the root nodes and "join" this list of root nodes to the previous result.
This returns the answer that I need. However, it seems quite convoluted - hence my question, is there a way to do this more elegantly - perhaps with just a single match clause?
Following is some sample data and the query I have written so far.
create (a:Company {revenue: 10, cid: "a"})
create (b:Company {revenue: 10, cid: "b"})
create (c:Company {revenue: 20, cid: "c"})
create (d:Company {revenue: 15, cid: "d"})
create (e:Company {revenue: 20, cid: "e"})
create (f:Company {revenue: 25, cid: "f"})
create (g:Company {revenue: 30, cid: "g"})
create (h:Company {revenue: 10, cid: "h"})
create (i:Company {revenue: 20, cid: "i"})
create (j:Company {revenue: 20, cid: "j"})
create (k:Company {revenue: 40, cid: "k"})
create (l:Company {revenue: 10, cid: "l"})
create (m:Company {revenue: 5, cid: "m"})
create (b)-[:REPORTS_TO]->(a)
create (c)-[:REPORTS_TO]->(a)
create (d)-[:REPORTS_TO]->(b)
create (e)-[:REPORTS_TO]->(c)
create (f)-[:REPORTS_TO]->(c)
create (h)-[:REPORTS_TO]->(g)
create (i)-[:REPORTS_TO]->(g)
create (j)-[:REPORTS_TO]->(h)
create (k)-[:REPORTS_TO]->(h)
create (l)-[:REPORTS_TO]->(i)
create (m)-[:REPORTS_TO]->(i)
;
Here is the query that I have created:
// First Calculate total revenue for each company in the tree with subsidiaries.
// This will include top level and intermediate level companies.
match (c: Company)<-[:REPORTS_TO*]-(s:Company)
with c.cid as r_cid, sum (s.revenue) + c.revenue as tot_revenue
// Next, Determine the root nodes
// "join" the list of root nodes to the totals for each company.
// The result is the root node companies with their total revenues.
match (c)
where not ()<-[:REPORTS_TO]-(c) AND
c.cid = r_cid
// Return the root company id and the revenue for it.
return c.cid, tot_revenue;
The above returns the result I am expecting which is:
+---------------------+
| c.cid | tot_revenue |
+---------------------+
| "g" | 135 |
| "a" | 100 |
+---------------------+
Again, this question is about whether or not there is a better way to write the cypher query than the solution I have come up with?
Yes, There are some ways to make your Cypher query better.
Few things you are doing in your query that are not required or can be improved:
Scan all the nodes second time and then filter in WHERE by matching cid of the current node with these nodes to get the node which you already have.
Calculating total revenue for all the companies. You can avoid total revenue calculations for subsidiaries, as you are not using it anywhere.
To make queries run efficiently you need to minimize the total database calls(AKA db hits). You can check db hits for your by profiling the query. This will show you a query plan and which operators are doing most of the work.
You need to run the query by adding PROFILE in the beginning.
I did profiling for your query. Total db hits for your query were
311.
Let's make changes to your query step by step:
Removing unnecessary comparisons: Total db hits reduced to 131
PROFILE
MATCH (c:Company)<-[:REPORTS_TO*]-(s:Company)
WITH c, sum(s.revenue) + c.revenue AS tot_revenue
MATCH (c)
WHERE NOT ()<-[:REPORTS_TO]-(c)
RETURN c.cid, tot_revenue;
Avoid calculating total revenue for subsidiaries by filtering root companies prior to calculation. Total db hits reduced to 108
PROFILE
MATCH (c:Company)<-[:REPORTS_TO*]-(s:Company)
WHERE NOT ()<-[:REPORTS_TO]-(c)
WITH c.cid AS r_cid, sum(s.revenue) + c.revenue AS tot_revenue
RETURN r_cid, tot_revenue;
Separating alias and addition on company revenue from aggregation. Total db hits reduced to 90
PROFILE
MATCH (c:Company)<-[:REPORTS_TO*]-(s:Company)
WHERE NOT ()<-[:REPORTS_TO]-(c)
WITH c, sum(s.revenue) AS sub_tot_revenue
RETURN c.cid AS cid, sub_tot_revenue + c.revenue AS tot_revenue;
These are some ways to improve your solution. You can read more about query tuning in Neo4j documentation.

Neo4j - Get certain nodes and relations

I have an application where nodes and relations are shown. After a result is shown, nodes and relations can be added through the gui. When the user is done, I would like to get all the data from the database again (because I don't have all data by this point in the front-end) based on the Neo4j id's of all nodes and links. The difficult part for me is that there are "floating" nodes that don't have a relation in the result of the gui (they will have relations in the database, but I don't want these). Worth mentioning is that on my relations, I have the start and end node id. I was thinking to start from there, but then I don't have these floating nodes.
Let's take a look at this poorly drawn example image:
As you can see:
node 1 is linked (no direction) to node 2.
node 2 is linked to node 3 (from 2 to 3)
node 3 is linked to node 4 (from 3 to 4)
node 3 is also linked to node 5 (no direction)
node 6 is a floating node, without relations
Let's assume that:
id(relation between 1 and 2) = 11
id(relation between 2 and 3) = 12
id(relation between 3 and 4) = 13
id(relation between 3 and 5) = 14
Keeping in mind that behind the real data, there are way more relations between all these nodes, how can I recreate this very image again via Neo4j? I have tried doing something like:
match path=(n)-[rels*]-(m)
where id(n) in [1, 2, 3, 4, 5]
and all(rel in rels where id in [11, 12, 13, 14])
and id(m) in [1, 2, 3, 4, 5]
return path
However, this doesn't work properly because of multiple reasons. Also, just matching on all the nodes doesn't get me the relations. Do I need to union multiple queries? Can this be done in 1 query? Do I need to write my own plugin?
I'm using Neo4j 3.3.5.
You don't need to keep a list of node IDs. Every relationship points to its 2 end nodes. Since you always want both end nodes, you get them for free using just the relationship ID list.
This query will return every single-relationship path from a relationship ID list. If you are using the neo4j Browser, its visualization should knit together these short paths and display your original full paths.
MATCH p=()-[r]-()
WHERE ID(r) IN [11, 12, 13, 14]
RETURN p
By the way, all neo4j relationships have a direction. You may choose not to specify the direction when you create one (using MERGE) and/or query for one, but it still has a direction. And the neo4j Browser visualization will always show the direction.
[UPDATED]
If you also want to include "floating" nodes that are not attached to a relationship in your relationship list, then you could just use a separate floating node ID list. For example:
MATCH p=()-[r]-()
WHERE ID(r) IN [11, 12, 13, 14]
RETURN p
UNION
MATCH p=(n)
WHERE ID(n) IN [6]
RETURN p

Query to return nodes that have no specific relationship within an already matched set of nodes

The following statement creates the data I am trying to work with:
CREATE (p:P2 {id: '1', name: 'Arthur'})<-[:EXPANDS {recorded: 1, date:1}]-(:P2Data {wage: 1000})
CREATE (d2:P2Data {wage: 1100})-[:EXPANDS {recorded: 2, date:4}]->(p)
CREATE (d3:P2Data {wage: 1150})-[:EXPANDS {recorded: 3, date:3}]->(p)
CREATE (d3)-[:CANCELS]->(d2)
So, Arthur is created and initially has a wage of 1000. On day 2 we add the info that the Wage will be 1100 from day 4 onwards. On day 3 we state that the wage will be increased to 1150, which cancels the entry from day 2.
Now, if I look at the history as it was valid for a given point in time, when the point in time is 2, the following history is correct:
day 1 - wage 1000
day 4 - wage 1100
when the point in time is 3, the following history is correct:
day 1 - wage 1000
day 3 - wage 1150
expressed in graph terms, when I match the P2Data based on the :EXPANDS relationship, I need those that are not cancelled by any other P2Data node that has also been matched.
This is my attempt so far:
MATCH p=(:P2 {id: '1'})<-[x1:EXPANDS]-(d1:P2Data)
WHERE x1.recorded <= 3
WITH x1.date as date,
FILTER(n in nodes(p)
WHERE n:P2Data AND
SIZE(FILTER(n2 IN nodes(p) WHERE (n2:P2Data)-[:CANCELS]->(n))) = 0) AS result
RETURN date, result
The idea was to only get those n in nodes(p) where there are no paths pointing to it via the :CANCELS relationship.
Since I am still new to this and somehow cypher hasn't clicked yet for me, feel free to discard that query completely.
If you modify your data model by removing the CANCELS relationship, and instead add an optional canceled date to the EXPANDS relationship type, you can greatly simplify the required query.
For example, create the test data:
CREATE (p:P2 {id: '1', name: 'Arthur'})<-[:EXPANDS {recorded: 1, date:1}]-(:P2Data {wage: 1000})
CREATE (d2:P2Data {wage: 1100})-[:EXPANDS {recorded: 2, date:4, canceled: 3}]->(p)
CREATE (d3:P2Data {wage: 1150})-[:EXPANDS {recorded: 3, date:3}]->(p)
Perform simple query:
MATCH p=(:P2 {id: '1'})<-[x1:EXPANDS]-(d1:P2Data)
WHERE x1.recorded <= 3 AND (x1.canceled IS NULL OR x1.canceled > 3)
RETURN x1.date AS date, d1
ORDER BY date;
MATCH (:P2 {id: '1'})<-[x1:EXPANDS]-(d1:P2Data)
WHERE x1.recorded <= 3
WITH x1.date AS valid_date, x1.recorded AS transaction_date, d1.wage AS wage
ORDER BY valid_date
WITH COLLECT({v: valid_date, t: transaction_date, w:wage}) AS dates
WITH REDUCE(x = [HEAD(dates)], date IN TAIL(dates)|
CASE
WHEN date.v = LAST(x).v AND date.t > LAST(x).t THEN x[..-1] + [date]
WHEN date.t > LAST(x).t THEN x + [date]
ELSE x
END) AS results
UNWIND results AS result
RETURN result.v, result.w
I'm trying to think of a way to model this better, but I'm honestly pretty stumped.

How to create an ordered chain linked to a node?

I have a set of HeadNodes which has field id and I have a set of TailNodes which are not related to each other and to HeadNodes and have fields id and date in milliseconds.
I want to write the query which takes:
Match (p: TailNodes) where not (p)-[:RELATED_TO]->()
that are not joined to HeadNode directly or through another TailNodes take their id number and look though HeadNodes for this id. When I found it (it's guaranteed to be there) I looked for a place to put it (in order of date time).
For example:
we have 1 HeadNode{id: 1} and 3 TailNodes: {id: 1, datetime: 111}, {id: 1, datetime: 115}, and {id: 1, datetime: 113} without any relationships.
At first step it takes first TailNode {id: 1, datetime: 111} and creates a relationship:
(head:HeadNode{id: 1})<-[:RELATED_TO]-(tail:TainNodes{id:1, datetime:111})
At second step it takes second Tailnode and finds out that 115 is greater than 111, so it deletes the previous relationship and creates 2 new relationships, and a chain that looks like this:
(head:HeadNode{id: 1})<-[:RELATED_TO]-(tail1:TainNodes{id:1, datetime:115})<-[:RELATED_TO]-(tail2:TainNodes{id:1, datetime:111})
At third step it founds out that 113 is greater than 111 but lesser than 115 and deletes relationship between datetime:115 and datetime:111; and then creates two new relationships finally getting the following:
(head:HeadNode{id: 1})<-[:RELATED_TO]-(tail1:TainNodes{id:1, datetime:115})<-[:RELATED_TO]-(tail2:TainNodes{id:1, datetime:113})<-[:RELATED_TO]-(tail3:TainNodes{id:1, datetime:111})
I hope it was clear explanation. Thanks in advance.
ok, first cut... out of time to create a more robust example but will take another shot later.
I started with a case where there were already nodes in teh list
H<--(T {dt:112})<--(T {dt:114})
I realize i create these in ascending order and not descending order too.
// match the orphaned tail nodes floating around
match (p:Tail)
where not(p-->())
with p
// match the strand with the same name and the tail nodes that are connected
// where one datetime (dt) is greater and one is less than my orphaned tail nodes
match (t1:Tail)<-[r:RELATED_TO]-(t2:Tail)
where t1.name = p.name
and t2.name = p.name
and t1.dt < p.dt
and t2.dt > p.dt
// break the relationship between the two nodes i want to insert between
delete r
// create new relationships from the orphan to the two previously connected tails
with t1, t2, p
create p-[:RELATED_TO]->t1
create t2-[:RELATED_TO]->p
return *
The case just needs to be extended for a tailless head and an orphan with a datetime greater than the last tail (i.e not in between two existing).

How to retrieve table of groups of related nodes in neo4j?

I have a dataset in neo4j that looks something like this:
(a)-[similar_to]->(b)
Each node has a property called 'id' that is unique. In the following example dataset, each 'a' node had a 'similar_to' relationship to each 'b' node:
a.id b.id
1 5
1 2
2 13
3 12
Here is what the topology looks like:
graph topology image
What I would like to do is to retrieve a table of the two groups of nodes that are connected such that the result would look like:
1, 2, 5, 13
3, 12
The best I've been able to do with Cypher so far is:
MATCH (a)-[r:similar_to*]-(b)
RETURN collect(distinct a.id)
However, the output of this is to print all of the nodes on one row:
5, 1, 2, 3, 12, 13
I have tried various permutations of this query, but keep failing. I've searched the forums for 'subgraph' and 'neo4j', but was unable to find a suitable solution. Any direction/ideas would be appreciated.
Thanks!
My understanding is you want every root node "a" and the group of all nodes that have the direct/indirect relationships [:similar_to] with the "a", if so, try this,
MATCH (a)-[r:similar_to*]->(b)
Where not(a<-[:similar_to]-())
RETURN a, collect(distinct b.id) as group
The "WHERE" clause restricts the node "a" to be the root node of each group.
The "RETURN" clause groups all nodes on the matched paths by the root node "a".
If you want to include each root "a" in the group, just change the path to,
(a)-[r:similar_to*0..]->(b)

Resources