Add relationships to existing data in Neo4j - neo4j

To start with Neo4j (4.2.3) I loaded a year's worth of flights data (7m rows) and wanted to try and model a flight as a relationship between origin and destination airport. However the following query just eats up memory and has not finished after two days, so something is clearly amiss:
MATCH (f:Flight), (dest:Airport), (orig:Airport)
WHERE f.Dest = dest.IATA_Code AND f.Origin = orig.IATA_Code
CREATE (orig)-[r:FlightTo {DeptDateTime:f.DepDT, ArriveDateTime:f.ArrDT, Flight:f.Name}]->(dest)
I can do this instead:
LOAD CSV WITH HEADERS FROM 'file:///flights.csv' AS row
MERGE (o:Org_Airport {Org_IATA:row.Origin})
MERGE (d:Dest_Airport {Dest_IATA:row.Dest})
CREATE (o)-[r:FlightTo {DeptDateTime:row.DepDT, ArriveDateTime:row.ArrDT, Flight:row.Name}]->(d)
While this has the advantage of working (even in a reasonable time) it feels ugly to essentially duplicate the airports and also to go through the CSV file again when all the required data is already in the database.
I'm not quite there with my graph thinking probably so I'd appreciate some guidance on what the best way is to add a relationship like this, keeping in mind that original load files might get lost.

Do you have indexes set? Looking at your first query, you'd need:
CREATE INDEX ON :Flight(Dest);
CREATE INDEX ON :Airport(IATA_Code);
If you don't have indexes/constraints set on the label/property, the look up/merge will be very slow.

Related

Cypher Query endless loop

I am new to graph databases and especially cypher. I am importing data from my csv. Below is the sample I pulled for some country data and added the cities and states. Now I was pushing the data for areas
LOAD CSV WITH HEADERS FROM
"file:///X:/loc.csv" as csvRow
MATCH (ct:city {poc:csvRow.poc})
MERGE (loc:area {eoc: csvRow.eoc, name:csvRow.loc_nme, name_wr:replace(csvRow.loc_nme," ","")})
MERGE (loc)-[:exists_inside]->(ct)
I've already pushed city and country data using the same query and built a relation between them too.
But when I try to create the areas inside the city it just keeps going, there is no stopping it. (15 mins have passed).
There are 7000 cities in the data I've got from the internet and 90k areas inside those cities.
Is it just taking time or have I messed up with the query.
After the Update
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM
"file:///X:/loc.csv" as csvRow
MATCH (ct:city {poc:csvRow.poc})
MERGE (loc:area {eoc: csvRow.eoc, name:csvRow.loc_nme, name_wr:replace(csvRow.loc_nme," ","")})
MERGE (loc)-[:exists_inside]->(ct)
Okay, your query plan shows NodeByLabelScans and filters are being used to find your nodes, which means that every time you match or merge to a node, it has to scan all nodes with the given labels and perform property access on all of them to find the nodes you're looking for.
You need to add indexes (or unique constraints, depending on if the field is supposed to be unique) on the relevant label/property combinations so those lookups will be quick.
So you'll need one on :city(poc), and probably one on :area(eoc), assuming those properties are referring to unique properties.
EDIT
One other big thing I initially missed, you need to add USING PERIODIC COMMIT before the LOAD CSV so the load will batch the writes to the db, that should do the trick here.

Creating/Managing millions of Vertex Tree in Neo4j 3.0.4

I'm doing some stuff with my University and I've been asked to create a system that builds Complete Trees with millions of nodes (1 or 2 million at least).
I was trying to create the Tree with a Load CSV Using a periodic commit and it worked well with the creation of just Nodes (70000 ms on a general purpose Notebook :P ). When I tried the same with the Edges, it didn't scale as well.
Using periodic commit LOAD CSV WITH HEADERS FROM 'file:///Archi.csv' AS line
Merge (:Vertex {name:line.from})<-[:EDGE {attr1: toFloat(line.attr1), attr2:toFloat(line.attr2), attr3: toFloat(line.attr3), attr4: toFloat(line.attr4), attr5: toFloat(line.attr5)}]-(:Vertex {name:line.to})
I need to guarantee that a Tree is generated in no more than 5 minutes.
Is there a Faster method that can return such a performances?
P.S. : The task doesn't expect to use Neo4j, but just a Database (either SQL or NoSQL), but I found out this NoSQL Graph DB and I thought would be nice to implement with Neo4j as the graph data structure is given for free.
P.P.S : I'm using Cypher
I think you should read up on MERGE in the developer documentation again, to make sure you understand exactly what it's doing.
A few things in particular to be aware of...
If the pattern you are merging does not exist, all elements of the pattern will be merged, which could result in duplicate :Vertex nodes being created. If your :Vertexes are supposed to be in the database already, and if there are no relationships yet, and if you are sure that no relationship repeats itself in your CSV, I strongly urge you to MATCH on the start and end nodes, and then CREATE the relationship between them instead of the MERGE. Remember that doing a MERGE with a relationship with many attributes means it will try to match on that first, so as the number of relationships grow between nodes, there will be an increasing number of comparisons, which will slow your query down further. CREATE is a better choice if you know that no relationship will be duplicated, and if you are sure those relationships don't exist yet.
I also urge you to create an index on :Vertex(name), as that will significantly help matching on end nodes.

Activerecord-import records with relationships to each other

My app needs to import hundreds to thousands of records at a time. The records are each nodes in a tree structure. I'm using activerecord-import to significantly speed up the import, and haven't yet settled on which of ancestry, closure_tree, acts_as_list or a custom solution to use for setting out the hierarchy.
The problem I'm grappling with is how to import all the data and relationships in one or just a few passes. My draft naive solution is:
creating each object in memory, and manually giving each object an id;
using those ids to manually giving each object the foreign_key that it needs (eg parent_id); and then
mass-importing the resulting array of objects using activerecord-import
This feels like a hack with obvious problems. For example, if the ids that I've chosen for my objects get used by the database while I'm still instantiating my objects, then the relationships I've manually set become useless/wrong.
Another major problem is that as I look into more advanced solutions for the tree data structure (eg closure_tree and ancestry), manually setting the fields required by those gems feels more and more like a hack.
So I guess my question is, is there a clean way to set up a tree structure of N nodes in a rails activerecord database while touching the database less than N times?
The master branch has a commit with this functionality. It will work only with Rails 4 and Postgres.
If you happen to have another configuration, you will need to:
Create a new array of Models/hashes
Model.import it
Retrieve the IDs of the rows you just inserted
Goto 1, setting in the new array the IDs (parent_id) you just retrieved

Neo4j data modeling for branching/merging graphs

We are working on a system where users can define their own nodes and connections, and can query them with arbitrary queries. A user can create a "branch" much like in SCM systems and later can merge back changes into the main graph.
Is it possible to create an efficient data model for that in Neo4j? What would be the best approach? Of course we don't want to duplicate all the graph data for every branch as we have several million nodes in the DB.
I have read Ian Robinson's excellent article on Time-Based Versioned Graphs and Tom Zeppenfeldt's alternative approach with Network versioning using relationnodes but unfortunately they are solving a different problem.
I Would love to know what you guys think, any thoughts appreciated.
I'm not sure what your experience level is. Any insight into that would be helpful.
It would be my guess that this system would rely heavily on tags on the nodes. maybe come up with 5-20 node types that are very broad, including the names and a few key properties. Then you could allow the users to select from those base categories and create their own spin-offs by adding tags.
Say you had your basic categories of (:Thing{Name:"",Place:""}) and (:Object{Category:"",Count:4})
Your users would have a drop-down or something with "Thing" and "Object". They'd select "Thing" for instance, and type a new label (Say "Cool"), values for "Name" and "Place", and add any custom properties (IsAwesome:True).
So now you've got a new node (:Thing:Cool{Name:"Rock",Place:"Here",IsAwesome:True}) Which allows you to query by broad categories or a users created categories. Hopefully this would keep each broad category to a proportional fraction of your overall node count.
Not sure if this is exactly what you're asking for. Good luck!
Hmm. While this isn't insane, think about the type of system you're replacing first. SQL. In SQL databases you wouldn't use branches because it's data storage. If you're trying to get data from multiple sources into one DB, I'd suggest exporting them all to CSV files and using a MERGE statement in cypher to bring them all into your DB at once.
This could manifest similar to branching by having each person run a script on their own copy of the DB when you merge that takes all the nodes and edges in their copy and puts them all into a CSV. IE
MATCH (n)-[:e]-(n2)
RETURN n,e,n2
Then comparing these CSV's as you pull them into your final DB to see what's already there from the other copies.
IMPORT CSV WITH HEADERS FROM "file:\\YourFile.CSV" AS file
MERGE (N:Node{Property1:file.Property1, Property2:file.Property2})
MERGE (N2:Node{Property1:file.Property1, Property2:file.Property2})
MERGE (N)-[E:Edge]-(N2)
This will work, as long as you're using node types that you already know about and each person isn't creating new data structures that you don't know about until the merge.

Creating nodes and relationships at the same time in neo4j

I am trying to build an database in Neo4j with a structure that contains seven different types of nodes, in total around 4-5000 nodes and between them around 40000 relationships. The cypher code i am currently using is that i first create the nodes with the code:
Create (node1:type {name:'example1', type:'example2'})
Around 4000 of that example with unique nodes.
Then I've got relationships stated as such:
Create
(node1)-[:r]-(node51),
(node2)-[:r]-(node5),
(node3)-[:r]-(node2);
Around 40000 of such unique relationships.
With smaller scale graphs this has not been any problem at all. But with this one, the Executing query never stops loading.
Any suggestions on how I can make this type of query work? Or what i should do instead?
edit. What I'm trying to build is a big graph over a product, with it's releases, release versions, features etc. in the same way as the Movie graph example is built.
The product has about 6 releases in total, each release has around 20 releaseversion. In total there is 371 features and of there 371 features there is also 438 featureversions. ever releaseversion (120 in total) then has around 2-300 featureversions each. These Featureversions are mapped to its Feature whom has dependencies towards a little bit of everything in the db. I have also involed HW dependencies such as the possible hw to run these Features on, releases on etc. so basicaly im using cypher code such as:
Create (Product1:Product {name:'ABC', type:'Product'})
Create (Release1:Release {name:'12A', type:'Release'})
Create (Release2:Release {name:'13A, type:'release'})
Create (ReleaseVersion1:ReleaseVersion {name:'12.0.1, type:'ReleaseVersion'})
Create (ReleaseVersion2:ReleaseVersion {name:'12.0.2, type:'ReleaseVersion'})
and below those i've structured them up using
Create (Product1)<-[:Is_Version_Of]-(Release1),
(Product1)<-[:Is_Version_Of]-(Release2),
(Release2)<-[:Is_Version_Of]-(ReleaseVersion21),
All the way down to features, and then I've also added dependencies between them such as:
(Feature1)-[:Requires]->(Feature239),
(Feature239)-[:Requires]->(Feature51);
Since i had to find all this information from many different excel-sheets etc, i made the code this way thinking i could just put it together in one mass cypher query and run it on the /browser on the localhost. it worked really good as long as i did not use more than 4-5000 queries at a time. Then it created the entire database in about 5-10 seconds at maximum, but now when I'm trying to run around 45000 queries at the same time it has been running for almost 24 hours, and are still loading and saying "executing query...". I wonder if there is anyway i can improve the time it takes, will the database eventually be created? or can i do some smarter indexes or other things to improve the performance? because by the way my cypher is written now i cannot divide it into pieces since everything in the database has some sort of connection to the product. Do i need to rewrite the code or is there any smooth way around?
You can create multiple nodes and relationships interlinked with a single create statement, like this:
create (a { name: "foo" })-[:HELLO]->(b {name : "bar"}),
(c {name: "Baz"})-[:GOODBYE]->(d {name:"Quux"});
So that's one approach, rather than creating each node individually with a single statement, then each relationship with a single statement.
You can also create multiple relationships from objects by matching first, then creating:
match (a {name: "foo"}), (d {name:"Quux"}) create (a)-[:BLAH]->(d);
Of course you could have multiple match clauses, and multiple create clauses there.
You might try to match a given type of node, and then create all necessary relationships from that type of node. You have enough relationships that this is going to take many queries. Make sure you've indexed the property you're using to match the nodes. As your DB gets big, that's going to be important to permit fast lookup of things you're trying to create new relationships off of.
You haven't specified which query you're running that isn't "stopping loading". Update your question with specifics, and let us know what you've tried, and maybe it's possible to help.
If you have one of the nodes already created then a simple approach would be:
MATCH (n: user {uid: "1"}) CREATE (n) -[r: posted]-> (p: post {pid: "42", title: "Good Night", msg: "Have a nice and peaceful sleep.", author: n.uid});
Here the user node already exists and you have created a new relation and a new post node.
Another interesting approach might be to generate your statements directly in Excel, see http://blog.bruggen.com/2013/05/reloading-my-beergraph-using-in-graph.html?view=sidebar for an example. You can run a lot of CREATE statements in one transaction, so this should not be overly complicated.
If you're able to use the Neo4j 2.1 prerelease milestones, then you should try using the new LOAD CSV and PERIODIC COMMIT features. They are designed for just this kind of use case.
LOAD CSV allows you to describe the structure of your data with one or more Cypher patterns, while providing the values in CSV to avoid duplication.
PERIODIC COMMIT can help make large imports more reliable and also improve performance by reducing the amount of memory that is needed.
It is possible to use a single cypher query to create a new node as well as relate it to an existing now.
As an example, assume you're starting with:
an existing "One" node which has an "id" property "1"
And your goal is to:
create a second node, let's call that "Two", and it should have a property id:"2"
relate the two nodes together
You could achieve that goal using a single Cypher query like this:
MATCH (one:One {id:'1'})
CREATE (one) -[:RELATED_TO]-> (two:Two {id:'2'})

Resources