I have a relational database (about 30 tables) and I would like to transpose it in a neo4j graph database, and I don't know where to start...
Is there a general way to transpose tables and/or tuples into a graph model ? (relations properties, one or more graphs ?) What are the best sources of documentation ?
Thanks for any help,
Best regards
First, if at all possible, I'd suggest NOT using your relational DB as your "reference" for transposing to a graph model. All too often, mistakes and pitfalls from relational modelling get transferred over to the graph model and introduce other oddities. In fact, if you have a source ER diagram, that might be an even better starting point as it's really already a graph. And maybe even consider a re-modelling exercise for your domain!
That said, from a basic point of view, you can think of most tables as representing a node type (e.g. "User" or "Movie") with join tables and keys representing relationship types.
A great starting point, from my perspective anyway, is to determine some questions your graph/data source should answer. Write those questions down, and try to come up with Cypher queries that represent the questions. Often times, a graph model naturally arises from such an effort, and it's really not that difficult.
If you haven't already, I'd strongly recommend picking up a (free) copy of the Graph Databases ebook from here: http://graphdatabases.com/
It's jam-packed with a lot of good info on where to start with modelling your domain and even things to consider when you're used to doing things in a relational manner. It also contains some material on Cypher, although the Neo4j site (neo4j.org) has a reference manual with plenty of up-to-date info on Cypher.
Hope this helps!
There's not going to be a one-stop-shop for this kind of conversion, as not all data models are appropriate for graph modeling, and every application is a unique special snowflake...but with that said.....
Generally, your 'base' tables (e.g. User, Role, Order, Product) would become nodes, and your 'join tables' (a.k.a. buster tables) would be candidates for your relationships (e.g. UserRole, OrderLineItem). The key thing to remember that in a graph, generally, you can only have one relationship of a given type between two specific nodes - so in the above example, if your system allows the same product to be in an order twice - it would cause issues.
Foreign keys are your second source of relationships, look to them to see if it makes sense to be a relationship or just a property.
Just keep in mind what you are trying to solve by your data model - if it's traversing your objects to find relationships and distance, etc... then graphs may be a good fit. If you are modeling an eCommerce app, where you are dealing with manipulating a single nested object (e.g. order -> line item -> product -> sku), then a relational model may be the right fit.
Hope my $0.02 helps...
As has been already said, there is no magical transformation from a relational database model to a graph database model.
You should look for the original entities and how they are related in order to find your nodes, properties and relations. And always keeping in mind what type of queries you are going to perform.
As BtySgtMajor said, "Graph Databases" is a good book to start, and it is free.
Related
For recursive breakdown structures, is it better to model as ...
a. Group HAS Subgroup... or
b. Subgroup PART_OF Group ?? ....
Some neo4j tutorials imply model both (the parent_of and child_of example) while the neo4j subtype tutorials imply that either will work fine (generally going with PART-OF).
Based on experience with neo4j, is there a practical reason for choosing one or the other or use both?
[UPDATED]
Representing the same logical relationship with a pair of relationships (having different types) in opposite directions is a very bad idea and a waste of time and resources. Neo4j can traverse a single relationship just as easily from either of its nodes.
With respect to which direction to pick (since we do not want both), see this answer to a related question.
Revisiting Neo4j after a long absence. I have read a lot of articles but still find I have a few questions to get me going again....
Bidirectional relationships
I have a “connected to”-type scenario where 2 nodes are connected to each other. In fact, the idea is to model a type of flow. However, the flow in both directions is not always the same. I’m uncertain of the best method to use: 1 relationship with 2 properties or 2 distinct relationships?
The former feels like the comfortable choice but then doesn’t feel natural in terms of modelling the actual facts – for example: what to call the properties because FlowIn and FlowOut wouldn’t make sense when looked at from each nodes’ perspective. I also wonder about the performance of properties versus relationships in this case – these values will need to be updated.
Representing Time
Now I want to take a step further and represent the flow between nodes at specific times or, more accurately, between specific times. So between 2pm and 3pm the flow between #1 and #2 will be x.
How should this be done in an optimal way? Relationship per time frame per connection seems….verbose. Could a timeframe being represented as a node be of value?!
Are there any Maximum Flow samples with Cypher out there?
Particularly interested in push-relabel max flow problem solving.
Thank you for any advice to might have to offer.
While you have definitely given some thought to your problem the question is a little unclear. This seems to be a question about Graph Data Models. You would like to know how best to organize a model to represent a complex relationship. If you are trying to track the "flow" between two nodes then assign a weight property to a unidirected edge.
Bidirectional relationships should be carefully considered. Neo4j can process them just as fast as unidirectional relationships. A quote from the graphaware about using bidirectional relationships:
Relationships in Neo4j can be traversed in both directions with the same speed. Moreover, direction can be completely ignored. Therefore, there is no need to create two different relationships between nodes, if one implies the other.
I believe your problems can be alleviated by gaining a better understanding of Graph data models. Looking at a few different models and understanding the why will help more than understanding cypher syntax at this point. May I suggest reading this survey by 2 professors at the University of Chile titled "Survey of Graph Database Models." The "Hypernode" model on page 21 may be of particular interest to you since it sounds like you are trying to model a complex cyclic object. From page twenty one;
Hypernodes can be used to represent simple (flat) and complex objects (hierarchical, composite, and cyclic) as well as mappings and records. A key feature is its inherent ability to encapsulate information.
Hopefully that information helps you in your efforts to model a complex relationship.
I know that there are similar questions around on Stackoverflow but I don't feel they answer the following.
Graph Databases to my understanding store data following mostly this schema:
Table/Collection 1: store nodes with UID
Table/Collection 2: store relations referencing nodes via UID
This allows storing arbitrary types of graphs. Now as I understand triple stores store nothing but triples:
Triple/Collection 1: store triples (2 nodes, 1 relation)
Now I would see the following distinction regarding use cases:
Graph Databases: when you have known, static connections
Triple Stores: when you have loosely connected nodes and are often looking for new connections
I am confused by the fact that people do not seem to be discussing which one to use according to these criteria. Most article I find are talking about arguments like speed or compatibility. But is this not the most relevant point?
Put the other way round:
Imagine having a clearly connected, user defined graph. Why on earth would you want to store that as triples only, loosing all the info about connections? Or having to implement some custom solution storing IDs in the triple subject.
Imagine having loosely collected nodes that you want to query for unknown relations using SPARQL. Graph databases do support that. But for this they have to build another index I assume and would be slower?
EDIT:
I see that "loosing info about connections" is the wrong way to put it. If you do as shown in the accepted answer and insert several triples for 2 nodes + 1 relation then you keep all the info and specifically the info what exact nodes are connected.
The main difference between graph databases and triple stores is how they model the graph. In a triple store (or quad store), the data tends to be very atomic. What I mean is that the "nodes" in the graph tend to be primitive data types like string, integer, date, etc. Relationships link primitives together, and so the "unit of discourse" in a triple store is a triple, and not a node or a relationship, typically.
By contrast, other graph databases are often called "property stores" because nodes are data containers that correspond to objects in a domain. A node stands in for an object, and has properties; they act as rich data types specified by the graph modelers, more than just primitive data types. In these graph databases, nodes and relationships are the "unit of discourse".
Let's say I have a person named "Bob" who knows "Susan". In RDF, it would be something like this:
<http://example.org/person/1> :hasName "Bob".
<http://example.org/person/1> foaf:knows <http://example.org/person/2>.
<http://example.org/person/2> :hasName "Susan".
In a graph database like neo4j, it would be this:
(a:Person {name: "Bob"})-[:KNOWS]->(b:Person {name: "Susan"})
Notice that in RDF, it's 3 relationships but only one of those relationships actually expresses semantics between two entities. The other two relationships are just tracking properties of a single higher-level entity (the person). In neo4j, it's 1 relationship amongst two nodes, with each node having a property. In RDF you'll tend to identify things by URI, in neo4j it's a database object that gets a database ID automatically. That's what I mean about the difference between a more atomic/primitive store (triple stores) and a richer property graph.
RDF and triple stores are mostly built for the kinds of architectural challenges you'd run into with the semantic web. For example, XML namespacing is built in, on the architectural assumption that you'll be mixing and matching the use of many different vocabularies and namespaces. (That right there is a very "semantic web" assumption). So in SPARQL and RDF you'll see typically at least the use of xsd, rdf, and rdfs namespaces concurrently, and probably also owl, skos, and many others. SPARQL and RDF/RDFS also have many hooks and features that are there explicitly to make things like ontology inference easier. You'll tend to identify things with URIs as a way of "namespacing your identifiers" but also because some people may want to de-reference the URI...again the assumption here is a wide data sharing arrangement between many parties.
Property stores by contrast are keyed towards different use cases, like flexible modeling of data within one model/namespace, mappings between objects and graphs for persistence of enterprise applications, rapid evolvability, and so on. You'll tend to identify things with your own scheme (or an internal database ID). An auto-incrementing integer may not be best form of ID for any random consumer on the web, (and they certainly can't be de-referenced like URLs) but they might not be your first thought for a company internal application.
So which is better? The more atomic triple store format, or a rich property graph? Do you need to mix and match many different vocabularies in one query or data model? Do you need to create an OWL ontology or do inference? Do you need to serialize a bunch of java objects in memory to a database? Do you need to do fast traversal of long paths? Those types of questions would guide your selection.
Graphs are graphs, both of them do graphs, and so I don't think there's much difference in terms of what they can represent, or how you go about thinking about a problem in "graph terms". The differences boil down to the architecture underneath of the hood, and what sorts of use cases you think you'll need. I won't tell you one is better than the other, but choose wisely.
(in reply to the comments on this answer: https://stackoverflow.com/a/30167732 )
When an owl:inverseOf production rule is defined, the inverse property triple is inferred by the reasoner either when adding or updating the store, or when selecting from the store. This is a "materialized relation"
Schema.org - an RDFS vocabulary - defines, for example, https://schema.org/isPartOf as the inverse property of hasPart. If both are specified, it's not necessary to run another graph pattern query to traverse a directed relation in the other direction.
(:book1 schema:hasPart ?o)
(?o schema:isPartOf :book1)
(?s schema:hasPart :chapter2)
It's certainly possible to use RDFS and OWL to describe schema for and within neo4j property graphs; but there's no reasoner to e.g. infer inverse properties or do schema validation.
Is there any RDF graph that neo4j cannot store? RDF has datatypes and languages for objects: you'd need to reify properties where datatypes and/or languages are specified (and you'd be re-implementing well-defined semantics)
Can every neo4j graph be represented with RDF? Yes.
RDF is a representation for graphs for which there are very many store implementations that are optimized for various use cases like insert and query performance.
Comparing neo4j to a particular triplestore (with reasoning support) might be a more useful comparison given that all neo4j graphs can be expressed as RDF.
I can't seem to find any discussion on this. I had been imagining a database that was schemaless and node based and heirarchical, and one day I decided it was too common sense to not exist, so I started searching around and neo4j is about 95% of what I imagined.
What I didn't imagine was the concept of relationships. I don't understand why they are necessary. They seem to add a ton of complexity to all topics centered around graph databases, but I don't quite understand what the benefit is. Relationships seem to be almost exactly like nodes, except more limited.
To explain what I'm thinking, I was imagining starting a company, so I create myself as my first nodes:
create (u:User { u.name:"mindreader"});
create (c:Company { c.name:"mindreader Corp"});
One day I get a customer, so I put his company into my db.
create (c:Company { c.name:"Customer Company"});
create (u:User { u.name:"Customer Employee1" });
create (u:User { u.name:"Customer Employee2"});
I decide to link users to their customers
match (u:User) where u.name =~ "Customer.*"
match (c:Company) where c.name =~ "Customer.*
create (u)-[:Employee]->(c);
match (u:User where name = "mindreader"
match (c:Company) where name =~ "mindreader.*"
create (u)-[:Employee]->(c);
Then I hire some people:
match (c:Company) where c.name =~ "mindreader.*"
create (u:User { name:"Employee1"})-[:Employee]->(c)
create (u:User { name:"Employee2"})-[:Employee]->(c);
One day hr says they need to know when I hired employees. Okay:
match (c:Company)<-[r:Employee]-(u:User)
where name =~ "mindreader.*" and u.name =~ "Employee.*"
set r.hiredate = '2013-01-01';
Then hr comes back and says hey, we need to know which person in the company recruited a new employee so that they can get a cash reward for it.
Well now what I need is for a relationship to point to a user but that isn't allowed (:Hired_By relationship between :Employee relationship and a User). We could have an extra relationship :Hired_By, but if the :Employee relationship is ever deleted, the hired_by will remain unless someone remembers to delete it.
What I could have done in neo4j was just have a
(u:User)-[:hiring_info]->(hire_info:HiringInfo)-[:hired_by]->(u:User)
In which case the relationships only confer minimal information, the name.
What I originally envisioned was that there would be nodes, and then each property of a node could be a datatype or it could be a pointer to another node. In my case, a user record would end up looking like:
User {
name: "Employee1"
hiring_info: {
hire_date: "2013-01-01"
hired_by: u:User # -> would point to a user
}
}
Essentially it is still a graph. Nodes point to each other. The name of the relationship is just a field in the origin node. To query it you would just go
match (u:User) where ... return u.name, u.hiring_info.hiring_date, u.hiring_info.hired_by.name
If you needed a one to many relationship of the same type, you would just have a collection of pointers to nodes. If you referenced a collection in return, you'd get essentially a join. If you delete hiring_info, it would delete the pointer. References to other nodes would not have to be a disorganized list at the toplevel of a node. Furthermore when I query each user I will know all of the info about a user without both querying for the user itself and also all of its relationships. I would know his name and the fact that he hired someone in the same query. From the database backend, I'm not sure much would change.
I see quite a few questions from people asking whether they should use nodes or relationships to model this or that, and occasionally people asking for a relationship between relationships. It feels like the XML problem where you are wondering if a pieces of information should be its own tag or just a property its parent tag.
The query engine goes to great pains to handle relationships, so there must be some huge advantage to having them, but I can't quite see it.
Different databases are for different things. You seem to be looking for a noSQL database.
This is an extremely wide topic area that you've reached into, so I'll give you the short of it. There's a spectrum of database schemas, each of which have different use cases.
NoSQL aka Non-relational Databases:
Every object is a single document. You can have references to other documents, but any additional traversal means you're making another query. Times when you don't have relationships between your data very often, and are usually just going to want to query once and have a large amount of flexibly-stored data as the document that is returnedNote: These are not "nodes". Node have a very specific definition and implies that there are edges.)
SQL aka Relational Databases:
This is table land, this is where foreign keys and one-to-many relationships come into play. Here you have strict schemas and very fast queries. This is honestly what you should use for your user example. Small amounts of data where the relationships between things are shallow (You don't have to follow a relationship more than 1-2 times to get to the relevant entry) are where these excel.
Graph Database:
Use this when relationships are key to what you're trying to do. The most common example of a graph is something like a social graph where you're connecting different users together and need to follow relationships for many steps. (Figure out if two people are connected within a depth for 4 for instance)
Relationships exist in graph databases because that is the entire concept of a graph database. It doesn't really fit your application, but to be fair you could just keep more in the node part of your database. In general the whole idea of a database is something that lets you query a LOT of data very quickly. Depending on the intrinsic structure of your data there are different ways that that makes sense. Hence the different kinds of databases.
In strongly connected graphs, Neo4j is 1000x faster on 1000x the data than a SQL database. NoSQL would probably never be able to perform in a strongly connected graph scenario.
Take a look at what we're building right now: http://vimeo.com/81206025
Update: In reaction to mindreader's comment, we added the related properties to the picture:
RDBM systems are tabular and put more information in the tables than the relationships. Graph databases put more information in relationships. In the end, you can accomplish much the same goals.
However, putting more information in relationships can make queries smaller and faster.
Here's an example:
Graph databases are also good at storing human-readable knowledge representations, being edge (relationship) centric. RDF takes it one step further were all information is stored as edges rather than nodes. This is ideal for working with predicate logic, propositional calculus, and triples.
Maybe the right answer is an object database.
Objectivity/DB, which now supports a full suite of graph database capabilities, allows you to design complex schema with one-to-one, one-to-many, many-to-one, and many-to-many reference attributes. It has the semantics to view objects as graph nodes and edges. An edge can be just the reference attribute from one node to another or an edge can exist as an edge object that sits between two nodes.
An edge object can have any number of attribute and can have references off to other objects, as shown in the diagram below.
Being able to "hang" complex objects off of an edge allows Objectivity/DB to support weighted queries where the edge-weight can be calculated using a user-defined weight calculator operator. The weight calculator operator can build the weight from a static attribute on the edge or build the weight by digging down through the objects connected to the edge. In the picture, above, we could create a edge-weight calculator that computes the sum of the CallDetail lengths connected to the Call edge.
I am playing around with Neo4j but trying to get my head around the graph concepts. As a learning process I want to port a small Postgres relational database schema to Neo4j. Is there any way I can port it and issues "equivalent" relational queries to Neo4j?
Yes, you can port your existing schema to a graph database. Keep in mind that this is not necessarily the best model for your data, but it is a starting point.
How easy it is depends a lot on the quality of your existing schema.
The tables corresponding to entities in an entity-relationship-diagram define your types of nodes. In the upcoming neo4j 2.0, you can labels them with the name of the entity to make a lookup easier. In older versions you can use an index or a manual label property.
Assuming a best case, where all your relationships between data is modelled using foreign keys, any 1:1 relationship between nodes can be identified and ported next.
For tables modelling n:m relationships, identify the corresponding nodes and add a direct relationship between them.
So as an example assume tables Author[id, name, publisher foreign key], Publisher[id, name] and Book[id, title] and written_by[author foreign key, book foreign key].
Every row in Author, Publisher and Book becomes a node.
Every Author node gets a relationship to the publisher identified by the foreign key relationship.
For every row in written_by you add a relationship between the Author node and Book node referenced
For queries in neo4j I recommend cypher due to its expressiveness. A (2.0) query looking for books by some author would look like:
MATCH (author:Author)-[:written_by]-(book:Book)
WHERE author.name='Hugh Laurie'
RETURN book.title
You actually have several options at hand:
use the Talend connector for Neo4J
export your schema+data in CSV files consumable by the batch importer
or you can do it programmatically
I'm afraid not. The relational data model and the graph data model are two different ways of modelling a real-world domain. It requires a human brain (at least as of 2013) to understand the domain in order to model it.
I suggest that you take a piece of paper and capture, using circles and arrows, what your entities are (nodes) and how they relate to each other (relationships). Then, have a look at that piece of paper. Voila, your new Neo4j data model.
Then, take a query that you want to be answered and try to figure out how you would do that without a computer, just by tracing your nodes and relationships with a finger on that piece of paper. Once you've figured that out, translate what you've done to a Cypher query.
Have a look at neo4j.org, there are plenty of examples.
Check this out:
The musicbrainz -> neo4j
https://github.com/redapple/sql2graph/tree/master/examples/musicbrainz
Neo4j Sql-importer
https://github.com/peterneubauer/sql-import
Good Luck!
This tool does exactly that.
Import any relational db into neo4j
https://github.com/jexp/neo4j-rdbms-import