Why do relationships as a concept exist in neo4j or graph databases in general? - neo4j

I can't seem to find any discussion on this. I had been imagining a database that was schemaless and node based and heirarchical, and one day I decided it was too common sense to not exist, so I started searching around and neo4j is about 95% of what I imagined.
What I didn't imagine was the concept of relationships. I don't understand why they are necessary. They seem to add a ton of complexity to all topics centered around graph databases, but I don't quite understand what the benefit is. Relationships seem to be almost exactly like nodes, except more limited.
To explain what I'm thinking, I was imagining starting a company, so I create myself as my first nodes:
create (u:User { u.name:"mindreader"});
create (c:Company { c.name:"mindreader Corp"});
One day I get a customer, so I put his company into my db.
create (c:Company { c.name:"Customer Company"});
create (u:User { u.name:"Customer Employee1" });
create (u:User { u.name:"Customer Employee2"});
I decide to link users to their customers
match (u:User) where u.name =~ "Customer.*"
match (c:Company) where c.name =~ "Customer.*
create (u)-[:Employee]->(c);
match (u:User where name = "mindreader"
match (c:Company) where name =~ "mindreader.*"
create (u)-[:Employee]->(c);
Then I hire some people:
match (c:Company) where c.name =~ "mindreader.*"
create (u:User { name:"Employee1"})-[:Employee]->(c)
create (u:User { name:"Employee2"})-[:Employee]->(c);
One day hr says they need to know when I hired employees. Okay:
match (c:Company)<-[r:Employee]-(u:User)
where name =~ "mindreader.*" and u.name =~ "Employee.*"
set r.hiredate = '2013-01-01';
Then hr comes back and says hey, we need to know which person in the company recruited a new employee so that they can get a cash reward for it.
Well now what I need is for a relationship to point to a user but that isn't allowed (:Hired_By relationship between :Employee relationship and a User). We could have an extra relationship :Hired_By, but if the :Employee relationship is ever deleted, the hired_by will remain unless someone remembers to delete it.
What I could have done in neo4j was just have a
(u:User)-[:hiring_info]->(hire_info:HiringInfo)-[:hired_by]->(u:User)
In which case the relationships only confer minimal information, the name.
What I originally envisioned was that there would be nodes, and then each property of a node could be a datatype or it could be a pointer to another node. In my case, a user record would end up looking like:
User {
name: "Employee1"
hiring_info: {
hire_date: "2013-01-01"
hired_by: u:User # -> would point to a user
}
}
Essentially it is still a graph. Nodes point to each other. The name of the relationship is just a field in the origin node. To query it you would just go
match (u:User) where ... return u.name, u.hiring_info.hiring_date, u.hiring_info.hired_by.name
If you needed a one to many relationship of the same type, you would just have a collection of pointers to nodes. If you referenced a collection in return, you'd get essentially a join. If you delete hiring_info, it would delete the pointer. References to other nodes would not have to be a disorganized list at the toplevel of a node. Furthermore when I query each user I will know all of the info about a user without both querying for the user itself and also all of its relationships. I would know his name and the fact that he hired someone in the same query. From the database backend, I'm not sure much would change.
I see quite a few questions from people asking whether they should use nodes or relationships to model this or that, and occasionally people asking for a relationship between relationships. It feels like the XML problem where you are wondering if a pieces of information should be its own tag or just a property its parent tag.
The query engine goes to great pains to handle relationships, so there must be some huge advantage to having them, but I can't quite see it.

Different databases are for different things. You seem to be looking for a noSQL database.
This is an extremely wide topic area that you've reached into, so I'll give you the short of it. There's a spectrum of database schemas, each of which have different use cases.
NoSQL aka Non-relational Databases:
Every object is a single document. You can have references to other documents, but any additional traversal means you're making another query. Times when you don't have relationships between your data very often, and are usually just going to want to query once and have a large amount of flexibly-stored data as the document that is returnedNote: These are not "nodes". Node have a very specific definition and implies that there are edges.)
SQL aka Relational Databases:
This is table land, this is where foreign keys and one-to-many relationships come into play. Here you have strict schemas and very fast queries. This is honestly what you should use for your user example. Small amounts of data where the relationships between things are shallow (You don't have to follow a relationship more than 1-2 times to get to the relevant entry) are where these excel.
Graph Database:
Use this when relationships are key to what you're trying to do. The most common example of a graph is something like a social graph where you're connecting different users together and need to follow relationships for many steps. (Figure out if two people are connected within a depth for 4 for instance)
Relationships exist in graph databases because that is the entire concept of a graph database. It doesn't really fit your application, but to be fair you could just keep more in the node part of your database. In general the whole idea of a database is something that lets you query a LOT of data very quickly. Depending on the intrinsic structure of your data there are different ways that that makes sense. Hence the different kinds of databases.
In strongly connected graphs, Neo4j is 1000x faster on 1000x the data than a SQL database. NoSQL would probably never be able to perform in a strongly connected graph scenario.

Take a look at what we're building right now: http://vimeo.com/81206025
Update: In reaction to mindreader's comment, we added the related properties to the picture:

RDBM systems are tabular and put more information in the tables than the relationships. Graph databases put more information in relationships. In the end, you can accomplish much the same goals.
However, putting more information in relationships can make queries smaller and faster.
Here's an example:
Graph databases are also good at storing human-readable knowledge representations, being edge (relationship) centric. RDF takes it one step further were all information is stored as edges rather than nodes. This is ideal for working with predicate logic, propositional calculus, and triples.

Maybe the right answer is an object database.
Objectivity/DB, which now supports a full suite of graph database capabilities, allows you to design complex schema with one-to-one, one-to-many, many-to-one, and many-to-many reference attributes. It has the semantics to view objects as graph nodes and edges. An edge can be just the reference attribute from one node to another or an edge can exist as an edge object that sits between two nodes.
An edge object can have any number of attribute and can have references off to other objects, as shown in the diagram below.
Being able to "hang" complex objects off of an edge allows Objectivity/DB to support weighted queries where the edge-weight can be calculated using a user-defined weight calculator operator. The weight calculator operator can build the weight from a static attribute on the edge or build the weight by digging down through the objects connected to the edge. In the picture, above, we could create a edge-weight calculator that computes the sum of the CallDetail lengths connected to the Call edge.

Related

Neo4j design choice: relationships vs nodes

I'm dealing with the following situation: many trips exist between many cities. Both have various properties. E.g the cities have a name and an amount of trips that passed them, whereas trips have a distance and time.
What is 'best practise' in Neo4j?
a) Add all cities and trips as nodes, and connect the trips to the start and end nodes by means of 'STARTED_AT' and 'ENDS_IN' relations.
or
b) Add only cities as a node, and represent each of the trips as a relation between 2 nodes. This means there are many of the same relations between nodes, where the only difference is that they have other properties.
Information that might be useful: we only need to do all kinds of queries. No insertion needed.
Thanks!
I would argue it's better to store trips as nodes because relationship properties cannot be indexed, and it will be slow to do more complex queries (like find shortest route by time) So if you are searching for trips by ID or something, you will need to store them as nodes.
On the other hand, an argument can be made for using relationships, because then you can take full advantage of APOC's weighted graph search functions.
A good way to decide if something should be a node or relation, is to ask yourself "are there any other relations that would make sense here?" If you are talking about if two cities are connected, a relationship makes more since because they either are or are not. If you are talking about road trips though, the trip can pass through multiple cities, can have participants in the trip (or groups there of) and can have an owner. In that case, for future flexibility, nodes will be much easier to maintain.
I would say it really depends on how you model these trips, lets assume we can generalize this as (city)-[trip]->(city). Notice that neo4j's relations always has a direction so we can go on adding an unlimited number of trips between cities without having to redefine each city for each trip -- this actually answers (a) by the way, we don't need to define where it starts and ends the relation does all that work for you.
'This means there are many of the same relations between nodes' <<- on this note, if you need to differ each trip based on the time the trip was taken you can add the date/timestamp in the relationship property or you can go with a time tree (see Mark Needham's Article on that here and Graphgrid's take)
Hope this helps.

Collapse Relationships Neo4j?

Is it possible to "collapse" relationships in neo4j? I'm trying to graph relationships between people, and they can be related in multiple different ways - a shared course, jointly authored paper, RT or tweet mention. Right now I'm modeling people, courses, papers, and tweets all as nodes. But what I'm really interested in is modeling the person-person relationships that go through these intermediary nodes. Is it possible to graph the implicit relationship (person-course-person) explicit (person-person), while still keeping the course as a node? Something like this http://catalhoyuk.stanford.edu/network/teams/ - slide 2 and 3.
Any other data modeling suggestions welcome as well.
Yes, you can do it. The query
MATCH(a:Person)-->(:Course)<--(b:Person)
CREATE (a)-[:IMPLICIT_RELATIONSHIP]->(b)
will crate a relationship with type :IMPLICIT_RELATIONSHIP between all people that are related to the same course. But probably you don't need it since you can transverse from a to b and from b to a without this extra and not necessary relationship. Also if you want a virtual relationship at query time to use in a projection you can use the APOC procedure apoc.create.vRelationship.
The APOC procedures docs says:
Virtual Nodes and Relationships don’t exist in the graph, they are
only returned to the UI/user for representing a graph projection. They
can be visualized or processed otherwise. Please note that they have
negative id’s.

Neo4j graph modelling performance and querability, property to a node or as separate node plus relationship

I am teaching myself graph modelling and use Neo4j 2.2.3 database with NodeJs and Express framework.
I have skimmed through the free neo4j graph database book and learned how to model a scenario, when to use relationship and when to create nodes, etc.
I have modelled a vehicle selling scenario, with following structure
NODES
(:VEHICLE{mileage:xxx, manufacture_year: xxxx, price: xxxx})
(:VFUEL_TYPE{type:xxxx}) x 2 (one for diesel and one for petrol)
(:VCOLOR{color:xxxx}) x 8 (red, green, blue, .... yellow)
(:VGEARBOX{type:xxx}) x 2 (AUTO, MANUAL)
RELATIONSHIPS
(vehicleNode)-[:VHAVE_COLOR]->(colorNode - either of the colors)
(vehicleNode)-[:VGEARBOX_IS]->(gearboxNode - either manual or auto)
(vehicleNode)-[:VCONSUMES_FUEL_TYPE]->(fuelNode - either diesel or petrol)
Assuming we have the above structure and so on for the rest of the features.
As shown in the above screenshot (136 & 137 are VEHICLE nodes), majority of the features of a vehicle is created as separate nodes and shared among vehicles with common feature with relationships.
Could you please advise whether roles (labels) like color, body type, driving side (left drive or right drive), gearbox and others should be seperate nodes or properties of vehicle node? Which option is more performance friendly, and easy to query?
I want to write a JS code that allows querying the graph with above structure with one or many search criteria. If majority of those features are properties of VEHICLE node then querying would not be difficult:
MATCH (v:VEHICLE) WHERE v.gearbox = "MANUAL" AND v.fuel_type = "PETROL" AND v.price > x AND v.price < y AND .... RETURN v;
However with existing graph model that I have it is tricky to search, specially when there are multiple criteria that are not necessarily a properties of VEHICLE node but separate nodes and linked via relationship.
Any ideas and advise in regards to existing structure of the graph to make it more query-able as well as performance friendly would be much appreciated. If we imagine a scenario with 1000 VEHICLE nodes that would generate 15000 relationship, sounds a bit scary and if it hits a million VEHICLE then at most 15 million relationships. Please comment if I am heading in the wrong direction.
Thank you for your time.
Modeling is full of tradeoffs, it looks like you have a decent start.
Don't be concerned at all with the number of relationships. That's what graph databases are good at, so I wouldn't be too concerned about over-using them.
Should something be a property, or a node? I can't answer for your scenario, but here are some things to consider:
If you look something up by a value all the time, and you have many objects, it's usually going to be faster to find one node and then everything connected to it, because graph DBs are good at exploiting relationships. It's less fast to scan all nodes of a label and find the items where a property=a value.
Relationships work well when you want to express a connection to something that isn't a simple primitive data type. For example, take "gearbox". There's manuals, and other types...if it's a property value, you won't later easily be able to decide to store 4 other sub-types/sub-aspects of "gearbox". If it were a node, that would later be easy because you could add more properties to the node, or relate other things.
If a piece of data really is a primitive (String, integer, etc) and you don't need extra detail about it, that usually makes a good property. Querying primitive values by connecting to other nodes will seem clunky later on. For example, I wouldn't model a person with a "date of birth" as a separate node, that would be irritating to query, and would give you flexibility you'd be very unlikely to need in the future.
Semantically, how is your data related? If two items are similar because they share an X, then that X probably should be a node. If two items happen to have the same Y value but that doesn't really mean much, then Y is probably better off as a node property.

How to partially isolate a subgraph without using labels in neo4j

I'm creating a graph that contains a large number of subgraphs of roughly treelike structure in that the 'root' of each subgraph only has outwardly directed relationships. The many leaves and branches of this subgraph all contain data related to the root. This is so that a single query like the following will return all data associated with a given root, and only the data associated with that root:
MATCH (root:ROOT {id: 'foo'})-[*]->(leaves) RETURN leaves
There are very strong reasons to optimize for this query. However, the subgraphs are not truly isolated, because some of the leaves are actually categories that can receive relationships from many roots, so structures like this exist:
(root)-[]->(category)<-[]-(root)
This seems like a great way to preserve the integrity of the subgraphs while also allowing for complex relationships between them, however, there's one catch. I can't have simple, one-to-one relationships directly between roots, or one root will contaminate the other's response to the first query. As I see it, there are only two real options.
Build a new dummy node for each 1-to-1 relationship between roots. Like so:
(root)-[]->(dummy)<-[]-(root)
I hate this option. It proliferates useless nodes and it dilutes the concept of relationships.
Give every child of each subgraph a label identifying it as a member of the subgraph. This is an even worse option as I see it. Since the subgraphs number in the many thousands it would dramatically pollute the label space.
I've also considered filtering on the label of a direct relationship, but that only excludes the foreign root, and not its children. See below:
Filter on the label of direct 1-to-1 relationships with a structure like this:
(root)-[:bar]->(foreign_root)-[]->(foreign_leaves)
And a primary query like this:
MATCH (root {id: 'foo'})-[*]->(leaves) WHERE NOT (root)-[:bar]->(leaves) RETURN leaves
Produces a result of (foreign_leaves) This is undesirable for multiple reasons, since it makes the most important query larger, and doesn't actually isolate the graph.
So, in one sense I am asking, is there a way to create a direct, 1-to-1 relationship between two of these roots without massive graph pollution or cross-contamination between subgraphs? In a larger sense, am I viewing the problem wrongly?
I think you are almost there. In your last Cypher query, you can tweak your WHERE clause so that it does not instantiate the :bar relationship's destination node. Like this:
MATCH (root {id: 'foo'})-[*]->(leaves)
WHERE NOT (root)-[:bar]->()
RETURN leaves
This way, you filter out all paths that start with a :bar relationship.

Neo4j, Which is better: multiple relationships or one with a property?

I'm new to neo4j, and I'm building a social network. For the sake of this question, my graph consists of user and event nodes with relationship(s) between them.
A user may be invited, join, attend or host an event, and each is a subset of the one before it.
Is there any benefit to / should I create multiple relationships for each status/state, or one relationship with a property to store the current state?
Graph-type queries are more easily/efficiently done on relationship types than properties, from what I understand.
How about one relationship, but a different relationship type?
You can query on several types of relationships with pipes using Cypher (in case you have other relationships to the event that you don't want to pick up in queries).
Update--adding console example: http://console.neo4j.org/?id=woe684
Alternatively, you can just leave the old relationships there and not have to build the slightly more complicated queries, but that feels a bit wasteful for this use case.
When possible, choosing different relationship types over a single type qualified by properties can have a significant positive performance impact when querying the graph. The former approach is aways at least 2x faster than the latter. When data is in high-level cache and the graph is queried using native Java API, the first approach is more than 8x faster for single-hop traversals.
Source: http://graphaware.com/neo4j/2013/10/24/neo4j-qualifying-relationships.html

Resources