Better way to model RATED relationship in neo4j movie graph database - neo4j

I want to know which is better approach to model [:RATED] relationship in movie database in Nneo4J? I can think of following two approaches:
Approach 1 feels more straighforward and somehow design academically more correct.
However, approach 1 requires n (:Movie) nodes. One might say that approach 2 looks more natural as graph can contain only one (:Movie) node for a particular movie ("The Matrix" in this case) which can exists regardless whether anyone rates it or not. However I feel it less comfortable to store rating values on [:RATED] relationship. Is it correct looking in purely design perspective?
Also what if we are dealing with a node which does not represent an entity. For example bunch of cars replacing users in above image and accident replacing "The Matrix". In this case (:Accident) node may not exist by default, but only created when accident occurs. Also accident faced by two different cars are different instances of (:Accident) and have many attributes associated with them like time, place etc. In this case it makes more design sense to create separate (:Accident) node for each car whenever it encounters accident and have its properties associated with it instead of having single (:Accident) and have properties associated with relationships pointing from(:Car) to (:Accident). But then it will create a lot of (:Accident) nodes. What will be best approach for this scenario in design perspective and performance perspective?
Summarizing:
Is approach 2 perfectly fine in design perspective? (Especially storing properties on relationships which might have been stored on nodes instead)
What are possible design, performance drawbacks of approach 2?

In general, whatever approach you choose to use should fit your use cases and queries.
Given your example, approach 2, using one Matrix :Movie node, is perfectly fine design given the use cases of tracking movie ratings. This is the same approach used in the Movie graph you can load up in Neo4j. Try that out, and note that the graph would be chaotic and difficult to query if there were multiple separate :Movie nodes for every single relationship to a :Movie.
You'll note that in approach 1, there is absolutely nothing different between each of the Matrix :Movie nodes. That's a strong indicator that you should be modeling the thing as a single node instead of multiple. It's also more difficult to query if you're using multiple nodes for the same thing, as the database can no longer use a single node as a starting point for the movie to get data based on relationships from it. Your queries about the movie itself also become slightly more complicated, in that you will need to add LIMIT 1 when matching to the movie by name, otherwise the query will match to all the multiple Matrix movies, which could be in the thousands or more depending on how many ratings there are.
Even though some of the other queries you might use for this model are going to use similar Cypher, or even the same Cypher queries, you will be needlessly impacting db operations through this data model. Consider an average rating query. With a single Matrix :Movie node, it's a matter of matching on the single :Movie node (by indexed or unique name), then taking the average of all its relationships. With multiple Matrix :Movie nodes, your match will match on thousands (or more) redundant nodes, and for all of those nodes it will need to pull those relationships and average them together. That's a ton of db hits you didn't need to do.
Also, keep in mind the difficulty of using this approach when combining this for other use cases. For example, consider if we had to change your data model to include actors and directors, similar to the movie db you can import in neo4j. If we had multiple nodes for every single rating for every single movie, which node would we use when creating relationships between actors and directors and the movie they worked in? With that kind of data model, there are no good choices for modeling this kind of data efficiently or clearly.
Considering your second case, it makes sense to make a new :Accident node with each accident, with details of the accident in each node. If two or more cars in your db is involved in the same accident, then it makes sense to use the same accident node to represent the accident, and make relationships from the multiple cars to the same accident they were involved in. That saves you from duplicating data about the same accident instance, and clearly models the participants in the accident, along with any other related data that is associated with the accident. You could always store accident data specific to the car in question on the relationship between the car and the accident, such as the damage sustained, and whether the driver of the car was found at fault.
It should be clear in this data model that there should be separate :Accident nodes (unless, as mentioned, it's the same accident for multiple cars), as the data between accidents will differ, and requires you to capture them in separate nodes. This is far different than your movie data model, where it does not make sense to use multiple :Movie nodes for the same movie, since the data is all the same.
As for storing data in relationships, again that depends upon your data model, and what makes the most sense. For ratings, storing the rating on the relationship to the movie looks fine to me.
There are cases where you may consider creating intermediary nodes to store data on a node instead of a relationship. Consider an employment graph, with :Person and :Company nodes. You could model this simply with :WORKS_AT relationships between nodes, but you would need to store data about the employment on the relationship, such as hireDate, salary, jobTitle, etc. That might be fine...but you could always extract that into its own node, an :Employment node between a :Person and a :Company to hold that data. That could let us index those properties, making it easier to query :Persons for a :Company in order of hireDate, for example, which wouldn't be as efficient if the data was on the relationships, as you can't index on relationship properties.
EDIT
Concerning cardinality of nodes, when to use a single node instance vs multiple node instances, again, that's usually best answered as you answer the questions of "does this make logical sense for this data model" and "is this easy and efficient to query this data?"
The two cases you presented, for Matrix :Movie nodes and :Accident nodes, each demonstrate opposite cases for this.
A single Matrix :Movie node makes sense, I think it may be a stretch to find use cases which would require multiple copies of Matrix nodes.
However, if you had to model movie showings of The Matrix, then that might call for a :Showing node, of which there would be several (per time and per theater), but all of them referencing the same Matrix :Movie node. It's the same movie, but it has multiple showings.
For :Accidents, it makes sense to use multiple :Accident nodes, each one representing a particular instance of an accident. In many cases there will be only one :Car associated with a single :Accident node, a driver crashing into something without involving other drivers. In other cases, when it's a multi-car collision, then several cars are involved in the same :Accident, so you would have the :Accident node with the time and location and details, and relationships with the :Cars involved in that particular accident.
While it's possible to use a single :Accident node for ALL accidents, and have the details on the relationships, you'll quickly encounter problems with some of the likely queries you'll need to make. For example, how do you know which accidents were multi-car accidents, and which cars were involved? We would have to examine all relationships to the single :Accident node, and even then we'd have to do extra logic to figure out the associations. What if we wanted to order :Accidents by date? We can't use indexes on relationship properties, so again we have to touch on all relationships and inspect their properties and sort them all. What if we wanted to indicate location based on closest city to the accident, for fast lookup of accidents in certain cities? Again, we can't use indexes on relationship properties for fast lookup. If we already have :City nodes, we can't create relationships between the relevant :City node and the crash relationship, you need a node for that.
I could list more cases, but it's fairly clear that multiple :Accident nodes are needed per accident (again, sharing the node for :Cars involved in the same :Accident).
This is one of those cases where even if you missed it when thinking about if the data model makes sense, consideration about the kind of queries you want to make, and their efficiency, should push you toward a better means of modeling your data...in this case, using multiple :Accident nodes.

Related

Developing graph database model for department/supplier/items

I'm currently ramping up on graph databases and to do that am working through a set of questions to learn Cypher. However, I'm not 100% happy with the design I've chosen since I have to match relationships to nodes to make some of the queries work.
I found Neo4j: Suggestions for ways to model a graph with shared nodes but has a unique path based on some property with some suggestions that are relevant, but they involve copying nodes (repeating them) when in fact they do represent the same thing. That seems like an update issue waiting to happen.
My design currently has
(:Dept {name,floor})-[:SOLD {quantity}]->(:Item {name,type})<-[:SUPPLIES {dept,volume)]-(:Company {name,address})
As you can see, to figure out which department a company supplied an item to, I have to check the :SUPPLIES dept property. This leads to somewhat awkward queries - it feels that way to me, anyway.
I've tried other relationships, like having (:Company)-[:SUPPLIES {item,vol}]->(:Dept) but then the problem just shifts to matching :SUPPLIES relationship properties to :Item nodes.
The types of queries I am building are of the nature: Find departments that sell all of the items they are supplied.
Is there some other way to model this that I am overlooking? Or is this sort of relationship, where a supplier is related to two things, an item and a department, just something that doesn't fit the graph model very well?
You want to store and query a triangular relationship between :Dept, :Item, and :Company. This can't be accomplished by a linear relationship pattern. Comparing IDs of entities is not the Neo4j way, you would neglect the strengths of a graph database.
(Assuming that I understood your use case scenario) I would introduce an additional node of type :SupplyEvent that has relationships to :Dept, :Item, and :Company. You could also split up :SOLD relationship in a similar way, if you want relations between department, item, and, e.g., a customer.
Now, you can query all companies that supplied which items to which departments (without comparing any IDs):
MATCH (company:Company)<-[:SUPPLIED_FROM]-(se:SupplyEvent)-[:SUPPLIED_TO]->(dept:Dept),
(se)-[:SUPPLIED]->(item:Item)
RETURN company, item, dept

Node based properties on a relationship

I'm starting out with Neo4J to create a graph of users and their relationships. At the moment there is a single 'KNOWS' relationship between users i.e.
What I want to do now is specify properties on the relationship specifically for each of the users. For example, "interest" which indicates how much a user is interested in the other user. Can I specify this for each user on a single KNOWS relationship or would I need to create two relationships between the users and set the attribute on each of the relationships?
Any help would be appreciated.
Can I specify this (property: interest) for each user on a single KNOWS relationship or would I need to create two relationships between the users and set the attribute on each of the relationships?
You will need two relationships.
You could do it with one but then you have to keep two properties in the relationship and information about which property goes with which node. Much easier with two relationships.
From comment:
Can I keep them as bi-directional or would I need to use directional
in this case?
Relationships are always directional. It is only when you query that the concept of bi-directional appears, but that is not really bi-directional, it is without direction, e.g. (a)-[r]-(b). So you would use (a)-[r]->(b) and (b)-[r]->(a) or (a)<-[r]-(b). If you query with the direction, then you know how to apply the relationship property.
I typically do more of my work with Java as an embedded application instead of Cypher and it pays to use directional queries as it makes for less code to do the associations.
Note
Since your case is so simple, just try various methods and see what works. Remember to keep track of how long the quires take and if necessary add indexes. Also use the query profiling tool to make sure you are making effective queries.

Best Way to Store Contextual Attributes in Core Data?

I am using Core Data to store objects. What is the most efficient possibility for me (i.e. best execution efficiency, least code required, greatest simplicity and greatest compatibility with existing functions/libraries/frameworks) to store different attribute values for each object depending on the context, knowing that the contexts cannot be pre-defined, will be legion and constantly edited by the user?
Example:
An Object is a Person (Potentially =Employer / =Employee)
Each person works for several other persons and has different titles in relation to their work relationships, and their title may change from one year to another (in case this detail matters: each person may also concomitantly employ one or several other persons, which is why a person is an employee but potentially also an employer)
So one attribute of my object would be “Title vs Employer vs Year Ended”
The best I could do with my current knowledge is save all three elements together as a string which would be an attribute value assigned to each object, and constantly parse that string to be able to use it, but this has the following (HUGE) disadvantages:
(1) Unduly Slowed Execution & Increased Energy Use. Using this contextual attribute is at the very core of my prospective App´s core function (so it would literally be used 10-100 times every minute). Having to constantly parse this information to be able to use it adds undue processing that I’d very much like to avoid
(2) Undue Coding Overhead. Saving this contextual attribute as a string will unduly make additional coding for me necessary each time I’ll use this central information (i.e. very often).
(3) Undue Complexity & Potential Incompatibility. It will also add undue complexity and by departing from the expected practice it will escape the advantages of Core Data.
What would be the most efficient way to achieve my intended purpose without the aforementioned disadvantages?
Taking your example, one option is to create an Employment entity, with attributes for the title and yearEnded and two (to-one) relationships to Person. One relationship represents the employer and the other represents the employee.
The inverse relationships are in both cases to-many. One represents the employments where the Person is the employee (so you might name it employmentsTaken) and the other relationship represents the employments where the Person is the Employer (so you might name it employmentsGiven).
Generalising, this is the solution recommended by Apple for many-many relationships which have attributes (see "Modelling a relationship based on its semantics" in their documentation).
Whether that will address all of the concerns listed in your question, I leave to your experimentation: if things are changing 10-100 times a minute, the overhead of fetch requests and creating/updating/deleting the intermediate (Employment) entity might be worse than your string representation.

Graph Database Data Model of One Type of Object

Say I'm a mechanic who's worked on many different cars and would like to keep a database of the cars I've worked on. These cars have different manufacturers, models, and some customers have modified versions of these cars with different parts so it's not guaranteed the same model gives you the same car. In addition, I would like to see all these different cars and their similarities/differences easily. Basically the database needs to both represent the logical similarities/differences between all cars that I encounter while still giving me the ability to push/pull each instance of a car I've encountered.
Is this more set up for a relational or graph database?
If a graph database, how would you go about designing it? Each of the relationship labels would just be a 'has_a' or 'is_a_type_of'. Would you have the logical structure amongst all the cars and for each individual car have them point to the leaf nodes? Or would you have each relationship represent each specific car and have those relationships span the logical tree structure of the cars?
Alright so a "graphy" way to go about this would be to create a node type for each kind of domain object. You have a Car identified by a VIN, it can be linked to a Make, Model, and Year. You also have Mechanic nodes that [:work_on] various Car nodes. Don't store make/model/year with the Car, but rather link via relationships, e.g.:
CREATE (c:Car { VIN: "ABC"})-[:make]->(m:Make {label:"Toyota"});
...and so on.
Each of the relationship labels would just be a 'has_a' or
'is_a_type_of'.
Probably no, I'd create different relationship types unique to pairings of node types. So Mechanic -> Car would be :works_on, Car -> Model would be [:model] and so on. I don't recommend using the same relationship type like has_a everywhere, because from a modeling perspective it's harder to sort out the valid domain and ranges of those relationships (e.g. you'll end up in a situation where has_a can go from just about anything to just about anything, and picking out which has_a relationships you want will be hard).
Or would you have each relationship represent each specific car and
have those relationships span the logical tree structure of the cars?
Each car is its own node, identified by something like a VIN, not by a make/model/year. (Splitting out make/model/year later will allow you to very easily query for all Volvos, etc).
Your last question (and the toughest one):
Is this more set up for a relational or graph database?
This is an opinionated question (it attracts opinionated answers), let me put it to you this way: any data under the sun can be done both relationally and via graphs. So I could answer both yes relational, and yes graph. Your data and your domain doesn't select whether you should do RDBMS or Graph. Your queries and access patterns select RDBMS vs. graph. If you know how you need to use your data, which queries you'll run, and what you're trying to do, then with that information in hand, you can do your own analysis and determine which one is better. Both have strengths and weaknesses and many points of tradeoff. Without knowing how you'll access the data, it's impossible to answer this question in a really fair way.

How to implement an EAV model in Neo4j?

The Entity-Attribute-Value (EAV) model is really powerful, but complex to implement using SQL, so people often look for alternatives to EAV. It seems like the perfect candidate for graph databases. I understand how to build a movie database where you have nodes with the Neo4j label "Movie" with the property "release_date" right on the node. How would you make this more generic, such that movies have the Neo4j label "Entity" following the general EAV model?
I've thought a lot about this, but I'm not confident I have a good solution. I'll take a stab at it anyway. Here's the most basic model:
<node> <relationship> <node>
Attribute --> :VALUE --> Entity
name="Label",type="string" --> value="Movie" --> name="The Matrix"
With this model, you can write code for how to display and edit Attribute.type. For example, maybe all labels have a text field with finite options on the front-end and all dates have a date-picker. You could break Attribute.type out into its own node, Type, if that was preferable (particularly would make sense for handling composite types). In that case, you have the relationship TYPE between Attribute and Type nodes.
This becomes a problem if entities have multiple relationships, as is the case for reviews or if you want to relate the value to something else, such as the user who assigned the value. Now, I think, the relationship "VALUE" has to be it's own node of type "Value" (i.e. has the Neo4j label, "Value") with an incoming relationship from both Attribute and User nodes.
The full form has Type nodes, Attribute nodes, User nodes, Value nodes, and Entity nodes, where the relationships have basically no properties on them.
Why do you need it in the first place?
I always thought that EAV was just a workaround for relational databases not being schema free.
Neo4j as other nosql databases is schema free, so you can just add the attributes that you want to both nodes and relationships.
If you need to you can also record the EAV model in a meta-schema within the graph but in most cases it is good enough if the meta-schema lives within the application that creates and uses your attributes.
Usually I treat labels as roles which in a certain context provide certain properties and relationships. A node can have many labels each of which representing one of those roles.
E.g. for the same node
:Person(name)-[:LIVES_IN]->(:City)
:Employee(empNo)-[:WORKS_AT]->(:Company)
:Developer()-[:HAS_SKILL]->(:CompSkill)
...
So in your case :Entity would just be a label that implies the name property.
And :Movie is a label that implies a release_date property and e.g. ACTED_IN relationships.

Resources