I am currently looking at modelling tertiary courses and other such entities (MATH101, BIOL360, BSc etc.), and one of the options we're looking at is graph databases. I am not familiar with graph databases, other than in theory.
One of the things I am trying to model is requirements, for example "MATH201 requires the student previously completed MATH101". That one seems easy - I can create a vertex between the two. Others are more complicated: "Bachelor of Computer Science requires 40 points at 200 level or higher in science papers".
What I'd like to do here is name a bunch of sets, a la Neo4J Labels, then create a relationship from one of the nodes to the set of nodes described in the labels, but I can't see a way to do this. Is this something that's possible in graph databasing engines, or am I basically running down an XY path, and should be doing something else entirely?
I've tagged Neo4J because I'm leaning towards it as (from what I can tell) it's the most widely known/used graph dbms, but I'm open to solutions in other databases too (in fact, if it's possible in the very new SQL Server offering, that would probably be ideal as other infrastructure is on that).
Well, in the fist case, I think you can make a :REQUIREMENT relationship between "MATH201" and "MATH101".
In the second case you can make a :REQUIREMENT relationship between "Bachelor of Computer Science" node and an intermediate node to group all science papers, as in the image below. Also, you can put some extra properties in the relationship to what type of requirement the course has:
Related
For recursive breakdown structures, is it better to model as ...
a. Group HAS Subgroup... or
b. Subgroup PART_OF Group ?? ....
Some neo4j tutorials imply model both (the parent_of and child_of example) while the neo4j subtype tutorials imply that either will work fine (generally going with PART-OF).
Based on experience with neo4j, is there a practical reason for choosing one or the other or use both?
[UPDATED]
Representing the same logical relationship with a pair of relationships (having different types) in opposite directions is a very bad idea and a waste of time and resources. Neo4j can traverse a single relationship just as easily from either of its nodes.
With respect to which direction to pick (since we do not want both), see this answer to a related question.
(I am new to Neo4J and very excited about it)
Here is my conceptual question:
Suppose we want to represent life on earth (based on a biological taxonomy hierarchy).
However, suppose at the leaves of the taxonomy tree we want to actually identify individual organisms. For example, at the mammalia branch, the homo-sapient sub-branch we want to identify each and every one of 7 billion humans and do the same for some other branches (give an ID to every living known great Ape left in the wild and so on)
Is this type of organization done with dense nodes (in the billions) ? or is it done with extensive use of labels (do labels support nesting)?
From my point of view it's better to use multiple nodes instead of multiple labels.
But it depends on the use case and what you want to do with it.
Neo4j doesn't support nested labels or some labels hierarchy.
Here are some resources which could be interesting for you
Graph Databases in Life Sciences: Bringing Biology Back to Its Nature
Open Tree of Life and Neo4j
Say I am managing collectibles. I have thousands of baseball trading cards, thousands still of gaming cards (think Magic: the Gathering), and then thousands and thousands of doilies.
The part of me that's been steeped in relational databases for 20+ years is uncomfortable with the idea of thousands of Neo4J nodes floating out in space.
So I am inclined to gather them all with a node such as (:BASEBALL_CARDS), (:MTG_CARDS), and of course (:DOILIES). The idea is that these are singletons.
Now if I want all baseball cards that perhaps refer to a certain player, I could do something like:
(:BASEBALL_CARDS)-[GATHERS]->(:BASEBALL_CARD)-[:FEATURES]->(p:PLAYER {name: '...'})
It's very comforting to have the :BASEBALL_CARDS singleton, but does it do anything more than could be accomplished by indexing :BASEBALL_CARD?
(:BASEBALL_CARD)-[:FEATURES]->(p:PLAYER {name: '...'})
Is it best-practice to have thousands of free-ranging nodes?
One exceptional strong point of the graph database is the local query: the relationship lives in the instance, not in the type. A particular challenge (apart from modelling well) is determining the starting point of the local query (and keeping it local, i.e., avoiding path explosions). In Neo4j 1.x your One Node was a way to achieve a starting point for a certain kind of query. With 2.x and the introduction of labels, indexing :BaseballCard is the standard way to accomplish the same. If the purpose of that One Node is as a starting point for the kind of query in your example, then you are better off using a label index. A common problem in 1.x was that a node with an increasing number of relationships of the same type and direction eventually becomes a bottle neck for traversals. People started partitioning your One Node into A Paged Handful of Nodes, something like
(:BaseballCards)-[:GATHERS]->(:BaseballCards1to10000)-[:GATHERS]->(:BaseballCard)
The purpose of finding a starting point for the local query is often better served by labels, perhaps in combination with a basic, ordinary, local traversal, than by A Handful of Nodes. Then again, if it calms your nerves or satisfies your sense of the epic to have such a node, by all means have it. Because of the locality of queries, it will do you no harm.
In your example, however, neither the One Node nor an index on :BaseballCard would best serve as the starting point of the local query. The most particular pattern of interest is instead the name of the player. If you index (:Player) on name you will get the best starting point. The traversal across the one or handful* of [:FEATURES] relationships is very cheap and with a simple test on the other end for the :BaseballCard label, you are done. You could of course maintain the One Node for all players that share a name...
In my most humble opinion there is little need for discomfort. I do, however, want to affirm and commend your unease, in this one regard: that the graph is most powerful for connected data. The particular connection gathering the baseball cards doesn't seem to add new understanding or improve performance, but wherever there is disconnected data there is the potential for discovering exciting and meaningful patterns. Perhaps in the future the cards will be connected through patterns that signify their range of value, or the quality of their lamination, or a linked list of previous owners, or how well they work as conversations starters on a date. The absence of relationships is a call to find that One Missing Link that brings tremendous insight and value into your data.
* Handful, assuming that more than one baseball card features the same player, or some baseball players are also featured on cards of Magic: The Gathering. I'm illiterate in both domains, so I want to at least allow for the possibility.
It is ironic that you are concerned about nodes "floating out in space", when the whole idea behind graph DBs is making the connections between nodes a first class DB construct.
But I think your actual concern is that nodes do not "belong to a table" (in relational DB parlance). So, you would feel more comfortable in creating a special singleton node that in some sense takes the place of a table, from which you can access all the nodes that ought belong to that table.
A node label can be seen as the equivalent of a "table name". So, not only is there no need for you to also create a singleton "table node", doing so would be wasteful in DB resources, and complicate and slow down your queries. And neo4j can quickly access all the nodes with the same label.
Revisiting Neo4j after a long absence. I have read a lot of articles but still find I have a few questions to get me going again....
Bidirectional relationships
I have a “connected to”-type scenario where 2 nodes are connected to each other. In fact, the idea is to model a type of flow. However, the flow in both directions is not always the same. I’m uncertain of the best method to use: 1 relationship with 2 properties or 2 distinct relationships?
The former feels like the comfortable choice but then doesn’t feel natural in terms of modelling the actual facts – for example: what to call the properties because FlowIn and FlowOut wouldn’t make sense when looked at from each nodes’ perspective. I also wonder about the performance of properties versus relationships in this case – these values will need to be updated.
Representing Time
Now I want to take a step further and represent the flow between nodes at specific times or, more accurately, between specific times. So between 2pm and 3pm the flow between #1 and #2 will be x.
How should this be done in an optimal way? Relationship per time frame per connection seems….verbose. Could a timeframe being represented as a node be of value?!
Are there any Maximum Flow samples with Cypher out there?
Particularly interested in push-relabel max flow problem solving.
Thank you for any advice to might have to offer.
While you have definitely given some thought to your problem the question is a little unclear. This seems to be a question about Graph Data Models. You would like to know how best to organize a model to represent a complex relationship. If you are trying to track the "flow" between two nodes then assign a weight property to a unidirected edge.
Bidirectional relationships should be carefully considered. Neo4j can process them just as fast as unidirectional relationships. A quote from the graphaware about using bidirectional relationships:
Relationships in Neo4j can be traversed in both directions with the same speed. Moreover, direction can be completely ignored. Therefore, there is no need to create two different relationships between nodes, if one implies the other.
I believe your problems can be alleviated by gaining a better understanding of Graph data models. Looking at a few different models and understanding the why will help more than understanding cypher syntax at this point. May I suggest reading this survey by 2 professors at the University of Chile titled "Survey of Graph Database Models." The "Hypernode" model on page 21 may be of particular interest to you since it sounds like you are trying to model a complex cyclic object. From page twenty one;
Hypernodes can be used to represent simple (flat) and complex objects (hierarchical, composite, and cyclic) as well as mappings and records. A key feature is its inherent ability to encapsulate information.
Hopefully that information helps you in your efforts to model a complex relationship.
I am teaching myself graph modelling and use Neo4j 2.2.3 database with NodeJs and Express framework.
I have skimmed through the free neo4j graph database book and learned how to model a scenario, when to use relationship and when to create nodes, etc.
I have modelled a vehicle selling scenario, with following structure
NODES
(:VEHICLE{mileage:xxx, manufacture_year: xxxx, price: xxxx})
(:VFUEL_TYPE{type:xxxx}) x 2 (one for diesel and one for petrol)
(:VCOLOR{color:xxxx}) x 8 (red, green, blue, .... yellow)
(:VGEARBOX{type:xxx}) x 2 (AUTO, MANUAL)
RELATIONSHIPS
(vehicleNode)-[:VHAVE_COLOR]->(colorNode - either of the colors)
(vehicleNode)-[:VGEARBOX_IS]->(gearboxNode - either manual or auto)
(vehicleNode)-[:VCONSUMES_FUEL_TYPE]->(fuelNode - either diesel or petrol)
Assuming we have the above structure and so on for the rest of the features.
As shown in the above screenshot (136 & 137 are VEHICLE nodes), majority of the features of a vehicle is created as separate nodes and shared among vehicles with common feature with relationships.
Could you please advise whether roles (labels) like color, body type, driving side (left drive or right drive), gearbox and others should be seperate nodes or properties of vehicle node? Which option is more performance friendly, and easy to query?
I want to write a JS code that allows querying the graph with above structure with one or many search criteria. If majority of those features are properties of VEHICLE node then querying would not be difficult:
MATCH (v:VEHICLE) WHERE v.gearbox = "MANUAL" AND v.fuel_type = "PETROL" AND v.price > x AND v.price < y AND .... RETURN v;
However with existing graph model that I have it is tricky to search, specially when there are multiple criteria that are not necessarily a properties of VEHICLE node but separate nodes and linked via relationship.
Any ideas and advise in regards to existing structure of the graph to make it more query-able as well as performance friendly would be much appreciated. If we imagine a scenario with 1000 VEHICLE nodes that would generate 15000 relationship, sounds a bit scary and if it hits a million VEHICLE then at most 15 million relationships. Please comment if I am heading in the wrong direction.
Thank you for your time.
Modeling is full of tradeoffs, it looks like you have a decent start.
Don't be concerned at all with the number of relationships. That's what graph databases are good at, so I wouldn't be too concerned about over-using them.
Should something be a property, or a node? I can't answer for your scenario, but here are some things to consider:
If you look something up by a value all the time, and you have many objects, it's usually going to be faster to find one node and then everything connected to it, because graph DBs are good at exploiting relationships. It's less fast to scan all nodes of a label and find the items where a property=a value.
Relationships work well when you want to express a connection to something that isn't a simple primitive data type. For example, take "gearbox". There's manuals, and other types...if it's a property value, you won't later easily be able to decide to store 4 other sub-types/sub-aspects of "gearbox". If it were a node, that would later be easy because you could add more properties to the node, or relate other things.
If a piece of data really is a primitive (String, integer, etc) and you don't need extra detail about it, that usually makes a good property. Querying primitive values by connecting to other nodes will seem clunky later on. For example, I wouldn't model a person with a "date of birth" as a separate node, that would be irritating to query, and would give you flexibility you'd be very unlikely to need in the future.
Semantically, how is your data related? If two items are similar because they share an X, then that X probably should be a node. If two items happen to have the same Y value but that doesn't really mean much, then Y is probably better off as a node property.