I am learning graph databases (Neo4j to be specific) and I chose to model the game Ticket to Ride. The game consists of each player connecting cities to each other. Some cities have two paths, maybe with different colors, between them. For example, to go from New York to Boston, you can choose to spend two red or two yellow cards. From Montreal to Boston, there are two paths, but they accept any colors, and from Montreal to New York, you can only spend 3 blue cards.
(source: daysofwonder.com)
The kinds of questions I need to answer are:
What is the longest path that goes through New York, Boston and Montreal?
What is the shortest path from Miami to Montreal, excluding the segment between Montreal and Boston (presumably because another player took the routes)?
What segments of 4 exist using color Red?
My question is: should the routes / segments between cities be nodes, or should they be relationships? I can see it both ways. Are there advantages to making them nodes, rather than relationships?
The only property I need to remember on a route is what player owns the route, or some sentinel value (distinct from NULL) to indicate a route is yet unowned.
One of the rules I often ask myself when taking this decision is: Will I need, or improve things, if I make it a node, that is: with a node I can connect other relationships to it. You mention that you need to relate a user or owner to that relationship, well in that case it's a good candidate for a node (to represent your routes
(Boston:Place)-[:route]->(x:Route)-[:route]->(Montreal:Place)
Here using labels if on Neo4j 2.0+
Also note that if you need to search for routes belonging to someone, it would be much faster with with it as a relationship, then building it into indexing as I would think that you might have many, many routes at some point, thus making it quite a huge index, but i could be wrong on that point, however it bares consideration.
As for longest and shortest, you can always use the
shortestPath((n1:Place)-[r:route:*..]-(n2:Place))
to exclude, have a look at WHERE clause as you can most likely do WHERE r <> something.
Related
I've just started using neo4j and, having done a few experiments, am ready to start organizing the database in itself. Therefore, I've started by designing a basic diagram (on paper) and came across the following doubt:
Most examples in the material I'm using (cypher and neo4j tutorials) present only a few properties per relationship/node. But I have to wonder what the cost of having a heavy string of properties is.
Q: Is it more efficient to favor a wide variety of relationship types (GOODFRIENDS_WITH, FRIENDS_WITH, ACQUAINTANCE, RIVAL, ENEMIES, etc) or fewer types with varying properties (SEES_AS type:good friend, friend, acquaintance, rival, enemy, etc)?
The same holds for nodes. The first draft of my diagram has a staggering amount of properties (title, first name, second name, first surname, second surname, suffix, nickname, and then there's physical characteristics, personality, age, jobs...) and I'm thinking it may lower the performance of the db. Of course some nodes won't need all of the properties, but the basic properties will still be quite a few.
Q: What is the actual, and the advisable, limit for the number of properties, in both nodes and relationships?
FYI, I am going to remake my draft in such a way as to diminish the properties by using nodes instead (create a node :family names, another for :job and so on), but I've only just started thinking it over as I'll need to carefully analyse which 'would-be properties' make sense to remain, even because the change will amplify the number of relationship types I'll be dealing with.
Background information:
1) I'm using neo4j to map out all relationships between the people living in a fictional small town. The queries I'll perform will mostly be as follow:
a. find all possible paths between 2 (or more) characters
b. find all locations which 2 (or more) characters frequent
c. find all characters which have certain types of relationship (friends, cousins, neighbors, etc) to character X
d. find all characters with the same age (or similar age) who studied in the same school
e. find all characters with the same age / first name / surname / hair color / height / hobby / job / temper (easy to anger) / ...
and variations of the above.
2) I'm not a programmer, but having self-learnt HTML and advanced excel, I feel confident I'll learn the intuitive Cypher quickly enough.
First off, for small data "sandbox" use, this is a moot point. Even with the most inefficient data layout, as long as you avoid Cartesian Products and its like, the only thing you will notice is how intuitive your data is to yourself. So if this is a "toy" scale project, just focus on what makes the most organizational sense to you. If you change your mind later, reformatting via cypher won't be too hard.
Now assuming this is a business project that needs to scale to some degree, remember that non-indexed properties are basically invisible to the Cypher planner. The more meaningful and diverse your relationships, the better the Cypher planner is going to be at finding your data quickly. Favor relationships for connections you want to be able to explore, and favor properties for data you just want to see. Index any properties or use labels that will be key for finding a particular (or set of) node(s) in your queries.
My database contains hotels, reviews of hotels, terms (i.e. words) in reviews and topics (e.g. there could be a topic talking "Staff" containing terms describing the hotel staff) as nodes. Indices on all nodes are present. Relationships as follows: Hotel<--Review-->Term-->Topic
I am currently trying to find an efficient way of querying for topics that have paths to two or more specified hotels. In other words, I am interested in the common topics of two hotels. If hotel A has paths to topics 1,2,3 and hotel B has paths to topics 2,3,4 then the result should be 2,3.
I tried the following below but this seems very inefficient which is very likely due to the amount of possible paths between hotels and topics. Basically each word in a review could create a new path that has to be checked.
// show all topics that two hotels have in common
MATCH (h2:Hotel)<--(r2:Review)-->(t2:Term)-->(to:Topic)<--(t1:Term)<--(r1:Review)-->(h1:Hotel)
WHERE h1.id IN ["id1","id2"] AND h2.id IN ["id1","id2"] AND NOT h1.id=h2.id
RETURN h1.id,to.topic, count(to) AS topic_mentions
I am wondering if there's a faster way of dealing with this, if I were to implement this in java or similar language I'd probably try doing a BFS starting at each hotel and then taking the overlap of what I find. I am fairly certain that adding the transitive edges as direct edges Hotel-->Topic would speed this up, but my limited database design knowledge told me that this might be unnecessarily redundant and not a good practice?
I tried to do the id matching before the pattern matching with another MATCH and WITH clause, but this didnt speed up anything; I think the problem really lies in the pattern matching itself.
I created something similar for searching KB's, and a direct relationship between Hotels and Topics will make this search dead easy, and it'll be faster. For example, your search for all topics with more than one Hotel in common, you'd use:
MATCH (h1:Hotel)-[:TOPIC]->(t:Topic)
MATCH (h2:Hotel)-[:TOPIC]->(t:Topic)
WHERE h1 <> h2
RETURN h1.id, h2.id, t.topic, count(t) AS topic_mentions
Note that this will return a count of all topics these two hotels have in common, which may or may not be what you want.
I am fairly certain that adding the transitive edges as direct edges
Hotel--Topic would speed this up, but my limited database design
knowledge told me that this might be unnecessarily redundant and not a
good practice?
All that would be doing is making an implicit relationship explicit, which is one of things that make graph db's so powerful. There is the maintenance aspect to be concerned about - namely if someone updates the words in a review, then you have to make sure that the (hotel)-[:TOPIC]->(topic) relationships are still valid - but you'd have to do that in your original design anyway, so no loss there.
Say I am managing collectibles. I have thousands of baseball trading cards, thousands still of gaming cards (think Magic: the Gathering), and then thousands and thousands of doilies.
The part of me that's been steeped in relational databases for 20+ years is uncomfortable with the idea of thousands of Neo4J nodes floating out in space.
So I am inclined to gather them all with a node such as (:BASEBALL_CARDS), (:MTG_CARDS), and of course (:DOILIES). The idea is that these are singletons.
Now if I want all baseball cards that perhaps refer to a certain player, I could do something like:
(:BASEBALL_CARDS)-[GATHERS]->(:BASEBALL_CARD)-[:FEATURES]->(p:PLAYER {name: '...'})
It's very comforting to have the :BASEBALL_CARDS singleton, but does it do anything more than could be accomplished by indexing :BASEBALL_CARD?
(:BASEBALL_CARD)-[:FEATURES]->(p:PLAYER {name: '...'})
Is it best-practice to have thousands of free-ranging nodes?
One exceptional strong point of the graph database is the local query: the relationship lives in the instance, not in the type. A particular challenge (apart from modelling well) is determining the starting point of the local query (and keeping it local, i.e., avoiding path explosions). In Neo4j 1.x your One Node was a way to achieve a starting point for a certain kind of query. With 2.x and the introduction of labels, indexing :BaseballCard is the standard way to accomplish the same. If the purpose of that One Node is as a starting point for the kind of query in your example, then you are better off using a label index. A common problem in 1.x was that a node with an increasing number of relationships of the same type and direction eventually becomes a bottle neck for traversals. People started partitioning your One Node into A Paged Handful of Nodes, something like
(:BaseballCards)-[:GATHERS]->(:BaseballCards1to10000)-[:GATHERS]->(:BaseballCard)
The purpose of finding a starting point for the local query is often better served by labels, perhaps in combination with a basic, ordinary, local traversal, than by A Handful of Nodes. Then again, if it calms your nerves or satisfies your sense of the epic to have such a node, by all means have it. Because of the locality of queries, it will do you no harm.
In your example, however, neither the One Node nor an index on :BaseballCard would best serve as the starting point of the local query. The most particular pattern of interest is instead the name of the player. If you index (:Player) on name you will get the best starting point. The traversal across the one or handful* of [:FEATURES] relationships is very cheap and with a simple test on the other end for the :BaseballCard label, you are done. You could of course maintain the One Node for all players that share a name...
In my most humble opinion there is little need for discomfort. I do, however, want to affirm and commend your unease, in this one regard: that the graph is most powerful for connected data. The particular connection gathering the baseball cards doesn't seem to add new understanding or improve performance, but wherever there is disconnected data there is the potential for discovering exciting and meaningful patterns. Perhaps in the future the cards will be connected through patterns that signify their range of value, or the quality of their lamination, or a linked list of previous owners, or how well they work as conversations starters on a date. The absence of relationships is a call to find that One Missing Link that brings tremendous insight and value into your data.
* Handful, assuming that more than one baseball card features the same player, or some baseball players are also featured on cards of Magic: The Gathering. I'm illiterate in both domains, so I want to at least allow for the possibility.
It is ironic that you are concerned about nodes "floating out in space", when the whole idea behind graph DBs is making the connections between nodes a first class DB construct.
But I think your actual concern is that nodes do not "belong to a table" (in relational DB parlance). So, you would feel more comfortable in creating a special singleton node that in some sense takes the place of a table, from which you can access all the nodes that ought belong to that table.
A node label can be seen as the equivalent of a "table name". So, not only is there no need for you to also create a singleton "table node", doing so would be wasteful in DB resources, and complicate and slow down your queries. And neo4j can quickly access all the nodes with the same label.
I am teaching myself graph modelling and use Neo4j 2.2.3 database with NodeJs and Express framework.
I have skimmed through the free neo4j graph database book and learned how to model a scenario, when to use relationship and when to create nodes, etc.
I have modelled a vehicle selling scenario, with following structure
NODES
(:VEHICLE{mileage:xxx, manufacture_year: xxxx, price: xxxx})
(:VFUEL_TYPE{type:xxxx}) x 2 (one for diesel and one for petrol)
(:VCOLOR{color:xxxx}) x 8 (red, green, blue, .... yellow)
(:VGEARBOX{type:xxx}) x 2 (AUTO, MANUAL)
RELATIONSHIPS
(vehicleNode)-[:VHAVE_COLOR]->(colorNode - either of the colors)
(vehicleNode)-[:VGEARBOX_IS]->(gearboxNode - either manual or auto)
(vehicleNode)-[:VCONSUMES_FUEL_TYPE]->(fuelNode - either diesel or petrol)
Assuming we have the above structure and so on for the rest of the features.
As shown in the above screenshot (136 & 137 are VEHICLE nodes), majority of the features of a vehicle is created as separate nodes and shared among vehicles with common feature with relationships.
Could you please advise whether roles (labels) like color, body type, driving side (left drive or right drive), gearbox and others should be seperate nodes or properties of vehicle node? Which option is more performance friendly, and easy to query?
I want to write a JS code that allows querying the graph with above structure with one or many search criteria. If majority of those features are properties of VEHICLE node then querying would not be difficult:
MATCH (v:VEHICLE) WHERE v.gearbox = "MANUAL" AND v.fuel_type = "PETROL" AND v.price > x AND v.price < y AND .... RETURN v;
However with existing graph model that I have it is tricky to search, specially when there are multiple criteria that are not necessarily a properties of VEHICLE node but separate nodes and linked via relationship.
Any ideas and advise in regards to existing structure of the graph to make it more query-able as well as performance friendly would be much appreciated. If we imagine a scenario with 1000 VEHICLE nodes that would generate 15000 relationship, sounds a bit scary and if it hits a million VEHICLE then at most 15 million relationships. Please comment if I am heading in the wrong direction.
Thank you for your time.
Modeling is full of tradeoffs, it looks like you have a decent start.
Don't be concerned at all with the number of relationships. That's what graph databases are good at, so I wouldn't be too concerned about over-using them.
Should something be a property, or a node? I can't answer for your scenario, but here are some things to consider:
If you look something up by a value all the time, and you have many objects, it's usually going to be faster to find one node and then everything connected to it, because graph DBs are good at exploiting relationships. It's less fast to scan all nodes of a label and find the items where a property=a value.
Relationships work well when you want to express a connection to something that isn't a simple primitive data type. For example, take "gearbox". There's manuals, and other types...if it's a property value, you won't later easily be able to decide to store 4 other sub-types/sub-aspects of "gearbox". If it were a node, that would later be easy because you could add more properties to the node, or relate other things.
If a piece of data really is a primitive (String, integer, etc) and you don't need extra detail about it, that usually makes a good property. Querying primitive values by connecting to other nodes will seem clunky later on. For example, I wouldn't model a person with a "date of birth" as a separate node, that would be irritating to query, and would give you flexibility you'd be very unlikely to need in the future.
Semantically, how is your data related? If two items are similar because they share an X, then that X probably should be a node. If two items happen to have the same Y value but that doesn't really mean much, then Y is probably better off as a node property.
I am trying to realize a datamodel in Neo4j. The model has points of interest in a city and streets. The streets connect the points.
Initially I thought that points and streets should both represented in the graph database as nodes.
Between these two different type of nodes there is a relationship ("point is connected with").
Now I am thinking the possibility that instead of representing the street as a node, perhaps is more correct to represent the street as relationship ("connects two points")
And this is my question actually. What is the more correct way to represent the network (line part) in a model: with nodes or with relationships?
The only major difference between relationships and nodes is that relationships must exist between two nodes. This means that you wouldn't be able to store a specific street if you didn't store two points of interest that it connects. So, if you see this being an issue, you may want to store streets as nodes. If you are certain that you will only want to store streets if there are points of interest in your database that exist on the street, then it'd make more sense to represent the streets as relationships.
In general, you should try to avoid storing properties in nodes that you only intend to use to find relationships between them. In this case, you mention possible storying a "point is connected with" property in each point of interest node. This would work, but is essentially just saying that a relationship exists between two points without actually using a relationship. Again, in the case where you want to be able to store streets that don't have points of interests existing on them, this may be necessary, and you could store streets that don't have points of interests on them by leaving the "point is connected with" property as NULL, but I would advise against this.
Another thing to think about is what you would store in the relationship. If you go with the model where streets are nodes, it becomes very difficult to represent quantities like distances between points of interest without adding relationships into your graph specifically for those properties, which may as well be properties of a street relationship.
UPDATE: Thought I'd add an example query to show how making the streets relationships can simplify your logic and make using your database much simpler and more intuitive.
Imagine you wish to find the path with the fewest points of interest between points A and B.
This is what the query would look like with the relationships model:
MATCH (a:Point {name: "foo"}), (b:Point {name: "bar"}),
p = shortestPath(a-[*:Street]-b)
RETURN p
By using relationships where appropriate, you enable the capabilities of Neo4j, allowing you to get a lot of work done with relatively simple queries. It's hard to think of a way to write this query in the model where you represent streets as nodes, but it would in all likelihood be much more complex and less efficient.