When should inferred relationships and nodes be used over explicit ones? - neo4j

I was looking up how to utilise temporary relationships in Neo4j when I came across this question: Cypher temp relationship
and the comment underneath it made me wonder when they should be used and since no one argued against him, I thought I would bring it up here.
I come from a mainly SQL background and my main reason for using virtual relationships was to eliminate duplicated data and do traversals to get properties of something instead.
For a more specific example, let's say we have a robust cake recipe, which has sugar as an ingredient. The sugar is what makes the cake sweet.
Now imagine a use case where I don't like sweet cakes so I want to get all the ingredients of the recipe that make the cake sweet and possibly remove them or find alternatives.
Then there's another use case where I just want foods that are sweet. I could work backwards from the sweet ingredients to get to the food or just store that a cake is sweet in general, which saves time from traversal and makes a query easier. However, as I mentioned before, this duplicates known data that can be inferred.
Sorry if the example is too strange, I suck at making them. I hope the main question comes across, though.

My feeling is that the only valid scenario for creating redundant "shortcut" relationships is this:
Your use case has a stringent time constraint (e.g., average query time must be less than 200ms), but your neo4j query -- despite optimization -- exceeds that constraint, and you have verified that adding "shortcut" relationships will indeed make the response time acceptable.
You should be aware that adding redundant "shortcut" relationships comes with its own costs:
Queries that modify the DB would need to be more complex (to modify the redundant relationships) and also slower.
You'd always have to add the redundant relationships -- even if actually you never need some (most?) of them.
If you want to make concurrent updates to the DB, the chances that you may lose some updates and introduce inconsistencies into the DB would increase -- meaning that you'd have to work even harder to avoid inconsistencies.
NOTE: For visualization purposes, you can use virtual nodes and relationships, which are temporary and not actually stored in the DB.

Related

Node vs Relationship

So I've just worked through the tutorial and I'm unclear about a few things. The main one, however, is how do you decide when something is a relationship and when it should be a Node?
For example, in the Movies Database,there is a relationship showing who acted in which film. A property of that relationship is the Role. BUT, what if it's a series of films? The role may well be constant between films (say, Jack Ryan in The Hunt for Red October, Patriot Games etc.)
We may also want to have some kind of Character bio, which would obviously remain constant between movies. Worse, the actor may change from one movie to another (Alec Baldwin then Harrison Ford.) There are many others like this (James Bond, for example).
Even if the actor doesn't change (Main roles in Harry Potter) the character is constant. So, at what point would the Role become a node in its own right? When it does, can I have a 3-way relationship (Actor-Role-Movie)? Say I start of with it being a relationship and then, down the line, decide it should've been a node, is there a simple way to go through the database and convert it?
No there is no way to convert your datamodel. When you start your own Database first take time to find a fitting schema. There is no ideal way to create a schema and also there are many different models fitting to the same situation without being totally wrong.
My strategy is to put less information to the relationship itself. I only add properties that directly concern the relationship and store all the other data in the nodes. Also think of properties you could use for traversing the graph. For example you might need some flags or even different labels for relationships even they more or less are the same. The apoc.algo.aStar is only including relationshiptypes you want (you could exclude certain nodes by giving them a special relationshiptype). So keep that in mind that you take a look at procedures that you might use later.
Try to create the schema as simple as possible and find a way to stay consistent in terms of what things are nodes and what deserves a relationship. Dont mix it up.
Choose a design that makes sense for you! (device 1)-[cable]-(device 2) vs (device 1)-[has cable]-(cable)-[has cable]-(device 2) in this case I'd prefer the first because [has cable] wouldn't bring anymore information. Irrespective to what I wrote above I would have a lot of information in this [cable] relationship but it totally makes sense for me because I wouldnt want to search in a device node for cable information.
For your example giving the role a own node is also valid way. For example if you want to espacially query which actors had the same role in common I'll totally go for giving the role a extra node.
Summary:
Think of what you want to do with the data and choose the easiest model.

Storing text in Neo4j

We're using Neo4J and liking it. We do all sorts of graphy things in it. However, some of what we do is not graphy. For example, we keep a log of all changes to a certain type of node:
(n)-[:CHANGE]->(c1)-[:CHANGE]->(c2) etc etc
This list of changes can get to be 20 or 30 c1 nodes long. While it looks weird, I don't have a real problem with it. (Of course, I am smarter now, and since each :CHANGE relationship has a date in it, I could porcupine all the c1 nodes right from n. But whatever.)
But what if I wanted to store large amounts of text, or images. Is there a problem with storing large amounts of data in a node? I could use a different database for these things, but this just increases the skill set required to run the business. And of course, joining data in two disparate databases is always a PITA.
So I need to worry about storing large amounts of text in a single property? Do I need to avoid creating logs in the way I did above?
Creating linked lists of events, whatever they may be, is fine. It'd even say it is graphy! We use this approach a lot, in the case of ChangeFeed for something similar to what you're doing.
In Neo4j, such a linked list is actually better than a "porcupine", because you have to traverse all the relationships anyway, if you're looking for all changes. But in the linked list case, you don't have to look at properties to order them. In fact, the best approach is a hybrid approach, like the TimeTree.
As for storing large amounts of text or images, Neo4j is not the best place for it. Another database would be best, especially if the volume of these is large ( > hundreds of thousands).
But if you do want to store them in Neo4j, one thing to keep in mind is that Neo4j will load all the properties of a node/relationship once a single property is accessed. So in order to achieve good performance even with text/images in Neo4j, I would store them in a node of their own. That way, you can only load them if you really need them, but not during regular traversals.
For example:
CREATE (b:BlogPost {title:'Neo4j Rocks', author:'Tony Ennis', date:".."})-[:HAS_BODY]->(:BlogPostBody {content:'..'})

One Node to Gather Them All

Say I am managing collectibles. I have thousands of baseball trading cards, thousands still of gaming cards (think Magic: the Gathering), and then thousands and thousands of doilies.
The part of me that's been steeped in relational databases for 20+ years is uncomfortable with the idea of thousands of Neo4J nodes floating out in space.
So I am inclined to gather them all with a node such as (:BASEBALL_CARDS), (:MTG_CARDS), and of course (:DOILIES). The idea is that these are singletons.
Now if I want all baseball cards that perhaps refer to a certain player, I could do something like:
(:BASEBALL_CARDS)-[GATHERS]->(:BASEBALL_CARD)-[:FEATURES]->(p:PLAYER {name: '...'})
It's very comforting to have the :BASEBALL_CARDS singleton, but does it do anything more than could be accomplished by indexing :BASEBALL_CARD?
(:BASEBALL_CARD)-[:FEATURES]->(p:PLAYER {name: '...'})
Is it best-practice to have thousands of free-ranging nodes?
One exceptional strong point of the graph database is the local query: the relationship lives in the instance, not in the type. A particular challenge (apart from modelling well) is determining the starting point of the local query (and keeping it local, i.e., avoiding path explosions). In Neo4j 1.x your One Node was a way to achieve a starting point for a certain kind of query. With 2.x and the introduction of labels, indexing :BaseballCard is the standard way to accomplish the same. If the purpose of that One Node is as a starting point for the kind of query in your example, then you are better off using a label index. A common problem in 1.x was that a node with an increasing number of relationships of the same type and direction eventually becomes a bottle neck for traversals. People started partitioning your One Node into A Paged Handful of Nodes, something like
(:BaseballCards)-[:GATHERS]->(:BaseballCards1to10000)-[:GATHERS]->(:BaseballCard)
The purpose of finding a starting point for the local query is often better served by labels, perhaps in combination with a basic, ordinary, local traversal, than by A Handful of Nodes. Then again, if it calms your nerves or satisfies your sense of the epic to have such a node, by all means have it. Because of the locality of queries, it will do you no harm.
In your example, however, neither the One Node nor an index on :BaseballCard would best serve as the starting point of the local query. The most particular pattern of interest is instead the name of the player. If you index (:Player) on name you will get the best starting point. The traversal across the one or handful* of [:FEATURES] relationships is very cheap and with a simple test on the other end for the :BaseballCard label, you are done. You could of course maintain the One Node for all players that share a name...
In my most humble opinion there is little need for discomfort. I do, however, want to affirm and commend your unease, in this one regard: that the graph is most powerful for connected data. The particular connection gathering the baseball cards doesn't seem to add new understanding or improve performance, but wherever there is disconnected data there is the potential for discovering exciting and meaningful patterns. Perhaps in the future the cards will be connected through patterns that signify their range of value, or the quality of their lamination, or a linked list of previous owners, or how well they work as conversations starters on a date. The absence of relationships is a call to find that One Missing Link that brings tremendous insight and value into your data.
* Handful, assuming that more than one baseball card features the same player, or some baseball players are also featured on cards of Magic: The Gathering. I'm illiterate in both domains, so I want to at least allow for the possibility.
It is ironic that you are concerned about nodes "floating out in space", when the whole idea behind graph DBs is making the connections between nodes a first class DB construct.
But I think your actual concern is that nodes do not "belong to a table" (in relational DB parlance). So, you would feel more comfortable in creating a special singleton node that in some sense takes the place of a table, from which you can access all the nodes that ought belong to that table.
A node label can be seen as the equivalent of a "table name". So, not only is there no need for you to also create a singleton "table node", doing so would be wasteful in DB resources, and complicate and slow down your queries. And neo4j can quickly access all the nodes with the same label.

How much does the architecture of data affect the speed of a query

I have the following nodes and relationships in Neo4j database.
The grey and the pink node are furtherly connected with more nodes. Running the following query:
MATCH (n:RealNode {gid:'$obj_id'})-[:CONTAINS*..3]-(z)
RETURN DISTINCT ID(z), z.id,n.id as InternalID"
I get a result very fast (the node n:RealNode is not one of the nodes in the image).
If I increase the depth to 4 like:
MATCH (n:RealNode {gid:'$obj_id'})-[:CONTAINS*..4]-(z)
RETURN DISTINCT ID(z), z.id,n.id as InternalID"
The response gets extremely slow. I will never get a response with depth 5 etc.
The depth 4 is actually the relationship between the blue-pink node. So my question is: can the architecture of data (in this case) affect in such a great level the speed of the query? If yes what should I do?
I have tried to run the query also using parameters but the result was the same. Also the gid of n:RealNode is an indexed value.
The architecture of your data has a huge, no...massive impact on query performance. There's a lot you can do with improving performance by reformulating your query, but you can do even more than that by changing your data model.
The model needs to be chosen in a way that's an accurate depiction of the real-world domain, but it often also has to make certain concessions to usage patterns. If you know you're going to do certain queries over and over, it makes sense to choose a data model that makes it easy for the DBMS to answer that query. In the RDBMS world, that entire line of thinking gets summarized in the word "denormalization". In graph databases, the concept is the same but the way you go about it is different.
The thing to keep in mind when adjusting your data model is that neo4j is good at traversing relationships fast, and that with all queries, the less data you have to consider, the faster the query will go.
So in your case, I don't know how many nodes branch off of each other node by a :CONTAINS relationship, but I'm guessing that at each level of the hierarchy you have many items below it. So going from level 4 to level 5 probably doesn't just add a fixed number of additional nodes, but if say each level of the hierarchy has 3x the number of nodes as the level above, the deeper you go, the more you're multiplying how much data you have to consider. If it's 10x...then ouch.
You have many different options. One is to create short-cut relationships, and "pre-materialize" the query. Imagine creating :grandfather and :greatgrandfather relationships to "hop" levels of the tree. That would make it faster. Another way would be to filter intermediate nodes, or the return nodes, so that you're not considering everything, but some subset.
In the end, really huge queries will always take longer than really small ones. You must first begin with a careful understanding of what data you want, and how often you have to run this query. I would not attempt to optimize your data model for infrequently run queries, but if you do this all the time, you should look at your options. Your query to me looks like it's going to return a whole lot of data no matter what you do.

Query templating in cypher? How to avoid repeating myself

My group has many queries that tend to refer to a class of relationship types. So we tend to write a lot of repetitive queries that look like this:
match (n:Provenance)-[r:`input to`|triggered|contributed|generated]->(m:Provenance)
where (...etc...)
return n, r, m
The question has do to with the repetition of the set of different relationship types. Really we're looking for any relationship in a set of relationship types. Is there a way to enumerate a bunch of relationship types into a set ("foo relationships") and then use that as a variable to avoid repeating myself over and over in many queries? This repetitive querying of relationship types tends to create problems when we might add a new relationship type; now many queries distributed through the code base need to all be updated.
Enumerating all possible relationships isn't such a big deal in an individual query, but it starts to get difficult to manage and update when distributed across dozens (or hundreds) of queries. What's the recommended solution pattern here? Query templating?
This is not currently possible as a built-in feature, but it seems like an interesting feature. I would encourage you to post this to the ideas trello board here:
https://trello.com/b/2zFtvDnV/public-idea-board
Perhaps suggesting something like allowing parameters for relationship types:
MATCH (n)-[r:{types}]->(p)
Of course, that makes it much harder for the query engine to optimize queries ahead of time.. A relationship type hierarchy could work, but we are incredibly hesitant to introduce new abstractions to the model unless absolutely necessary. Still, suggestions for improvements are very welcome!
For now, yes, something like you suggest with templates would solve it. Ideally, you'd send the query to neo containing all the relationship types you are interested in, and with other items parameterized, to allow optimal planning. So to do that, you'd do some string replacement on your side to inject the long list of reltypes into the query before sending it off.

Resources