Why it is not recommended to index relationships in a graph database

Why it is not recommended to index relationships in a graph database - neo4j

In the book Neo4j in Action by Aleksa Vukotic and Nicki Watt, the authors say:
In our experience, it is less common for relationship indexes to be good solutions. We are not saying that relationship indexing is poor practice, but if you find yourself adding lots of relationship indexes, it is worth asking why.
It sounds that the authors do not recommend to index relationship in a graph database but no explanation is given thereafter. Does anyone know why?

I've voted for this question to be migrated to SO, and answering it while hoping it to be really migrated. I used Neo4j a couple of years. Although it has changed a lot since then, the principles of being a graph database won't alter much I believe. In my opinion, if you need a lot of indices to promptly query the relationships between the nodes, you could have designed your data model in some other way such that it focuses more on the graph nodes (just for example, relationships being your nodes, and nodes being your relationships as in line graph); because the querying mechanism (e.g. Cypher query) is generally optimised for the nodes.

First, it's important to understand the role of indexes in Neo4j, in that indexes are used to find starting points in the graph, after which relationship traversal and filtering are used to perform the remainder of the pattern matching and to complete the query.
The advice therefore is about the same as: "we do not recommend using relationships as starting points in the graph", and we find that true more often than not.
Usually when you need to do index lookups, you have certain "things" in mind as your starting places, and important things in graphs are typically represented by nodes. If we're asking "what employees are connected to this particular company" we're interested in starting quickly by finding that particular company and expanding out, not in finding all :EMPLOYED_BY relationships in the graph and filtering by the connected company, which would take far more time.
Often we find that those who encounter this restriction, and need this kind of fast lookup of relationships anyway, may need to rethink their model. Often when there is a need to lookup relationships as starting places in the graph, it is an indication that the thing represented by a relationship is important enough that it really should be a node in the graph (with its own relationships to the previously connected nodes), so this becomes a "modeling smell" that drives refactoring changes to the model. Often this kind of change feels more natural after, and affords more capability for the thing as a node that wasn't available when it was being modeled as a relationship (for example, the ability to apply multiple labels to it, or to connect it via relationships to more nodes than just the original two).
All that said, there will be cases where a relationship really does just need to be a relationship (either for business reasons, or because it truly is most practical modeling-wise for it to be kept as a relationship), and using those relationships as starting points in the graph make sense.
With the fulltext schema indexes introduced in Neo4j 3.5, we added the capability to add relationship indexes by relationship type(s) and property(or properties). So the capability is there, if needed, after you've ruled out refactoring of your model.

Related

What is the time complexity of search query in Graph database?

What is the time complexity of search query in Graph database (especially Neo4j) ?
I'm having relational data with me. I'm confused to use a Relational database or Graph database to store that data. So, I want to store the data based on the performance and time complexity of the queries for that particular database. But, I'm unable to find the performance and time complexity of the queries for Graph database.
Can anyone help me out ?

The answer isn't so simple because the time complexities typically depend upon what you're doing in the query (the results of the query planner), there isn't a one-size-fits-all time complexity for all queries.
I can speak some for Neo4j (disclaimer: I am a Neo4j employee).
I won't say much on Lucene index lookups in Neo4j, as these are typically performed only to find starting nodes by index, and represent a fraction of execution time of a query. Relationship traversals tend to be where the real differences show themselves.
Once starting nodes are found, Neo4j walks the graph through relationship traversal, which for Neo4j is basically pointer chasing through memory, which tends to be where native graph dbs outperform relational dbs: The cost of chasing pointers is constant per traversal.
For relational dbs (including graph layers built on top of relational dbs), relationship traversal is usually accomplished by various table join algorithms, which have their own time complexity, typically O(log n) when the join is backed by a B-tree index. This can be quite efficient, but we are in the age of big data and data lakes, so data is getting larger, and the efficiency of the join does grow based on the data in the tables being joined. And this is fine for smaller numbers of joins, but there are some queries that require many joins (sometimes we won't have an upper bound on when to stop joining), and we may want to traverse through many kinds of tables/nodes (and sometimes we may not care what kind they are), so we may need the capability to join to or through any table arbitrarily, which isn't easily expressed or performed in a relational db.
This makes native graph databases like Neo4j well-positioned for handling queries over connected data, especially with a non-trivial or growing number of relationship traversals (or if the traversals are unbounded, such as for reachabilty queries, shortest-path, and others). The cost for queries is associated with the part of the graph touched or walked by the query, not the total number of nodes in the database, so it helps when you can adequately constrain the query to touch the smallest possible subgraph in the db.
As far as your question of whether to use a relational or graph database, that depends upon the connectedness of your data and the queries you plan to run.
If you do decide upon a graph database, then you have choices here as well, and a different set of criteria, such as native vs non-native implementation (Neo4j is a native graph database and takes advantage of index-free adjacency for relationship traversals), whether you need ACID (Neo4 is an ACID database), and if a rich and expressive query language is desired (Cypher is Neo4j's query language, feel free to learn and compare against others).
For more in-depth info, here's a DZone article on Why Graph Databases Outperfrom RDBMS on Connected Data, and an article on Explaining Graph Databases to a Developer by Neo4j's Chief Scientist Jim Webber.

Actually, the most probable scenario is to use both Neo4j and some DBMS (relational or nosql like Mongo). Because it is too hard to store all dataset in Neo4j.
Speed-wise traversing nodes in DBMS is 10-100++ times slower than Neo4j. Also Neo4j has built-in shortestPath (and many other) methods.
Also, can mention hybrid solutions, like ArangoDB. It has graph engine + document-based engine. But under the hood it is two separate tables, so it is almost as inconvenient as Neo4j+DBMS.

It would actually depend upon the size of your data and the complexity.
In Graph databases like neo4j the time complexity is depends on the kind of query and on the planners(executors) behind the query. Graph database perticularly Neo4j performs easier JOINS which give us a clear view of data.
For more information please visit this reference blog by Neo4j.
And you can also refer to this question as it is similar to yours.
Hope this helps!

Time Based Graph Data Modeling

I have a data modeling question. The data that I have is basically nodes with relations to other nodes. Nodes have properties. Edges are directional and have properties. I am exploring if a Graph DB like Neo4j will be appropriate or not.
The doubt is because: The data that I have is time based. It changes on the basis of time, and I need to keep track of the historical data as well. For example, I should be able to query:
What was the graph like on a particular date?
Who all did a given node depend on at a particular time?
What were the properties of the edge between two given nodes at a particular time?
I searched but couldn't find a satisfactory resource where I could understand how time can be factored into a Graph DB. Do you think my requirement can be inherently met using a Graph DB? Is there an example/resource/article which describes this for Neo4j or any other graph db?
I want to make sure that the database is scalable to about 100K nodes, and millions of edges. I am optimizing for time over space.

Is there an example/resource/article which describes this for Neo4j or
any other graph db?
Here is an excellent article from Ian Robinson blog about time-based versioned graphs.
Basically the article describes a way to represent a time-based versioned graphs adding some extra nodes and timestamped relationships to represent the state of the graph in a given timestamp.
The following image from the referenced article shows:
The price of produc_id : 1 has changed from 1.00 to 2.00. This is a state change.
The product_id : 1 is now sold by shop_id : 2 (and not by shop_id : 1). This is a structural change.
Do you think my requirement can be inherently met using a Graph DB?
Yes, but not in an easy or "natural" way. Versioning a time based model with a database that don't offers this functionality natively can be hard and expensive. From the article:
Neo4j doesn’t provide intrinsic support either at the level of its
labelled property graph model or in its Cypher query language for
versioning. Therefore, to version a graph we need to make our
application graph data model and queries version aware.
and
versioning necessarily creates a lot more data – both more nodes and
more relationships. In addition, queries will tend to be more complex,
and slower, because every MATCH must take account of one or more
versioned elements. Given these overheads, apply versioning with care.
Perhaps not all of your graph needs to be versioned. If that’s the
case, version only those portions of the graph that require it.
EDIT:
A few words from the book Graph Databases (by Ian Robinson, Jim Webber and Emil Eifrem) about versioning in graph databases. This book is available for download at Neo4J page:
Versioning:
A versioned graph enables us to recover the state of the
graph at a particular point in time. Most graph databases don’t
support versioning as a first-class concept. It is possible, however,
to create a versioning scheme inside the graph model. With this scheme
nodes and relationships are timestamped and archived whenever they are
modified The downside of such versioning schemes is that they leak
into any queries written against the graph, adding a layer of
complexity to even the simplest query.
This paragraph links the article indicated in the beginning of this answer.

When to not use neo4j?

Neo4j is a great tool for mapping relational data, but I am curious what under what conditions it would not be a good tool to use.
In which use cases would using neo4j be a bad idea?

You might want to check out this slide deck and in particular slides 18-22.
Your question could have a lot of details to it, but let me try to focus on the big pieces. Graph databases are naturally indexed by relationships. So graph databases will be good when you need to traverse a lot of relationships. Graphs themselves are very flexible, so they'll be good when the inter-connections between your data need to change from time to time, or when the data about your core objects that's important to store needs to change. Graphs are a very natural method of modeling some (but not all) data sources, things like peer to peer networks, road maps, organizational structures, etc.
Graphs tend to not be good at managing huge lists of things. For example, if you were going to build a customer transaction database with analytics (where you need 1 million customers, 50 million transactions, and all you do is post transactions all day long) then it's probably not a good fit. RDBMS is great at that, notice how that use case doesn't exploit relationships really.
Make sure to read those two links I provided, they have much more discussion.

For maintenance reasons, any service aggregating data feeds has until now been well advised to keep their sources independent.
If I want to explore relationships between different feeds, this can be done at application level, using data tracking (for example) user preferences amongst the other feeds.
Graph databases are about managing relationship complexity, but this complexity is in many cases a design choice. Putting all your kids in one bathtub is fine until you drop the soap..

Simple social network design flaw with graph database

I was looking at graph databases and Neo4j. As suggested, I tried to draw a simple social networking graph on white paper and after a few sketches I stucked at some similar points.
At first I designed a social network where "user"s can "like" "post"s.
(u1:User)-[:LIKED]->(p:Post)<-[:POSTED]-(u2:User)
Now I want to notify user2 about the like action and draw this on the white paper.
(u1:User)-[:LIKED]->(p:Post)<-[:POSTED]-(u2:User)
| ^
|__________[:NOTIFY]_________|
I am not sure if it is clear but I just drew a relationship between a node and another relationship which is not possible for graph databases, at least for Neo4j. So I decided, a Like should be a node instead of a relationship. Then my graph turned into this.
(u1:User)-[:CREATAD]->(l:Like)-[:BELONGS_TO]->(p:Post)<-[:POSTED]-(u2:User)
| ^
|__________________[:NOTIFY]________________|
Now everything is OK. Then I added Comments feature to the system as a relationship but when notifications involved, again it turned into a node. And same happened when I added "Liking comments" feature, "Likes to Comments" first seemed they are relationships but once again they turned into nodes when notifications involved.
In general, at some point I find myself drawing a relationship between a node and another relationship. My solution to that feels like I am turning entities, which naturally look like relationships, into nodes. And this makes me think of I have some problems with deciding what should be a node and what should be a relationship.
So my question is, does anyone else other than me fall into this "relationship between a node and another relationship" issue and if so how do you solve that?

It all depends on your use-cases, in many cases a simple relationship is good enough but if you want to do more with that entity or fact you turn it into a node, oftentimes it turns out that it is an actually quite important concept in the domain.
In our data modeling class there is a specific section on this and also in the "Graph Databases" book it is discussed in detail (you can get the free PDF here).
Sometimes it makes sense to keep the original relationship around for a fast shortcut crossing over that intermediate node if you don't need that detail.

Can this be accomplished by a Graph Database?

I have a request to develop an application that keep track of the movements of a certain item (or items). To better demonstrate what the application must do, I drew a diagram (simplified abstraction).
As I never worked with any databases other than the relational ones, I really don't know if I can solve this problem with a graph database.
These questions must be answered by the system:
What was the path that a certain pen drive walked?
I passed some pen drivers. Where are they now?
What are the pens I received, from where did they come from and to where did they go?
Where are the pens I burned and passed? And with whom?
Any help and suggestions are much appreciated.
Thanks

In Neo4j everything is either a node or a relationship. So it's useful to think: what would be my nodes and relationships?
Here it might be, for example, that every "pen drive, "person" and "location" is a node. Verbs like "walk" or "give" would be your relationships.
In this model, you'd be able use "Cypher" to query for things like "give me all location nodes connected to pen nodes by the relationship walk." Or, say "start at all person nodes and return nodes who have a give relationship to a pen drive node that doesn't have a give relationship that connects back to the starting person node."
This rich graph query language gives you nice algorithms like shortest distance for free, so you beyond a transactional record you could determine whether, for example, a pen drive made it from A to B using the optimal path. But as you can see above, "relational joins" do not beget simple queries or descriptions thereof.
When it comes to database design, when the model becomes cumbersome to map mentally, it's going to be a pain to develop too. Design your database based on how you plan to query it. If you're unable to easily explain those queries in terms of Neo4j, it's possible that Neo4j isn't going to be the best fit.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart