neo4j single node maximum relationship capacity - neo4j

searched, and read from capacity documents, but I can't get figures on what is the maximum capacity for a single node to have?
If I have a user, that has so many posts, comments, uploads, etc, that is related to him, is there any maximum of relations that I can attach to him?
thanks!

There is not really a maximum / limit.
The relationships are stored in separate structures by type and direction.
For some use-cases it might make sense to separate some information out to a separate node, it depends on the use-cases that you want to support with your graph model.

Related

Neo4J user profile data modeling

I need to design a data model, which stores user profile information. User node may contain name, address, telephone as attributes. Amount of users is expected to be increased dramatically.
At the same time, I want to store each users' skills and hobbies which are entered by the users themselves during the profile creation.
One user can enter multiple skills and hobbies. Of course multiple users may share a certain hobby or a skill.
We also have a requirement to filter users by skill or hobby. That is, if the hobby (Badminton) is given, we need to find all the users who like Badminton.
Would it make sense to create hobbies, skills as nodes? My understanding is, this will increase the query performance but, amount of distinct hobbies, skills users happen to enter may increase the number of nodes in the database.
Would it be good to store skills and hobbies as attributes of user nodes? My understanding is, search by an attribute over all the user base would decrease the query performance.
Please advise.
Thank you.
The answer is in your phrasing of the question.
Skills and hobbies are shared amongst users
Find all users who like Badminton
This clearly indicates that skills/hobbies are entities, or nodes. There's no need to prematurely optimise, and the number of nodes will increase but steady at some level (the number of skills and hobbies is not infinite). Also, the performance of queries is unrelated to the total size of the graph, so it may not really matter that the size of the graph has grown- the performance will depend on the size of the subgraph touched. Unless you're looking at 10s or 100s of billions of nodes, it is pretty safe to add skills and hobbies as nodes, and not worry about performance at this stage.

Why it is not recommended to index relationships in a graph database

In the book Neo4j in Action by Aleksa Vukotic and Nicki Watt, the authors say:
In our experience, it is less common for relationship indexes to be good solutions. We are not saying that relationship indexing is poor practice, but if you find yourself adding lots of relationship indexes, it is worth asking why.
It sounds that the authors do not recommend to index relationship in a graph database but no explanation is given thereafter. Does anyone know why?
I've voted for this question to be migrated to SO, and answering it while hoping it to be really migrated. I used Neo4j a couple of years. Although it has changed a lot since then, the principles of being a graph database won't alter much I believe. In my opinion, if you need a lot of indices to promptly query the relationships between the nodes, you could have designed your data model in some other way such that it focuses more on the graph nodes (just for example, relationships being your nodes, and nodes being your relationships as in line graph); because the querying mechanism (e.g. Cypher query) is generally optimised for the nodes.
First, it's important to understand the role of indexes in Neo4j, in that indexes are used to find starting points in the graph, after which relationship traversal and filtering are used to perform the remainder of the pattern matching and to complete the query.
The advice therefore is about the same as: "we do not recommend using relationships as starting points in the graph", and we find that true more often than not.
Usually when you need to do index lookups, you have certain "things" in mind as your starting places, and important things in graphs are typically represented by nodes. If we're asking "what employees are connected to this particular company" we're interested in starting quickly by finding that particular company and expanding out, not in finding all :EMPLOYED_BY relationships in the graph and filtering by the connected company, which would take far more time.
Often we find that those who encounter this restriction, and need this kind of fast lookup of relationships anyway, may need to rethink their model. Often when there is a need to lookup relationships as starting places in the graph, it is an indication that the thing represented by a relationship is important enough that it really should be a node in the graph (with its own relationships to the previously connected nodes), so this becomes a "modeling smell" that drives refactoring changes to the model. Often this kind of change feels more natural after, and affords more capability for the thing as a node that wasn't available when it was being modeled as a relationship (for example, the ability to apply multiple labels to it, or to connect it via relationships to more nodes than just the original two).
All that said, there will be cases where a relationship really does just need to be a relationship (either for business reasons, or because it truly is most practical modeling-wise for it to be kept as a relationship), and using those relationships as starting points in the graph make sense.
With the fulltext schema indexes introduced in Neo4j 3.5, we added the capability to add relationship indexes by relationship type(s) and property(or properties). So the capability is there, if needed, after you've ruled out refactoring of your model.

Time Based Graph Data Modeling

I have a data modeling question. The data that I have is basically nodes with relations to other nodes. Nodes have properties. Edges are directional and have properties. I am exploring if a Graph DB like Neo4j will be appropriate or not.
The doubt is because: The data that I have is time based. It changes on the basis of time, and I need to keep track of the historical data as well. For example, I should be able to query:
What was the graph like on a particular date?
Who all did a given node depend on at a particular time?
What were the properties of the edge between two given nodes at a particular time?
I searched but couldn't find a satisfactory resource where I could understand how time can be factored into a Graph DB. Do you think my requirement can be inherently met using a Graph DB? Is there an example/resource/article which describes this for Neo4j or any other graph db?
I want to make sure that the database is scalable to about 100K nodes, and millions of edges. I am optimizing for time over space.
Is there an example/resource/article which describes this for Neo4j or
any other graph db?
Here is an excellent article from Ian Robinson blog about time-based versioned graphs.
Basically the article describes a way to represent a time-based versioned graphs adding some extra nodes and timestamped relationships to represent the state of the graph in a given timestamp.
The following image from the referenced article shows:
The price of produc_id : 1 has changed from 1.00 to 2.00. This is a state change.
The product_id : 1 is now sold by shop_id : 2 (and not by shop_id : 1). This is a structural change.
Do you think my requirement can be inherently met using a Graph DB?
Yes, but not in an easy or "natural" way. Versioning a time based model with a database that don't offers this functionality natively can be hard and expensive. From the article:
Neo4j doesn’t provide intrinsic support either at the level of its
labelled property graph model or in its Cypher query language for
versioning. Therefore, to version a graph we need to make our
application graph data model and queries version aware.
and
versioning necessarily creates a lot more data – both more nodes and
more relationships. In addition, queries will tend to be more complex,
and slower, because every MATCH must take account of one or more
versioned elements. Given these overheads, apply versioning with care.
Perhaps not all of your graph needs to be versioned. If that’s the
case, version only those portions of the graph that require it.
EDIT:
A few words from the book Graph Databases (by Ian Robinson, Jim Webber and Emil Eifrem) about versioning in graph databases. This book is available for download at Neo4J page:
Versioning:
A versioned graph enables us to recover the state of the
graph at a particular point in time. Most graph databases don’t
support versioning as a first-class concept. It is possible, however,
to create a versioning scheme inside the graph model. With this scheme
nodes and relationships are timestamped and archived whenever they are
modified The downside of such versioning schemes is that they leak
into any queries written against the graph, adding a layer of
complexity to even the simplest query.
This paragraph links the article indicated in the beginning of this answer.

Data Partitioning in Neo4j

I'm playing around with neo4j - seeing what I can and can't do with it before suggesting it for something serious. One of the things I'm looking at now is Data Partitioning. By this I mean having a single data store that contains data from many different customers, and knowing which customer the data belongs to.
In the SQL world, we've always done this by having a customer_id field on the tables that are customer specific, and then always including that in the queries and indices. This works perfectly well for us, but in the Graph DB world it feels like we can do better.
The options that I've come up with some far are:
The same as before - including a property on the nodes that is the Customer ID
Storing a Label on each Node that identifies the Customer. However, as far as I can tell you can't bind parameters to labels so this would mean that the queries are generated slightly awkwardly.
Storing a Customer Node, and linking all of the other nodes to it.
Number #3 seems to be the "correct" Graph DB way of managing this, but I'm concerned with the impact of this on the performance of the data. It's perfectly feasible that there will be hundreds of thousands of links from a single Customer Node to the other data nodes, and there will be hundreds of different Customer Nodes. (Based on the volume of data in the existing SQL database)
What's the recommended way of achieving this level of data partitioning whilst maintaining performance?

Permissions to be stored as a Node or a property

We have six different types of permissions for content nodes. If we want to query neo4j for the content by the permission type, is it better to store the permissions as an attribute for each content node, or as a separate node to which each piece of content has a relationship?
This is a good data modeling question, and the truth is it depends.
I'm personally in favor of storing them as a separate node, so you don't have to traverse all nodes(or at least all user nodes) in order to find all the permissions you are looking for, especially if you start to get a lot of users and will be looking for all users of permission X.
This also adds a level of normalization, as well as the ability to perform counts easily.

Resources