Neo4J user profile data modeling - neo4j

I need to design a data model, which stores user profile information. User node may contain name, address, telephone as attributes. Amount of users is expected to be increased dramatically.
At the same time, I want to store each users' skills and hobbies which are entered by the users themselves during the profile creation.
One user can enter multiple skills and hobbies. Of course multiple users may share a certain hobby or a skill.
We also have a requirement to filter users by skill or hobby. That is, if the hobby (Badminton) is given, we need to find all the users who like Badminton.
Would it make sense to create hobbies, skills as nodes? My understanding is, this will increase the query performance but, amount of distinct hobbies, skills users happen to enter may increase the number of nodes in the database.
Would it be good to store skills and hobbies as attributes of user nodes? My understanding is, search by an attribute over all the user base would decrease the query performance.
Please advise.
Thank you.

The answer is in your phrasing of the question.
Skills and hobbies are shared amongst users
Find all users who like Badminton
This clearly indicates that skills/hobbies are entities, or nodes. There's no need to prematurely optimise, and the number of nodes will increase but steady at some level (the number of skills and hobbies is not infinite). Also, the performance of queries is unrelated to the total size of the graph, so it may not really matter that the size of the graph has grown- the performance will depend on the size of the subgraph touched. Unless you're looking at 10s or 100s of billions of nodes, it is pretty safe to add skills and hobbies as nodes, and not worry about performance at this stage.

Related

Dimensional Modeling: app session or activity measures

I am trying to answer the below question given by the business (The business generates revenue from multiple apps through customer pay model) The business is interested in the below questions
new users (trend with respect to previous months)
daily active users
Day 1 retention
I came up with the below DM
Dimension: users, app, deviceid, useractions, plan, date
Fact: fact_activity(userid, appid,deviceid, actionid)
Actions could be: app installed, app launch, registered, completed purchase, postedcomments, playgame etc
The questions I have is
should the fact table contain action_type instead of the actionid into the fact (to avoid join with useractions)
Definition of day 1 retention: No of apps installed/ app launches next day how do to avoid multiple counting of single user using multiple devices
Would it be advisable to have device details in the user dimension
or separate.
If I need to measure average session duration, should I use another fact at session level or tweak the activity fact?
your questions are really unanswerable without significant more information about your business processes, data definitions, etc. In effect, you are asking someone to design a dimensional model for you before they can answer your questions - which is obviously not going to happen.
However, I can give you some very generic pointers that may help you:
Dimensions
A Dimension describes an entity so if attributes can't be described as belonging to the same entity then they shouldn't be in the same dimension. In your case, I assume a Device and a User are not the same thing and therefore they need to be separate dimensions
Facts
You need to define your measures i.e. what precisely are the things you are going to want to aggregate (count, sum, avg, etc) and how are they defined/calculated.
For each measure, you also need to define its grain i.e. what is the minimum set of dimensions that uniquely identify it. Once you have the grain defined, if multiple measures have the same grain then they can be held in the same fact table and if they don't then they can't

Handling Contracts extension and licensing/Subscriptions addition/removal in dimensional model

Background: I am trying to design a star schema for a data warehouse. We have the following business model where we have few products that our customers can buy and then use. The customers are companies and then they have people in their organization who can be mapped to the licenses they have brought for products.
I have the following dimensions.
Account_dim: The dimension contains all the list of companies that have are our current/prospective with our company. It could have companies who still don't have a contract with us and are still in a discussion phase. so some rows might not have a contract.
User_dim: This is the list of users the company has nominated point of contacts for their company. So a user will belong to one particular Account in the Account_dim. One account can have many users.
Product_Dim: This dimension contains all the information regarding all the products we sell. The cost of a license and how many users are allowed on a license.So if for example he brought product A a max of two users can use it.
Now I have three tables that have data regarding the contract.
Contract: It contains information regarding a contract we have which will include the contract start date and end date and the account which this contract is assigned to.
products_bought: This table contains the product brought under a contract. A contract can hold multiple products bought.Each product row will have the product start date/end date and the price of the asset the client has paid.
allocated users:Each product bought can have users mapped to it who are allowed to use the product which is the user in user_dim for that account. Basically attaching a license to a user.
I am trying to model the contract, product bought and allocated user so I can generate the following data.
The amount of money a account has spend on products.
THe utilization of licenses by an account. for example an account has a product that allows 3 users but has only one user mapped to it will show the product is under utilized.
I tried denormalizing all three tables into one fact table but the I am running into problem where the contract end date can be changed if it is extended. As well as new assets can be mapped to it. Also last be not least, the company can remove a user and then map another user to the product or remove users because they left the company or add more users.
How can this be best modeled. Because they contract and asset users can change they should be SCDs rather than fact table or how should I implement a fact to handle these changes as well which must be captured as well to maintain history of usage over time.
your best bet is to read a book on how to go about designing a data warehouse: The Data Warehouse Lifecycle Toolkit as this will give you all the information you need to be able to answer questions like this.
However, to specifically address your question, the best way to approach this is as follows:
Define your measures: what are the values that you wish to be able to aggregate in your reports
Define the grain of each measure: what are the dimensions that uniquely identify each measure. For example, a transaction amount might be defined by Store, Customer and Date/Time; if you dropped any of these then the transaction amount would change; if you added another dimension, such as rainfall, it would not change the transaction amount (n.b. having defined the grain of a measure you should never add dimensions that would change the grain e.g. Product Dimension, in this example)
Once you have defined your measures and their grains you can add all the other dimensions to them (that won't affect their grain) andn then decide whether to hold them in separate fact tables or combine them into one fact table:
Rule: if two measures don't have the same grain you must not put them in the same fact table
Guidance: for measures that meet the above rule, if there is also significant overlap in the other dimensions you want to use for each measure then consider combining them into a single fact table. My rule of thumb is that if you have 2-3 dimensions that don't apply to all measures then that's OK; if you hit 5 or more then you probably need to be thinking of splitting the measures into separate facts

Duplicating relations vs executing more queries

I have the following architecture.
You will find a duplication in HAS relationship. The main one is between Badge and Skill as I want to be able to aggregate/count same Skill from different Badge of the same User.
So, the duplicate relationship is between User and Skill. That is because, for instance, if an Organization wants to know all the skills of single or multiple recipients I would follow the following path:
Org -OWNS-> Badges -IS_AWARDED_To-> User -HAS-> Skill
//Skill nodes for a specific or multiple user represent each skill contained in every Badge the user was awarded.
However, if I do not add the duplicated relationship HAS between User and Skill, I will follow the following path instead:
Org -OWNS-> Badges -IS_AWARDED_TO-> User -IS_AWARDED-> Badges -HAS-> Skill
//Now I have all skills for a specific or multiple User for every badge awarded
The difference between the two paths is obvious. The first one will result in less queries but the duplication of the relationship is a concern. The second one will remove the duplication problem (is it a problem?) but has more queries. I am still a newbie to neo4j and feel free to tell me that both of my approaches seem convoluted and there is a more optimized way to reach what I am trying to do.
Your two models are valid, and you can use both of them.
But like you said, on the first one you duplicate some data. Generally we do that when we have some performance issues. Is it your case for now ?
As a starting point, I recommend you to start with the model 2 (ie. without duplication), and if you have some issues with this model, you can easely change it to the model 1 (the flexibility of Neo4j is really great for graph refactoring !).
In IT, nothing is free : if you duplicate some data to have better performances in reads, you will have an impact on writes.
When you write a (badge)-[:HAS]->(skill) relationship, you also need to create a (user)-[:HAS]->(skill) rel (same for update or delete).
So you need to keep the consistency of this data when you update the graph. In fact it's like you are creating a SQL stored view.

neo4j single node maximum relationship capacity

searched, and read from capacity documents, but I can't get figures on what is the maximum capacity for a single node to have?
If I have a user, that has so many posts, comments, uploads, etc, that is related to him, is there any maximum of relations that I can attach to him?
thanks!
There is not really a maximum / limit.
The relationships are stored in separate structures by type and direction.
For some use-cases it might make sense to separate some information out to a separate node, it depends on the use-cases that you want to support with your graph model.

Neo4j graph model for a social network

I've created a graph model for a social network and needed some concrete advice regarding the design in regards to scaling. Pardon the n00bness of these questions but I'm not finding very many clear examples out there...
NOTE: the status updates and activity nodes /relationships are linked lists - with the newest entries constantly being placed at the top of the list.
Linked lists allow for news feed generation, but there could be hundreds of records per user - I presume the limit clause isn't sufficient even though the data is in descending order by date. Do I have to have a separate linked list that would only hold the most recent 10 status/activity updates) and constantly replace the head on that list to get better activity feed generation, or will one list properly sorted and do the job (with a limit clause)
These nodes all have properties (json data with content, IDs, etc) - how do "global" indexes come into play here so that I can find, for example, users that like Depeche Mode without waiting a lifetime for results? I know how to add a node to an index, just wondering if I'm missing a part of the picture here..
Security - logins and passwords.. I would presume a graph database could store them, but I'd presume it's a security risk at this point - would it be better to keep this in postgres etc?
How would you improve this model to handle scalability? Imagine 20 million users banging away on this..
Imagine 40 million users - what's wrong with this model when it comes to scalability?
Part 1.
You can write cypher or gremlin queries that do what you want. Remember that you can traverse forwards and backwards on edges. Given a user, it should always be relatively constant time to pull up the last ten things they did.
Part 2.
If you are representing a band as an entity of a certain type, index on that attribute. Then you'll be able to pull out that node and traverse outwards to find all the users who like that band. If you don't have an independent entity, or it is somehow implicit, you'll want to enable full text search for your respective graph database.
Part 3.
Learn more about security. The only thing you would be storing would be a properly hashed string of the user's password. At that point you would be fine using any graph db and good security practices.
Part 4/5.
Once you have one user, worry about the next thousand.
When you have a thousand users, worry about the next hundred thousand.
When you have one hundred thousand, worry about the next million.
When you have a million users, you can start worrying about the questions you asked.
Until you have at least 0.1% of the users/volume you want to scale to, it's mental masturbation to try and ask questions about how to scale up to a certain size.

Resources