Is that normal that nodes with same labels have different properties? - neo4j

In modeling, instances of the same label, i.e Student, have same set of properties. However, is it normal that instances of the same label have different sets of properties. For example, I have Product node:
(p:Product)-[:HAS_ATTRIBUTE]->(a:Attributes)
Different instances of Product result in different instances of Attributes. In this case, different Attributes nodes have very different properties.
Is this modeling normal? Different categories of products can have very different attributes.

It's very useful to have different properties. For instance, I have a Y-DNA project with single nucleotide polymorphisms (SNP) Nodes. Some are on the know haplotree and some are not. So, I set a property InHGTree to Y or blank to reflect this. Now I can more readily create queries using the haplotree branching.
BTW, relationships can also have different properties with the same value. DNA results from an individual are in a "kit." The kit is related to numerous SNPs. You want to be able to determine whether the kit is positive or negative for the SNP. It is most logically to put this fact in the relationship between the kit and the SNP.

It's certainly allowed, as there is no table schema as in relational dbs to enforce homogenous properties.
While this provides great flexibility, it may introduce complexity. It's up to the modelers and administrators of the database to provide any guidelines or implement restrictions, if needed.
While that would usually be in the form of convention, APOC triggers (or kernel extensions if you want to implement this yourself) could be used to enforce only certain properties for a node of a given label.

Related

Neo4j Link prediction ML Pipeline

I am working on a use case predict relation between nodes but of different type.
I have a graph something like this.
(:customer)-[:has]->(:session) (:session)-[:contains]->(:order) (:order)-[:has]->(:product) (:order)-[:to]->(:relation)
There are many customers who have placed orders. Some of the orders specify to whom the order was intended to (relation) i.e., mother/father etc. and some orders do not. For these orders my intention is to predict to whom the order was likely intended to.
I have prepared a Link Prediction ML pipeline on neo4j. The gds.beta.pipeline.linkPrediction.predict.mutate procedure has 2 ways of prediction: Exhaustive search, Approximate search. The first one predicts for all unconnected nodes and the second one applies KNN to predict. I do not want both; rather I want the model to predict the link only between 2 specific nodes 'order' node and 'relation' node. How do I specify this in the predict procedure?
You can also frame this problem as node classification and get what you are looking for. Treat Relation as the target variable and it will become a multi class classification problem. Let's say that Relation is a categorical variable with a few types (Mother/Father/Sibling/Friend etc.) and the hypothesis is that based on the properties on the Customer and the Order nodes, we can predict which relation a certain order is intended to.
Some of the examples of the properties of Customer nodes are age, location, billing address etc., and the properties of the Order nodes are category, description, shipped address, billing address, location etc. Properties of Session nodes are probably not useful in predicting anything about the order or the relation that order is intended to.
For running any algorithm in Neo4j, we have to project a graph into memory. Some of the properties on Customer and Order nodes are strings and graph projection procs do not support projecting strings into memory. Hence, the strings have to be converted into numerical values.
For example, Customer age can be used as is but the order description has to be converted into a word/phrase embedding using some NLP methodology etc. Some creative feature engineering also helps - instead of encoding billing/shipping addresses, a simple flag to identify if they are the same or different makes it easier to differentiate if the customer is shipping the order to his/her own address or to somewhere else.
Since we are using Relation as a target variable, let's label encode the relation type and add that as a class label property on Order nodes where relationship to Relation node exists (labelled examples). For all other orders, add a class label property as 0 (or any other number other than the label encoded relation type)
Now, project a graph with Customer, Session and Order nodes along with the properties of interest into memory. Since we are not using Session nodes in our prediction task, we can collapse the path between Customer and Order nodes. One customer can connect to multiple orders via multiple session nodes and orders are unique. Collapse path procedure will not result in multiple relationships between a customer and an order node and hence, aggregation is not needed.
You can now use Node classification ML pipeline in Neo4j GDS library to generate embeddings and use embedding property on Order node as a feature vector and class label property as target and train a multi class classification model to predict the class that particular order belongs to or the likelihood that particular order is intended to some relation type.
This use case is not supported by the latest stable release of GDS (2.1.11, at the time of writing). In GDS pipelines, we assume a homogeneous graph, where the training algorithm will consider each node as the same type as any other node, and similarly for relationships.
However, we are currently building features to support heterogeneous use cases. In 2.2 we will add so-called context configuration, where you can direct your training algorithm to attempt to learn only a specific relationship type between specific source and target node labels, while still allowing the feature-producing node property steps to use the richer graph.
This will be effective relative to the node features you are using -- if you are using an embedding, you must know that these are still homogeneous and will not likely be able to tell the various different relationship types apart (except for GraphSAGE). Even if you do use them, you will only get the predictions for the relevant label-type-label triple which you specified for training. But I would recommend to think about what features to use and how to tune your models effectively.
You can already try out the 2.2 features by using our alpha releases -- find our latest alpha through this download link. Preview documentation is available here. Note that this is preview software and that the API may change a lot until the final 2.2.0 released version.

Neo4j: Node versus Node property

I am developing a Neo4j database that will contain genomic and clinical data for cancer patients. A common design issue in developing graph databases is whether a data item should be represented by a Node or by a property within a Node. In my case, patients will have hundreds of clinical and demographic measurements (e.g. sex, medications, tumor size). Some of these data will be constant (e.g. sex) while others will be subject to variation with each patient visit. The responses I've seen to previous node vs property questions have recommended using the anticipated queries against the data to make the decision. I think I can identify some properties that will be common search criteria and should be nodes (e.g. smoking history, sex, cancer type) but that still leaves me with hundreds of other properties. Is there a practical limit in Neo4j for the number of properties that a Node should contain? Also, a hybrid approach, where some data are properties and others are Nodes would seem to make both loading data from source files and subsequent queries more complicated.
The main idea behind "look at your queries to decide", is that how data relates to each other effects whether a node or property is better. Really, the main point of a graph database is to make walking relationships easier to query. So the real question you should ask yourself is "Does (a)-->()<--(b) have any significant meaning?" In other words, do I need to be able to find other nodes that share this property?
Here are some quick rule-of-thumb guidlines
Node
Has it's own sub-values or relations
Multiple nodes sharing this value has meaning, and you need to be able to walk along this shared value between them
Changes infrequently
If more than 1 value can apply at the same time
Properties
Has a large range of possible values
Changes over time
If more than 1 value can apply, values are usually updated as a set (rather than individually)
Label
Has a small, finite range of mutually exclusive values
Almost never changes
So lets go through the thought process of a few of your properties...
Sex
Will either be "Male" or "Female", and everyone will be connected to one of the two, so they will both end up being super nodes (overloaded). Also, if you do end up needing to find two people that share the same sex, almost any other method would be more efficient than finding them through the super node. However these are mutually exclusive, immutable, genetic traits so making this a label is also perfectly acceptable (and sometimes preferred).
Address
This is a variable value with sub-properties, won't be shared by very many nodes, and the walk from one person to another at the same address (or, by extension, live in an area) has valuable meaning. So this should almost definitely be a node.
Height and Weight
These change constantly with time, have no sub values, and two people sharing this value has little to no meaning. The range of values is far too wide, so Labels make no since either, so this should be a property.
Blood type
While has more options then Sex does, all the same logic applies, except that the relation does matter now (because people must share a blood type to donate). The problem is that this value will be so overloaded, that you will need to filter on area first, and than just verifying blood type. Could be a property or label. The case for node is if you include an "Can_Donate_To" or "Can_Accept" relation between the blood types. While you likely won't walk these relations to find a potential donor (because they are too overloaded, and you will have to filter by area first), you can use them to verify someone can be a donor.
Social Security Number
Is highly sensitive, and a lawsuit waiting to happen. Keep out of the DB if at all possible. If you absolutely have to; this property is immutable, but will be unique to every person, so because of the lack of reuse, is a bad label and will be pointless as a node. Definitely a property. (But should be salted+hashed if only for verification purposes only)
Mother's maiden name
The possible values are endless, and two nodes sharing this value has no real meaning. Definitely a property.
First born child
Since the child is already their own node, with it's own sub properties, just create a relation between the two. While the value of this info is questionable, any time you need to reference another node, always use a relationship for it. Definitely a node.

Is there a benefit to implementing singletons in Neo?

My business requirement says I need to add an arbitrary number of well-defined (AKA not dynamic, not unknown) attributes to certain types of nodes. I am pretty sure that while there could be 30 or 40 different attributes, a node will probably have no more than 4 or 5 of them. Of course there will be corner cases...
In this context, I am generically using 'attribute' as a tag wanted by the business, and not in the Neo4J sense.
I'll be expected to report on which nodes have which attributes. For example, I might have to report on which nodes have the "detention", "suspension", or "double secret probation" attributes.
One way is to simply have an array of appropriate attributes on each entity. But each query would require a search of all nodes. Or, I could create explicit attributes on each node. Now they could be indexed. I'm not seriously considering either of these approaches.
Another way is to implement each attribute as a singleton Neo node, and allow many (tens of thousands?) of other nodes to relate to these nodes. This implementation would have 10,000 nodes but 40,000 relationships.
Finally, the attribute nodes could be created and used by specific entity nodes on an as-needed basis. In this case, if 10,000 entities had an average of 4 attributes, I'd have a total of 50,000 nodes.
As I type this, I realize that in the 2nd case, I still have 40,000 relationships; the 'truth' of the situation did not change.
Is there a reason to avoid the 'singleton' implementation? I could put timestamps on the relationships. But those wouldn't be indexed...
For your simple use case, I'd suggest an approach you didn't list -- which is to use a node label for each "attribute".
Nodes can have multiple labels, and neo4j can quickly iterate through all the nodes with the same label -- making it very quick and easy to find all the nodes with a specific label.
For example:
MATCH (n:Detention)
RETURN n;

In Neo4j, faster structure for clustering

Suppose I have a bunch of User node, which has a property named gender, which can be male or female. Now in order to cluster user based on gender, I have two choice of structure:
1) Add an index to the gender property, and use a WHERE to select users under a gender.
2) Create a Male node and a Female node, and edges linking them to relevant users. Then every time when querying upon gender, I use pattern ,say, (:Male)-[]->(:User).
My question is, which one is better?
Indices should never be a replacement for putting things in the graph.
Indexing is great for looking up unique values and, in some cases, groups of values; however, with the caching that Neo4j can do (and the extensibility of modeling your domain).
Only indexing a property with two (give or take) properties is not the best use of an index and likely won't net too much of a performance boost given the number of results per property value.
That said, going with option #2 can create supernodes, a bottle-necking issue which can become a major headache depending on your model.
Maybe consider using labels (:Male and :Female, for example) as they are essentially "schema indices". Also keep in mind you can use multiple labels per node, so you could have (user:User:Male), etc. It also helps to avoid supernodes while not creating a classic or "legacy" index.
HTH

Representing an item in an inventory

I am new to Neo4j and I need some advice from the more experienced Neo4j developers.
In which situation does it makes sense for an inventory system to represent individual items as a path through their properties instead of a node with the same properties?
In order to make my self clear:
Let's say we have a eyeglass lens. This item has properties like it's SPHERE power it's CYLINDER power and an AXIS, among others.
There is a finite set of SPHERE powers but also of CYLINDER power and AXIS. The combination of those makes an item (lens).
Does it make sense to represent a lens like this:
MATCH (lens:Lens)-[:-2.00]-(sph:Sphere:{power:'-2.00'})-[:-0.50]-(cyl:Cylinder{power:'-0.50'})-[:90]-(ax:Axis{degree:'90'})
RETURN lens.brand_name, lens.price
Please note that the above item(lens) can be available from different manufacturers and with different brand names and list prices so "lens" will represent all individual brands that can match with the above query and will have as properties the brand name and price, at least.
Let's say you have a piece of data ("SPHERE"). When should it be a property of the lens node, and when should it be its own node, via relation?
Do you need to relate multiple lenses to the same sphere? This argues it should be its own node, so that multiple lenses can link to the same sphere.
Do you need to assert extra properties about the sphere value? (Like who measured it, or when?) This argues you should make it a separate node.
Do you need to store properties about the relationship? If the relationship is any more complicated than simple "HAS A" you might want a relationship between two nodes, so you can store properties on the relationship.
Any of those cases would argue you should store that piece of data as a separate node, and then relate it by relationship.
ON THE OTHER HAND, if it's a simple primitive data type (float), with a simple "HAS-A" relationship to the parent (i.e. a lens HAS-A sphere measurement) and you have no need for extra metadata, then it should be a node property.
I'm not an optometrist but I think this latter situation is your case, I'm just trying to give you a more general answer. "Sphere" should probably be a node property, but the cases above are how to think about the issue more generally for future data items.
In your special domain, with finite ranges and discrete values for each of the parameters, it absolutely makes sense to model the properties of a lens as value nodes. The resulting index graph seems not to be too large, and quite balanced (no supernodes).

Resources