I want to store multiple independent trees(there is no relation between those trees). I think to generate and assign a unique label to every independent tree. Then every query will have a filter using those labels. So if there are 10,000 trees, I would have to generate 10,000 different labels. Is there a better solution like a multi-graph or something else available?
I recommend using one label, say :Root, for all of your trees and a root_id property that contains the tree's unique identifier.
You can create a unique constraint on root_id to ensure that no two trees have the same ID. The unique constraint has the side effect of creating an index on the property so accessing the :Root nodes by root_id will be very fast.
Related
In modeling, instances of the same label, i.e Student, have same set of properties. However, is it normal that instances of the same label have different sets of properties. For example, I have Product node:
(p:Product)-[:HAS_ATTRIBUTE]->(a:Attributes)
Different instances of Product result in different instances of Attributes. In this case, different Attributes nodes have very different properties.
Is this modeling normal? Different categories of products can have very different attributes.
It's very useful to have different properties. For instance, I have a Y-DNA project with single nucleotide polymorphisms (SNP) Nodes. Some are on the know haplotree and some are not. So, I set a property InHGTree to Y or blank to reflect this. Now I can more readily create queries using the haplotree branching.
BTW, relationships can also have different properties with the same value. DNA results from an individual are in a "kit." The kit is related to numerous SNPs. You want to be able to determine whether the kit is positive or negative for the SNP. It is most logically to put this fact in the relationship between the kit and the SNP.
It's certainly allowed, as there is no table schema as in relational dbs to enforce homogenous properties.
While this provides great flexibility, it may introduce complexity. It's up to the modelers and administrators of the database to provide any guidelines or implement restrictions, if needed.
While that would usually be in the form of convention, APOC triggers (or kernel extensions if you want to implement this yourself) could be used to enforce only certain properties for a node of a given label.
My business requirement says I need to add an arbitrary number of well-defined (AKA not dynamic, not unknown) attributes to certain types of nodes. I am pretty sure that while there could be 30 or 40 different attributes, a node will probably have no more than 4 or 5 of them. Of course there will be corner cases...
In this context, I am generically using 'attribute' as a tag wanted by the business, and not in the Neo4J sense.
I'll be expected to report on which nodes have which attributes. For example, I might have to report on which nodes have the "detention", "suspension", or "double secret probation" attributes.
One way is to simply have an array of appropriate attributes on each entity. But each query would require a search of all nodes. Or, I could create explicit attributes on each node. Now they could be indexed. I'm not seriously considering either of these approaches.
Another way is to implement each attribute as a singleton Neo node, and allow many (tens of thousands?) of other nodes to relate to these nodes. This implementation would have 10,000 nodes but 40,000 relationships.
Finally, the attribute nodes could be created and used by specific entity nodes on an as-needed basis. In this case, if 10,000 entities had an average of 4 attributes, I'd have a total of 50,000 nodes.
As I type this, I realize that in the 2nd case, I still have 40,000 relationships; the 'truth' of the situation did not change.
Is there a reason to avoid the 'singleton' implementation? I could put timestamps on the relationships. But those wouldn't be indexed...
For your simple use case, I'd suggest an approach you didn't list -- which is to use a node label for each "attribute".
Nodes can have multiple labels, and neo4j can quickly iterate through all the nodes with the same label -- making it very quick and easy to find all the nodes with a specific label.
For example:
MATCH (n:Detention)
RETURN n;
I know that nodes and relationships have an integral value that identifies them. Is the same true of labels and/or properties? Is it sufficient to identify a node labeling by giving a node id and a string? That is, is it possible to assign the same label to the same node/relationship more than once?
All nodes and relationships have IDs and can be looked up by their IDs. You can do it like this:
MATCH (n) WHERE id(n) = 5 RETURN n;
IDs should be thought of as an internal implementation detail; don't rely on them to have any particular value or to be consistent or ordered, just rely on them to uniquely identify each node.
In general, IMHO it's good practice to assign your own meaningful identifier to nodes, probably indexed, that you can use to find your nodes, in the same way you'd assign a primary key to a relational database record.
It is possible to assign a label to more than one node; labels should be thought of more as classes of nodes, sort of like entities in an ERD. Typically labels would be things like Person, Company, Job, etc. Labels don't have much to do with identifiers though.
Do labels and properties have an id value in Neo4j?
No
Is it sufficient to identify a node labeling by giving a node id and a string?
If by this you mean that you have the node id and a label value in a String variable then yes, it is. It is also sufficient to just use the ID.
MATCH (n) WHERE ID(n) = 1234 RETURN n
Better:
MATCH (n:YourLabel) WHERE ID(n) = 1234 RETURN n
As FrobberOfBits intimated it is also preferable to ignore (withint reason) the internal IDs that Neo attaches to Nodes/Relationships in your external interactions. This is in part to do with Neo recycling it's internal identifiers (So if you create a Node it gets assigned ID 1, delete that Node, the next created Node could be assigned ID 1), and inpart to do with exposing the internals of the system. Instead you should probably attach meaningful identifiers or UUIDS where required.
Using your own identifier:
MATCH (n:YourLabel{uid:1234}) RETURN n
To make lookups fast, index them:
CREATE INDEX ON :YourLabel(uid)
Is it possible to assign the same label to the same node/relationship more than once?
Nodes can have as many labels as you want to assign them (including none), which is handy when you want to maintain a hierarchy of Node "types". Having a label makes them faster to lookup as Neo has a hint of where to start.
Relationships can only have a single type and that type is immutable. i.e If you want to change from type HAS_A to HAD_A you cannot just change the type, you must delete and re-add the relationship.
A node can be related to another node as many times as you want, using the same relationship type or different relationship types. Performancewise it is better to have different relationship types than to use properties on relationships as the lookup is faster.
CREATE (p:Person{name:"Dave"}), (m:Pet), (d:Pet),
(p)-[:HAS_PET{type:"Cat"}]->(m),
(p)-[:HAS_PET{type:"Dog"}]->(d)
Is fine, as is:
REATE (p:Person{name:"Dave"}), (m:Pet), (d:Pet),
(p)-[:HAS_CAT]->(m),
(p)-[:HAS_DOG]->(d)
Which is now faster if you wanted to do a query for matching all dogs. If Dave's dog gets reassigned you cannot do:
MATCH (p:Person{name:"Dave"})-[rel:HAS_DOG]->()
SET rel:HAS_CAT
But you could do:
MATCH (p:Person{name:"Dave"})<-[rel:HAS_PET{type:"Dog"}]-()
SET rel.type = "Cat"
But I've probably missed the point of the multiple-assignment question.
Not sure what you're aiming for:
yes, property-names, rel-types and labels have internal id's they are not stored as strings
you can identify nodes by label + property + value, one if you have a uniqueness constraint otherwise multiple
you can identify nodes by label (many)
Say I have a massive graph of users and other types of nodes. Each type has a label, some may have multiple labels. Since I am defining users and their access to nodes, there is one relationship type between users and nodes: CAN_ACCESS. Between other objects, there are different relationship types, but for the purpose of access control, everything involves a CAN_ACCESS relationship when we start from a user.
I never perform a match without using labels, so my intention and hope is that any performance downsides to having one heavily-used relationship type from my User nodes should be negated by matching a label. Obviously, this match could get messy:
MATCH (n:`User`)-[r1:`CAN_ACCESS`]->(n2)
But I'd never do that. I'd do this:
MATCH (n:`User`)-[r1:`CAN_ACCESS`]->(n2:`LabelX`)
My question, then is whether the use of labels on the destination side of the match is effectively equivalent to having a dedicated relationship type between a User and any given label. In other words, does this:
MATCH (n:`User`)-[r1:`CAN_ACCESS`]->(n2:`LabelX`)
Give me the same performance as this:
MATCH (n:`User`)-[r1:`CAN_ACCESS_LABEL_X`]->(n2)
If CAN_ACCESS_LABEL_X ALWAYS goes (n:`User`)-->(n:`LabelX`)?
As pointed out by Michael Hunger's comment, Mark Needham's blog post here demonstrates that performance is best when you use a dedicated relationship type instead of relying on labels.
Suppose I have a bunch of User node, which has a property named gender, which can be male or female. Now in order to cluster user based on gender, I have two choice of structure:
1) Add an index to the gender property, and use a WHERE to select users under a gender.
2) Create a Male node and a Female node, and edges linking them to relevant users. Then every time when querying upon gender, I use pattern ,say, (:Male)-[]->(:User).
My question is, which one is better?
Indices should never be a replacement for putting things in the graph.
Indexing is great for looking up unique values and, in some cases, groups of values; however, with the caching that Neo4j can do (and the extensibility of modeling your domain).
Only indexing a property with two (give or take) properties is not the best use of an index and likely won't net too much of a performance boost given the number of results per property value.
That said, going with option #2 can create supernodes, a bottle-necking issue which can become a major headache depending on your model.
Maybe consider using labels (:Male and :Female, for example) as they are essentially "schema indices". Also keep in mind you can use multiple labels per node, so you could have (user:User:Male), etc. It also helps to avoid supernodes while not creating a classic or "legacy" index.
HTH