Neo4j node property type - neo4j

I'm playing around with neo4j, and I was wondering, is it common to have a type property on nodes that specify what type of Node it is? I've tried searching for this practice, and I've seen some people use name for a purpose like this, but I was wondering if it was considered a good practice or if indexes would be the more practical method?
An example would be a "User" node, which would have type: user, this way if the index was bad, I would be able to do an all-node scan and look for types of user.

Labels have been added to neo4j 2.0. They fix this problem.
You can create nodes with labels:
CREATE (me:American {name: "Emil"}) RETURN me;
You can match on labels:
MATCH (n:American)
WHERE n.name = 'Emil'
RETURN n
You can set any number of labels on a node:
MATCH (n)
WHERE n.name='Emil'
SET n :Swedish:Bossman
RETURN n
You can delete any number of labels on a node:
MATCH (n { name: 'Emil' })
REMOVE n:Swedish
Etc...

True, it does depend on your use case.
If you add a type property and then wish to find all users, then you're in potential trouble as you've got to examine that property on every node to get to the users. In that case, the index would probably do better- but not in cases where you need to query for all users with conditions and relations not available in the index (unless of course, your index is the source of the "start").
If you have graphs like mine, where a relation type implies two different node types like A-(knows)-(B) and A or B can be a User or a Customer, then it doesn't work.
So your use case is really important- it's easy to model graphs generically, but important to "tune" it as per your usage pattern.

IMHO you shouldn't have to put a type property on the node. Instead, a common way to reference all nodes of a specific "type" is to connect all user nodes to a node called "Users" maybe. That way starting at the Users node, you can very easily find all user nodes. The "Users" node itself can be indexed so you can find it easily, or it can be connected to the reference node.

I think it's really up to you. Some people like indexed type attributes, but I find that they're mostly useful when you have other indexed attributes to narrow down the number of index hits (search for all users over age 21, for example).
That said, as #Luanne points out, most of us try to solve the problem in-graph first. Another way to do that (and the more natural way, in my opinion) is to use the relationship type to infer a practical node type, i.e. "A - (knows) -> B", so A must be a user or some other thing that can "know", and B must be another user, a topic, or some other object that can "be known".

For client APIs, modeling the element type as a property makes it easy to instantiate the right domain object in your client-side code so I always include a type property on each node/vertex.
The "type" var name is commonly used for this, but in some languages like Python, "type" is a reserved word so I use "element_type" in Bulbs ( http://bulbflow.com/quickstart/#models ).
This is not needed for edges/relationships because they already contain a type (the label) -- note that Neo4j also uses the keyword "type" instead of label for relationships.

I'd say it's common practice. As an example, this is exactly how Spring Data Neo4j knows of which entity type a certain node is. Each node has "type" property that contains the qualified class name of the entity. These properties are automatically indexed in the "types" index, thus nodes can be looked up really fast. You could implement your use case exactly like this.

Labels have recently been added to Neo4j 2.0 ( http://docs.neo4j.org/chunked/milestone/graphdb-neo4j-labels.html ). They are still under development at the moment, but they address this exact problem.

Related

How to Model a relationship that adds a feature to a node?

This is a follow-up to this earlier question
How to model two nodes related through a third node in neo4j?
If the capabilities of a product are enhanced by a connects_to relationship with another product, how should that fact be captured?
Example: given
(shelf:Shelf {maxload:20}), if (node:L-bracket)-[connects-to]->(shelf), then shelf's maxload increases by 10. Now, if someone queries for a Shelf that supports maxload=30, I should be able to retrieve this combination of L-Bracket+Shelf as an option, in addition to the shelves that support maxload without L-bracket. This is one use-case.
The other is when the connects_to relationship adds an entirely new property to the Shelf node. The option I'm thinking of is adding a property to the relationship called 'provides feature' and then query those as well when returning nodes, to see if a product is been enhanced by any of its connections.
Part 1 :
I should be able to retrieve this combination of L-Bracket+Shelf as
an option, in addition to the shelves that support maxload without
L-bracket.
This use case is handled with OPTIONAL MATCH :
MATCH (shelf:Shelf {maxload:30})
OPTIONAL MATCH (shelf)<-[:CONNECTS_TO]-(bracket:L-Bracket)
RETURN shelf, collect(bracket) as brackets
This would return you a list of shelfs and a collection of brackets for each of them - empty collection if they don't have any brackets.
Part 2 :
the other is when the connects_to relationship adds an entirely new
property to the Shelf node. The option i'm thinking of is adding a
property to the relationship called 'provides feature' and then query
those as well when returning nodes, to see if a product is been
enhanced by any of its connections
You can simply use a PROVIDES_FEATURE relationship type, no need for a property on it. You can request for them in the same way as for part 1.
To be a bit more general, suppose everything that can be connected to a shelf (not just an L-Bracket) was represented by an Accessory node that has type and extraLoad properties, like this:
(:Accessory {type: 'L-Bracket', extraLoad: 10})
This would allow accessories of different types and with differing extra load capacities.
With this model, you could find all Shelf/Accessory combinations that can hold a load of at least 30 this way:
MATCH (shelf:Shelf)
OPTIONAL MATCH (shelf)<-[:CONNECTS_TO]-(x:Accessory)
WITH shelf, COLLECT(x) AS accessories, SUM(x.extraLoad) AS extra
WHERE shelf.maxLoad + extra >= 30
RETURN shelf, accessories;

How can I mitigate having bidirectional relationships in a family tree, in Neo4j?

I am running into this wall regarding bidirectional relationships.
Say I am attempting to create a graph that represents a family tree. The problem here is that:
* Timmy can be Suzie's brother, but
* Suzie can not be Timmy's brother.
Thus, it becomes necessary to model this in 2 directions:
(Sure, technically I could say SIBLING_TO and leave only one edge...what I'm not sure what the vocabulary is when I try to connect a grandma to a grandson.)
When it's all said and done, I pretty sure there's no way around the fact that the direction matters in this example.
I was reading this blog post, regarding common Neo4j mistakes. The author states that this bidirectionality is not the most efficient way to model data in Neo4j and should be avoided.
And I am starting to agree. I set up a mock set of 2 families:
and I found that a lot of queries I was attempting to run were going very, very slow. This is because of the 'all connected to all' nature of the graph, at least within each respective family.
My question is this:
1) Am I correct to say that bidirectionality is not ideal?
2) If so, is my example of a family tree representable in any other way...and what is the 'best practice' in the many situations where my problem may occur?
3) If it is not possible to represent the family tree in another way, is it technically possible to still write queries in some manner that gets around the problem of 1) ?
Thanks for reading this and for your thoughts.
Storing redundant information (your bidirectional relationships) in a DB is never a good idea. Here is a better way to represent a family tree.
To indicate "siblingness", you only need a single relationship type, say SIBLING_OF, and you only need to have a single such relationship between 2 sibling nodes.
To indicate ancestry, you only need a single relationship type, say CHILD_OF, and you only need to have a single such relationship between a child to each of its parents.
You should also have a node label for each person, say Person. And each person should have a unique ID property (say, id), and some sort of property indicating gender (say, a boolean isMale).
With this very simple data model, here are some sample queries:
To find Person 123's sisters (note that the pattern does not specify a relationship direction):
MATCH (p:Person {id: 123})-[:SIBLING_OF]-(sister:Person {isMale: false})
RETURN sister;
To find Person 123's grandfathers (note that this pattern specifies that matching paths must have a depth of 2):
MATCH (p:Person {id: 123})-[:CHILD_OF*2..2]->(gf:Person {isMale: true})
RETURN gf;
To find Person 123's great-grandchildren:
MATCH (p:Person {id: 123})<-[:CHILD_OF*3..3]-(ggc:Person)
RETURN ggc;
To find Person 123's maternal uncles:
MATCH (p:Person {id: 123})-[:CHILD_OF]->(:Person {isMale: false})-[:SIBLING_OF]-(maternalUncle:Person {isMale: true})
RETURN maternalUncle;
I'm not sure if you are aware that it's possible to query bidirectionally (that is, to ignore the direction). So you can do:
MATCH (a)-[:SIBLING_OF]-(b)
and since I'm not matching a direction it will match both ways. This is how I would suggest modeling things.
Generally you only want to make multiple relationships if you actually want to store different state. For example a KNOWS relationship could only apply one way because person A might know person B, but B might not know A. Similarly, you might have a LIKES relationship with a value property showing how much A like B, and there might be different strengths of "liking" in the two directions

Neo4j labels, relationship types, and cypher matching performance

Say I have a massive graph of users and other types of nodes. Each type has a label, some may have multiple labels. Since I am defining users and their access to nodes, there is one relationship type between users and nodes: CAN_ACCESS. Between other objects, there are different relationship types, but for the purpose of access control, everything involves a CAN_ACCESS relationship when we start from a user.
I never perform a match without using labels, so my intention and hope is that any performance downsides to having one heavily-used relationship type from my User nodes should be negated by matching a label. Obviously, this match could get messy:
MATCH (n:`User`)-[r1:`CAN_ACCESS`]->(n2)
But I'd never do that. I'd do this:
MATCH (n:`User`)-[r1:`CAN_ACCESS`]->(n2:`LabelX`)
My question, then is whether the use of labels on the destination side of the match is effectively equivalent to having a dedicated relationship type between a User and any given label. In other words, does this:
MATCH (n:`User`)-[r1:`CAN_ACCESS`]->(n2:`LabelX`)
Give me the same performance as this:
MATCH (n:`User`)-[r1:`CAN_ACCESS_LABEL_X`]->(n2)
If CAN_ACCESS_LABEL_X ALWAYS goes (n:`User`)-->(n:`LabelX`)?
As pointed out by Michael Hunger's comment, Mark Needham's blog post here demonstrates that performance is best when you use a dedicated relationship type instead of relying on labels.

Do we need to index on relationship properties to ensure that Neo4j will not search through all relationships

To clarify, let's assume that I have a relationship type: "connection." Connections has a property called: "typeOfConnection," which can take on values in the domain:
{"GroupConnection", "FriendConnection", "BlahConnect"}.
When I query, I may want to qualify connection with one of these types. While there are not many types, there will be millions of connections with each property type.
Do I need to put an index on connection.typeOfConnection in order to ensure that all connections will not be traversed?
If so, I have been unable to find a simple cypher statement to do this. I've seen some stuff in the documentation describing how to do this in Java, but I'm interacting with Neo using Py2Neo, so it would be wonderful if there was a cypher way to do this.
This is a mixed granularity property graph data model. Totally fine, but you need to replace your relationship qualifiers with intermediate nodes. To do this, replace your relationships with one type node and 2 relationships so that you can perform indexing.
Your model has a graph with a coarse-grained granularity. The opposite extreme is referred to as fine-grained granularity, which is the foundation of the RDF model. With property graph you'll need to use nodes in place of relationships that have labels applied by their type if you're going to do this kind of coarse-grained graph.
For instance, let's assume you have:
MATCH (thing1:Thing { id: 1 })-->(:Connection { type: "group" }),
(group)-->(thing2:Thing)
RETURN thing2
Then you can index on the label Connection by property type.
CREATE INDEX ON :Connection(type)
This allows you the flexibility of not typing your relationships if your application requires dynamic types of connections that prevent you from using a fine-grained granularity.
Whatever you do, don't work around your issue by dynamically generating typed relationships in your Cypher queries. This will prevent your query templates from being cached and decrease performance. Either type all your relationships or go with the intermediate node I've recommended above.

Assumptions regarding Node ID strings in Neo4j - cypher

In my recent question, Modeling conditional relationships in neo4j v.2 (cypher), the answer has led me to another question regarding my data model and the cypher syntax to represent it. Lets say in my model, there is a node CLT1 that is what I'll call the Source node. CLT1 has relationships to other 286 Target nodes. This is a model of a target node:
CREATE
(Abnormally_high:Label1:Label2:Label3:Label4:Label5:Label6:Label7:Label8:Label9:Label10
{Pro1:'x',Prop2:'y',Prop3:'z'})
Key point: I am assuming the string after the CREATE clause is
The ID of this target node
The ID is significant because its content has domain-specific meaning
and is query-able.
in this case its the phrase ...."Abnormally_high".
I made this assumption based on the movie database example.
CREATE (Keanu:Person {name:'Keanu Reeves', born:1964})
CREATE (Carrie:Person {name:'Carrie-Anne Moss', born:1967})
The first strings after CREATE definitely have domain-specific meaning!
In my earlier post I discuss Problem 2. I find that problem 2 arises because among the 286 target nodes, there are many instances where there was at least one more Target node who shares the identical ID. In this instance, the ID is "Abnormally_high". The other Target nodes may differ in the value of any of Label1 - Label10 or the associated properties.
Apparently, Cypher doesn't like that. In Problem 2, I was discussing the ways to deal with the fact that cypher doesn't like using the same node ID multiple times even though the labels or properties were different.
My problem are my assumptions about the Target node ID.
AM I RIGHT?
I am now thinking that I could instead use this....
CREATE (CLT1_target_1:Label1:Label2:Label3:Label4:Label5:Label6:Label7:Label8:Label9:Label10
{name:'Abnormally_high',Prop2:'y',Prop3:'z'})
If indeed the first string after the CREATE clause is an ID, then all I have to do is put a unique target node identifier.... like CLT1_target_1 and increment up to CLT1_target_286. If I do this, then I can have the name as a property and change whatever label or property I want.
Do I have this right?
You are wrong. In Cypher, a node name (like "Abnormally_high") is just a variable name that exists for the lifetime of the query (and sometimes not even that long). The node name used in a Cypher query is never persisted in any way, and can be any arbitrary string.
Also, in neo4j, the term "ID" has a specific meaning. The neo4j DB will automatically assign a (currently) unique integer ID to each new node. You have no control over the ID value assigned to a node. And when a node is deleted, neo4j can reassign its ID to a new node.
You should read the neo4j manual (available at docs.neo4j.org), especially the section on Cypher, to get a better understanding.

Resources