I am going over this YouTube tutorial, "Using LOAD CSV in the Real World".
The tutorial shows how to take a CSV, where each row is a complaint made against some bank, and model it as a Neo4j dictionary.
When doing so, the narrator sets Properties on the Complaint node:
CREATE (complaint:Complaint {id: line.`Complaint ID`})
SET complaint.year= TOINT(date[2]),
complaint.month= TOINT(date[0]),
complaint.day = TOINT(date[1])
I'm confused about a small point -- what makes this date information more of a 'Property' than a Label?
Could this be modeled instead where the node has this information encapsulated as Labels instead of Properties? At what point do you need one of these and not the other?
Labels and properties are very different things.
A property belongs to a node or a relationship, and has a name and a value.
A node label is similar in concept to a "class name", and has no value.
So, it does not make any sense to talk about putting a date value in a "label". You can only put a value in a property.
Note, however, that people often use a label name (e.g., "Foo") as a shorthand for "node that has the Foo label". For example, they may say "store the date in Foo" when they actually mean "store the date in the appropriate property of a node with the label Foo". Perhaps this is what is causing the confusion.
As cybersam pointed out in his answer, labels cannot contain values. They are just... labels. Like a tag. Taking this in a slightly different direction:
A long, long time ago, in a version far, far away, Neo4j didn't have labels. So, if you wanted to identify a particular type of node (say... a Person)... you'd likely include a property+value such as nodeType = 'Person'. And then you'd include a filter in your queries, such as:
WHERE node.nodeType = 'Person'
Labels make such a property type obsolete, and are also indexable. Further, you may have multiple labels on a node (which would require your legacy nodeType property to be an array, and not as efficient to search).
So: Labels for tagging/indexing. Properties for holding values.
" Node labels serve as an anchor point for a query. By specifying a label, we are specifying a subset of one or more nodes with which to start a query. Using a label helps to reduce the amount of data that is retrieved." https://graphacademy.neo4j.com/
Related
I am developing a Neo4j database that will contain genomic and clinical data for cancer patients. A common design issue in developing graph databases is whether a data item should be represented by a Node or by a property within a Node. In my case, patients will have hundreds of clinical and demographic measurements (e.g. sex, medications, tumor size). Some of these data will be constant (e.g. sex) while others will be subject to variation with each patient visit. The responses I've seen to previous node vs property questions have recommended using the anticipated queries against the data to make the decision. I think I can identify some properties that will be common search criteria and should be nodes (e.g. smoking history, sex, cancer type) but that still leaves me with hundreds of other properties. Is there a practical limit in Neo4j for the number of properties that a Node should contain? Also, a hybrid approach, where some data are properties and others are Nodes would seem to make both loading data from source files and subsequent queries more complicated.
The main idea behind "look at your queries to decide", is that how data relates to each other effects whether a node or property is better. Really, the main point of a graph database is to make walking relationships easier to query. So the real question you should ask yourself is "Does (a)-->()<--(b) have any significant meaning?" In other words, do I need to be able to find other nodes that share this property?
Here are some quick rule-of-thumb guidlines
Node
Has it's own sub-values or relations
Multiple nodes sharing this value has meaning, and you need to be able to walk along this shared value between them
Changes infrequently
If more than 1 value can apply at the same time
Properties
Has a large range of possible values
Changes over time
If more than 1 value can apply, values are usually updated as a set (rather than individually)
Label
Has a small, finite range of mutually exclusive values
Almost never changes
So lets go through the thought process of a few of your properties...
Sex
Will either be "Male" or "Female", and everyone will be connected to one of the two, so they will both end up being super nodes (overloaded). Also, if you do end up needing to find two people that share the same sex, almost any other method would be more efficient than finding them through the super node. However these are mutually exclusive, immutable, genetic traits so making this a label is also perfectly acceptable (and sometimes preferred).
Address
This is a variable value with sub-properties, won't be shared by very many nodes, and the walk from one person to another at the same address (or, by extension, live in an area) has valuable meaning. So this should almost definitely be a node.
Height and Weight
These change constantly with time, have no sub values, and two people sharing this value has little to no meaning. The range of values is far too wide, so Labels make no since either, so this should be a property.
Blood type
While has more options then Sex does, all the same logic applies, except that the relation does matter now (because people must share a blood type to donate). The problem is that this value will be so overloaded, that you will need to filter on area first, and than just verifying blood type. Could be a property or label. The case for node is if you include an "Can_Donate_To" or "Can_Accept" relation between the blood types. While you likely won't walk these relations to find a potential donor (because they are too overloaded, and you will have to filter by area first), you can use them to verify someone can be a donor.
Social Security Number
Is highly sensitive, and a lawsuit waiting to happen. Keep out of the DB if at all possible. If you absolutely have to; this property is immutable, but will be unique to every person, so because of the lack of reuse, is a bad label and will be pointless as a node. Definitely a property. (But should be salted+hashed if only for verification purposes only)
Mother's maiden name
The possible values are endless, and two nodes sharing this value has no real meaning. Definitely a property.
First born child
Since the child is already their own node, with it's own sub properties, just create a relation between the two. While the value of this info is questionable, any time you need to reference another node, always use a relationship for it. Definitely a node.
I am trying to dump some date to Neo4J. Some of my node names (in the chosen format for dumping) has numbers, which have to be exported as node-names.
I encounter the following error when the node name or label starts with a number.
Neo.ClientError.Statement.InvalidSyntax
MERGE (1:User {name: "u1"})
Is this because, internally neo4j has a unique ID?. How do we circumvent this problem?
I believe these are just the syntax rules Neo4j uses. Also keep in mind that the thing you are referring to as the node name (1, in your example) is actually a variable name, and only persists for the duration of the query (or until it leaves scope if not carried over in a WITH clause to the next part of the query).
From the developer documentation:
Variable names are case sensitive, and can contain underscores and
alphanumeric characters (a-z, 0-9), but must always start with a
letter...The same rules apply to property names.
While I didn't see anything about label names, it looks like it follows the same syntax rules.
Property values, of course, can be anything you want.
You described the limitation as a "problem", so I'm guessing there's a perceived issue with this in your import, likely around the confusion between variables and what you called node names. If that's so, then please add some more details to your description, and I can add on to my answer accordingly.
Relationship/Arrows in Neo4j can not get more than one type/label (see here, and here). I have a data model that edges need to get labels and (probably) properties. If I decide to use Neo4j (instead of OriendDB which supports labeled arrow), I think I would have then two options to model an arrow, say f, between two nodes A and B:
1) encode an arrow f as a span, say A<--f-->B, such that f is also a node and --> and <-- are arrows.
or
2) encode an arrow f as A --> f -->B, such that f is a node again and two --> are arrows.
Though this seems to be adding unnecessary complexity on my data model, it does not seem to be any other option at the moment if I want to use Neo4j. Then, I am trying to see which of the above encoding might fit better in my queries (queries are the core of my system). For doing so, I need to resort to examples. So I have two question:
First Question:
part1) I have nodes labeled as Person and father, and there are arrows between them like Person<-[:sr]-father-[:tr]->Person in order to model who is father of who (tr is father of sr). For a given person p1 how can I get all of his ancestors.
part2) If I had Person-[:sr]->father-[:tr]->Person structure instead, for modeling father relationship, how the above same query would look like.
This is answered here when father is considered as a simple relationship (instead of being encoded as a node)
Second Question:
part1) I have nodes labeled as A nodes with the property p1 for each. I want to query A nodes, get those elements that p1<5, then create the following structure: for each a1 in the query result I create qa1<-[:sr]-isA-[:tr]->a1 such that isA and qa1 are nodes.
part2) What if I wanted to create qa1-[:sr]->isA-[:tr]->qa1 instead?
This question is answered here when isA is considered as a simple arrow (instead of being modeled as a node).
First, some terminology; relationships don't have labels, they only have types. And yes, one type per relationship.
Second, relative to modeling, I think the direction of the relationship isn't always super important, since with neo4j you can traverse it both ways easily. So the difference between A-->f-->B and A<--f-->B I think should be entirely driven what what makes sense semantically for your domain, nothing else. So your options (1) and (2) at the top seem the same to me in terms of overall complexity, which brings me to point #3:
Your main choice is between making a complex relationship into a node (which I think we're calling f here) or keeping it as a relationship. Making "a relationship into a node" is called reification and I think it's considered a fairly standard practice to accommodate a number of modeling issues. It does add complexity (over a simple relationship) but adds flexibility. That's a pretty standard engineering tradeoff everywhere.
So with all of that said, for your first question I wouldn't recommend an intermediate node at all. :father is a very simple relationship, and I don't see why you'd ever need more than one label on it. So for question one, I would pick "neither of the options you list" and would instead model it as (personA)-[:father]->(personB). More simple. You'd query that by saying
MATCH (personA { name: "Bob"})-[:father]->(bobsDad) RETURN bobsDad
Yes, you could model this as (personA)-[:sr]->(fatherhood)-[:tr]->(personB) but I don't see how this gains you much. As for the relationship direction, again it doesn't matter for performance or query, only for semantics of whatever :tr and :sr are supposed to mean.
I have nodes labeled as A nodes with the property p1 for each. I want
to query A nodes, get those elements that p1<5, then create the
following structure: for each a1 in the query result I create
qa1<-[:sr]-isA-[:tr]->a1 such that isA and qa1 are nodes.
That's this:
MATCH (aNode:A)
WHERE aNode.p1 < 5
WITH aNode
MATCH (qa1 { label: "some qa1 node" })
CREATE (qa1)<-[:sr]-(isA)-[:tr]->aNode;
Note that you'll need to adjust the criteria for qa1 and also specify something meaningful for isA.
What if I wanted to create qa1-[:sr]->isA-[:tr]->qa1 instead?
It should be trivial to modify that query above, just change the direction of the arrows, same query.
A portion of my neo4j graph represents objects, their values, and the attributes associated with those values. To use common programming syntax, I'm persisting something like the result of:
Object.Attribute = Value
where Object, Attribute and Value are all nodes, and they're linked by VALUE and ATTRIBUTE relationships like this:
Object-[:VALUE]->Value-[:ATTRIBUTE]->Attribute
To describe this with a specific example, the result of this code:
Object.Colour = 'Red'
would be persisted as:
Object-[:VALUE]->(Value { value:'Red' })-[:ATTRIBUTE]->(Attribute { name:'Colour' })
The problem occurs when I want to modify a persisted state like that above, and I wish to re-use existing Attribute (and ideally, Value) nodes - that is, I don't want to have multiple instances of the Attribute { name:'Colour' } node, I want to have a single instance that is related to each Value node instance.
The following Cypher query will go ahead and create new Value and Attribute nodes every time, regardless of whether identical nodes already exist:
start o=node(something)
create unique o-[:VALUE]->(v {value:'Green'})-[:ATTRIBUTE]->(a {name:'Colour'})
return v;
The following will apparently recycle both Value and Attribute nodes, but of course won't work when the required Attribute doesn't exist (i.e. the first time it's used):
start o=node(something), a=node(something)
create unique o-[:VALUE]->(v {value:'Green'})-[:ATTRIBUTE]->a
where a.name = 'Colour'
return v;
The statement in the documentation that "create unique will always make the least change possible to the graph — if it can use parts of the existing graph, it will", doesn't appear to be completely true, and I don't understand why my query doesn't exhibit this behaviour.
How can I get the "recycling" effect of the latter query, combined with the creation on demand of the Attribute (and Value) when required like the former?
The first problem is that you are building a property graph in a property graph. It's possible, but awkward and not always a good choice.
Your example:
http://console.neo4j.org/r/evl77k
If understood you correctly, you want to be able to change the value node and re-use the same attribute node, right?
The problem is that in your second query, you ask Cypher to find value node with a different property than what is already in your graph. Cypher can't find such a node, and so creates one for you. It then tries to find an outgoing ATTRIBUTE relationship from the newly created node, and of course it doesn't find any. So it creates a new relationship and attribute for you.
If you want to keep using the same attribute node, you simply omit the property value from the value node, like so:
START o=node(0)
CREATE UNIQUE o-[:VALUE]->(v)-[:ATTRIBUTE]->(a {name:'Colour'})
SET v.value = 'Green'
RETURN v
This will find you Colour attribute node, and then set the property for you, instead of creating new paths every time.
Makes sense?
Andrés
I'm playing around with neo4j, and I was wondering, is it common to have a type property on nodes that specify what type of Node it is? I've tried searching for this practice, and I've seen some people use name for a purpose like this, but I was wondering if it was considered a good practice or if indexes would be the more practical method?
An example would be a "User" node, which would have type: user, this way if the index was bad, I would be able to do an all-node scan and look for types of user.
Labels have been added to neo4j 2.0. They fix this problem.
You can create nodes with labels:
CREATE (me:American {name: "Emil"}) RETURN me;
You can match on labels:
MATCH (n:American)
WHERE n.name = 'Emil'
RETURN n
You can set any number of labels on a node:
MATCH (n)
WHERE n.name='Emil'
SET n :Swedish:Bossman
RETURN n
You can delete any number of labels on a node:
MATCH (n { name: 'Emil' })
REMOVE n:Swedish
Etc...
True, it does depend on your use case.
If you add a type property and then wish to find all users, then you're in potential trouble as you've got to examine that property on every node to get to the users. In that case, the index would probably do better- but not in cases where you need to query for all users with conditions and relations not available in the index (unless of course, your index is the source of the "start").
If you have graphs like mine, where a relation type implies two different node types like A-(knows)-(B) and A or B can be a User or a Customer, then it doesn't work.
So your use case is really important- it's easy to model graphs generically, but important to "tune" it as per your usage pattern.
IMHO you shouldn't have to put a type property on the node. Instead, a common way to reference all nodes of a specific "type" is to connect all user nodes to a node called "Users" maybe. That way starting at the Users node, you can very easily find all user nodes. The "Users" node itself can be indexed so you can find it easily, or it can be connected to the reference node.
I think it's really up to you. Some people like indexed type attributes, but I find that they're mostly useful when you have other indexed attributes to narrow down the number of index hits (search for all users over age 21, for example).
That said, as #Luanne points out, most of us try to solve the problem in-graph first. Another way to do that (and the more natural way, in my opinion) is to use the relationship type to infer a practical node type, i.e. "A - (knows) -> B", so A must be a user or some other thing that can "know", and B must be another user, a topic, or some other object that can "be known".
For client APIs, modeling the element type as a property makes it easy to instantiate the right domain object in your client-side code so I always include a type property on each node/vertex.
The "type" var name is commonly used for this, but in some languages like Python, "type" is a reserved word so I use "element_type" in Bulbs ( http://bulbflow.com/quickstart/#models ).
This is not needed for edges/relationships because they already contain a type (the label) -- note that Neo4j also uses the keyword "type" instead of label for relationships.
I'd say it's common practice. As an example, this is exactly how Spring Data Neo4j knows of which entity type a certain node is. Each node has "type" property that contains the qualified class name of the entity. These properties are automatically indexed in the "types" index, thus nodes can be looked up really fast. You could implement your use case exactly like this.
Labels have recently been added to Neo4j 2.0 ( http://docs.neo4j.org/chunked/milestone/graphdb-neo4j-labels.html ). They are still under development at the moment, but they address this exact problem.