Lets say there are two nodes A and B,
(A)-[r]-(B)
r has a property 'weight', that is a measure of dependency of A on B, let's say.
The value of weight frequently changes, and I wish to version the value of weight.
Is it feasible to make a new relationship between the two same nodes, and add a property ['valid': true] on the relationship created last?
I ask this question because I was told that if I need versioning on properties, they should definitely be nodes:
https://twitter.com/ikwattro/status/746997161645187072
But, the weight property between the two nodes A and B naturally belongs to the relationship between them. How do use a node to maintain the weight?
EDIT:
An example:
Let A be a node with label :FRUIT, and B be a node of label :PERSON
Further, let r be a relationship between the two, with a label :LIKING, and, the 'weight' property of r be a measure of how much person B likes fruit A.
The weight property of r keep changing, and it is required to version this property with time.
I think this depends on two things: The frequency of weight updates and the queries you will run on the versioned weights:
If you expect a smallish number of updates and if only keep them for reference, you could use a single relationship and store the old values in a property (e.g. a map or even a string).
If you expect a smallish number of updates and if you want to query the data regularly, it would be reasonable to use new relationships for each update.
If the weight changes frequently and you actually need to access the data (i.e. collect millions of weight values for millions of fruits), I would not store it in neo4j. Use a simple MySQL table with PersonID, FruitID, weight, timestamp or some other data store. Store only the latest value in neo4j.
I use both 2. and 3. a lot and even though 3. sounds overkill it's usually simple to implement as long as you only 'outsource' structured data with clear queries.
Related
I am developing a Neo4j database that will contain genomic and clinical data for cancer patients. A common design issue in developing graph databases is whether a data item should be represented by a Node or by a property within a Node. In my case, patients will have hundreds of clinical and demographic measurements (e.g. sex, medications, tumor size). Some of these data will be constant (e.g. sex) while others will be subject to variation with each patient visit. The responses I've seen to previous node vs property questions have recommended using the anticipated queries against the data to make the decision. I think I can identify some properties that will be common search criteria and should be nodes (e.g. smoking history, sex, cancer type) but that still leaves me with hundreds of other properties. Is there a practical limit in Neo4j for the number of properties that a Node should contain? Also, a hybrid approach, where some data are properties and others are Nodes would seem to make both loading data from source files and subsequent queries more complicated.
The main idea behind "look at your queries to decide", is that how data relates to each other effects whether a node or property is better. Really, the main point of a graph database is to make walking relationships easier to query. So the real question you should ask yourself is "Does (a)-->()<--(b) have any significant meaning?" In other words, do I need to be able to find other nodes that share this property?
Here are some quick rule-of-thumb guidlines
Node
Has it's own sub-values or relations
Multiple nodes sharing this value has meaning, and you need to be able to walk along this shared value between them
Changes infrequently
If more than 1 value can apply at the same time
Properties
Has a large range of possible values
Changes over time
If more than 1 value can apply, values are usually updated as a set (rather than individually)
Label
Has a small, finite range of mutually exclusive values
Almost never changes
So lets go through the thought process of a few of your properties...
Sex
Will either be "Male" or "Female", and everyone will be connected to one of the two, so they will both end up being super nodes (overloaded). Also, if you do end up needing to find two people that share the same sex, almost any other method would be more efficient than finding them through the super node. However these are mutually exclusive, immutable, genetic traits so making this a label is also perfectly acceptable (and sometimes preferred).
Address
This is a variable value with sub-properties, won't be shared by very many nodes, and the walk from one person to another at the same address (or, by extension, live in an area) has valuable meaning. So this should almost definitely be a node.
Height and Weight
These change constantly with time, have no sub values, and two people sharing this value has little to no meaning. The range of values is far too wide, so Labels make no since either, so this should be a property.
Blood type
While has more options then Sex does, all the same logic applies, except that the relation does matter now (because people must share a blood type to donate). The problem is that this value will be so overloaded, that you will need to filter on area first, and than just verifying blood type. Could be a property or label. The case for node is if you include an "Can_Donate_To" or "Can_Accept" relation between the blood types. While you likely won't walk these relations to find a potential donor (because they are too overloaded, and you will have to filter by area first), you can use them to verify someone can be a donor.
Social Security Number
Is highly sensitive, and a lawsuit waiting to happen. Keep out of the DB if at all possible. If you absolutely have to; this property is immutable, but will be unique to every person, so because of the lack of reuse, is a bad label and will be pointless as a node. Definitely a property. (But should be salted+hashed if only for verification purposes only)
Mother's maiden name
The possible values are endless, and two nodes sharing this value has no real meaning. Definitely a property.
First born child
Since the child is already their own node, with it's own sub properties, just create a relation between the two. While the value of this info is questionable, any time you need to reference another node, always use a relationship for it. Definitely a node.
To keep things simple, as part of the ETL on my time-series data, I added a sequence number property to each row corresponding to 0..370365 (370,366 nodes, 5,555,490 properties - not that big). I later added a second property and named it "outeseq" (original) and "ineseq" (second) to see if an outright equivalence to base the relationship on might speed things up a bit.
I can get both of the following queries to run properly on up to ~30k nodes (LIMIT 30000) but past that, its just an endless wait. My JVM has 16g max (if it can even use it on a windows box):
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.outeseq-1
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
or
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.ineseq
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
I also added these in hopes of speeding things up:
CREATE CONSTRAINT ON (a:BOOK)
ASSERT a.outeseq IS UNIQUE
CREATE CONSTRAINT ON (b:BOOK)
ASSERT b.ineseq IS UNIQUE
I can't get the relationships created for the entire data set! Help!
Alternatively, I can also get bits of the relationships built with parameters, but haven't figured out how to parameterize the sequence over all of the node-to-node sequential relationships, at least not in a semantically general enough way to do this.
I profiled the query, but did't see any reason for it to "blow-up".
Another question: I would like each relationship to have a property to represent the difference in the time-stamps of each node or delta-t. Is there a way to take the difference between the two values in two sequential nodes, and assign it to the relationship?....for all of the relationships at the same time?
The last Q, if you have the time - I'd really like to use the raw data and just chain the directed relationships from one nodes'stamp to the next nearest node with the minimum delta, but didn't run right at this for fear that it cause scanning of all the nodes in order to build each relationship.
Before anyone suggests that I look to KDB or other db's for time series, let me say I have a very specific reason to want to use a DAG representation.
It seems like this should be so easy...it probably is and I'm blind. Thanks!
Creating Relationships
Since your queries work on 30k nodes, I'd suggest to run them page by page over all the nodes. It seems feasible because outeseq and ineseq are unique and numeric so you can sort nodes by that properties and run query against one slice at time.
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq = b.outeseq-1
WITH a, b ORDER BY a.outeseq SKIP {offset} LIMIT 30000
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
It will take about 13 times to run the query changing {offset} to cover all the data. It would be nice to write a script on any language which has a neo4j client.
Updating Relationship's Properties
You can assign timestamp delta to relationships using SET clause following the MATCH. Assuming that a timestamp is a long:
MATCH (a:BOOK)-[s:FORWARD_SEQ]->(b:BOOK)
SET s.delta = abs(b.timestamp - a.timestamp);
Chaining Nodes With Minimal Delta
When relationships have the delta property inside, the graph becomes a weighted graph. So we can apply this approach to calculate the shortest path using deltas. Then we just save the length of the shortest path (summ of deltas) into the relation between the first and the last node.
MATCH p=(a:BOOK)-[:FORWARD_SEQ*1..]->(b:BOOK)
WITH p AS shortestPath, a, b,
reduce(weight=0, r in relationships(p) : weight+r.delta) AS totalDelta
ORDER BY totalDelta ASC
LIMIT 1
MERGE (a)-[nearest:NEAREST {delta: totalDelta}]->(b)
RETURN nearest;
Disclaimer: queries above are not supposed to be totally working, they just hint possible approaches to the problem.
My business requirement says I need to add an arbitrary number of well-defined (AKA not dynamic, not unknown) attributes to certain types of nodes. I am pretty sure that while there could be 30 or 40 different attributes, a node will probably have no more than 4 or 5 of them. Of course there will be corner cases...
In this context, I am generically using 'attribute' as a tag wanted by the business, and not in the Neo4J sense.
I'll be expected to report on which nodes have which attributes. For example, I might have to report on which nodes have the "detention", "suspension", or "double secret probation" attributes.
One way is to simply have an array of appropriate attributes on each entity. But each query would require a search of all nodes. Or, I could create explicit attributes on each node. Now they could be indexed. I'm not seriously considering either of these approaches.
Another way is to implement each attribute as a singleton Neo node, and allow many (tens of thousands?) of other nodes to relate to these nodes. This implementation would have 10,000 nodes but 40,000 relationships.
Finally, the attribute nodes could be created and used by specific entity nodes on an as-needed basis. In this case, if 10,000 entities had an average of 4 attributes, I'd have a total of 50,000 nodes.
As I type this, I realize that in the 2nd case, I still have 40,000 relationships; the 'truth' of the situation did not change.
Is there a reason to avoid the 'singleton' implementation? I could put timestamps on the relationships. But those wouldn't be indexed...
For your simple use case, I'd suggest an approach you didn't list -- which is to use a node label for each "attribute".
Nodes can have multiple labels, and neo4j can quickly iterate through all the nodes with the same label -- making it very quick and easy to find all the nodes with a specific label.
For example:
MATCH (n:Detention)
RETURN n;
I am teaching myself graph modelling and use Neo4j 2.2.3 database with NodeJs and Express framework.
I have skimmed through the free neo4j graph database book and learned how to model a scenario, when to use relationship and when to create nodes, etc.
I have modelled a vehicle selling scenario, with following structure
NODES
(:VEHICLE{mileage:xxx, manufacture_year: xxxx, price: xxxx})
(:VFUEL_TYPE{type:xxxx}) x 2 (one for diesel and one for petrol)
(:VCOLOR{color:xxxx}) x 8 (red, green, blue, .... yellow)
(:VGEARBOX{type:xxx}) x 2 (AUTO, MANUAL)
RELATIONSHIPS
(vehicleNode)-[:VHAVE_COLOR]->(colorNode - either of the colors)
(vehicleNode)-[:VGEARBOX_IS]->(gearboxNode - either manual or auto)
(vehicleNode)-[:VCONSUMES_FUEL_TYPE]->(fuelNode - either diesel or petrol)
Assuming we have the above structure and so on for the rest of the features.
As shown in the above screenshot (136 & 137 are VEHICLE nodes), majority of the features of a vehicle is created as separate nodes and shared among vehicles with common feature with relationships.
Could you please advise whether roles (labels) like color, body type, driving side (left drive or right drive), gearbox and others should be seperate nodes or properties of vehicle node? Which option is more performance friendly, and easy to query?
I want to write a JS code that allows querying the graph with above structure with one or many search criteria. If majority of those features are properties of VEHICLE node then querying would not be difficult:
MATCH (v:VEHICLE) WHERE v.gearbox = "MANUAL" AND v.fuel_type = "PETROL" AND v.price > x AND v.price < y AND .... RETURN v;
However with existing graph model that I have it is tricky to search, specially when there are multiple criteria that are not necessarily a properties of VEHICLE node but separate nodes and linked via relationship.
Any ideas and advise in regards to existing structure of the graph to make it more query-able as well as performance friendly would be much appreciated. If we imagine a scenario with 1000 VEHICLE nodes that would generate 15000 relationship, sounds a bit scary and if it hits a million VEHICLE then at most 15 million relationships. Please comment if I am heading in the wrong direction.
Thank you for your time.
Modeling is full of tradeoffs, it looks like you have a decent start.
Don't be concerned at all with the number of relationships. That's what graph databases are good at, so I wouldn't be too concerned about over-using them.
Should something be a property, or a node? I can't answer for your scenario, but here are some things to consider:
If you look something up by a value all the time, and you have many objects, it's usually going to be faster to find one node and then everything connected to it, because graph DBs are good at exploiting relationships. It's less fast to scan all nodes of a label and find the items where a property=a value.
Relationships work well when you want to express a connection to something that isn't a simple primitive data type. For example, take "gearbox". There's manuals, and other types...if it's a property value, you won't later easily be able to decide to store 4 other sub-types/sub-aspects of "gearbox". If it were a node, that would later be easy because you could add more properties to the node, or relate other things.
If a piece of data really is a primitive (String, integer, etc) and you don't need extra detail about it, that usually makes a good property. Querying primitive values by connecting to other nodes will seem clunky later on. For example, I wouldn't model a person with a "date of birth" as a separate node, that would be irritating to query, and would give you flexibility you'd be very unlikely to need in the future.
Semantically, how is your data related? If two items are similar because they share an X, then that X probably should be a node. If two items happen to have the same Y value but that doesn't really mean much, then Y is probably better off as a node property.
I've been using Neo4j for a little while now and have an app up and running using Neo4j, its all working really well and Neo4j has been really cool at solving this problem, but I now need to extend the app and having been trying to impl. a Key-Value List of data into Neo4j and I'm not sure the best way to go about it.
I have a List, the list is around 7 million elements in length and so a bit long for just storing the whole list in memory and managing it myself. I tested this and it would consume 3Gb.
My choices are either:
(a) Neo4j is just the wrong tool for the job and I should use an actual key-value data store. A little adverse to do this as I'd have to introduce another data store just for this list of data.
(b) Use Neo4j, by creating a node per key-value setting the key and value as properties on the node, but there is no relationship other then having an index to group these nodes together, exposing the key of the key-value as the key on the index.
(c) Create a single node and hold all key-values as properties, this feels wrong, because when getting the node it will load the whole thing into memory, then I'd have to search the properties for the one I'm interested in, and I might as well manage the List myself.
(d) The key is a two part key that actually points to two nodes, so create a relationship and set the value as a property on the relationship. I started down this path, but when it came to doing a lookup for a specific key/value it's not simple and fast, so backed away from this.
Either options 'a' or 'b' feel the way to go.
Any advice would be appreciated.
Example scenario
We have Node A and Node B which has a relationship between the two Nodes.
The nodes all have a property of 'foo', with foo having some value.
In this example node A has foo=X and Node B has foo=Y
We then have this list of K/Vs. One of those K/V is Key:X+Y=Value:Z
So, the original idea was to create another relationship between Node A and Node B and store a property on the relationship holding Z. Then create an index on 'foo' and a relationship idx on the new relationship.
When given Key X+Y get the value.
Lookup logic would be get Node A (from X) and Node B (from y), then walk through Node A relationships to Node B lookup for this new relationship type.
While this will work, I do not like the fact I have to lookup through all relationships to/from the nodes looking for a specific type this is inefficient. Especially if there are many relationships of different types.
So the conclusion to go with options 'A' or 'B', or I'm trying to do something impractical with Neo.
Don't try to store 7 million items in a Neo4j property -- you're right, that's wrong.
Redis and Neo4j often make a good pairing, but I don't quite understand what you're trying to do or what you mean in "d" -- what are the key/value pairs, and how do they relate to the nodes and relationships in the graph? Examples would help.
UPDATE: The most natural way to do this with a graph database is to store it as a property on the edge between the two nodes. Then you can use Gremlin to get its value.
For example, to return a property on an edge that exists between two vertices (nodes) that have some properties:
start = g.idx('vertices')[[key:value]] // start vertex
edge = start.outE(label).as('e') // edge
end = edge.inV.filter{it.someprop == somevalue} // end vertex
prop = end.back('e').prop // edge property
return prop
You could store it in an index like you suggested, but this adds more complexity to your system, and if you need to reference the data as part of the traversal, then you will either have to store duplicate data or look it up in Redis during the traversal, which you can do, see:
Have Gremlin Talk to Redis in Real Time while It's Walking the Graph
https://groups.google.com/d/msg/gremlin-users/xhqP-0wIg5s/bxkNEh9jSw4J
UPDATE 2:
If the ID of vertex a and b are known ahead of time, then it's even easier:
g.v(a).outE(label).filter{it.inVertex.id == b}.prop
If vertex a and b are known ahead of time, then it's:
a.outE(label).filter{it.inVertex == b}.prop