What's the right way to model and iterate time dependent relationships?
For example:
John married Elizabeth on 07/03/69 and divorced her 05/12/73; then he married Corrie on 03/18/82 and still married.
Mark worked for IBM (certain date intervals), then MSFT (other time interval), etc.
There are many other relationships which are time dependent:
lived in
worked for
reported to
belonged to etc.
What is the right way to model these? A typical query would be to find a traversal with an "as of" parameter, e.g "Who is the spouse of John as of 01/01/74?"
This is easily done by using indexes. Index/properties can only use primitive, so you can convert your Date object to a long value, then you can index that value. You'll have to store the index as a special Numeric type, but after that, you can search based on a range, so you can do a "Before" or an "After" or even a "Between" type query.
You could consider using extra nodes to represent a particular marriage or period of employment in conjunction with a calendar subgraph. For example, John's marriages could be modelled as follows:
(John)-[:MARRIAGE]->(John+Liz)
(Liz)-[:MARRIAGE]->(John+Liz)
(John+Liz)-[:START_DATE]->(07/03/69)
(John+Liz)-[:END_DATE]->(05/12/73)
(John)-[:MARRIAGE]->(John+Corrie)
(Corrie)-[:MARRIAGE]->(John+Corrie)
(John+Corrie)-[:START_DATE]->(03/18/82)
This gives you flexibility in both the number of marriages each person may undertake as well as whether an END_DATE exists or not.
Hope this helps
Nige
i would assume to build always two kinds of relationships for every type - lived_in_s, lived_in_e, belonged_to_s, belonged_to_e .... where the postfixes "s" and "e" represents the time of start and end of that relationships. than, querying an as of parameter could look like:
START n=node({John})
MATCH n-[r:spouse_of_s]-m, n-[?r2:spouse_of_e]-m
WHERE r2.time? < {timestamp} AND r.time > {timestamp}
RETURN m;
(i might have a typo in the query at r2?, i wrote it without testing)
and you might want to use the indexes #nicholas wrote about
I usually add a timestamp with on_date attribute.
I would also recommend encoding your date in the following format YYYYMMDD. You can then easily add comparisons in your code. For instance:
John married Elizabeth on 07/03/69 and divorced her 05/12/73; then he married Corrie on 03/18/82 and still married.
if you want to know if John is married and to whom, you can simply compare the date, and get the end node.
Related
I'm dealing with the following situation: many trips exist between many cities. Both have various properties. E.g the cities have a name and an amount of trips that passed them, whereas trips have a distance and time.
What is 'best practise' in Neo4j?
a) Add all cities and trips as nodes, and connect the trips to the start and end nodes by means of 'STARTED_AT' and 'ENDS_IN' relations.
or
b) Add only cities as a node, and represent each of the trips as a relation between 2 nodes. This means there are many of the same relations between nodes, where the only difference is that they have other properties.
Information that might be useful: we only need to do all kinds of queries. No insertion needed.
Thanks!
I would argue it's better to store trips as nodes because relationship properties cannot be indexed, and it will be slow to do more complex queries (like find shortest route by time) So if you are searching for trips by ID or something, you will need to store them as nodes.
On the other hand, an argument can be made for using relationships, because then you can take full advantage of APOC's weighted graph search functions.
A good way to decide if something should be a node or relation, is to ask yourself "are there any other relations that would make sense here?" If you are talking about if two cities are connected, a relationship makes more since because they either are or are not. If you are talking about road trips though, the trip can pass through multiple cities, can have participants in the trip (or groups there of) and can have an owner. In that case, for future flexibility, nodes will be much easier to maintain.
I would say it really depends on how you model these trips, lets assume we can generalize this as (city)-[trip]->(city). Notice that neo4j's relations always has a direction so we can go on adding an unlimited number of trips between cities without having to redefine each city for each trip -- this actually answers (a) by the way, we don't need to define where it starts and ends the relation does all that work for you.
'This means there are many of the same relations between nodes' <<- on this note, if you need to differ each trip based on the time the trip was taken you can add the date/timestamp in the relationship property or you can go with a time tree (see Mark Needham's Article on that here and Graphgrid's take)
Hope this helps.
I am running into this wall regarding bidirectional relationships.
Say I am attempting to create a graph that represents a family tree. The problem here is that:
* Timmy can be Suzie's brother, but
* Suzie can not be Timmy's brother.
Thus, it becomes necessary to model this in 2 directions:
(Sure, technically I could say SIBLING_TO and leave only one edge...what I'm not sure what the vocabulary is when I try to connect a grandma to a grandson.)
When it's all said and done, I pretty sure there's no way around the fact that the direction matters in this example.
I was reading this blog post, regarding common Neo4j mistakes. The author states that this bidirectionality is not the most efficient way to model data in Neo4j and should be avoided.
And I am starting to agree. I set up a mock set of 2 families:
and I found that a lot of queries I was attempting to run were going very, very slow. This is because of the 'all connected to all' nature of the graph, at least within each respective family.
My question is this:
1) Am I correct to say that bidirectionality is not ideal?
2) If so, is my example of a family tree representable in any other way...and what is the 'best practice' in the many situations where my problem may occur?
3) If it is not possible to represent the family tree in another way, is it technically possible to still write queries in some manner that gets around the problem of 1) ?
Thanks for reading this and for your thoughts.
Storing redundant information (your bidirectional relationships) in a DB is never a good idea. Here is a better way to represent a family tree.
To indicate "siblingness", you only need a single relationship type, say SIBLING_OF, and you only need to have a single such relationship between 2 sibling nodes.
To indicate ancestry, you only need a single relationship type, say CHILD_OF, and you only need to have a single such relationship between a child to each of its parents.
You should also have a node label for each person, say Person. And each person should have a unique ID property (say, id), and some sort of property indicating gender (say, a boolean isMale).
With this very simple data model, here are some sample queries:
To find Person 123's sisters (note that the pattern does not specify a relationship direction):
MATCH (p:Person {id: 123})-[:SIBLING_OF]-(sister:Person {isMale: false})
RETURN sister;
To find Person 123's grandfathers (note that this pattern specifies that matching paths must have a depth of 2):
MATCH (p:Person {id: 123})-[:CHILD_OF*2..2]->(gf:Person {isMale: true})
RETURN gf;
To find Person 123's great-grandchildren:
MATCH (p:Person {id: 123})<-[:CHILD_OF*3..3]-(ggc:Person)
RETURN ggc;
To find Person 123's maternal uncles:
MATCH (p:Person {id: 123})-[:CHILD_OF]->(:Person {isMale: false})-[:SIBLING_OF]-(maternalUncle:Person {isMale: true})
RETURN maternalUncle;
I'm not sure if you are aware that it's possible to query bidirectionally (that is, to ignore the direction). So you can do:
MATCH (a)-[:SIBLING_OF]-(b)
and since I'm not matching a direction it will match both ways. This is how I would suggest modeling things.
Generally you only want to make multiple relationships if you actually want to store different state. For example a KNOWS relationship could only apply one way because person A might know person B, but B might not know A. Similarly, you might have a LIKES relationship with a value property showing how much A like B, and there might be different strengths of "liking" in the two directions
I am just getting into Graph databases and need advice.
For this example, I have a 'Person' node and a 'Project' node with two relationships between the two. The two relationships are:
A scheduled date, this is the projected finished date
A verified date, this is the actual finished date
Both are from the Person to the Project.
Specifically referring to using the relationship property to hold the "date value" of the event. Are they any downsides to this, or a better way to model this in a graph?
A simple mock up is below:
It is easier to hold dates in the form of Unix Epoch time stamp (stored as long integer), rather than as Julian dates. Neo4j has no built in date / time format.
Timestamp and can be used to perform calculations on the dates to find things like how-many days behind schedule is the project based on current date.
The timestamp() function in Cypher provides a way to get the current Unix time within neo4j.
Each relationship in Neo4J takes up 34 Bytes of data internally, excluding the actual content of the relationship. It might be more efficient to hold both scheduled completion and verified completion as properties in a single relationship rather than storing them as two relationships.
A relationship does not need to have both the scheduled date and the verified date at the same time (the advantages of NoSQL). You can add the verified date later using the SET keyword.
Just to give you an example.
use the following Cypher statement to create.
Create (p:Person {name:'Bill'})-[r:Works_On {scheduledcompletion: 1461801600}]->(pro:Project {name:'Jabberwakie'})
use the following Cypher statement to set the verified date to current time.
Match (p:Person {name:'Bill'})-[r:Works_On]->(pro:Project {name:'Jabberwakie'}) set r.verifiedcompletion=timestamp()
use the following Cypher statement to perform some kind of calculation, in this case to return a boolean value if the project was behind schedule or not.
Match (p:Person {name:'Bill'})-[r:Works_On]->(pro:Project {name:'Jabberwakie'}) return case when r.scheduledcompletion > r.verifiedcompletion then true else false end as behindschedule
Also think about storing projected finished date and actual finished date in node project in case if this property related to the whole project and is the same for all persons related to it.
This will help you to avoid duplication of data and will make querying projects by properties work faster as you woun't have to look for relationships. In cases where your model designed to have different dates actual and finished for different persons in addition to storing datasets in relationships it still makes sense to store in project Node combined information for the whole project. As it will make model more clear and some queries faster to execute.
Suppose I have a large knowledge base with many relationship types, e.g., hasChild, livesIn, locatedIn, capitalOf, largestCityOf...
The number of capicalOf relationships is relatively small (say, one hundred) compared to that of all nodes and other types of relationships.
I want to fetch any capital which is also the largest city in their country by the following query:
MATCH city-[:capitalOf]->country, city-[:largestCityOf]->country RETURN city
Apparently it would be wise to take the capitalOf type as clue, scan all 100 relationship with this type and refine by [:largestCityOf]. However the current execution plan engine of neo4j would do an AllNodesScan and Expand. Why not consider add an "RelationshipByTypeScan" operator into the current query optimization engine, like what NodeByLabelScan does?
I know that I can transform relationship types to relationship properties, index it using the legacy index and manually indicate
START r=relationship:rels(rtype = "capitalOf")
to tell neo4j how to make it efficient. But for a more complicated pattern query with many relationship types but no node id/label/property to start from, it is clearly a duty of the optimization engine to decide which relationship type to start with.
I saw many questions asking the same problem but getting answers like "negative... a query TYPICALLY starts from nodes... ". I just want to use the above typical scenario to ask why once more.
Thanks!
A relationship is local to its start and end node - there is no global relationship dictionary. An operation like "give me globally all relationships of type x" is therefore an expensive operation - you need to go through all nodes and collect matching relationships.
There are 2 ways to deal with this:
1) use a manual index on relationships as you've sketched
2) assign labels to your nodes. Assume all the country nodes have a Country label. Your can rewrite your query:
MATCH (city)-[:capitalOf]->(country:Country), (city)-[:largestCityOf]->(country) RETURN city
The AllNodesScan is now a NodeByLabelScan. The query grabs all countries and matches to the cities. Since every country does have one capital and one largest city this is efficient and scales independently of the rest of your graph.
If you put all relationships into one index and try to grab to ~100 capitalOf relationships that operation scales logarithmically with the total number of relationships in your graph.
I was thinking about text driven search by user input.
often you are searching in a database of addresses, where you can find customers and so on.
has anybody any idea how to find out which of the typed words is the name, which is the street name, which is the company name?
and secondly if the name is a double name like "Lee Harvey", how can I find out that the two words Lee and Harvey belong together?
Same problem with company names like "frank the baker inc."...
Is there any algorithm or best practice strategy?
thanks for links, tutorials, scripts and all other help ;-)
What you basically want is a search engine :) Here are the basic steps you need to follow -
You need to create an 'Inverted Index' of the content you want to be searched on.
The index is 'name'=>'value' pair. You can have this pair in whichever way you want (tuned according to your data & needs.
Eg. for your problem of double names, you could split all your names into single words & index it like so -
'lee'=>'lee harvey'
'harvey'=>'lee harvey'
...
this way when anyone searches for 'lee' they get 'lee harvey'. There are other better approaches to this called "n-gram" indexing. Check it out...
You could possibly build indexes of names, addresses, emails etc & when the user types a query check it against all your indexes with the approach suggested above. After you get the results then merge them. Maybe you could introduce the notion of rank so that you can sort your results & show the most latest or most relevant ones at the top. For this you need to figure out a way to score your terms...
Don't care, just perform full-text search. Then you should check the result items for which field contains the search terms. Also, you may display items in separate lists (terms found int name, term found in address). The only difficulty is if John Smith is living in the John Smiht street, you must decide, which list/lists the result item belongs to.