I am just getting into Graph databases and need advice.
For this example, I have a 'Person' node and a 'Project' node with two relationships between the two. The two relationships are:
A scheduled date, this is the projected finished date
A verified date, this is the actual finished date
Both are from the Person to the Project.
Specifically referring to using the relationship property to hold the "date value" of the event. Are they any downsides to this, or a better way to model this in a graph?
A simple mock up is below:
It is easier to hold dates in the form of Unix Epoch time stamp (stored as long integer), rather than as Julian dates. Neo4j has no built in date / time format.
Timestamp and can be used to perform calculations on the dates to find things like how-many days behind schedule is the project based on current date.
The timestamp() function in Cypher provides a way to get the current Unix time within neo4j.
Each relationship in Neo4J takes up 34 Bytes of data internally, excluding the actual content of the relationship. It might be more efficient to hold both scheduled completion and verified completion as properties in a single relationship rather than storing them as two relationships.
A relationship does not need to have both the scheduled date and the verified date at the same time (the advantages of NoSQL). You can add the verified date later using the SET keyword.
Just to give you an example.
use the following Cypher statement to create.
Create (p:Person {name:'Bill'})-[r:Works_On {scheduledcompletion: 1461801600}]->(pro:Project {name:'Jabberwakie'})
use the following Cypher statement to set the verified date to current time.
Match (p:Person {name:'Bill'})-[r:Works_On]->(pro:Project {name:'Jabberwakie'}) set r.verifiedcompletion=timestamp()
use the following Cypher statement to perform some kind of calculation, in this case to return a boolean value if the project was behind schedule or not.
Match (p:Person {name:'Bill'})-[r:Works_On]->(pro:Project {name:'Jabberwakie'}) return case when r.scheduledcompletion > r.verifiedcompletion then true else false end as behindschedule
Also think about storing projected finished date and actual finished date in node project in case if this property related to the whole project and is the same for all persons related to it.
This will help you to avoid duplication of data and will make querying projects by properties work faster as you woun't have to look for relationships. In cases where your model designed to have different dates actual and finished for different persons in addition to storing datasets in relationships it still makes sense to store in project Node combined information for the whole project. As it will make model more clear and some queries faster to execute.
Related
I'm new to neo4j and currently attempting to migrate existing data into a neo4j database. I have written a small program to convert current data (in bespoke format) into a large CREATE cypher query for initial population of the database. My first iteration has been to somewhat retain the structuring of the existing object model, i.e Objects become nodes, node type is same as object name in current object model, and the members become properties (member name is property name). This is done for all fundamental types (and strings) and any member objects are thus decomposed in the same way as in the original object model.
This has been fine in terms of performance and 13000+ line CREATE cypher queries have been generated which can be executed throuh the web frontend/client. However the model is not ideal for a graph database, I beleive, since there can be many properties, and instead I would like to deomcompose these 'fundamental' nodes (with members which are fundamental types) into their own node, relating to a more 'abstract' node which represents the more higher level object/class. This means each member is a node with a single (at first, it may grow) property say { value:"42" }, or I could set the node type to the data type (i.e integer). If my understanding is correct this would also allow me to create relationships between the 'members' (since they are nodes and not propeties) allowing a greater freedom when expressing relationships between original members of different objects rather than just relating the parent objects to each other.
The problem is this now generates 144000+ line Cypher queries (and this isn't a large dataset in compraison to others) which the neo4j client seems to bulk at. The code highlighting appears to work in the query input box of the client (i.e it highlights correctly, which I assume implies it parsed it correctly and is valid cypher query), but when I come to run the query, I get the usual browser not responding and then a stack overflow (no punn intended) error. Whats more the neo4j client doesn't exit elegantly and always requires me to force end task and the db is in the 2.5-3GB usage from, what is effectively and small amount of data (144000 lines, approx 2/3 are relationships so at most ~48000 nodes). Yet I read I should be able to deal with millions of nodes and relationships in the milliseconds?
Have tried it with firefox and chrome. I am using the neo4j community edition on windows10. The sdk would initially be used with C# and C++. This research is in its initial stages so I haven't used the sdk yet.
Is this a valid approach, i.e to initially populate to database via a CREATE query?
Also is my approach about decomposing the data into fundamental types a good one? or are there issues which are likely to arise from this approach.
That is a very large Cypher query!!!
You would do much better to populate your database using LOAD CSV FROM... and supplying a CSV file containing the data you want to load.
For a detailed explaination, have a look at:
https://neo4j.com/developer/guide-import-csv/
(This page also discusses the batch loader for really large datasets.)
Since you are generating code for the Cypher query I wouldn't imagine you would have too much trouble generating a CSV file.
(As an indication of performance, I have been loading a 1 million record CSV today into Neo4j running on my laptop in under two minutes.)
My data model has a ClickerRecord entity with 2 attributes: date (NSDate) and numberOfBiscuits (NSNumber). Every time a new record is added, a different value for numberOfBiscuits can be entered.
To calculate a daily average for the number of biscuits I'm currently doing a fetch request for each day within range and using the corresponding NSExpression to calculate the sum of all numberOfBiscuits values for that day.
The problem: I'm using asynchronous fetch requests to avoid blocking the main thread, so it ends up being quite slow when there are many days between the first and last record. The fetch requests are performed one after another.
I could also load all records into memory and perform the sorting and calculations, but I'm worried that it could become an issue when the number of records becomes very large.
Therefore, my question: Is it possible to use NSExpressions to add something like sub-predicates for each date interval, in order to do a single fetch request and retrieve a dictionary with an entry for each daily sum of numberOfBiscuits?
If not, what would be the recommended approach for this situation?
I've read about subqueries but as far as I've understood they're not intended for this kind of use.
This is the first question I'm asking on SO, so I hope to have written it in a clear way :)
I think what you are looking for is the propertiesToGroupBy (see the Apple Docs) for the NSFetchRequest, though in your case it is not straight forward to implement, for reasons I will discuss later.
Suppose you could specify the category of biscuit consumed on each occasion, and this is stored in a category attribute of your entity. Then to obtain the total number of biscuits of each category (ignoring the date), you could use an NSExpression using #sum and specify:
fetch.propertiesToGroupBy = ["category"]
CoreData will then group the results of the fetch by the category and will calculate the sum for each group separately.
The problem in your case is that (unless you already strip out the time information from your date attribute), there is no attribute that represents the date interval that you want to group by, and CoreData will not let you specify a computed value to group by. You would need to add a new day attribute to your entity, and calculate that whenever you add/update a record, and specify it in the group by. And you face the same problem again if you subsequently want to calculate your average over a different interval - weeks or months for example. One other downside to this is that the results will only include days for which there are ClickerRecords: if the user has a day where they consume no biscuits, then the fetch will not show a result for that day (ie it will not infer an average of 0). You would need to handle this appropriately when using the results.
It might be better either to tune your asynchronous fetch or, as you suggest, just to read the whole lot into memory to perform the calculations. If your entity only has those two attributes, and assuming your users don't live entirely on biscuits, the volumes should not be too problematic.
Is a simple query, but i don't know which cypher operation or java collection would be useful to get the expected result.
So, again working with users assigned to a specific Taxi, where each user has a Grid destination (drop off).
First i returned the users assigned to a specific Taxi and same dropOff time:
MATCH (g:Grid)<-[d:DROP_OFF]-(u:User)-[rt:ASSIGNED]->(t:Taxi {name: 'Taxi1813'})
WHERE d.time = '04:38'
RETURN u.name, g.name as LocationGrid
u.name LocationGrid
UserTestSame Grid1239
UserTestNew Grid919
User177 Grid1239
I would like to add a condition to return the users located in the same Grid. The result set is not fixed; i may have n number f users.
Is there an easy way to do that, or should i use a collection in java to compare each Grid.
Thank you in advance
You can use aggregate Cypher's built-in aggregate functions, such as collect:
MATCH (g:Grid)<-[d:DROP_OFF]-(u:User)-[rt:ASSIGNED]->(t:Taxi {name: 'Taxi1813'})
WHERE d.time = '04:38'
RETURN g.name as LocationGrid, collect (u.name) as users
LocationGrid users
Grid1239 [UserTestSame, User177]
Grid919 [UserTestNew]
Your query assumes that users assigned to the same taxi have the same drop off time? Seems confusing to me, but I might be missing something.
BTW, I don't think storing and matching the drop off time as a string is going to be very suitable for following reasons:
querying will be inefficient as you'll probably need to match both a date and a time string property
it doesn't allow for matching drop off times which are close
I'd recommend modelling the drop off time as a time tree. See following links for reference:
How to represent days in timeline tree for Neo4j/graphDB
Neo4j: Cypher – Creating a time tree down to the day
performance of time filters in neo4j where
Importing Forests into Neo4j
//just use the answer above me
Also Neo4j has an option to use collect so the results come back in a list.
Match(n)-[]->(r)
return n, collect (r) as collection
http://neo4j.com/docs/stable/query-aggregation.html#aggregation-collect
I think you should change your date time. Not in a string but more like yyyyMMddhhmmss so you can compare. Or any other way.
It would also be faster not to store it as a string.
There are several possible ways I can think of to store and then query temporal data in Neo4j. Looking at an example of being able to search for recurring events and any exceptions, I can see two possibilities:
One easy option would be to create a node for each occurrence of the event. Whilst easy to construct a cypher query to find all events on a day, in a range, etc, this could create a lot of unnecessary nodes. It would also make it very easy to change individual events times, locations etc, because there is already a node with the basic information.
The second option is to store the recurrence temporal pattern as a property of the event node. This would greatly reduce the number of nodes within the graph. When searching for events on a specific date or within a range, all nodes that meet the start/end date (plus any other) criteria could be returned to the client. It then boils down to iterating through the results to pluck out the subset who's temporal pattern gives a date within the search range, then comparing that to any exceptions and merging (or ignoring) the results as necessary (this could probably be partially achieved when pulling the initial result set as part of the query).
Whilst the second option is the one I would choose currently, it seems quite inefficient in that it processes the data twice, albeit a smaller subset the second time. Even a plugin to Neo4j would probably result in two passes through the data, but the processing would be done on the database server rather than the requesting client.
What I would like to know is whether it is possible to use Cypher or Neo4j to do this processing as part of the initial query?
Whilst I'm not 100% sure I understand you requirement, I'd have a look at this blog post, perhaps you'll find a bit of inspiration there: http://graphaware.com/neo4j/2014/08/20/graphaware-neo4j-timetree.html
I am trying to build an database in Neo4j with a structure that contains seven different types of nodes, in total around 4-5000 nodes and between them around 40000 relationships. The cypher code i am currently using is that i first create the nodes with the code:
Create (node1:type {name:'example1', type:'example2'})
Around 4000 of that example with unique nodes.
Then I've got relationships stated as such:
Create
(node1)-[:r]-(node51),
(node2)-[:r]-(node5),
(node3)-[:r]-(node2);
Around 40000 of such unique relationships.
With smaller scale graphs this has not been any problem at all. But with this one, the Executing query never stops loading.
Any suggestions on how I can make this type of query work? Or what i should do instead?
edit. What I'm trying to build is a big graph over a product, with it's releases, release versions, features etc. in the same way as the Movie graph example is built.
The product has about 6 releases in total, each release has around 20 releaseversion. In total there is 371 features and of there 371 features there is also 438 featureversions. ever releaseversion (120 in total) then has around 2-300 featureversions each. These Featureversions are mapped to its Feature whom has dependencies towards a little bit of everything in the db. I have also involed HW dependencies such as the possible hw to run these Features on, releases on etc. so basicaly im using cypher code such as:
Create (Product1:Product {name:'ABC', type:'Product'})
Create (Release1:Release {name:'12A', type:'Release'})
Create (Release2:Release {name:'13A, type:'release'})
Create (ReleaseVersion1:ReleaseVersion {name:'12.0.1, type:'ReleaseVersion'})
Create (ReleaseVersion2:ReleaseVersion {name:'12.0.2, type:'ReleaseVersion'})
and below those i've structured them up using
Create (Product1)<-[:Is_Version_Of]-(Release1),
(Product1)<-[:Is_Version_Of]-(Release2),
(Release2)<-[:Is_Version_Of]-(ReleaseVersion21),
All the way down to features, and then I've also added dependencies between them such as:
(Feature1)-[:Requires]->(Feature239),
(Feature239)-[:Requires]->(Feature51);
Since i had to find all this information from many different excel-sheets etc, i made the code this way thinking i could just put it together in one mass cypher query and run it on the /browser on the localhost. it worked really good as long as i did not use more than 4-5000 queries at a time. Then it created the entire database in about 5-10 seconds at maximum, but now when I'm trying to run around 45000 queries at the same time it has been running for almost 24 hours, and are still loading and saying "executing query...". I wonder if there is anyway i can improve the time it takes, will the database eventually be created? or can i do some smarter indexes or other things to improve the performance? because by the way my cypher is written now i cannot divide it into pieces since everything in the database has some sort of connection to the product. Do i need to rewrite the code or is there any smooth way around?
You can create multiple nodes and relationships interlinked with a single create statement, like this:
create (a { name: "foo" })-[:HELLO]->(b {name : "bar"}),
(c {name: "Baz"})-[:GOODBYE]->(d {name:"Quux"});
So that's one approach, rather than creating each node individually with a single statement, then each relationship with a single statement.
You can also create multiple relationships from objects by matching first, then creating:
match (a {name: "foo"}), (d {name:"Quux"}) create (a)-[:BLAH]->(d);
Of course you could have multiple match clauses, and multiple create clauses there.
You might try to match a given type of node, and then create all necessary relationships from that type of node. You have enough relationships that this is going to take many queries. Make sure you've indexed the property you're using to match the nodes. As your DB gets big, that's going to be important to permit fast lookup of things you're trying to create new relationships off of.
You haven't specified which query you're running that isn't "stopping loading". Update your question with specifics, and let us know what you've tried, and maybe it's possible to help.
If you have one of the nodes already created then a simple approach would be:
MATCH (n: user {uid: "1"}) CREATE (n) -[r: posted]-> (p: post {pid: "42", title: "Good Night", msg: "Have a nice and peaceful sleep.", author: n.uid});
Here the user node already exists and you have created a new relation and a new post node.
Another interesting approach might be to generate your statements directly in Excel, see http://blog.bruggen.com/2013/05/reloading-my-beergraph-using-in-graph.html?view=sidebar for an example. You can run a lot of CREATE statements in one transaction, so this should not be overly complicated.
If you're able to use the Neo4j 2.1 prerelease milestones, then you should try using the new LOAD CSV and PERIODIC COMMIT features. They are designed for just this kind of use case.
LOAD CSV allows you to describe the structure of your data with one or more Cypher patterns, while providing the values in CSV to avoid duplication.
PERIODIC COMMIT can help make large imports more reliable and also improve performance by reducing the amount of memory that is needed.
It is possible to use a single cypher query to create a new node as well as relate it to an existing now.
As an example, assume you're starting with:
an existing "One" node which has an "id" property "1"
And your goal is to:
create a second node, let's call that "Two", and it should have a property id:"2"
relate the two nodes together
You could achieve that goal using a single Cypher query like this:
MATCH (one:One {id:'1'})
CREATE (one) -[:RELATED_TO]-> (two:Two {id:'2'})