I'm trying to decide on a good data model for representing tasks and sub-tasks. It's a two part problem:
First, I want to be able to get a string of tasks (task1)-[:NEXT]->(task2)-[:NEXT]->(task3) etc. And I want to be able to gather them starting with the first one and display them in order. The cypher is simple enough ... something like
p = match(first:Task)-[:NEXT*]->(others:Task)
return o.name, o.instructions
order by length(p) // or something like this, probably with a union to get both the first task and other tasks in the same output
However, I'd also like to let a sub-task have children. For instance, I might have a set of tasks that constitute "How to make coffee", but then when I'm creating a set of tasks that constitute "How to make breakfast", I'd like to point to the "How to make coffee" set of tasks and re-use them.
It would be nice to get cypher to return a staggered list (e.g. 1, 1.1, 1.1.1, 2, etc.), but I'd actually be equally happy with just 1, 2, 3 ... n.
I've been looking and haven't seen a clear solution anywhere. Here's a picture of what I'm imagining. Any directions, thoughts, or references much appreciated.
I'm going to reuse my answer to another question, but it really is the best solution to this problem. The problem you have is that your scheme starts to fall apart once you have multiple distinct but intersecting paths. So you end up trying to corrupt your data to try and resolve the conflicts generated.
1) For each chain, create a node to represent that chain.
2) Create a relation from that node to each node in the chain, and add an index property on the relationship.
3) Run Cyphers on your "chain" nodes instead.
To expand on the above for your case...
In this instance, the chain node represents a list of tasks that have to be done in order; And can itself also be a task. Separating the list from the task is more work, but it would allow you to define multiple ways to complete a certain task. For example, I can make coffee by starting my coffee machine, asking Google to make it (which will start the machine), or go to Starbucks. This should be flexible enough to support anything you need to represent, without instances trampling on each-other. Part of the key here is relationships can have properties too! Don't be afraid to use that to make them distinct. (You could just add a 'group-id' to each task chain to make them distinct, but that will have scaling issues)
Related
So I've just worked through the tutorial and I'm unclear about a few things. The main one, however, is how do you decide when something is a relationship and when it should be a Node?
For example, in the Movies Database,there is a relationship showing who acted in which film. A property of that relationship is the Role. BUT, what if it's a series of films? The role may well be constant between films (say, Jack Ryan in The Hunt for Red October, Patriot Games etc.)
We may also want to have some kind of Character bio, which would obviously remain constant between movies. Worse, the actor may change from one movie to another (Alec Baldwin then Harrison Ford.) There are many others like this (James Bond, for example).
Even if the actor doesn't change (Main roles in Harry Potter) the character is constant. So, at what point would the Role become a node in its own right? When it does, can I have a 3-way relationship (Actor-Role-Movie)? Say I start of with it being a relationship and then, down the line, decide it should've been a node, is there a simple way to go through the database and convert it?
No there is no way to convert your datamodel. When you start your own Database first take time to find a fitting schema. There is no ideal way to create a schema and also there are many different models fitting to the same situation without being totally wrong.
My strategy is to put less information to the relationship itself. I only add properties that directly concern the relationship and store all the other data in the nodes. Also think of properties you could use for traversing the graph. For example you might need some flags or even different labels for relationships even they more or less are the same. The apoc.algo.aStar is only including relationshiptypes you want (you could exclude certain nodes by giving them a special relationshiptype). So keep that in mind that you take a look at procedures that you might use later.
Try to create the schema as simple as possible and find a way to stay consistent in terms of what things are nodes and what deserves a relationship. Dont mix it up.
Choose a design that makes sense for you! (device 1)-[cable]-(device 2) vs (device 1)-[has cable]-(cable)-[has cable]-(device 2) in this case I'd prefer the first because [has cable] wouldn't bring anymore information. Irrespective to what I wrote above I would have a lot of information in this [cable] relationship but it totally makes sense for me because I wouldnt want to search in a device node for cable information.
For your example giving the role a own node is also valid way. For example if you want to espacially query which actors had the same role in common I'll totally go for giving the role a extra node.
Summary:
Think of what you want to do with the data and choose the easiest model.
I'm trying to create an application to find the best paths to use when traveling by bus within my local city. I have found some useful answers on here so far, but I'm currently struggeling with my approach in general and I'd like to get some feedback.
Current data model
There are stations and stages modeled as nodes and two relationships between stations and stops. Each stage node has a start and an end time as a string in "HH:mm" format and belongs to some higher-level structure which I call routes, that are connected to these stage nodes to describe a trajectory along stations with time details. Each :FROM relationship has a property duration to model the travel time for reduce statements.
So the following query would return something like this. The stage nodes show the start property in the picture.
match (from:Station {name: "Glosberg"})
match (to:Station {name: "Knellendorf"})
match paths=((from)-[:FROM|:TO*..10]->(to))
return paths;
Problems so far:
ShortestPath/AllShortestPaths is not a valid option as smallest number of hops does not mean best path. What I want is a reduction of travel duration, which I can achive with a Reduce statement, which I have already done. Since I have to check out all paths I'm using the general pattern matcher with a limit (as seen above). The limit I use in my queries is actually the length of the shortest paths between from and to plus 10% or so to also include paths that might consist of more hops but take less time. This is not necessarily accurate but seems like a fair trade-off.
Using dijkstra gives me all paths from A to B. Since stage nodes have a form of time data on them, most of the paths do not make sense, because they are either combined in reversed order (2pm -> 1pm) or produce long waiting times (2pm -> 4pm), which are not necessary. Therefore I have to filter out bad paths, either in cypher or at some api level. However, with the current data model there simply are too many paths to check for validity. With some sample data, which would also run in production, I have a route that visits 24 stations 2 times a day, resulting in 2^23 paths to take. I'm pretty sure that my data model is the problem, but I can't see any ways to solve this; any ideas?
Questions:
More of a side problem: How would you solve ordering paths with stages that go past 0am? As "23:59" is bigger than "00:01" but not chronologically.
What would you change about the data model?
Would you suggest any trade-offs in how the path finding works to reduce the complexity (eg. simply using shortestpath)?
Would you suggest seperating the actual route data (timetable, who stops where and when) from the infrastructure data (stations and which stations are close to which)? That way I'd have to and use neo4j to find a path/set of stations to travel along and then try to find a suiting set of elements from a timetable, similiar to wanderu's approach.
I've read that the traversal api is a better way to describe how the graph should be accessed instead of using cypher, which only describes what to look for, but I'd like to receive feedback on my thoughs until now before I dive into that.
I am trying to build an database in Neo4j with a structure that contains seven different types of nodes, in total around 4-5000 nodes and between them around 40000 relationships. The cypher code i am currently using is that i first create the nodes with the code:
Create (node1:type {name:'example1', type:'example2'})
Around 4000 of that example with unique nodes.
Then I've got relationships stated as such:
Create
(node1)-[:r]-(node51),
(node2)-[:r]-(node5),
(node3)-[:r]-(node2);
Around 40000 of such unique relationships.
With smaller scale graphs this has not been any problem at all. But with this one, the Executing query never stops loading.
Any suggestions on how I can make this type of query work? Or what i should do instead?
edit. What I'm trying to build is a big graph over a product, with it's releases, release versions, features etc. in the same way as the Movie graph example is built.
The product has about 6 releases in total, each release has around 20 releaseversion. In total there is 371 features and of there 371 features there is also 438 featureversions. ever releaseversion (120 in total) then has around 2-300 featureversions each. These Featureversions are mapped to its Feature whom has dependencies towards a little bit of everything in the db. I have also involed HW dependencies such as the possible hw to run these Features on, releases on etc. so basicaly im using cypher code such as:
Create (Product1:Product {name:'ABC', type:'Product'})
Create (Release1:Release {name:'12A', type:'Release'})
Create (Release2:Release {name:'13A, type:'release'})
Create (ReleaseVersion1:ReleaseVersion {name:'12.0.1, type:'ReleaseVersion'})
Create (ReleaseVersion2:ReleaseVersion {name:'12.0.2, type:'ReleaseVersion'})
and below those i've structured them up using
Create (Product1)<-[:Is_Version_Of]-(Release1),
(Product1)<-[:Is_Version_Of]-(Release2),
(Release2)<-[:Is_Version_Of]-(ReleaseVersion21),
All the way down to features, and then I've also added dependencies between them such as:
(Feature1)-[:Requires]->(Feature239),
(Feature239)-[:Requires]->(Feature51);
Since i had to find all this information from many different excel-sheets etc, i made the code this way thinking i could just put it together in one mass cypher query and run it on the /browser on the localhost. it worked really good as long as i did not use more than 4-5000 queries at a time. Then it created the entire database in about 5-10 seconds at maximum, but now when I'm trying to run around 45000 queries at the same time it has been running for almost 24 hours, and are still loading and saying "executing query...". I wonder if there is anyway i can improve the time it takes, will the database eventually be created? or can i do some smarter indexes or other things to improve the performance? because by the way my cypher is written now i cannot divide it into pieces since everything in the database has some sort of connection to the product. Do i need to rewrite the code or is there any smooth way around?
You can create multiple nodes and relationships interlinked with a single create statement, like this:
create (a { name: "foo" })-[:HELLO]->(b {name : "bar"}),
(c {name: "Baz"})-[:GOODBYE]->(d {name:"Quux"});
So that's one approach, rather than creating each node individually with a single statement, then each relationship with a single statement.
You can also create multiple relationships from objects by matching first, then creating:
match (a {name: "foo"}), (d {name:"Quux"}) create (a)-[:BLAH]->(d);
Of course you could have multiple match clauses, and multiple create clauses there.
You might try to match a given type of node, and then create all necessary relationships from that type of node. You have enough relationships that this is going to take many queries. Make sure you've indexed the property you're using to match the nodes. As your DB gets big, that's going to be important to permit fast lookup of things you're trying to create new relationships off of.
You haven't specified which query you're running that isn't "stopping loading". Update your question with specifics, and let us know what you've tried, and maybe it's possible to help.
If you have one of the nodes already created then a simple approach would be:
MATCH (n: user {uid: "1"}) CREATE (n) -[r: posted]-> (p: post {pid: "42", title: "Good Night", msg: "Have a nice and peaceful sleep.", author: n.uid});
Here the user node already exists and you have created a new relation and a new post node.
Another interesting approach might be to generate your statements directly in Excel, see http://blog.bruggen.com/2013/05/reloading-my-beergraph-using-in-graph.html?view=sidebar for an example. You can run a lot of CREATE statements in one transaction, so this should not be overly complicated.
If you're able to use the Neo4j 2.1 prerelease milestones, then you should try using the new LOAD CSV and PERIODIC COMMIT features. They are designed for just this kind of use case.
LOAD CSV allows you to describe the structure of your data with one or more Cypher patterns, while providing the values in CSV to avoid duplication.
PERIODIC COMMIT can help make large imports more reliable and also improve performance by reducing the amount of memory that is needed.
It is possible to use a single cypher query to create a new node as well as relate it to an existing now.
As an example, assume you're starting with:
an existing "One" node which has an "id" property "1"
And your goal is to:
create a second node, let's call that "Two", and it should have a property id:"2"
relate the two nodes together
You could achieve that goal using a single Cypher query like this:
MATCH (one:One {id:'1'})
CREATE (one) -[:RELATED_TO]-> (two:Two {id:'2'})
START names = node(*),
target=node:node_auto_index(target_name="TARGET_1")
MATCH names
WHERE NOT names-[:contains]->()
AND HAS (names.age)
AND (names.qualification =~ ".*(?i)B.TECH.*$"
OR names.qualification =~ ".*(?i)B.E.*$")
CREATE UNIQUE (names)-[r:contains{type:"declared"}]->(target)
RETURN names.name,names,names.qualification
Iam consisting of nearly 1,80,000 names nodes, i had iterated the above process to create unique relationships above 100 times by changing the target. its taking too much amount of time.How can i resolve it..
i build the query with java and iterated.iam using neo4j 2.0.0.5 and java 1.7 .
I edited your cypher query because I think I understand it, but I can barely read the rest of your question. If you edit it with white spaces and punctuation it might be easier to understand what you are trying to do. Until then, here are some thoughts about your query being slow.
You bind all the nodes in the graph, that's typically pretty slow.
You bind all the nodes in the graph twice. First you bind universally in your start clause: names=node(*), and then you bind universally in your match clause: MATCH names, and only then you limit your pattern. I don't quite know what the Cypher engine makes of this (possibly it gets a migraine and goes off to make a pot of coffee). It's unnecessary, you can at least drop the names=node(*) from your start clause. Or drop the match clause, I suppose that could work too, since you don't really do anything there, and you will still need a start clause for as long as you use legacy indexing.
You are using Neo4j 2.x, but you use legacy indexing instead of labels, at least in this query. Without knowing your data and model it's hard to know what the difference would be for performance, but it would certainly make it much easier to write (and read) your queries. So, that's a different kind of slow. It's likely that if you had labels and label indices, the query performance would improve.
So, first try removing one of the universal bindings of nodes, then use the 2.x schema tools to structure your data. You should be able to write queries like
MATCH target:Target
WHERE target.target_name="TARGET_1"
WITH target
MATCH names:Name
WHERE NOT names-[:contains]->()
AND HAS (names.age)
AND (names.qualification =~ ".*(?i)B.TECH.*$"
OR names.qualification =~ ".*(?i)B.E.*$")
CREATE UNIQUE (names)-[r:contains{type:"declared"}]->(target)
RETURN names.name,names,names.qualification
I have no idea if such a query would be fast on your data, however. If you put the "Name" label on all your nodes, then MATCH names:Name will still bind all nodes in the database, so it'll probably still be slow.
P.S. The relationships you create have a TYPE called contains, and you give them a property called type with value declared. Maybe you have a good reason, but that's potentially very confusing.
Edit:
Reading through your question and my answer again I no longer think that I understand even your cypher query. (Why are you returning both the bound nodes and properties of those nodes?) Please consider posting sample data on console.neo4j.org and explain in more detail what your model looks like and what you are trying to do. Let me know if my answer meets your question at all or I'll consider removing it.
Using Neo4j 1.9.3 -
I want to create a music program listing. On a given program there may be three pieces being performed. Each piece has a composer associated with them, and may appear on many different programs, so I can't put sequence numbers on the piece nodes.
I assume I can create the program, with relationships to each piece like so:
(program1)-[:PROGRAM_PIECE {program_seq: 1}]->(piece1)
(program1)-[:PROGRAM_PIECE {program_seq: 2}]->(piece2)
(program1)-[:PROGRAM_PIECE {program_seq: 3}]->(piece3)
My question is, how do I query the graph so that the pieces are in order of the relationship property program_seq? I'm fine using ORDER BY with node properties, but have not been successful with relationships (story of my life...)
If you like it, lock it down: that is, bind it to a variable. Then you can use ORDER BY the same way you would with node properties. If you have retrieved your program as (program1) you can do something like
MATCH (program1)-[r:PROGRAM_PIECE]->(piece1)
RETURN program1, r, piece1
ORDER BY r.program_seq
I have done the same thing recently to keep track of chess moves in a particular game. It is the same thing as node properties.
start program = node(*) // or better yet, use a real index query to find the program
match (program)-[program_piece:PROGRAM_PIECE]->(piece)
return program, piece
order by program_piece.program_seq