Considering the existence of three types of nodes in a db, connected by the schema
(a)-[ra {qty}]->(b)-[rb {qty}]->(c)
with the user being able to have some of each in their wishlist or whatever.
What would be the best way to query the database to return a list of all the nodes the user has on their wishlist, considering that when he has an (a) then in the result the associated (b) and (c) should also be returned after having multiplied some of their fields (say b.price and c.price) for the respective ra.qty and rb.qty?
NOTE: you can find the same problem without the variable length over here
Assuming you have users connected to the things they want like so:
(user:User)-[:WANTS]->(part:Part)
And that parts, like you describe, have dependencies on other parts in specific quantities:
CREATE
(a:Part) -[:CONTAINS {qty:2}]->(b:Part),
(a:Part) -[:CONTAINS {qty:3}]->(c:Part),
(b:Part) -[:CONTAINS {qty:2}]->(c:Part)
Then you can find all parts, and how many of each, you need like so:
MATCH
(user:User {name:"Steven"})-[:WANTS]->(part),
chain=(part)-[:CONTAINS*1..4]->(subcomponent:Part)
RETURN subcomponent, sum( reduce( total=1, r IN relationships(chain) | total * r.rty) )
The 1..4 term says to look between 1-4 sub-components down the tree. You can obv. set that to whatever you like, including "1..", infinite depth.
The second term there is a bit complex. It helps to try the query without the sum to see what it does. Without that, the reduce will do the multiplying of parts that you want for each "chain" of dependencies. Adding the sum will then aggregate the result by subcomponent (inferred from your RETURN clause) and sum up the total count for that subcomponent.
Figuring out the price is then an excercise of multiplying the aggregate quantities of each part. I'll leave that as an exercise for the reader ;)
You can try this out by running the queries in the online console at http://console.neo4j.org/
Related
I have a graph model -
(p:Person)-[r:LINK {startDate: timestamp, endDate: timestamp}]->(c:Company)
A person can be linked to multiple companies at the same time and a company can have multiple people linking to it at the same time (i.e. there is a many-to-many relationship between companies and people).
The endDate property is optional and will only be present when a person has left a company.
I am trying to display a network of connections and can successfully return all related nodes from a person using the following cypher query (this will display 2 levels of people connections) -
MATCH (p:Person {id:<id>})-[r:LINK*0..4]-(l) RETURN *
What I now need to do is filter the relationships where the relationships match on timeframe, e.g. Person 1 worked at Company A between 01/01/2000 and 31/12/2002. Person 2 worked at Company A between 01/01/2001 and 31/06/2001. Person 3 worked at Company A between 01/01/2005 and is still at Company A. The results for Person 1 should include Person 2 but not Person 3.
This same logic needs to be applied to all levels of the graph (we allow the user to display 3 levels of connections) and relates to the parent node in each level, i.e. when displaying level 2, the dates for Person 2 and Person 3 should be used to filter their respective relationships.
Essentially, we are trying to do something similar to the LinkedIn connections but to filter based on people working at companies at the same time.
I have tried using the REDUCE function but cannot get the logic to work for the optional end date - can someone please advise how to filter the relationships based on the start and end dates?
It turns out there are 4 ways in which date ranges can overlap, but only 2 in which they do not (person 1 ends before person 2 starts, or person 2 ends before person 1 starts), so it is much simpler to check that neither of these no-overlap conditions exist.
In the level 1 case, this query should do the trick:
MATCH (start:Person{id:1})-[r1:LINK]->(c)<-[r2:LINK]-(suggest)
WHERE NOT ((r1.endDate IS NOT NULL and r1.endDate < r2.startDate)
OR (r2.endDate IS NOT NULL and r2.endDate < r1.startDate))
RETURN suggest
The tricky part is applying this to multiple levels.
While we could create a single Cypher query to handle this dynamically, the evaluation of the relationships would only happen after expansion, not during, so it may not be the most efficient:
MATCH path = (start:Person{id:1})-[:LINK*..6]-(suggest:Person)
WITH path, start, suggest, apoc.coll.pairsMin(relationships(path)) as pairs
WITH path, start, suggest, [index in range(0, size(pairs)-1) WHERE index % 2 = 0 | pairs[index]] as pairs
WHERE none(pair in pairs WHERE (pair[0].endDate IS NOT NULL AND pair[0].endDate < pair[1].startDate)
OR (pair[1].endDate IS NOT NULL AND pair[1].endDate < pair[0].startDate))
RETURN suggest
Some of the highlights here...
We're using apoc.coll.pairsMin() from APOC Procedures to get pairs of adjacent relationships from the collection of relationships in each path, but we're only interested in the even-numbered entries (the two relationships from people working at the same company), because the odd-numbered pairs correspond to relationships from the same person going to two different companies.
So if we were executing on this pattern:
MATCH path = (start:Person)-[r1:LINK]->(c1)<-[r2:LINK]-(person2)-[r3:LINK]->(c2)<-[r4:LINK]-(person3)
The apoc.coll.pairsMin(relationships(path)) would return [[r1, r2], [r2,r3], [r3,r4]], and as you can see the relationships we need to consider are the ones linking 2 people to a company, so indexes 0 and 2 in the pairs list.
After we get our pairs we need to ensure that all of those interesting relationship pairs in the path considered to a suggestion meet your criteria and overlap (or do not NOT overlap).
Something like this should work:
MATCH path=(p:Person {id: $id})-[r:LINK*..4]-(l)
WHERE ALL(x IN NODES(path)[1..] WHERE x.startDate <= p.endDate AND x.endDate >= p.startDate)
RETURN path;
Assumptions:
The id value of the main person of interest is provided by the $id parameter.
You want the variable-length relationship pattern to have a lower bound of 1 (which is the default). If you used 0 for a lower bound, then you will also get the main person of interest as a result -- which is probably not what you want.
startDate and endDate have values that are suitable for comparison using comparison operators
I have got a graph that represents several bus/train stops in different cities.
Lets assume I want to go from city A (with stops a1, a2, a3...) to city Z (with stops z1, z2...)
There are several routes (relations) between the nodes and I want to get all paths between the start and the end node. My cost vector would be complex (travel time and waiting time and price and and and...) in reality, therefore I cannot use shortestpaths etc. I managed to write a (quite complex) query that does what I want: In general it is looking for each match with start A and end Z that is available.
I try to avoid looping by filter out results with special characteristics, e. g.
MATCH (from{name:'a1'}), (to{name:'z1'}),
path = (from)-[:CONNECTED_TO*0..8]->(to)
WHERE ALL(b IN NODES(path) WHERE SINGLE(c IN NODES(path) WHERE b = c))
Now I want to avoid the possiblity to visit one city more than once, e. g. instead of a1-->a2-->d2-->d4-->a3-->a4-->z1 I want to get a1-->a4-->z1.
Therefore I have to check all nodes in the path. If the value of n.city is the same for consecutive nodes, everything is fine. But If I got a path with nodes of the same city that are not consecutive, e. g. cityA--> cityB-->cityA I want to throw away that path.
How can I do that? Is something possible?
I know, that is not really a beatiful approach, but I invested quite a lot of time in finding a better one without throwing away the whole data structure but I could not find one. Its just a prototype and Neo4j is not my focus. I want to test some tools and products to build some knowledge. I will go ahead with a better approach next time.
Interesting question. The important thing to observe here is that a path that never revisits a city (after leaving it) must have fewer transitions between cities than the number of distinct cities. For example:
AABBC (a "good" path) has 3 distinct cities and 2 transitions
ABBAC (a "bad" path) also has 3 distinct cities but 3 transitions
With this observation in mind, the following query should work (even if the start and end nodes are the same):
MATCH path = ({name:'a1'})-[:CONNECTED_TO*0..8]->({name:'z1'})
WITH path, NODES(path) as ns
WITH path, ns,
REDUCE(s = {cnt: 0, last: ns[0].city}, x IN ns[1..] |
CASE WHEN x.city = s.last THEN s ELSE {cnt: s.cnt+1, last: x.city} END).cnt AS nTransitions
UNWIND ns AS node
WITH path, nTransitions, COUNT(DISTINCT node.city) AS nCities
WHERE nTransitions < nCities
RETURN path;
The REDUCE function is used to calculate the number of transitions in a path.
Currently I have a unique index on node with label "d:ReferenceEntity". It's taking approximately 11 seconds for this query to run, returning 7 rows. Granted T1 has about 400,000 relationships.
I'm not sure why this would take too long, considering we can build a Map of all connected Nodes to T1, thus giving constant time.
Am I missing some other index features that Neo4j can provide? Also my entire dataset is in memory, so it shouldn't have anything with going to disk.
match(n:ReferenceEntity {entityId : "T1" })-[r:HAS_REL]-(d:ReferenceEntity) WHERE d.entityId in ["T2", "T3", "T4"] return n
:schema
Indexes
ON :ReferenceEntity(entityId) ONLINE (for uniqueness constraint)
Constraints
ON (referenceentity:ReferenceEntity) ASSERT referenceentity.entityId IS UNIQUE
Explain Plan:
You had used EXPLAIN instead of PROFILE to get that query plan, so it shows misleading estimated row counts. If you had used PROFILE, then the Expand(All) operation actually would have had about 400,000 rows, since that operation would actually iterate through every relationship. That is why your query takes so long.
You can try this query, which tells Cypher use the index on d as well as n. (On my machine, I had to use the USING INDEX clause twice to get the desired results.) It definitely pays to use PROFILE to tune Cypher code.
MATCH (n:ReferenceEntity { entityId : "T1" })
USING INDEX n:ReferenceEntity(entityId)
MATCH n-[r:HAS_REL]-(d:ReferenceEntity)
USING INDEX d:ReferenceEntity(entityId)
WHERE d.entityId IN ["T2", "T3", "T4"]
RETURN n, d;
Here is the Profile Plan (In my DB, I had 2 relationships that satisfy the WHERE test):
I have a database in Neo4j of modules that I imported through CSV. The data looks something like this. Each module has its name, it's module that is the successor, average time duration and another duration called medtime.
I have been able to import the data and to set the relationships through a Cypher Query script that looks like this:
LOAD CSV WITH HEADERS FROM "file:c:/users/Skelo/Desktop/Neo4J related/Statistic Dependencies/Simple.csv" AS row FIELDTERMINATOR ';'
CREATE (n:Module)
SET n = row, n.name = row.name, n.mafter = row.mafter, n.avgtime = row.avgtime, n.medtime = row.medtime
WITH n
RETURN n
Then I have set the relationships like this:
Match (p:Module),(q:Module)
Where p.mafter = q.name
Merge (p)-[:PRECEEDS]->(q)
Return p,q
Now to the point. I want to calculate the shortest path from a certain module to another, more specifically the time that it takes to get from a module to another and for this, I use the more or less copied part of the script from
http://www.neo4j.org/graphgist?8412907 and that is
MATCH p = (trop:Module {name:'BLSACXAMT0A_00'})-[prec:PRECEEDS*]->(hop:Module {name:'BL_LOAD_CLOSE'})
WITH p, REDUCE(x = 0, a IN NODES(p) | x + a.avgtime) AS cum_duration
ORDER BY cum_duration DESC
LIMIT 1
RETURN cum_duration AS `Total Average Time`
This, however, takes about 50 second to execute and that is outrageous. You can see it on the screenshot right below. The ammount of modules imported into the database is only about 2000 and what I want to achieve, is to successfully work with more than 50 000 nodes and perform such tasks much faster.
Other issue is, that the results are somehow suspicious. The format looks wrong, every number I have in the database has max 4 digits after the decimal point and I am only adding these values to zero, therefore if the result looks like this: 00103,68330,51670, I have serious doubts. Please, help me, if it is wrong, why is it so, and what can I do to correct it.
Neo4j claims that it is efficient and fast, therefore I presume that the fault is in my code (the performance of my computer is more than enough). Please, If you can, help me to shorten this time and explain the patterns needed to perform this.
A few observations that should help:
You have several errors in how you are importing. These errors will create many more nodes than you think, and create the "suspicious" issue you raised:
Your file has multiple rows with the same name, but your import is creating a new Module node every time. Therefore, you are ending up with multiple nodes for some of your modules. You should be using MERGE instead of CREATE.
Your mafter property needs to contain a collection of strings, not a single string.
You are importing the numeric values as strings, so code such as x + a.avgtime is just doing string concatenation, not numeric addition. Furthermore, even if you did attempt to convert your strings to numbers, that would fail because your numbers use a comma instead of a period to indicate the decimal place.
Try this for importing (into an empty DB):
LOAD CSV WITH HEADERS FROM "file:c:/users/Skelo/Desktop/Neo4J related/Statistic Dependencies/Simple.csv" AS row FIELDTERMINATOR ';'
MERGE (n:Module {name: row.name})
ON CREATE SET
n.mafter = [row.mafter],
n.avgtime = TOFLOAT(REPLACE(row.avgtime, ',', '.')),
n.medtime = TOFLOAT(REPLACE(row.medtime, ',', '.'))
ON MATCH SET
n.mafter = n.mafter + row.mafter;
You also need to change your current merge query so that you can handle an mafter that is a collection. Note that the following query is designed to NOT create any new nodes (even if a name in mafter does not yet have a module node).
MATCH (p:Module)
OPTIONAL MATCH (p)-[:PRECEEDS]->(z:Module)
WITH p, COLLECT(z.name) AS existing
WITH p, filter(x IN p.mafter
WHERE NOT x IN existing) AS todo
MATCH (q:Module)
WHERE q.name IN todo
MERGE (p)-[:PRECEEDS]->(q)
RETURN p, q;
You should create an index to speed up the matching of modules by name:
CREATE INDEX ON :Module(name)
Cypher does have a shortestPath function, see http://neo4j.com/docs/stable/query-match.html#_shortest_path. However this calculates the shortest path based on the number of hops and does not take a weight into account.
Neo4j has couple of graph algorithms on board, e.g. Dijekstra or AStar. Unfortunately these are not yet available via cypher. Instead you have two alternatives to use them:
1) write an unmanaged extension to Neo4j and use GraphAlgoFactory in the implmentation. This requires to write same java code and deploy it to the Neo4j server. Using a custom CostEvaluator you can use the avgTime property on your nodes as cost parameter.
2) use the REST API as documented on http://neo4j.com/docs/stable/rest-api-graph-algos.html#rest-api-execute-a-dijkstra-algorithm-and-get-a-single-path. This approach requires to have the weight as a property on the relationship and not on a node (like in your data model)
I'm trying to find a query that will show me any length relationship that exists between two nodes that share the same index. Basically, if there is any overlap between for a specific label. My graph is Pretty simple and not particularly large:
(m:`Campaign`), (n:`Politician`), (o:`Assistant`), (p:`Staff`), (q:`Aid`), (s:`Contributor`)
(m)<-[:Campaigns_for]-(n)
(o)<-[:works_for]-(m)
(p)<-[:works_for]-(o)
(q)<-[:volunteers_for]-(p)
(m)<-[:contributes_to]-(s)
I want to find all the shared nodes and their relationships between Campaigns.
so far i have:
MATCH (n:`Campaign`)-[r*]-(m:`Campaign`)
RETURN n,count(r) as R,m
ORDER BY R DESC
but it's not returning everyhing I want, I want in addition to the counts, the labels of each relationship and the names of the nodes in between.
Assuming that "names of the nodes" means "return the name property of the node" (you could always substitute in "labels(n)" if you're after labels), then something like this might work, but you have some aggregation going on here so you may need to parse a bit:
MATCH p =(a:Campaign)-[r*]-(b:Campaign)
RETURN a, length(relationships(p)) AS count, b, extract(x IN relationships(p)| type(x)), extract(x IN nodes(p)| x.name)
ORDER BY count DESC
I'm also assuming that when you say "not returning everything you want", you mean that in addition to what's currently returned in your result set, you want just those other items you listed.
Keep in mind it might also be possible to have a cycle in your graph (not knowing too much about your particular graph), so, you may want to check your beginning and ending nodes.