Given: A directed acyclic graph with weighted edges, where a node can have multiple parents.
Problem: For each child of root node, find a minimum-cost(sum of weights) path from such a child to some leaf which can be reached. A node can only be present in one such min-cost paths.
Example graph:
In the above graph, for node 2, all the available paths are:
2 -> 5
2 -> 1 -> 9 -> 6
2 -> 1 -> 10 -> 6
Among which 2 -> 1 -> 10 -> 6 has minimum cost of 3.5
Similarly, for node 4, all the available paths are:
4 -> 11 -> 8
4 -> 7 -> 10 -> 6
Among which 4 -> 7 -> 10 -> 6 has minimum cost of 3.0
The result currently is:
For node 2, the path is 2 -> 1 -> 10 -> 6 : Cost 3.5
For node 4, the path is 4 -> 7 -> 10 -> 6 : Cost 3.0
For node 3, there is no such path.
I have written a code which does this. Now, if such min-cost paths didn't have any node in common, the algorithm would stop and give me min-cost paths for all children of root node.
However, if there exists a node in common, I have to retain it only in one of them. The reason is that normally such multi-parent nodes are due to noisy data. A node is supposed to belong to only one parent. I am trying to keep such a node in a path which has minimum cost. So here, node 10 belongs to min-path of node 4 which has cost of 3.0, compared to min-path of node 2 which has cost of 3.5. Same logic with node 6 too. So, I will just compare the costs to dis-associate some of the multi-parent nodes. Dis-association doesn't mean the edge will be removed. All I do is that save best parent for each node within node's data-structure. For example, Node 10 will have an entry saying "best parent is node 7" and Node 6 will have an entry "best parent is node 10". I can actually remove the edge itself but I may want the whole graph structure to remain intact for any future computations.
So, the logic looks like this:
Do:
For each child node of the root:
Find out min-cost path. Store that path and the cost.
If conflicting paths exist:
Compare the costs of conflicting paths and save "best parent" for each node.
While there were conflicting paths;
Questions:
Does this logic makes sense? I am worried that this iterative way of eliminating conflicts may not converge for some graphs. For example, while re-calculating the min-path for node 2, if 2 -> 5 is found to be the min-path now and assume if node 5 is being used in some other node's min-path during first iteration, then I would have to re-assign "best parent" for node 5 as node 2 and re-iterate. In a nut-shell, everytime I try to fix some node's min-path, I may change the other's. Can such an algorithm converge to some solution? If yes, what will its complexity be?
Is there a way to eliminate such conflicts before computing the min-cost paths in the first place?
It's dynamic programming.
First reverse all the edges to make a new graph, we call it newG.
In newG, the node without parent has the value 0.
for every node which it have parents in newG, calculate it's
parent's value, then choose the minimal value parent, it must be the
part of the result.
when ask the path from the origin gtaph, the answer is the same
in the newG.(may be the edges in the answer is in reverse order).
Time O(n)
Related
I have an application where nodes and relations are shown. After a result is shown, nodes and relations can be added through the gui. When the user is done, I would like to get all the data from the database again (because I don't have all data by this point in the front-end) based on the Neo4j id's of all nodes and links. The difficult part for me is that there are "floating" nodes that don't have a relation in the result of the gui (they will have relations in the database, but I don't want these). Worth mentioning is that on my relations, I have the start and end node id. I was thinking to start from there, but then I don't have these floating nodes.
Let's take a look at this poorly drawn example image:
As you can see:
node 1 is linked (no direction) to node 2.
node 2 is linked to node 3 (from 2 to 3)
node 3 is linked to node 4 (from 3 to 4)
node 3 is also linked to node 5 (no direction)
node 6 is a floating node, without relations
Let's assume that:
id(relation between 1 and 2) = 11
id(relation between 2 and 3) = 12
id(relation between 3 and 4) = 13
id(relation between 3 and 5) = 14
Keeping in mind that behind the real data, there are way more relations between all these nodes, how can I recreate this very image again via Neo4j? I have tried doing something like:
match path=(n)-[rels*]-(m)
where id(n) in [1, 2, 3, 4, 5]
and all(rel in rels where id in [11, 12, 13, 14])
and id(m) in [1, 2, 3, 4, 5]
return path
However, this doesn't work properly because of multiple reasons. Also, just matching on all the nodes doesn't get me the relations. Do I need to union multiple queries? Can this be done in 1 query? Do I need to write my own plugin?
I'm using Neo4j 3.3.5.
You don't need to keep a list of node IDs. Every relationship points to its 2 end nodes. Since you always want both end nodes, you get them for free using just the relationship ID list.
This query will return every single-relationship path from a relationship ID list. If you are using the neo4j Browser, its visualization should knit together these short paths and display your original full paths.
MATCH p=()-[r]-()
WHERE ID(r) IN [11, 12, 13, 14]
RETURN p
By the way, all neo4j relationships have a direction. You may choose not to specify the direction when you create one (using MERGE) and/or query for one, but it still has a direction. And the neo4j Browser visualization will always show the direction.
[UPDATED]
If you also want to include "floating" nodes that are not attached to a relationship in your relationship list, then you could just use a separate floating node ID list. For example:
MATCH p=()-[r]-()
WHERE ID(r) IN [11, 12, 13, 14]
RETURN p
UNION
MATCH p=(n)
WHERE ID(n) IN [6]
RETURN p
I need to compute the distance that separate two nodes A and B with their lowest common ancestor in a graph. I use the followinf function to find LCA:
match p1 = (A:Category {idCat: "Main_topic") -[*0..]-> (common:Category) <-[*0..]- (B:Category {idCat: "Heat_transfer"})
return common, p1
Is there any function in Neo4j that allows to return the respective distance between d(A,common) and d(B, common).
Thank you fo your help
If I understand the lowest common ancestor correctly, this comes down to finding the shortest path between A and B with at least one node in between. That you can do using this query. Here the condition that the length of p is larger than 1 forces at least one node between the two. Below example uses the IMDB toy database and returns the movie Avatar.
match p=shortestPath((n:Person {name:'Zoe Saldana'})-[r*1..15]-(n1:Person {name:'James Cameron'})) where length(p) > 1 return nodes(p)[1]
Basically you can choose any element from the nodes in the path, except the first and last one (since those will be A and B)
I've built a graph with 40 mln nodes and 40 mln relations with Neo4j.
Mostly I search for different shortest paths and queries are to be very fast. Right now it usually takes a few milliseconds per query.
For speed I encode all parameters in relations property val, so ordinary query looks like this:
MATCH (one:Obj{oid:'1'})
with one
MATCH (two:Obj{oid:'2'}), path=shortestPath((one) -[*0..50]-(two))
WHERE ALL (x IN RELATIONSHIPS(path) WHERE ((x.val > 50 and x.val<109) ))
return path
But one filter cannot be done this way, as it should evaluate (on each step) property of starting node, property of relation, property of ending node, for example:
Path: n1(==1)-r1(==2)-n2(==1)-r2(==5)-n3(==3)
On step1: properties of n1 and n2 equal 1 and relation's property equals 2, that's OK, going further
On step2: property of n2 equals 1, but property of n3 equals 3, so we stop. If it was 1, we would stop anyway, because relation r2 is not 2, but 5.
I've used RELATIONSHIPS and NODES predicates, but they seem to work separately.
Also, I guess this can be done with traversal API, but I'll have to rewrite a lot of my other code, so it is not desirable.
Am I missing some fast solution?
It looks like your basic query is running quite fast. If you want to filter at additional steps, you probably have to add additional optional match and with statements to accommodate the filters. Undesired elements should drop out.
I have a DAG which for the most part is a tree... but there are a few cycles in it. I mention it in case it matters.
I have to translate the graph into pairs of relations. If:
A -> B
C
D -> 1
2 -> X
Y
Then I would produce ArB, ArC, arD, Dr1, Dr2, 2rX, 2rY, where r is some relationship information (in other words, the query cannot totally ignore it.)
Also, in my graph, node A has many cousins, so I need to 'anchor' my query to A.
My current attempt generates all possible pairs, so I get many unhelpful pairs such as ArY since A can eventually traverse to Y.
What is a query that starts (or ends) with A, that returns a list of pairs? I don't want to query Neo individually for each node - I want to get the list in one shot if possible.
The query would be great, doc pages that explain would be great. Any help is appreciated.
EDIT Here's what I have so far, using Frobber's post as inspiration:
1. MATCH p=(n {id:"some_id"})-[*]->(m)
2. WITH DISTINCT(NODES(p)) as zoot
3. MATCH (x)-[r]->(y)
4. WHERE x IN zoot AND y IN zoot
5. RETURN DISTINCT x, TYPE(r) as r, y
Where in line 1, I make a path that includes all the nodes under the one I care about.
In line 2, I start a new match that is intended to return my pairs
Line 3, I convert the path of nodes to a collection of nodes
Line 4, I accept only x and y nodes that were scooped up the first match. I am not sure why I have to include y in the condition, but it seems to matter.
Line 5, I return the results. I do not know why I need a distinct here. I thought the one on line 3 would do the trick.
So far, this is working for me. I have no insight into its performance in a large graph.
Here's an approach to try - this query is modeled off of the sample matrix data you can find online so you can play with it before adapting it to your schema.
MATCH p=(n:Crew)-[r:KNOWS*]-m
WHERE n.name='Neo'
WITH p, length(nodes(p)) AS nCount, length(relationships(p)) AS rCount
RETURN nodes(p)[nCount-2], relationships(p)[rCount-1], nodes(p)[nCount-1];
ORDER BY length(p) ASC;
A couple of notes about what's going on here:
Consider the "Neo" node (n.name="Neo") to be your "A" here. You're rooting this path traversal in some particular node you pick out.
We're matching paths, not nodes or edges.
We're going through all paths rooted at the A node, ordering by path length. This gets the near nodes before the distant nodes.
For each path we find, we're looking at the nodes and relationships in the path, and then returning the last pair. The second-to-last node (nodes(p)[nCount-2]) and the last relationship in the path (relationships(p)[rCount-1]).
This query basically returns the node, the relationship, and the connected node showing that you can get those items; from there you just customize the query to pull out whatever about those nodes/rels you might need pursuant to your schema.
The basic formula starts with matching p=(someNode {startingPoint: "A"})-[r:*]->(otherStuff); from there it's just processing paths as you go.
Let's say you have a mnesia table replicated on nodes A and B. If on node C, which does not contain a copy of the table, I do mnesia:change_config(extra_db_nodes, [NodeA, NodeB]), and then on node C I do mnesia:dirty_read(user, bob) how does node C choose which node's copy of the table to execute a query on?
According to my own research answer for the question is - it will choose the most recently connected node. I will be grateful for pointing out errors if found - mnesia is a really complex system!
As Dan Gudmundsson pointed out on the mailing list algorithm of selection of the remote node to query is defined in mnesia_lib:set_remote_where_to_read/2. It is the following
set_remote_where_to_read(Tab, Ignore) ->
Active = val({Tab, active_replicas}),
Valid =
case mnesia_recover:get_master_nodes(Tab) of
[] -> Active;
Masters -> mnesia_lib:intersect(Masters, Active)
end,
Available = mnesia_lib:intersect(val({current, db_nodes}), Valid -- Ignore),
DiscOnlyC = val({Tab, disc_only_copies}),
Prefered = Available -- DiscOnlyC,
if
Prefered /= [] ->
set({Tab, where_to_read}, hd(Prefered));
Available /= [] ->
set({Tab, where_to_read}, hd(Available));
true ->
set({Tab, where_to_read}, nowhere)
end.
So it gets the list of active_replicas (i.e. list of candidates), optionally shrinks the list to master nodes for the table, remove tables to be ignored (for any reason), shrinks the list to currently connected nodes and then selects in the following order:
First non-disc_only_copies
Any available node
The most important part is in fact the list of active_replicas, since it determines the order of nodes in the list of candidates.
List of active_replicas is formed by remote calls of mnesia_controller:add_active_replica/* from newly connected nodes to old nodes (i.e. one which were in the cluster before), which boils down to the function add/1 which adds the item as the head of the list.
Hence answer for the question is - it will choose the most recently connected node.
Notes:
To check out the list of active replicas on the given node you can use this (dirty hack) code:
[ {T,X} || {{T,active_replicas}, X} <- ets:tab2list(mnesia_gvar) ].
Well, node C would need to contact either node A or node B in order to do a query. Thus node C will have to decide itself which table copy to execute the query on.
If you need something more than this you would either need to have some algorithm which will decide which node to query on, or even replicate the table on node C (this would typically depend on what kind of characteristics you want / need).
If node A and node B form or are part of a database cluster, a good start is probably the round robin algorithm (or random, as you suggest).