Looking for an algorithm to determine the difference of reachable vertices - graph-algorithm

For a project I need to know which vertices are reachable from one vertex but not another in a directed acyclic graph.
For the graph, the following assumptions can be made:
Each vertex has only one or two outgoing edges (more are possible but unlikely).
The number of incoming edges is not limited.
The graph grows fast in depth, but slow in breadth.
Basically you can imagine the graph being the relation in a git repository. The vertices are commits and the edges are the relation to the parent commit(s).
At the moment I just calculate the neighborhood (with infinite depth) for each of the two vertices and calculate the difference between the two result sets. However, this was an expensive method to begin with and as the graph grows larger it just becomes slower.
Example graph (Edges are directed from top to bottom):
* A
|\
| * B
* | C
| |
| |
| * D
* | E
|\ \
| |/
| * F
* | G
|\ \
| |/
| * H
|/
* I
Reachable from A: A, B, C, D, E, F, G, H, I
Reachable from C: C, E, F, G, H, I
So the difference between A and C is: A, B, D

Related

How to repeat row range N times in Google Sheets

Basically, I have N rows with one unique value always repeating three times. This is col_1. Then I have a range of values I want repeated as many times there are unique values in col_1. This needs to be dynamic, since col_1 is automatically generated from a list.
col_1 | values
------- ------
a | d
a | e
a | f
b |
b |
b |
c |
c |
c |
So this is what I want to end up with:
col_1 | col_2
----------------
a | d
a | e
a | f
b | d
b | e
b | f
c | d
c | e
c | f
Edit: as a note in comment, my data is completely dynamic so I can't have any assumptions about how many rows there will be. In here I have a list of [a,b,c], multiplied by as many times there are items in Values, so [a,b,c] & [d,e,f] results in 9 rows. If I add "g" to [d,e,f], I then have 12 rows and if I then add "h" to [a,b,c] I would have 16 rows. The dynamic part is the important bit in here.
So I want to answer my own question, because I spend way too long for looking the answer and couldn't find one, so I just came up with one by myself. So here's the answer:
=ArrayFormula(TRANSPOSE(SPLIT(REPT(CONCATENATE(C2:C4&"~"),COUNTA(UNIQUE(A2:A500))),"~")))
You can just copy and change the ranges for it to work, but let me explain how does it work.
First we combine the values we want to repeat into one string with CONCATENATE. The three values are defined in the range of C2:C4.
CONCATENATE(C2:C4&"~") → "d~e~f~"
~ is used here as a delimiter, so there's no any special tricks in here. Next we repeat this string we just made as many times as there are unique values in col_1. For this we use a combination of COUNTA, UNIQUE and REPT.
COUNTA(UNIQUE(A2:A500)) ← Count how many unique occurrences there are in a range ( 3 )
REPT(CONCATENATE(C2:C4&"~"),COUNTA(UNIQUE(A2:A500))
Basically this is converted into:
REPT("d~e~f~",3) → "d~e~f~d~e~f~d~e~f~"
Now we have as many d, e and f as we want. Next we need to turn them into cells. We'll do this with a combination of SPLIT and TRANSPOSE.
TRANSPOSE(SPLIT(REPT(CONCATENATE(C2:C4&"~"),COUNTA(UNIQUE(A2:A500))),"~"))
We split the string from "~" so we'll end up with an array looking like [d,e,f,d,e,f,d,e,f]. We then need to transpose it to turn it into rows instead of columns.
Last part is to wrap everything into an arrayformula, so the formula actually does work.
=ArrayFormula(TRANSPOSE(SPLIT(REPT(CONCATENATE(C2:C4&"~"),COUNTA(UNIQUE(A2:A500))),"~")))
Now the array will look like:
col_1 | col_2
----------------
a | d
a | e
a | f
b | d
b | e
b | f
c | d
c | e
c | f
Now any time you add a new unique value to col_1, three new values are added
There is a new function that we discovered on the Google Product forums due to a user's post. That function is called FLATTEN().
in your scenario, this should work:
=ARRAYFORMULA(QUERY(SPLIT(FLATTEN(A2:A&"|"&TRANSPOSE(C2:C4)),"|",0,0),"where Col1<>''"))

Proving MC/DC unique cause definition compliance

I'm reading the following paper on MC/DC: http://shemesh.larc.nasa.gov/fm/papers/Hayhurst-2001-tm210876-MCDC.pdf.
I have the source code: Z := (A or B) and (C or D) and the following test cases:
-----------------
| A | F F T F T |
| B | F T F T F |
| C | T F F T T |
| D | F T F F F |
| Z | F T F T T |
-----------------
I want to prove that the mentioned test cases comply with unique cause definition.
I started by eliminating masked tests:
A or B = F T T T T, meaning it masks the first test case from C or D as F and (C or D) = F.
C or D = T T F T T, meaning it masks the third test case from A or B as (A or B) and F = F.
I then determined MC/DC:
Required test cases for A or B:
F F (first case)
T F (fifth case)
F T (second or fourth case)
Required test cases for C or D:
F F (third case)
T F (fourth or fifth case)
F T (second case)
Required test cases for (A or B) and (C or D):
T T (second, fourth or fifth case)
F T (first case)
T F (third case)
According to the paper, this example doesn't complies to unique cause definition. Instead, they propose changing the second test case from F T F T to T F F T.
-----------------
| A | F T T F T |
| B | F F F T F |
| C | T F F T T |
| D | F T F F F |
| Z | F T F T T |
-----------------
I determined MC/DC for A or B again:
F F (first case)
T F (fifth case)
F T (fourth case)
Then, they introduce the following independence pairs table that shows the difference between both examples (in page 38):
I understand that for the first example, the independence pair that they show changes two variables instead of one, however I don't understand how they are computing the independence pairs.
In the A column, I can infer they take F F T F from the test cases table's A row, and they compute the independence pair as the same test case with only A changed (T F T F).
In B's column, however, they pick F F T F again. According to my thinking, this should equal to the B's column: F T F T instead.
The rest of the letters show the same dilemma.
Also for D's first example column, they show that the independence pair of F T F T is T F F F, which ruins my theory that they are computing the independence pair from the first value, and proving that they are picking it from somewhere else.
Can someone explain better how (and from where) do they construct such independence pair table?
First the let’s re-read the definitions:
(From www.faa.gov/aircraft/air_cert/design_approvals/air_software/cast/cast_papers/media/cast-10.pdf)
DO-178B/ED-12B includes the following definitions:
Condition
A Boolean expression containing no Boolean operators.
Decision
A Boolean expression composed of conditions and zero or more Boolean operators.
A decision without a Boolean operator is a condition.
If a condition appears more than once in a decision, each occurrence is a
distinct condition.
Decision Coverage
Every point of entry and exit in the program has been invoked at least once
and every decision in the program has taken on all possible outcomes at least once.
Modified Condition/Decision Coverage
Every point of entry and exit in the program has been invoked at least once,
every condition in a decision in the program has taken all possible outcomes
at least once, every decision in the program has taken all possible outcomes
at least once, and each condition in a decision has been shown to independently
affect that decision's outcome.
A condition is shown to independently affect a decision's outcome by varying just
that condition while holding fixed all other possible conditions.
So, for the decision '(A or B) and (C or D)' we have four conditions: A,B,C and D
For each condition we must find a pair of test vectors that shows that the condition
'independently affect that decision's outcome'.
For unique cause MC/DC, only the value of the condition considered can vary in the pair of test vectors.
For example let's consider condition A. The following pair of test vectors covers condition A:
(A or B) and (C or D) = Z
T F T F T
F F T F F
With this pair of test vectors (TFTF, FFTF) only the value of A and Z (the decision) change.
We then search pairs for conditions B, C and D.
Using the RapiCover GUI (Qualifiable Code coverage tool from Rapita Systems - www.rapitasystems.com/products/rapicover) we can see the full set of test vectors (observed or missing) to fully cover all conditions of the decision.
RapiCover screenshot
Vector V3 (in yellow in the screenshot above) isn't used in any independence pair.
Vector V6 (in red in the screenshot) is missing for MC/DC coverage of condition D.
This is for the definition of 'unique cause' MC/DC.
Now for 'masking MC/DC':
For 'masking MC/DC' the requirement that the value of a single condition may vary in a pair
of test vectors is relaxed provided that any other change is masked by the boolean
operators in the expression.
For example, let's consider the pair of vectors for condition D:
(A or B) and (C or D) = Z
T F F T T
T F F F F
We can represent these two test vectors on the expression tree:
and
/ \
or1 or2
/ \ / \
A B C D
and and
[T] [F]
/ \ / \
or1 or2 or1 or2
[T] [T] [T] [F]
/ \ / \ / \ / \
A B C D A B C D
[T] [F][F] [T] [T] [F][F] [F]
This is a pair for unique cause MC/DC.
Let's now consider a new pair of test vectors for condition D:
(A or B) and (C or D) = Z
F T F T T
T F F F F
Again we can represent these two test vectors on the expression tree:
and and
[T] [F]
/ \ / \
or1 or2 or1 or2
[T] [T] [T] [F]
/ \ / \ / \ / \
A B C D A B C D
[F] [T][F] [T] [T] [F][F] [F]
This is a pair for masking MC/DC because although the values for 3 conditions (A, B and D) have changed
the change for conditions A and B is masked by the boolean operator 'or1' (i.e. the value of 'A or B' is unchanged).
So, for masking MCDC, the independence pairs for all condition D can be:
RapiCover screenshot

can neo4j coalesce identical object nodes?

I am considering learning graph databases (like neo4j), but I was curious if such facilities are available in graph databases, eg., If I do:
Step 1: create: A --> B --> C
Step 2: create: D --> B --> E
Step 3: create: F --> G --> E
This should automatically result in a graph stored something like:
A ---> B ----> C
/|\ \
D -----| \--> E
/|\
F ---> G --------|
Here the common nodes B and E are coalesced (without having to programmatically check for a prior existence of these nodes). In a real world example, there would be 1000's of such B's and E's which would be implemented in relational DB as follows:
FK = Foreign Key .. X Y Z are keys for three primary tables.
___________ ________ _____________ ________
X | FK(Y) Y | ... FK(Y) | FK(Z) Z | ..
---|------- --|----- ------|------ ---|----
A | FK(B) B | ... FK(B) | FK(C) C | ..
D | FK(B) G | .. FK(B) | FK(E) E | ..
F | FK(G) FK(G) | FK(E)
In a RDB, (eg., when I insert relation D-->B) I would have to programmatically search for a duplicate object B in the 2nd table (or look for a fail code when trying to insert an identical object into it) and then get the B's foreign key to put along with D. I am hoping that in graph DB, such things are taken care of by the DB.
You should look at v2.0's new MERGE clause, which allows you to have a follow-on ON MATCH and ON CREATE clause, so you can take a specific action when a node is found vs created.
See the 2.0M3 blog post for an intro (2.0M4 is the latest build but MERGE was intro'd in M3), as well as this "What's new in 2.0" video.

cypher find relation direction

How can I find a relation direction with regards to a containing path? I need this to do a weighted graph search that takes into account relation direction (weighing "wrong" direction with a 0, see also comments).
Lets say:
START a=node({param})
MATCH a-[*]-b
WITH a, b
MATCH p = allshortestpaths(a-[*]-b)
RETURN extract(r in rels(p): flows_with_path(r)) as in_flow
where
flows_with_path = 1 if sp = (a)-[*0..]-[r]->[*0..]-(b), otherwise
0
EDIT: corrected query
So, here's a way to do it with existing cypher functions. I don't promise it's super performant, but give it a shot. We're building our collection with reduce, using an accumulator tuple with a collection and the last node we looked at, so we can check that it's connected to the next node. This requires 2.0's case/when syntax--there may be a way to do it in 1.9 but it's probably even more complex.
START a=node:node_auto_index(name="Trinity")
MATCH a-[*]-b
WHERE a <> b
WITH distinct a,b
MATCH p = allshortestpaths(a-[*]-b)
RETURN extract(x in nodes(p): x.name?), // a concise representation of the path we're checking
head(
reduce(acc=[[], head(nodes(p))], x IN tail(nodes(p)): // pop the first node off, traverse the tail
CASE WHEN ALL (y IN tail(acc) WHERE y-->x) // a bit of a hack because tail(acc)-->x doesn't parse right, so I had to wrap it so I can have a bare identifier in the pattern predicate
THEN [head(acc) + 0, x] // add a 0 to our accumulator collection
ELSE [head(acc) + 1, x] // add a 1 to our accumulator collection
END )) AS in_line
http://console.neo4j.org/r/v0jx03
Output:
+---------------------------------------------------------------------------+
| extract(x in nodes(p): x.name?) | in_line |
+---------------------------------------------------------------------------+
| ["Trinity","Morpheus"] | [1] |
| ["Trinity","Morpheus","Cypher"] | [1,0] |
| ["Trinity","Morpheus","Cypher","Agent Smith"] | [1,0,0] |
| ["Trinity","Morpheus","Cypher","Agent Smith","The Architect"] | [1,0,0,0] |
| ["Trinity","Neo"] | [1] |
| ["Trinity","Neo",<null>] | [1,1] |
+---------------------------------------------------------------------------+
Note: Thanks #boggle for the brainstorming session.

algo for reducing a graph while preserving edge values along paths from a start node to an end node

I have a directed cyclic graph with values at edges but no values at nodes.
The graph has a start node and an end node, and I want to preserve the set of paths through the graph but I don't care about the nodes on the path, only the edge values. Example below.
Are there any algorithms that will produce a smaller graph that preserves that property?
The graph might have 10's of thousands of nodes, but not millions. The number of edges per node is small w.r.t. the number of nodes.
Conservative heuristics are welcome.
As an example, where O is a node, and a number is the value of the adjacent edge:
O --------> O -------> O
2 3
^ |4
|1 v
1 2 3 4
start -> O -> O -> O -> end
|5 ^
v |8
O --------> O -------> O
6 7
has two paths with edge values [1,2,3,4] from start to end, so one is redundant, and I would be happy to reduce the above to
O --------> O -------> O
2 3
^ |4
|1 v
start end
|5 ^
v |8
O --------> O -------> O
6 7
The graph can be cyclic, so in
1
/-\
| /
v/
start -> O -> O -> end
1 1 2
a simpler graph would eliminate the second 1 transition to leave only the self-edge:
1
/-\
| /
v/
start -> O -> end
1 2
I would iterate through all nodes that are not the start and end and proceed to remove them. When removing a node you add another edge between all nodes that were connected through this node (watch direction since its a directed graph). The thing to remember is that if you try to add an edge that already exists due to this process - make sure to keep the edge with the smaller weight (that's the key).
Implication Charts did what I need. They're O(n**2) space-wise on the number of nodes, but that's manageable in my situation.

Resources