Flink Gelly Path/Trail Usecase - path

our team is new to Gelly api. We are looking to implement a simple use case that will list all paths originating from an initial vertice - for e.g.
input edge csv file is
1,2\n2,3\n3,4\n1,5\n5,6
the required output will be (the full path that starts from 1)
1,2,3,4\n1,5,6
Can someone please help.

You can use one of Gelly's iteration abstractions, e.g. vertex-centric iterations. Starting from the source vertex, you can iteratively extend the paths, one hop per superstep. Upon receiving a path, a vertex appends its ID to the path and propagates it to its outgoing neighbors. If a vertex has no outgoing neighbors, then it prints / stores the path and does not propagate it further. To avoid loops a vertex could also check if its ID exists in the path, before propagating. The compute function could look like this:
public static final class ComputePaths extends ComputeFunction<Integer, Boolean, NullValue, ArrayList<Integer>> {
#Override
public void compute(Vertex<Integer, Boolean> vertex, MessageIterator<ArrayList<Integer>> paths) {
if (getSuperstepNumber() == 1) {
// the source propagates its ID
if (vertex.getId().equals(1)) {
ArrayList<Integer> msg = new ArrayList<>();
msg.add(1);
sendMessageToAllNeighbors(msg);
}
}
else {
// go through received messages
for (ArrayList<Integer> p : paths) {
if (!p.contains(vertex.getId())) {
// if no cycle => append ID and forward to neighbors
p.add(vertex.getId());
if (!vertex.getValue()) {
sendMessageToAllNeighbors(p);
}
else {
// no out-neighbors: print p
System.out.println(p);
}
}
else {
// found a cycle => print the path and don't propagate further
System.out.println(p);
}
}
}
}
}
In this code I have assumed that you have pre-processed vertices to mark the ones that have no out-neighbors with a "true" value. You could e.g. use graph.outDegrees() to find those.
Have in mind that enumerating all paths in a large and dense graph is expensive to compute. The intermediate paths state can explode quite quickly. You could look into using a more compact way for representing paths than an using ArrayList of ints, but beware of the cost if you have a dense graph with large diameter.
If you don't need the paths themselves but you're only interested in reachability or shortest paths, then there exist more efficient algorithms.

Related

Apache Beam Stateful DoFn Periodically Output All K/V Pairs

I'm trying to aggregate (per key) a streaming data source in Apache Beam (via Scio) using a stateful DoFn (using #ProcessElement with #StateId ValueState elements). I thought this would be most appropriate for the problem I'm trying to solve. The requirements are:
for a given key, records are aggregated (essentially summed) across all time - I don't care about previously computed aggregates, just the most recent
keys may be evicted from the state (state.clear()) based on certain conditions that I control
Every 5 minutes, regardless if any new keys were seen, all keys that haven't been evicted from the state should be outputted
Given that this is a streaming pipeline and will be running indefinitely, using a combinePerKey over a global window with accumulating fired panes seems like it will continue to increase its memory footprint and the amount of data it needs to run over time, so I'd like to avoid it. Additionally, when testing this out, (maybe as expected) it simply appends the newly computed aggregates to the output along with the historical input, rather than using the latest value for each key.
My thought was that using a StatefulDoFn would simply allow me to output all of the global state up until now(), but it seems this isn't a trivial solution. I've seen hintings at using timers to artificially execute callbacks for this, as well as potentially using a slowly growing side input map (How to solve Duplicate values exception when I create PCollectionView<Map<String,String>>) and somehow flushing this, but this would essentially require iterating over all values in the map rather than joining on it.
I feel like I might be overlooking something simple to get this working. I'm relatively new to many concepts of windowing and timers in Beam, looking for any advice on how to solve this. Thanks!
You are right that Stateful DoFn should help you here. This is a basic sketch of what you can do. Note that this only outputs the sum without the key. It may not be exactly what you want, but it should help you move forward.
class CombiningEmittingFn extends DoFn<KV<Integer, Integer>, Integer> {
#TimerId("emitter")
private final TimerSpec emitterSpec = TimerSpecs.timer(TimeDomain.PROCESSING_TIME);
#StateId("done")
private final StateSpec<ValueState<Boolean>> doneState = StateSpecs.value();
#StateId("agg")
private final StateSpec<CombiningState<Integer, int[], Integer>>
aggSpec = StateSpecs.combining(
Sum.ofIntegers().getAccumulatorCoder(null, VarIntCoder.of()), Sum.ofIntegers());
#ProcessElement
public void processElement(ProcessContext c,
#StateId("agg") CombiningState<Integer, int[], Integer> aggState,
#StateId("done") ValueState<Boolean> doneState,
#TimerId("emitter") Timer emitterTimer) throws Exception {
if (SOME CONDITION) {
countValueState.clear();
doneState.write(true);
} else {
countValueState.addAccum(c.element().getValue());
emitterTimer.align(Duration.standardMinutes(5)).setRelative();
}
}
}
#OnTimer("emitter")
public void onEmit(
OnTimerContext context,
#StateId("agg") CombiningState<Integer, int[], Integer> aggState,
#StateId("done") ValueState<Boolean> doneState,
#TimerId("emitter") Timer emitterTimer) {
Boolean isDone = doneState.read();
if (isDone != null && isDone) {
return;
} else {
context.output(aggState.getAccum());
// Set the timer to emit again
emitterTimer.align(Duration.standardMinutes(5)).setRelative();
}
}
}
}
Happy to iterate with you on something that'll work.
#Pablo was indeed correct that a StatefulDoFn and timers are useful in this scenario. Here is the with code I was able to get working.
Stateful Do Fn
// DomainState is a custom case class I'm using
type DoFnT = DoFn[KV[String, DomainState], KV[String, DomainState]]
class StatefulDoFn extends DoFnT {
#StateId("key")
private val keySpec = StateSpecs.value[String]()
#StateId("domainState")
private val domainStateSpec = StateSpecs.value[DomainState]()
#TimerId("loopingTimer")
private val loopingTimer: TimerSpec = TimerSpecs.timer(TimeDomain.EVENT_TIME)
#ProcessElement
def process(
context: DoFnT#ProcessContext,
#StateId("key") stateKey: ValueState[String],
#StateId("domainState") stateValue: ValueState[DomainState],
#TimerId("loopingTimer") loopingTimer: Timer): Unit = {
... logic to create key/value from potentially null values
if (keepState(value)) {
loopingTimer.align(Duration.standardMinutes(5)).setRelative()
stateKey.write(key)
stateValue.write(value)
if (flushState(value)) {
context.output(KV.of(key, value))
}
} else {
stateValue.clear()
}
}
#OnTimer("loopingTimer")
def onLoopingTimer(
context: DoFnT#OnTimerContext,
#StateId("key") stateKey: ValueState[String],
#StateId("domainState") stateValue: ValueState[DomainState],
#TimerId("loopingTimer") loopingTimer: Timer): Unit = {
... logic to create key/value checking for nulls
if (keepState(value)) {
loopingTimer.align(Duration.standardMinutes(5)).setRelative()
if (flushState(value)) {
context.output(KV.of(key, value))
}
}
}
}
With pipeline
sc
.pubsubSubscription(...)
.keyBy(...)
.withGlobalWindow()
.applyPerKeyDoFn(new StatefulDoFn())
.withFixedWindows(
duration = Duration.standardMinutes(5),
options = WindowOptions(
accumulationMode = DISCARDING_FIRED_PANES,
trigger = AfterWatermark.pastEndOfWindow(),
allowedLateness = Duration.ZERO,
// Only take the latest per key during a window
timestampCombiner = TimestampCombiner.END_OF_WINDOW
))
.reduceByKey(mostRecentEvent())
.saveAsCustomOutput(TextIO.write()...)

Context dependent ANTLR4 ParseTreeVisitor implementation

I am working on a project where we migrate massive number (more than 12000) views to Hadoop/Impala from Oracle. I have written a small Java utility to extract view DDL from Oracle and would like to use ANTLR4 to traverse the AST and generate an Impala-compatible view DDL statement.
The most of the work is relatively simple, only involves re-writing some Oracle specific syntax quirks to Impala style. However, I am facing an issue, where I am not sure I have the best answer yet: we have a number of special cases, where values from a date field are extracted in multiple nested function calls. For example, the following extracts the day from a Date field:
TO_NUMBER(TO_CHAR(d.R_DATE , 'DD' ))
I have an ANTLR4 grammar declared for Oracle SQL and hence get the visitor callback when it reaches TO_NUMBER and TO_CHAR as well, but I would like to have special handling for this special case.
Is not there any other way than implementing the handler method for the outer function and then resorting to manual traversal of the nested structure to see
I have something like in the generated Visitor class:
#Override
public String visitNumber_function(PlSqlParser.Number_functionContext ctx) {
// FIXME: seems to be dodgy code, can it be improved?
String functionName = ctx.name.getText();
if (functionName.equalsIgnoreCase("TO_NUMBER")) {
final int childCount = ctx.getChildCount();
if (childCount == 4) {
final int functionNameIndex = 0;
final int openRoundBracketIndex = 1;
final int encapsulatedValueIndex = 2;
final int closeRoundBracketIndex = 3;
ParseTree encapsulated = ctx.getChild(encapsulatedValueIndex);
if (encapsulated instanceof TerminalNode) {
throw new IllegalStateException("TerminalNode is found at: " + encapsulatedValueIndex);
}
String customDateConversionOrNullOnOtherType =
customDateConversionFromToNumberAndNestedToChar(encapsulated);
if (customDateConversionOrNullOnOtherType != null) {
// the child node contained our expected child element, so return the converted value
return customDateConversionOrNullOnOtherType;
}
// otherwise the child was something unexpected, signalled by null
// so simply fall-back to the default handler
}
}
// some other numeric function, default handling
return super.visitNumber_function(ctx);
}
private String customDateConversionFromToNumberAndNestedToChar(ParseTree parseTree) {
// ...
}
For anyone hitting the same issue, the way to go seems to be:
changing the grammar definition and introducing custom sub-types for
the encapsulated expression of the nested function.
Then, I it is possible to hook into the processing at precisely the desired location of the Parse tree.
Using a second custom ParseTreeVisitor that captures the values of function call and delegates back the processing of the rest of the sub-tree to the main, "outer" ParseTreeVisitor.
Once the second custom ParseTreeVisitor has finished visiting all the sub-ParseTrees I had the context information I required and all the sub-tree visited properly.

How chain indefinite amount of flatMap operators in Reactor?

I have some initial state in my application and a few of policies that decorates this state with reactively fetched data (each of policy's Mono returns new instance of state with additional data). Eventually I get fully decorated state.
It basically looks like this:
public interface Policy {
Mono<State> apply(State currentState);
}
Usage for fixed number of policies would look like that:
Flux.just(baseState)
.flatMap(firstPolicy::apply)
.flatMap(secondPolicy::apply)
...
.subscribe();
It basically means that entry state for a Mono is result of accumulation of initial state and each of that Mono predecessors.
For my case policies number is not fixed and it comes from another layer of the application as a collection of objects that implements Policy interface.
Is there any way to achieve similar result as in the given code (with 2 flatMap), but for unknown number of policies? I have tried with Flux's reduce method, but it works only if policy returns value, not a Mono.
This seems difficult because you're streaming your baseState, then trying to do an arbitrary number of flatMap() calls on that. There's nothing inherently wrong with using a loop to achieve this, but I like to avoid that unless absolutely necessary, as it breaks the natural reactive flow of the code.
If you instead iterate and reduce the policies into a single policy, then the flatMap() call becomes trivial:
Flux.fromIterable(policies)
.reduce((p1,p2) -> s -> p1.apply(s).flatMap(p2::apply))
.flatMap(p -> p.apply(baseState))
.subscribe();
If you're able to edit your Policy interface, I'd strongly suggest adding a static combine() method to reference in your reduce() call to make that more readable:
interface Policy {
Mono<State> apply(State currentState);
public static Policy combine(Policy p1, Policy p2) {
return s -> p1.apply(s).flatMap(p2::apply);
}
}
The Flux then becomes much more descriptive and less verbose:
Flux.fromIterable(policies)
.reduce(Policy::combine)
.flatMap(p -> p.apply(baseState))
.subscribe();
As a complete demonstration, swapping out your State for a String to keep it shorter:
interface Policy {
Mono<String> apply(String currentState);
public static Policy combine(Policy p1, Policy p2) {
return s -> p1.apply(s).flatMap(p2::apply);
}
}
public static void main(String[] args) {
List<Policy> policies = new ArrayList<>();
policies.add(x -> Mono.just("blah " + x));
policies.add(x -> Mono.just("foo " + x));
String baseState = "bar";
Flux.fromIterable(policies)
.reduce(Policy::combine)
.flatMap(p -> p.apply(baseState))
.subscribe(System.out::println); //Prints "foo blah bar"
}
If I understand the problem correctly, then the most simple solution is to use a regular for loop:
Flux<State> flux = Flux.just(baseState);
for (Policy policy : policies)
{
flux = flux.flatMap(policy::apply);
}
flux.subscribe();
Also, note that if you have just a single baseSate you can use Mono instead of Flux.
UPDATE:
If you are concerned about breaking the flow, you can extract the for loop into a method and apply it via transform operator:
Flux.just(baseState)
.transform(this::applyPolicies)
.subscribe();
private Publisher<State> applyPolicies(Flux<State> originalFlux)
{
Flux<State> newFlux = originalFlux;
for (Policy policy : policies)
{
newFlux = newFlux.flatMap(policy::apply);
}
return newFlux;
}

Logical node deletion in Neo4j embedded

I have the following graph in an embedded Neo4j instance:
I want to find all the people who are not greeted by anyone else. That's simple enough: MATCH (n) WHERE NOT ()-[:GREETS]->(n) RETURN n.
However, whenever I find non-greeted people, I want to remove those node from the db and repeat the query, as long as it matches one or more nodes. In other words, starting from the graph in the picture, I want to:
Run the query, which returns "Alice"
Remove "Alice" from the db
Run the query, which returns "Bob"
Remove "Bob" from the db
Run the query, which returns no matches
Return the names "Alice" and "Bob"
Moreover, I want to execute this algorithm without actually removing any nodes from the database - i.e., a sort of "logical deletion".
One solution I have found is to not call success() on the transaction, so that node deletions are not committed to the db, as in the following code:
import org.neo4j.graphdb.*;
import org.neo4j.graphdb.factory.GraphDatabaseFactory;
import java.io.File;
import java.util.*;
public class App
{
static String dbPath = "~/neo4j/data/databases/graph.db";
private enum RelTypes implements RelationshipType { GREETS }
public static void main(String[] args) {
File graphDirectory = new File(dbPath);
GraphDatabaseService graph = new GraphDatabaseFactory().newEmbeddedDatabase(graphDirectory);
Set<String> notGreeted = new HashSet<>();
try (Transaction tx = graph.beginTx()) {
while (true) {
Node notGreetedNode = getFirstNode(graph, "MATCH (n) WHERE NOT ()-[:GREETS]->(n) RETURN n");
if (notGreetedNode == null) {
break;
}
notGreeted.add((String) notGreetedNode.getProperty("name"));
detachDeleteNode(graph, notGreetedNode);
}
// Here I do NOT call tx.success()
}
System.out.println("Non greeted people: " + String.join(", ", notGreeted));
graph.shutdown();
}
private static Node getFirstNode(GraphDatabaseService graph, String cypherQuery) {
try (Result r = graph.execute(cypherQuery)) {
if (!r.hasNext()) {
return null;
}
Collection<Object> nodes = r.next().values();
if (nodes.size() == 0) {
return null;
}
return (Node) nodes.iterator().next();
}
}
private static boolean detachDeleteNode(GraphDatabaseService graph, Node node) {
final String query = String.format("MATCH (n) WHERE ID(n) = %s DETACH DELETE n", node.getId());
try (Result r = graph.execute(query)) {
return true;
}
}
}
This code works correctly and prints "Non greeted people: Bob, Alice".
My question is: does this approach (i.e. keeping a series of db operations within an open transaction) have any drawbacks that I should be aware of (e.g. potential memory issues)? Are there other approaches I could use to accomplish this?
I have also considered using a boolean property on the nodes to mark them as either deleted or not deleted. My concern is that the actual application I'm working on contains thousands of nodes and various kinds of relationships, and the actual queries are non-trivial, so I'd rather not change them to accommodate a soft-deletion boolean property (but I am open to doing that, if that turns out to be the best approach).
Also, please note that I am not simply looking for nodes that are not in cycles. Rather, the underlying idea is as follows. I have some nodes that satisfy a certain condition c; I want to (logically) remote those nodes; this will potentially make new nodes satisfy the same condition c, and so on, until the set of nodes that satisfy c is empty.

Right way to create parallel For loop in Google DataFlow pipeline

I have a simple DataFlow java job that reads a few lines from a .csv file. Each line contains a numeric cell, which represents how many steps a certain function has to be performed on that line.
I don't want to perform that using a traditional For loop within the function, in case these numbers become very large. What is the right way to do this using the parallel-friendly DataFlow methodology?
Here's the current Java code:
public class SimpleJob{
static class MyDoFn extends DoFn<String, Integer> {
public void processElement(ProcessContext c) {
String name = c.element().split("\\,")[0];
int val = Integer.valueOf(c.element().split("\\,")[1]);
for (int i = 0; i < val; i++) // <- what's the preferred way to do this in DF?
System.out.println("Processing some function: " + name); // <- do something
c.output(val);
}
}
public static void main() {
DataflowPipelineOptions options = PipelineOptionsFactory
.as(DataflowPipelineOptions.class);
options.setProject(DEF.ID_PROJ);
options.setStagingLocation(DEF.ID_STG_LOC);
options.setRunner(DirectPipelineRunner.class);
Pipeline pipeline = Pipeline.create(options);
pipeline.apply(TextIO.Read.from("Source.csv"))
.apply(ParDo.of(new MyDoFn()));
pipeline.run();
}
}
This is what the "source.csv" looks like (so each number represents how many times I want to run a parallel function on that line):
Joe,3
Mary,4
Peter,2
Curiously enough, this is one of the motivating use cases for Splittable DoFn! That API is currently in heavy development.
However, until that API is available, you can basically mimic most of what it would have done for you:
class ElementAndRepeats { String element; int numRepeats; }
PCollection<String> lines = p.apply(TextIO.Read....)
PCollection<ElementAndRepeats> elementAndNumRepeats = lines.apply(
ParDo.of(...parse number of repetitions from the line...));
PCollection<ElementAndRepeats> elementAndNumSubRepeats = elementAndNumRepeats
.apply(ParDo.of(
...split large numbers of repetitions into smaller numbers...))
.apply(...fusion break...);
elementAndNumSubRepeats.apply(ParDo.of(...execute the repetitions...))
where:
"split large numbers of repetitions" is a DoFn that, e.g., splits an ElementAndRepeats{"foo", 34} into {ElementAndRepeats{"foo", 10}, ElementAndRepeats{"foo", 10}, ElementAndRepeats{"foo", 10}, ElementAndRepeats{"foo", 4}}
fusion break - see here, to prevent the several ParDo's from being fused together, defeating the parallelization

Resources