I have a simple DataFlow java job that reads a few lines from a .csv file. Each line contains a numeric cell, which represents how many steps a certain function has to be performed on that line.
I don't want to perform that using a traditional For loop within the function, in case these numbers become very large. What is the right way to do this using the parallel-friendly DataFlow methodology?
Here's the current Java code:
public class SimpleJob{
static class MyDoFn extends DoFn<String, Integer> {
public void processElement(ProcessContext c) {
String name = c.element().split("\\,")[0];
int val = Integer.valueOf(c.element().split("\\,")[1]);
for (int i = 0; i < val; i++) // <- what's the preferred way to do this in DF?
System.out.println("Processing some function: " + name); // <- do something
c.output(val);
}
}
public static void main() {
DataflowPipelineOptions options = PipelineOptionsFactory
.as(DataflowPipelineOptions.class);
options.setProject(DEF.ID_PROJ);
options.setStagingLocation(DEF.ID_STG_LOC);
options.setRunner(DirectPipelineRunner.class);
Pipeline pipeline = Pipeline.create(options);
pipeline.apply(TextIO.Read.from("Source.csv"))
.apply(ParDo.of(new MyDoFn()));
pipeline.run();
}
}
This is what the "source.csv" looks like (so each number represents how many times I want to run a parallel function on that line):
Joe,3
Mary,4
Peter,2
Curiously enough, this is one of the motivating use cases for Splittable DoFn! That API is currently in heavy development.
However, until that API is available, you can basically mimic most of what it would have done for you:
class ElementAndRepeats { String element; int numRepeats; }
PCollection<String> lines = p.apply(TextIO.Read....)
PCollection<ElementAndRepeats> elementAndNumRepeats = lines.apply(
ParDo.of(...parse number of repetitions from the line...));
PCollection<ElementAndRepeats> elementAndNumSubRepeats = elementAndNumRepeats
.apply(ParDo.of(
...split large numbers of repetitions into smaller numbers...))
.apply(...fusion break...);
elementAndNumSubRepeats.apply(ParDo.of(...execute the repetitions...))
where:
"split large numbers of repetitions" is a DoFn that, e.g., splits an ElementAndRepeats{"foo", 34} into {ElementAndRepeats{"foo", 10}, ElementAndRepeats{"foo", 10}, ElementAndRepeats{"foo", 10}, ElementAndRepeats{"foo", 4}}
fusion break - see here, to prevent the several ParDo's from being fused together, defeating the parallelization
Related
I'm trying to aggregate (per key) a streaming data source in Apache Beam (via Scio) using a stateful DoFn (using #ProcessElement with #StateId ValueState elements). I thought this would be most appropriate for the problem I'm trying to solve. The requirements are:
for a given key, records are aggregated (essentially summed) across all time - I don't care about previously computed aggregates, just the most recent
keys may be evicted from the state (state.clear()) based on certain conditions that I control
Every 5 minutes, regardless if any new keys were seen, all keys that haven't been evicted from the state should be outputted
Given that this is a streaming pipeline and will be running indefinitely, using a combinePerKey over a global window with accumulating fired panes seems like it will continue to increase its memory footprint and the amount of data it needs to run over time, so I'd like to avoid it. Additionally, when testing this out, (maybe as expected) it simply appends the newly computed aggregates to the output along with the historical input, rather than using the latest value for each key.
My thought was that using a StatefulDoFn would simply allow me to output all of the global state up until now(), but it seems this isn't a trivial solution. I've seen hintings at using timers to artificially execute callbacks for this, as well as potentially using a slowly growing side input map (How to solve Duplicate values exception when I create PCollectionView<Map<String,String>>) and somehow flushing this, but this would essentially require iterating over all values in the map rather than joining on it.
I feel like I might be overlooking something simple to get this working. I'm relatively new to many concepts of windowing and timers in Beam, looking for any advice on how to solve this. Thanks!
You are right that Stateful DoFn should help you here. This is a basic sketch of what you can do. Note that this only outputs the sum without the key. It may not be exactly what you want, but it should help you move forward.
class CombiningEmittingFn extends DoFn<KV<Integer, Integer>, Integer> {
#TimerId("emitter")
private final TimerSpec emitterSpec = TimerSpecs.timer(TimeDomain.PROCESSING_TIME);
#StateId("done")
private final StateSpec<ValueState<Boolean>> doneState = StateSpecs.value();
#StateId("agg")
private final StateSpec<CombiningState<Integer, int[], Integer>>
aggSpec = StateSpecs.combining(
Sum.ofIntegers().getAccumulatorCoder(null, VarIntCoder.of()), Sum.ofIntegers());
#ProcessElement
public void processElement(ProcessContext c,
#StateId("agg") CombiningState<Integer, int[], Integer> aggState,
#StateId("done") ValueState<Boolean> doneState,
#TimerId("emitter") Timer emitterTimer) throws Exception {
if (SOME CONDITION) {
countValueState.clear();
doneState.write(true);
} else {
countValueState.addAccum(c.element().getValue());
emitterTimer.align(Duration.standardMinutes(5)).setRelative();
}
}
}
#OnTimer("emitter")
public void onEmit(
OnTimerContext context,
#StateId("agg") CombiningState<Integer, int[], Integer> aggState,
#StateId("done") ValueState<Boolean> doneState,
#TimerId("emitter") Timer emitterTimer) {
Boolean isDone = doneState.read();
if (isDone != null && isDone) {
return;
} else {
context.output(aggState.getAccum());
// Set the timer to emit again
emitterTimer.align(Duration.standardMinutes(5)).setRelative();
}
}
}
}
Happy to iterate with you on something that'll work.
#Pablo was indeed correct that a StatefulDoFn and timers are useful in this scenario. Here is the with code I was able to get working.
Stateful Do Fn
// DomainState is a custom case class I'm using
type DoFnT = DoFn[KV[String, DomainState], KV[String, DomainState]]
class StatefulDoFn extends DoFnT {
#StateId("key")
private val keySpec = StateSpecs.value[String]()
#StateId("domainState")
private val domainStateSpec = StateSpecs.value[DomainState]()
#TimerId("loopingTimer")
private val loopingTimer: TimerSpec = TimerSpecs.timer(TimeDomain.EVENT_TIME)
#ProcessElement
def process(
context: DoFnT#ProcessContext,
#StateId("key") stateKey: ValueState[String],
#StateId("domainState") stateValue: ValueState[DomainState],
#TimerId("loopingTimer") loopingTimer: Timer): Unit = {
... logic to create key/value from potentially null values
if (keepState(value)) {
loopingTimer.align(Duration.standardMinutes(5)).setRelative()
stateKey.write(key)
stateValue.write(value)
if (flushState(value)) {
context.output(KV.of(key, value))
}
} else {
stateValue.clear()
}
}
#OnTimer("loopingTimer")
def onLoopingTimer(
context: DoFnT#OnTimerContext,
#StateId("key") stateKey: ValueState[String],
#StateId("domainState") stateValue: ValueState[DomainState],
#TimerId("loopingTimer") loopingTimer: Timer): Unit = {
... logic to create key/value checking for nulls
if (keepState(value)) {
loopingTimer.align(Duration.standardMinutes(5)).setRelative()
if (flushState(value)) {
context.output(KV.of(key, value))
}
}
}
}
With pipeline
sc
.pubsubSubscription(...)
.keyBy(...)
.withGlobalWindow()
.applyPerKeyDoFn(new StatefulDoFn())
.withFixedWindows(
duration = Duration.standardMinutes(5),
options = WindowOptions(
accumulationMode = DISCARDING_FIRED_PANES,
trigger = AfterWatermark.pastEndOfWindow(),
allowedLateness = Duration.ZERO,
// Only take the latest per key during a window
timestampCombiner = TimestampCombiner.END_OF_WINDOW
))
.reduceByKey(mostRecentEvent())
.saveAsCustomOutput(TextIO.write()...)
I am working on a project where we migrate massive number (more than 12000) views to Hadoop/Impala from Oracle. I have written a small Java utility to extract view DDL from Oracle and would like to use ANTLR4 to traverse the AST and generate an Impala-compatible view DDL statement.
The most of the work is relatively simple, only involves re-writing some Oracle specific syntax quirks to Impala style. However, I am facing an issue, where I am not sure I have the best answer yet: we have a number of special cases, where values from a date field are extracted in multiple nested function calls. For example, the following extracts the day from a Date field:
TO_NUMBER(TO_CHAR(d.R_DATE , 'DD' ))
I have an ANTLR4 grammar declared for Oracle SQL and hence get the visitor callback when it reaches TO_NUMBER and TO_CHAR as well, but I would like to have special handling for this special case.
Is not there any other way than implementing the handler method for the outer function and then resorting to manual traversal of the nested structure to see
I have something like in the generated Visitor class:
#Override
public String visitNumber_function(PlSqlParser.Number_functionContext ctx) {
// FIXME: seems to be dodgy code, can it be improved?
String functionName = ctx.name.getText();
if (functionName.equalsIgnoreCase("TO_NUMBER")) {
final int childCount = ctx.getChildCount();
if (childCount == 4) {
final int functionNameIndex = 0;
final int openRoundBracketIndex = 1;
final int encapsulatedValueIndex = 2;
final int closeRoundBracketIndex = 3;
ParseTree encapsulated = ctx.getChild(encapsulatedValueIndex);
if (encapsulated instanceof TerminalNode) {
throw new IllegalStateException("TerminalNode is found at: " + encapsulatedValueIndex);
}
String customDateConversionOrNullOnOtherType =
customDateConversionFromToNumberAndNestedToChar(encapsulated);
if (customDateConversionOrNullOnOtherType != null) {
// the child node contained our expected child element, so return the converted value
return customDateConversionOrNullOnOtherType;
}
// otherwise the child was something unexpected, signalled by null
// so simply fall-back to the default handler
}
}
// some other numeric function, default handling
return super.visitNumber_function(ctx);
}
private String customDateConversionFromToNumberAndNestedToChar(ParseTree parseTree) {
// ...
}
For anyone hitting the same issue, the way to go seems to be:
changing the grammar definition and introducing custom sub-types for
the encapsulated expression of the nested function.
Then, I it is possible to hook into the processing at precisely the desired location of the Parse tree.
Using a second custom ParseTreeVisitor that captures the values of function call and delegates back the processing of the rest of the sub-tree to the main, "outer" ParseTreeVisitor.
Once the second custom ParseTreeVisitor has finished visiting all the sub-ParseTrees I had the context information I required and all the sub-tree visited properly.
Let me simplify my case. I'm using Apache Beam 0.6.0. My final processed result is PCollection<KV<String, String>>. And I want to write values to different files corresponding to their keys.
For example, let's say the result consists of
(key1, value1)
(key2, value2)
(key1, value3)
(key1, value4)
Then I want to write value1, value3 and value4 to key1.txt, and write value4 to key2.txt.
And in my case:
Key set is determined when the pipeline is running, not when constructing the pipeline.
Key set may be quite small, but the number of values corresponding to each key may be very very large.
Any ideas?
Handily, I wrote a sample of this case just the other day.
This example is dataflow 1.x style
Basically you group by each key, and then you can do this with a custom transform that connects to cloud storage. Caveat being that your list of lines per-file shouldn't be massive (it's got to fit into memory on a single instance, but considering you can run high-mem instances, that limit is pretty high).
...
PCollection<KV<String, List<String>>> readyToWrite = groupedByFirstLetter
.apply(Combine.perKey(AccumulatorOfWords.getCombineFn()));
readyToWrite.apply(
new PTransformWriteToGCS("dataflow-experiment", TonyWordGrouper::derivePath));
...
And then the transform doing most of the work is:
public class PTransformWriteToGCS
extends PTransform<PCollection<KV<String, List<String>>>, PCollection<Void>> {
private static final Logger LOG = Logging.getLogger(PTransformWriteToGCS.class);
private static final Storage STORAGE = StorageOptions.getDefaultInstance().getService();
private final String bucketName;
private final SerializableFunction<String, String> pathCreator;
public PTransformWriteToGCS(final String bucketName,
final SerializableFunction<String, String> pathCreator) {
this.bucketName = bucketName;
this.pathCreator = pathCreator;
}
#Override
public PCollection<Void> apply(final PCollection<KV<String, List<String>>> input) {
return input
.apply(ParDo.of(new DoFn<KV<String, List<String>>, Void>() {
#Override
public void processElement(
final DoFn<KV<String, List<String>>, Void>.ProcessContext arg0)
throws Exception {
final String key = arg0.element().getKey();
final List<String> values = arg0.element().getValue();
final String toWrite = values.stream().collect(Collectors.joining("\n"));
final String path = pathCreator.apply(key);
BlobInfo blobInfo = BlobInfo.newBuilder(bucketName, path)
.setContentType(MimeTypes.TEXT)
.build();
LOG.info("blob writing to: {}", blobInfo);
Blob result = STORAGE.create(blobInfo,
toWrite.getBytes(StandardCharsets.UTF_8));
}
}));
}
}
Just write a loop in a ParDo function!
More details -
I had the same scenario today, the only thing is in my case key=image_label and value=image_tf_record. So like what you have asked, I am trying to create separate TFRecord files, one per class, each record file containing a number of images. HOWEVER not sure if there might be memory issues when a number of values per key are very high like your scenario:
(Also my code is in Python)
class WriteToSeparateTFRecordFiles(beam.DoFn):
def __init__(self, outdir):
self.outdir = outdir
def process(self, element):
l, image_list = element
writer = tf.python_io.TFRecordWriter(self.outdir + "/tfr" + str(l) + '.tfrecord')
for example in image_list:
writer.write(example.SerializeToString())
writer.close()
And then in your pipeline just after the stage where you get key-value pairs to add these two lines:
(p
| 'GroupByLabelId' >> beam.GroupByKey()
| 'SaveToMultipleFiles' >> beam.ParDo(WriteToSeparateTFRecordFiles(opt, p))
)
you can use FileIO.writeDinamic() for that
PCollection<KV<String,String>> readfile= (something you read..);
readfile.apply(FileIO. <String,KV<String,String >> writeDynamic()
.by(KV::getKey)
.withDestinationCoder(StringUtf8Coder.of())
.via(Contextful.fn(KV::getValue), TextIO.sink())
.to("somefolder")
.withNaming(key -> FileIO.Write.defaultNaming(key, ".txt")));
p.run();
In Apache Beam 2.2 Java SDK, this is natively supported in TextIO and AvroIO using respectively TextIO and AvroIO.write().to(DynamicDestinations). See e.g. this method.
Update (2018): Prefer to use FileIO.writeDynamic() together with TextIO.sink() and AvroIO.sink() instead.
Just write below lines in your ParDo class :
from apache_beam.io import filesystems
eventCSVFileWriter = filesystems.FileSystems.create(gcsFileName)
for record in list(Records):
eventCSVFileWriter.write(record)
If you want the full code I can help you with that too.
How can I use existing Flux operators to make a Flux return incoming values into multiple Lists with a minimum delay between returns?
This can be achieved with a non-trivial set of composed operators.
import java.time.Duration;
import java.util.*;
import reactor.core.publisher.*;
public class DelayedBuffer {
public static void main(String[] args) {
Flux.just(1, 2, 3, 6, 7, 10)
.flatMap(v -> Mono.delayMillis(v * 1000)
.doOnNext(w -> System.out.println("T=" + v))
.map(w -> v)
)
.compose(f -> delayedBufferAfterFirst(f, Duration.ofSeconds(2)))
.doOnNext(System.out::println)
.blockLast();
}
public static <T> Flux<List<T>> delayedBufferAfterFirst(Flux<T> source, Duration d) {
return source
.publish(f -> {
return f.take(1).collectList()
.concatWith(f.buffer(d).take(1))
.repeatWhen(r -> r.takeUntilOther(f.ignoreElements()));
});
}
}
(Note however, that the expected emission pattern may be better matched with a custom operator due to time being involved.)
I thought buffer(Duration) would fit your need, but it doesn't.
edit: leaving this in case someone with your exact same need is tempted to use that operator. This variant of buffer splits the sequence into consecutive time windows (that each produce a buffer). That is, the new delay starts at the end of the previous one, not whenever a new out-of-delay element is emitted.
I have many large unpartitioned BigQuery tables and files that I would like to partition in various ways. So I decided to try and write a Dataflow job to achieve this. The job I think is simple enough. I tried to write with generics so that I easily apply it both TextIO and BigQueryIO sources. It works fine with small tables, but I keep getting java.lang.OutOfMemoryError: Java heap space when I run it on large tables.
In my main class I either read a file with target keys (made with another DF job) or run a query against a BigQuery table to get a list of keys to shard by. My main class looks like this:
Pipeline sharder = Pipeline.create(opts);
// a functional interface that shows the tag map how to get a tuple tag
KeySelector<String, TableRow> bqSelector = (TableRow row) -> (String) row.get("COLUMN") != null ? (String) row.get("COLUMN") : "null";
// a utility class to store a tuple tag list and hash map of String TupleTag
TupleTagMap<String, TableRow> bqTags = new TupleTagMap<>(new ArrayList<>(inputKeys),bqSelector);
// custom transorm
ShardedTransform<String, TableRow> bqShard = new ShardedTransform<String, TableRow>(bqTags, TableRowJsonCoder.of());
String source = "PROJECTID:ADATASET.A_BIG_TABLE";
String destBase = "projectid:dataset.a_big_table_sharded_";
TableSchema schema = bq.tables().get("PROJECTID","ADATASET","A_BIG_TABLE").execute().getSchema();
PCollectionList<TableRow> shards = sharder.apply(BigQueryIO.Read.from(source)).apply(bqShard);
for (PCollection<TableRow> shard : shards.getAll()) {
String shardName = StringUtils.isNotEmpty(shard.getName()) ? shard.getName() : "NULL";
shard.apply(BigQueryIO.Write.to(destBase + shardName)
.withWriteDisposition(WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withSchema(schema));
System.out.println(destBase+shardName);
}
sharder.run();
I generate a set of TupleTags to use in a custom transform. I created a utility class that stores a TupleTagList and HashMap so that I can reference the tuple tags by key:
public class TupleTagMap<Key, Type> implements Serializable {
private static final long serialVersionUID = -8762959703864266959L;
final private TupleTagList tList;
final private Map<Key, TupleTag<Type>> map;
final private KeySelector<Key, Type> selector;
public TupleTagMap(List<Key> t, KeySelector<Key, Type> selector) {
map = new HashMap<>();
for (Key key : t)
map.put(key, new TupleTag<Type>());
this.tList = TupleTagList.of(new ArrayList<>(map.values()));
this.selector = selector;
}
public Map<Key, TupleTag<Type>> getMap() {
return map;
}
public TupleTagList getTagList() {
return tList;
}
public TupleTag<Type> getTag(Type t){
return map.get(selector.getKey(t));
}
Then I have this custom transform that basically has a function that uses the tuple map to output PCollectionTuple and then moves it to a PCollectionList to return to the main class:
public class ShardedTransform<Key, Type> extends
PTransform<PCollection<Type>, PCollectionList<Type>> {
private static final long serialVersionUID = 3320626732803297323L;
private final TupleTagMap<Key, Type> tags;
private final Coder<Type> coder;
public ShardedTransform(TupleTagMap<Key, Type> tags, Coder<Type> coder) {
this.tags = tags;
this.coder = coder;
}
#Override
public PCollectionList<Type> apply(PCollection<Type> in) {
PCollectionTuple shards = in.apply(ParDo.of(
new ShardFn<Key, Type>(tags)).withOutputTags(
new TupleTag<Type>(), tags.getTagList()));
List<PCollection<Type>> shardList = new ArrayList<>(tags.getMap().size());
for (Entry<Key, TupleTag<Type>> e : tags.getMap().entrySet()){
PCollection<Type> shard = shards.get(e.getValue()).setName(e.getKey().toString()).setCoder(coder);
shardList.add(shard);
}
return PCollectionList.of(shardList);
}
}
The actual DoFn is dead simple it just uses the lambda provided in the main class do find the matching tuple tag in the hash map for side output:
public class ShardFn<Key, Type> extends DoFn<Type, Type> {
private static final long serialVersionUID = 961325260858465105L;
private final TupleTagMap<Key, Type> tags;
ShardFn(TupleTagMap<Key, Type> tags) {
this.tags = tags;
}
#Override
public void processElement(DoFn<Type, Type>.ProcessContext c)
throws Exception {
Type element = c.element();
TupleTag<Type> tag = tags.getTag(element);
if (tag != null)
c.sideOutput(tags.getTag(element), element);
}
}
The Beam model doesn't have good support for dynamic partitioning / large numbers of partitions right now. Your approach chooses the number of shards at graph construction time, and then the resulting ParDos likely all fuses together, so you've got each worker trying to write to 80 different BQ tables at the same time. Each write requires some local buffering, so it's probably just too much.
There's an alternate approach which will do the parallelization across tables (but not across elements). This would work well if you have a large number of relatively small output tables. Use a ParDo to tag each element with the table it should go to and then do a GroupByKey. This gives you a PCollection<KV<Table, Iterable<ElementsForThatTable>>>. Then process each KV<Table, Iterable<ElementsForThatTable>> by writing the elements to the table.
Unfortunately for now you'll have to the BQ write by hand to use this option. We're looking at extending the Sink APIs with built in support for this. And since the Dataflow SDK is being further developed as part of Apache Beam, we're tracking that request here: https://issues.apache.org/jira/browse/BEAM-92