How do I write to multiple files in Apache Beam?

How do I write to multiple files in Apache Beam? - google-cloud-dataflow

Let me simplify my case. I'm using Apache Beam 0.6.0. My final processed result is PCollection<KV<String, String>>. And I want to write values to different files corresponding to their keys.
For example, let's say the result consists of
(key1, value1)
(key2, value2)
(key1, value3)
(key1, value4)
Then I want to write value1, value3 and value4 to key1.txt, and write value4 to key2.txt.
And in my case:
Key set is determined when the pipeline is running, not when constructing the pipeline.
Key set may be quite small, but the number of values corresponding to each key may be very very large.
Any ideas?

Handily, I wrote a sample of this case just the other day.
This example is dataflow 1.x style
Basically you group by each key, and then you can do this with a custom transform that connects to cloud storage. Caveat being that your list of lines per-file shouldn't be massive (it's got to fit into memory on a single instance, but considering you can run high-mem instances, that limit is pretty high).
...
PCollection<KV<String, List<String>>> readyToWrite = groupedByFirstLetter
.apply(Combine.perKey(AccumulatorOfWords.getCombineFn()));
readyToWrite.apply(
new PTransformWriteToGCS("dataflow-experiment", TonyWordGrouper::derivePath));
...
And then the transform doing most of the work is:
public class PTransformWriteToGCS
extends PTransform<PCollection<KV<String, List<String>>>, PCollection<Void>> {
private static final Logger LOG = Logging.getLogger(PTransformWriteToGCS.class);
private static final Storage STORAGE = StorageOptions.getDefaultInstance().getService();
private final String bucketName;
private final SerializableFunction<String, String> pathCreator;
public PTransformWriteToGCS(final String bucketName,
final SerializableFunction<String, String> pathCreator) {
this.bucketName = bucketName;
this.pathCreator = pathCreator;
}
#Override
public PCollection<Void> apply(final PCollection<KV<String, List<String>>> input) {
return input
.apply(ParDo.of(new DoFn<KV<String, List<String>>, Void>() {
#Override
public void processElement(
final DoFn<KV<String, List<String>>, Void>.ProcessContext arg0)
throws Exception {
final String key = arg0.element().getKey();
final List<String> values = arg0.element().getValue();
final String toWrite = values.stream().collect(Collectors.joining("\n"));
final String path = pathCreator.apply(key);
BlobInfo blobInfo = BlobInfo.newBuilder(bucketName, path)
.setContentType(MimeTypes.TEXT)
.build();
LOG.info("blob writing to: {}", blobInfo);
Blob result = STORAGE.create(blobInfo,
toWrite.getBytes(StandardCharsets.UTF_8));
}
}));
}
}

Just write a loop in a ParDo function!
More details -
I had the same scenario today, the only thing is in my case key=image_label and value=image_tf_record. So like what you have asked, I am trying to create separate TFRecord files, one per class, each record file containing a number of images. HOWEVER not sure if there might be memory issues when a number of values per key are very high like your scenario:
(Also my code is in Python)
class WriteToSeparateTFRecordFiles(beam.DoFn):
def __init__(self, outdir):
self.outdir = outdir
def process(self, element):
l, image_list = element
writer = tf.python_io.TFRecordWriter(self.outdir + "/tfr" + str(l) + '.tfrecord')
for example in image_list:
writer.write(example.SerializeToString())
writer.close()
And then in your pipeline just after the stage where you get key-value pairs to add these two lines:
(p
| 'GroupByLabelId' >> beam.GroupByKey()
| 'SaveToMultipleFiles' >> beam.ParDo(WriteToSeparateTFRecordFiles(opt, p))
)

you can use FileIO.writeDinamic() for that
PCollection<KV<String,String>> readfile= (something you read..);
readfile.apply(FileIO. <String,KV<String,String >> writeDynamic()
.by(KV::getKey)
.withDestinationCoder(StringUtf8Coder.of())
.via(Contextful.fn(KV::getValue), TextIO.sink())
.to("somefolder")
.withNaming(key -> FileIO.Write.defaultNaming(key, ".txt")));
p.run();

In Apache Beam 2.2 Java SDK, this is natively supported in TextIO and AvroIO using respectively TextIO and AvroIO.write().to(DynamicDestinations). See e.g. this method.
Update (2018): Prefer to use FileIO.writeDynamic() together with TextIO.sink() and AvroIO.sink() instead.

Just write below lines in your ParDo class :
from apache_beam.io import filesystems
eventCSVFileWriter = filesystems.FileSystems.create(gcsFileName)
for record in list(Records):
eventCSVFileWriter.write(record)
If you want the full code I can help you with that too.

Related

Apache Beam Stateful DoFn Periodically Output All K/V Pairs

I'm trying to aggregate (per key) a streaming data source in Apache Beam (via Scio) using a stateful DoFn (using #ProcessElement with #StateId ValueState elements). I thought this would be most appropriate for the problem I'm trying to solve. The requirements are:
for a given key, records are aggregated (essentially summed) across all time - I don't care about previously computed aggregates, just the most recent
keys may be evicted from the state (state.clear()) based on certain conditions that I control
Every 5 minutes, regardless if any new keys were seen, all keys that haven't been evicted from the state should be outputted
Given that this is a streaming pipeline and will be running indefinitely, using a combinePerKey over a global window with accumulating fired panes seems like it will continue to increase its memory footprint and the amount of data it needs to run over time, so I'd like to avoid it. Additionally, when testing this out, (maybe as expected) it simply appends the newly computed aggregates to the output along with the historical input, rather than using the latest value for each key.
My thought was that using a StatefulDoFn would simply allow me to output all of the global state up until now(), but it seems this isn't a trivial solution. I've seen hintings at using timers to artificially execute callbacks for this, as well as potentially using a slowly growing side input map (How to solve Duplicate values exception when I create PCollectionView<Map<String,String>>) and somehow flushing this, but this would essentially require iterating over all values in the map rather than joining on it.
I feel like I might be overlooking something simple to get this working. I'm relatively new to many concepts of windowing and timers in Beam, looking for any advice on how to solve this. Thanks!

You are right that Stateful DoFn should help you here. This is a basic sketch of what you can do. Note that this only outputs the sum without the key. It may not be exactly what you want, but it should help you move forward.
class CombiningEmittingFn extends DoFn<KV<Integer, Integer>, Integer> {
#TimerId("emitter")
private final TimerSpec emitterSpec = TimerSpecs.timer(TimeDomain.PROCESSING_TIME);
#StateId("done")
private final StateSpec<ValueState<Boolean>> doneState = StateSpecs.value();
#StateId("agg")
private final StateSpec<CombiningState<Integer, int[], Integer>>
aggSpec = StateSpecs.combining(
Sum.ofIntegers().getAccumulatorCoder(null, VarIntCoder.of()), Sum.ofIntegers());
#ProcessElement
public void processElement(ProcessContext c,
#StateId("agg") CombiningState<Integer, int[], Integer> aggState,
#StateId("done") ValueState<Boolean> doneState,
#TimerId("emitter") Timer emitterTimer) throws Exception {
if (SOME CONDITION) {
countValueState.clear();
doneState.write(true);
} else {
countValueState.addAccum(c.element().getValue());
emitterTimer.align(Duration.standardMinutes(5)).setRelative();
}
}
}
#OnTimer("emitter")
public void onEmit(
OnTimerContext context,
#StateId("agg") CombiningState<Integer, int[], Integer> aggState,
#StateId("done") ValueState<Boolean> doneState,
#TimerId("emitter") Timer emitterTimer) {
Boolean isDone = doneState.read();
if (isDone != null && isDone) {
return;
} else {
context.output(aggState.getAccum());
// Set the timer to emit again
emitterTimer.align(Duration.standardMinutes(5)).setRelative();
}
}
}
}
Happy to iterate with you on something that'll work.

#Pablo was indeed correct that a StatefulDoFn and timers are useful in this scenario. Here is the with code I was able to get working.
Stateful Do Fn
// DomainState is a custom case class I'm using
type DoFnT = DoFn[KV[String, DomainState], KV[String, DomainState]]
class StatefulDoFn extends DoFnT {
#StateId("key")
private val keySpec = StateSpecs.value[String]()
#StateId("domainState")
private val domainStateSpec = StateSpecs.value[DomainState]()
#TimerId("loopingTimer")
private val loopingTimer: TimerSpec = TimerSpecs.timer(TimeDomain.EVENT_TIME)
#ProcessElement
def process(
context: DoFnT#ProcessContext,
#StateId("key") stateKey: ValueState[String],
#StateId("domainState") stateValue: ValueState[DomainState],
#TimerId("loopingTimer") loopingTimer: Timer): Unit = {
... logic to create key/value from potentially null values
if (keepState(value)) {
loopingTimer.align(Duration.standardMinutes(5)).setRelative()
stateKey.write(key)
stateValue.write(value)
if (flushState(value)) {
context.output(KV.of(key, value))
}
} else {
stateValue.clear()
}
}
#OnTimer("loopingTimer")
def onLoopingTimer(
context: DoFnT#OnTimerContext,
#StateId("key") stateKey: ValueState[String],
#StateId("domainState") stateValue: ValueState[DomainState],
#TimerId("loopingTimer") loopingTimer: Timer): Unit = {
... logic to create key/value checking for nulls
if (keepState(value)) {
loopingTimer.align(Duration.standardMinutes(5)).setRelative()
if (flushState(value)) {
context.output(KV.of(key, value))
}
}
}
}
With pipeline
sc
.pubsubSubscription(...)
.keyBy(...)
.withGlobalWindow()
.applyPerKeyDoFn(new StatefulDoFn())
.withFixedWindows(
duration = Duration.standardMinutes(5),
options = WindowOptions(
accumulationMode = DISCARDING_FIRED_PANES,
trigger = AfterWatermark.pastEndOfWindow(),
allowedLateness = Duration.ZERO,
// Only take the latest per key during a window
timestampCombiner = TimestampCombiner.END_OF_WINDOW
))
.reduceByKey(mostRecentEvent())
.saveAsCustomOutput(TextIO.write()...)

Right way to create parallel For loop in Google DataFlow pipeline

I have a simple DataFlow java job that reads a few lines from a .csv file. Each line contains a numeric cell, which represents how many steps a certain function has to be performed on that line.
I don't want to perform that using a traditional For loop within the function, in case these numbers become very large. What is the right way to do this using the parallel-friendly DataFlow methodology?
Here's the current Java code:
public class SimpleJob{
static class MyDoFn extends DoFn<String, Integer> {
public void processElement(ProcessContext c) {
String name = c.element().split("\\,")[0];
int val = Integer.valueOf(c.element().split("\\,")[1]);
for (int i = 0; i < val; i++) // <- what's the preferred way to do this in DF?
System.out.println("Processing some function: " + name); // <- do something
c.output(val);
}
}
public static void main() {
DataflowPipelineOptions options = PipelineOptionsFactory
.as(DataflowPipelineOptions.class);
options.setProject(DEF.ID_PROJ);
options.setStagingLocation(DEF.ID_STG_LOC);
options.setRunner(DirectPipelineRunner.class);
Pipeline pipeline = Pipeline.create(options);
pipeline.apply(TextIO.Read.from("Source.csv"))
.apply(ParDo.of(new MyDoFn()));
pipeline.run();
}
}
This is what the "source.csv" looks like (so each number represents how many times I want to run a parallel function on that line):
Joe,3
Mary,4
Peter,2

Curiously enough, this is one of the motivating use cases for Splittable DoFn! That API is currently in heavy development.
However, until that API is available, you can basically mimic most of what it would have done for you:
class ElementAndRepeats { String element; int numRepeats; }
PCollection<String> lines = p.apply(TextIO.Read....)
PCollection<ElementAndRepeats> elementAndNumRepeats = lines.apply(
ParDo.of(...parse number of repetitions from the line...));
PCollection<ElementAndRepeats> elementAndNumSubRepeats = elementAndNumRepeats
.apply(ParDo.of(
...split large numbers of repetitions into smaller numbers...))
.apply(...fusion break...);
elementAndNumSubRepeats.apply(ParDo.of(...execute the repetitions...))
where:
"split large numbers of repetitions" is a DoFn that, e.g., splits an ElementAndRepeats{"foo", 34} into {ElementAndRepeats{"foo", 10}, ElementAndRepeats{"foo", 10}, ElementAndRepeats{"foo", 10}, ElementAndRepeats{"foo", 4}}
fusion break - see here, to prevent the several ParDo's from being fused together, defeating the parallelization

Google Dataflow Out Of Heap When Creating Multiple Tagged Outputs

I have many large unpartitioned BigQuery tables and files that I would like to partition in various ways. So I decided to try and write a Dataflow job to achieve this. The job I think is simple enough. I tried to write with generics so that I easily apply it both TextIO and BigQueryIO sources. It works fine with small tables, but I keep getting java.lang.OutOfMemoryError: Java heap space when I run it on large tables.
In my main class I either read a file with target keys (made with another DF job) or run a query against a BigQuery table to get a list of keys to shard by. My main class looks like this:
Pipeline sharder = Pipeline.create(opts);
// a functional interface that shows the tag map how to get a tuple tag
KeySelector<String, TableRow> bqSelector = (TableRow row) -> (String) row.get("COLUMN") != null ? (String) row.get("COLUMN") : "null";
// a utility class to store a tuple tag list and hash map of String TupleTag
TupleTagMap<String, TableRow> bqTags = new TupleTagMap<>(new ArrayList<>(inputKeys),bqSelector);
// custom transorm
ShardedTransform<String, TableRow> bqShard = new ShardedTransform<String, TableRow>(bqTags, TableRowJsonCoder.of());
String source = "PROJECTID:ADATASET.A_BIG_TABLE";
String destBase = "projectid:dataset.a_big_table_sharded_";
TableSchema schema = bq.tables().get("PROJECTID","ADATASET","A_BIG_TABLE").execute().getSchema();
PCollectionList<TableRow> shards = sharder.apply(BigQueryIO.Read.from(source)).apply(bqShard);
for (PCollection<TableRow> shard : shards.getAll()) {
String shardName = StringUtils.isNotEmpty(shard.getName()) ? shard.getName() : "NULL";
shard.apply(BigQueryIO.Write.to(destBase + shardName)
.withWriteDisposition(WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withSchema(schema));
System.out.println(destBase+shardName);
}
sharder.run();
I generate a set of TupleTags to use in a custom transform. I created a utility class that stores a TupleTagList and HashMap so that I can reference the tuple tags by key:
public class TupleTagMap<Key, Type> implements Serializable {
private static final long serialVersionUID = -8762959703864266959L;
final private TupleTagList tList;
final private Map<Key, TupleTag<Type>> map;
final private KeySelector<Key, Type> selector;
public TupleTagMap(List<Key> t, KeySelector<Key, Type> selector) {
map = new HashMap<>();
for (Key key : t)
map.put(key, new TupleTag<Type>());
this.tList = TupleTagList.of(new ArrayList<>(map.values()));
this.selector = selector;
}
public Map<Key, TupleTag<Type>> getMap() {
return map;
}
public TupleTagList getTagList() {
return tList;
}
public TupleTag<Type> getTag(Type t){
return map.get(selector.getKey(t));
}
Then I have this custom transform that basically has a function that uses the tuple map to output PCollectionTuple and then moves it to a PCollectionList to return to the main class:
public class ShardedTransform<Key, Type> extends
PTransform<PCollection<Type>, PCollectionList<Type>> {
private static final long serialVersionUID = 3320626732803297323L;
private final TupleTagMap<Key, Type> tags;
private final Coder<Type> coder;
public ShardedTransform(TupleTagMap<Key, Type> tags, Coder<Type> coder) {
this.tags = tags;
this.coder = coder;
}
#Override
public PCollectionList<Type> apply(PCollection<Type> in) {
PCollectionTuple shards = in.apply(ParDo.of(
new ShardFn<Key, Type>(tags)).withOutputTags(
new TupleTag<Type>(), tags.getTagList()));
List<PCollection<Type>> shardList = new ArrayList<>(tags.getMap().size());
for (Entry<Key, TupleTag<Type>> e : tags.getMap().entrySet()){
PCollection<Type> shard = shards.get(e.getValue()).setName(e.getKey().toString()).setCoder(coder);
shardList.add(shard);
}
return PCollectionList.of(shardList);
}
}
The actual DoFn is dead simple it just uses the lambda provided in the main class do find the matching tuple tag in the hash map for side output:
public class ShardFn<Key, Type> extends DoFn<Type, Type> {
private static final long serialVersionUID = 961325260858465105L;
private final TupleTagMap<Key, Type> tags;
ShardFn(TupleTagMap<Key, Type> tags) {
this.tags = tags;
}
#Override
public void processElement(DoFn<Type, Type>.ProcessContext c)
throws Exception {
Type element = c.element();
TupleTag<Type> tag = tags.getTag(element);
if (tag != null)
c.sideOutput(tags.getTag(element), element);
}
}

The Beam model doesn't have good support for dynamic partitioning / large numbers of partitions right now. Your approach chooses the number of shards at graph construction time, and then the resulting ParDos likely all fuses together, so you've got each worker trying to write to 80 different BQ tables at the same time. Each write requires some local buffering, so it's probably just too much.
There's an alternate approach which will do the parallelization across tables (but not across elements). This would work well if you have a large number of relatively small output tables. Use a ParDo to tag each element with the table it should go to and then do a GroupByKey. This gives you a PCollection<KV<Table, Iterable<ElementsForThatTable>>>. Then process each KV<Table, Iterable<ElementsForThatTable>> by writing the elements to the table.
Unfortunately for now you'll have to the BQ write by hand to use this option. We're looking at extending the Sink APIs with built in support for this. And since the Dataflow SDK is being further developed as part of Apache Beam, we're tracking that request here: https://issues.apache.org/jira/browse/BEAM-92

Test that either one thing holds or another in AssertJ

I am in the process of converting some tests from Hamcrest to AssertJ. In Hamcrest I use the following snippet:
assertThat(list, either(contains(Tags.SWEETS, Tags.HIGH))
.or(contains(Tags.SOUPS, Tags.RED)));
That is, the list may be either that or that. How can I express this in AssertJ? The anyOf function (of course, any is something else than either, but that would be a second question) takes a Condition; I have implemented that myself, but it feels as if this should be a common case.

Edited:
Since 3.12.0 AssertJ provides satisfiesAnyOf which succeeds is one of the given assertion succeeds,
assertThat(list).satisfiesAnyOf(
listParam -> assertThat(listParam).contains(Tags.SWEETS, Tags.HIGH),
listParam -> assertThat(listParam).contains(Tags.SOUPS, Tags.RED)
);
Original answer:
No, this is an area where Hamcrest is better than AssertJ.
To write the following assertion:
Set<String> goodTags = newLinkedHashSet("Fine", "Good");
Set<String> badTags = newLinkedHashSet("Bad!", "Awful");
Set<String> tags = newLinkedHashSet("Fine", "Good", "Ok", "?");
// contains is statically imported from ContainsCondition
// anyOf succeeds if one of the conditions is met (logical 'or')
assertThat(tags).has(anyOf(contains(goodTags), contains(badTags)));
you need to create this Condition:
import static org.assertj.core.util.Lists.newArrayList;
import java.util.Collection;
import org.assertj.core.api.Condition;
public class ContainsCondition extends Condition<Iterable<String>> {
private Collection<String> collection;
public ContainsCondition(Iterable<String> values) {
super("contains " + values);
this.collection = newArrayList(values);
}
static ContainsCondition contains(Collection<String> set) {
return new ContainsCondition(set);
}
#Override
public boolean matches(Iterable<String> actual) {
Collection<String> values = newArrayList(actual);
for (String string : collection) {
if (!values.contains(string)) return false;
}
return true;
};
}
It might not be what you if you expect that the presence of your tags in one collection implies they are not in the other one.

Inspired by this thread, you might want to use this little repo I put together, that adapts the Hamcrest Matcher API into AssertJ's Condition API. Also includes a handy-dandy conversion shell script.

How to parse using Grok from Java.. Is there any example available.?

I have seen Grok being very strong and lethal in parsing the log data. I wanted to use Grok for log parsing in our application, which is in java.. How can i connect/work with Grok from Java.?

Try downloading java-grok from GitHub: https://github.com/NFLabs/java-grok
You can test patterns using the Grok Debugger: http://grokdebug.herokuapp.com/

Check out this Java library
https://github.com/aicer/grok
You can include it in your project as a maven dependency
<dependency>
<groupId>org.aicer.grok</groupId>
<artifactId>grok</artifactId>
<version>0.9.0</version>
</dependency>
It comes with pre-defined patterns and you can also add yours.
The named patterns are extracted and the results are available in a map with the groups names as the keys and the retrieved values are mapped to these keys.
final GrokDictionary dictionary = new GrokDictionary();
// Load the built-in dictionaries
dictionary.addBuiltInDictionaries();
// Add custom pattern
dictionary.addDictionary(new File(patternDirectoryOrFilePath));
// Resolve all expressions loaded
dictionary.bind();
This next examples adds string patterns directly to the dictionary without using a file
final GrokDictionary dictionary = new GrokDictionary();
// Load the built-in dictionaries
dictionary.addBuiltInDictionaries();
// Add custom pattern directly
dictionary.addDictionary(new StringReader("DOMAINTLD [a-zA-Z]+"));
dictionary.addDictionary(new StringReader("EMAIL %{NOTSPACE}#%{WORD}\.%{DOMAINTLD}"));
// Resolve all expressions loaded
dictionary.bind();
Here is a complete example of how to use the library
public final class GrokStage {
private static final void displayResults(final Map<String, String> results) {
if (results != null) {
for(Map.Entry<String, String> entry : results.entrySet()) {
System.out.println(entry.getKey() + "=" + entry.getValue());
}
}
}
public static void main(String[] args) {
final String rawDataLine1 = "1234567 - israel.ekpo#massivelogdata.net cc55ZZ35 1789 Hello Grok";
final String rawDataLine2 = "98AA541 - israel-ekpo#israelekpo.com mmddgg22 8800 Hello Grok";
final String rawDataLine3 = "55BB778 - ekpo.israel#example.net secret123 4439 Valid Data Stream";
final String expression = "%{EMAIL:username} %{USERNAME:password} %{INT:yearOfBirth}";
final GrokDictionary dictionary = new GrokDictionary();
// Load the built-in dictionaries
dictionary.addBuiltInDictionaries();
// Resolve all expressions loaded
dictionary.bind();
// Take a look at how many expressions have been loaded
System.out.println("Dictionary Size: " + dictionary.getDictionarySize());
Grok compiledPattern = dictionary.compileExpression(expression);
displayResults(compiledPattern.extractNamedGroups(rawDataLine1));
displayResults(compiledPattern.extractNamedGroups(rawDataLine2));
displayResults(compiledPattern.extractNamedGroups(rawDataLine3));
}
}

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How do I write to multiple files in Apache Beam? - google-cloud-dataflow

In Apache Beam 2.2 Java SDK, this is natively supported in TextIO and AvroIO using respectively TextIO and AvroIO.write().to(DynamicDestinations). See e.g. this method. Update (2018): Prefer to use FileIO.writeDynamic() together with TextIO.sink() and AvroIO.sink() instead.

Just write below lines in your ParDo class : from apache_beam.io import filesystems eventCSVFileWriter = filesystems.FileSystems.create(gcsFileName) for record in list(Records): eventCSVFileWriter.write(record) If you want the full code I can help you with that too.

Related

Apache Beam Stateful DoFn Periodically Output All K/V Pairs

Right way to create parallel For loop in Google DataFlow pipeline

Google Dataflow Out Of Heap When Creating Multiple Tagged Outputs

Test that either one thing holds or another in AssertJ

How to parse using Grok from Java.. Is there any example available.?

Categories

Resources