Flink Avro Parquet Writer in RollingSink

Flink Avro Parquet Writer in RollingSink - avro

I have an issue when i'm trying to set an AvroParquetWriter in RollingSink,
sink path and writer path seems to be in conflict
flink version : 1.1.3
parquet-avro version : 1.8.1
error :
[...]
12/14/2016 11:19:34 Source: Custom Source -> Sink: Unnamed(8/8) switched to CANCELED
INFO JobManager - Status of job af0880ede809e0d699eb69eb385ca204 (Flink Streaming Job) changed to FAILED.
java.lang.RuntimeException: Could not forward element to next operator
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:376)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:358)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:346)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:329)
at org.apache.flink.streaming.api.operators.StreamSource$NonTimestampContext.collect(StreamSource.java:161)
at org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecord(AbstractFetcher.java:225)
at org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.run(Kafka09Fetcher.java:253)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: File already exists: /home/user/data/file
at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:264)
at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:257)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.<init>(ChecksumFileSystem.java:386)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:447)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:426)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:784)
at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:223)
at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:266)
at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:217)
at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:183)
at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:153)
at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:119)
at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:92)
at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:66)
at org.apache.parquet.avro.AvroParquetWriter.<init>(AvroParquetWriter.java:54)
at fr.test.SpecificParquetWriter.open(SpecificParquetWriter.java:28) // line in code => writer = new AvroParquetWriter(new Path("/home/user/data/file"), schema, compressionCodecName, blockSize, pageSize);
at org.apache.flink.streaming.connectors.fs.RollingSink.openNewPartFile(RollingSink.java:451)
at org.apache.flink.streaming.connectors.fs.RollingSink.invoke(RollingSink.java:371)
at org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:39)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:373)
... 7 more
INFO JobClientActor - 12/14/2016 11:19:34 Job execution switched to status FAILED.
12/14/2016 11:19:34 Job execution switched to status FAILED.
INFO JobClientActor - Terminate JobClientActor.
[...]
main :
RollingSink sink = new RollingSink<String>("/home/user/data");
sink.setBucketer(new DateTimeBucketer("yyyy/MM/dd"));
sink.setWriter(new SpecificParquetWriter());
stream.addSink(sink);
SpecificParquetWriter :
public class SpecificParquetWriter<V> extends StreamWriterBase<V> {
private transient AvroParquetWriter writer;
private CompressionCodecName compressionCodecName = CompressionCodecName.SNAPPY;
private int blockSize = ParquetWriter.DEFAULT_BLOCK_SIZE;
private int pageSize = ParquetWriter.DEFAULT_PAGE_SIZE;
public static final String USER_SCHEMA = "{"
+ "\"type\":\"record\","
+ "\"name\":\"myrecord\","
+ "\"fields\":["
+ " { \"name\":\"str1\", \"type\":\"string\" },"
+ " { \"name\":\"str2\", \"type\":\"string\" },"
+ " { \"name\":\"int1\", \"type\":\"int\" }"
+ "]}";
public SpecificParquetWriter(){
}
#Override
// workaround
public void open(FileSystem fs, Path path) throws IOException {
super.open(fs, path);
Schema schema = new Schema.Parser().parse(USER_SCHEMA);
writer = new AvroParquetWriter(new Path("/home/user/data/file"), schema, compressionCodecName, blockSize, pageSize);
}
#Override
public void write(Object element) throws IOException {
if(writer != null)
writer.write(element);
}
#Override
public Writer duplicate() {
return new SpecificParquetWriter();
}
}
I don't know if i'm doing it on the right way...
Is there a simple way to do this ?

This is problem with the base class that is Writer in case of RollingSink or StreamBaseWriter in case of Bucketing Sink as they only accept the Writers which can process OutputStream rather than saving them own their own.
writer= new AvroKeyValueWriter<K, V>(keySchema, valueSchema, compressionCodec, streamObject);
whereas AvroParquetWriter or ParquetWriter Accepts filePath
writer = AvroParquetWriter.<V>builder(new Path("filePath"))
.withCompressionCodec(CompressionCodecName.SNAPPY)
.withSchema(schema).build();
I went in deep to understand the ParquetWriter and realized that the stuff we are trying to do , does not make sense as Flink Being an event processing system like storm can't write a single record to a parquet whereas spark streaming can because it works on MicroBatch Principle.
Using Storm with Trident we can still write parquet files, but with FLink we can't until flink introduces something like MicroBatches.
So, for this type of usecase, Spark Streaming is a better choice.
Or go for batch processing if want to use Flink.

Related

Wrong Kafka topic names for Spring-Cloud-Function app deployed as part of Spring-Cloud-Data-Flow stream

I have a simple SCDF stream that looks like this:
http --port=12346 | mvmn-transform | file --name=tmp.txt --directory=/tmp
The mvmn-transform is a simple custom transformer that looks like this:
#SpringBootApplication
#EnableBinding(Processor.class)
#EnableConfigurationProperties(ScdfTestTransformerProperties.class)
#Configuration
public class ScdfTestTransformer {
public static void main(String args[]) {
SpringApplication.run(ScdfTestTransformer.class, args);
}
#Autowired
protected ScdfTestTransformerProperties config;
#Transformer(inputChannel = Processor.INPUT, outputChannel = Processor.OUTPUT)
public Object transform(Message<?> message) {
Object payload = message.getPayload();
Map<String, Object> result = new HashMap<>();
Map<String, String> headersStr = new HashMap<>();
message.getHeaders().forEach((k, v) -> headersStr.put(k, v != null ? v.toString() : null));
result.put("headers", headersStr);
result.put("payload", payload);
result.put("configProp", config.getSomeConfigProp());
return result;
}
// See https://stackoverflow.com/questions/59155689/could-not-decode-json-type-for-key-file-name-in-a-spring-cloud-data-flow-stream
#Bean("kafkaBinderHeaderMapper")
public KafkaHeaderMapper kafkaBinderHeaderMapper() {
BinderHeaderMapper mapper = new BinderHeaderMapper();
mapper.setEncodeStrings(true);
return mapper;
}
}
This works fine.
But I've read that Spring Cloud Function should allow me to implement such apps without a necessity to specify binding and transformer annotations, so I've changed it to this:
#SpringBootApplication
// #EnableBinding(Processor.class)
#EnableConfigurationProperties(ScdfTestTransformerProperties.class)
#Configuration
public class ScdfTestTransformer {
public static void main(String args[]) {
SpringApplication.run(ScdfTestTransformer.class, args);
}
#Autowired
protected ScdfTestTransformerProperties config;
// #Transformer(inputChannel = Processor.INPUT, outputChannel = Processor.OUTPUT)
#Bean
public Function<Message<?>, Map<String, Object>> transform(
// Message<?> message
) {
return message -> {
Object payload = message.getPayload();
Map<String, Object> result = new HashMap<>();
Map<String, String> headersStr = new HashMap<>();
message.getHeaders().forEach((k, v) -> headersStr.put(k, v != null ? v.toString() : null));
result.put("headers", headersStr);
result.put("payload", payload);
result.put("configProp", "Config prop val: " + config.getSomeConfigProp());
return result;
};
}
// See https://stackoverflow.com/questions/59155689/could-not-decode-json-type-for-key-file-name-in-a-spring-cloud-data-flow-stream
#Bean("kafkaBinderHeaderMapper")
public KafkaHeaderMapper kafkaBinderHeaderMapper() {
BinderHeaderMapper mapper = new BinderHeaderMapper();
mapper.setEncodeStrings(true);
return mapper;
}
}
And now I have a problem - SCDF source and target topic names are ignored by Spring-Cloud-Function apparently, and topics transform-in-0 and transform-out-0 are created instead.
SCDF creates topics that have names like <stream-name>.<app-name> eg something like TestStream123.http and TestStream123.mvmn-transform
Previously they were used for transformer - as it should be, since it is a part of the SCDF stream. But now they are ignored by Spring-Cloud-Function and transform-in-0 and transform-out-0 are created instead.
Thus my transformer no longer receives any input, as it expects it on a wrong Kafka topic. And would probably produce no output to the stream as well, since it outputs to the wrong Kafka topic also.
P.S. Just in case, full project code on GitHub: https://github.com/mvmn/scdftest-transformer/tree/scfunc
In order to run locally start up Kafka, Skipper, SCDF and SCDF console, do mvn clean install in the app folder and then do app register --name mvmn-transform-1 --type processor --uri maven://x.mvmn.study.scdf.scdftest:scdftest-transformer:0.1.1-SNAPSHOT --metadata-uri maven://x.mvmn.study.scdf.scdftest:scdftest-transformer:0.1.1-SNAPSHOTin the coonsole. Then you can deploy stream using definition http --port=12346 | mvmn-transform | file --name=tmp.txt --directory=/tmp

Since you are using the functional model of writing Spring Cloud Stream applications, when you deploy this app, you need to pass two properties on the custom processor to restore the Spring Cloud Data Flow behavior.
spring.cloud.stream.function.bindings.transform-in-0=input
spring.cloud.stream.function.bindings.transform-out-0=output
Can you try that and see if that makes a difference?

Does Apache Beam support custom file names for its output?

While in a distributed processing environment it is common to use "part" file names such as "part-000", is it possible to write an extension of some sort to rename the individual output file names (such as a per window file name) of Apache Beam?
To do this, one might have to be able to assign a name for a window or infer a file name based on the window's content. I would like to know if such an approach is possible.
As to whether the solution should be streaming or batch, a streaming mode example is preferable

Yes as suggested by jkff you can achieve this using TextIO.write.to(FilenamePolicy).
Examples are below:
If you want to write output to particular local file you can use:
lines.apply(TextIO.write().to("/path/to/file.txt"));
Below is the simple way to write the output using the prefix, link. This example is for google storage, instead of this you can use local/s3 paths.
public class MinimalWordCountJava8 {
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.create();
// In order to run your pipeline, you need to make following runner specific changes:
//
// CHANGE 1/3: Select a Beam runner, such as BlockingDataflowRunner
// or FlinkRunner.
// CHANGE 2/3: Specify runner-required options.
// For BlockingDataflowRunner, set project and temp location as follows:
// DataflowPipelineOptions dataflowOptions = options.as(DataflowPipelineOptions.class);
// dataflowOptions.setRunner(BlockingDataflowRunner.class);
// dataflowOptions.setProject("SET_YOUR_PROJECT_ID_HERE");
// dataflowOptions.setTempLocation("gs://SET_YOUR_BUCKET_NAME_HERE/AND_TEMP_DIRECTORY");
// For FlinkRunner, set the runner as follows. See {#code FlinkPipelineOptions}
// for more details.
// options.as(FlinkPipelineOptions.class)
// .setRunner(FlinkRunner.class);
Pipeline p = Pipeline.create(options);
p.apply(TextIO.read().from("gs://apache-beam-samples/shakespeare/*"))
.apply(FlatMapElements
.into(TypeDescriptors.strings())
.via((String word) -> Arrays.asList(word.split("[^\\p{L}]+"))))
.apply(Filter.by((String word) -> !word.isEmpty()))
.apply(Count.<String>perElement())
.apply(MapElements
.into(TypeDescriptors.strings())
.via((KV<String, Long> wordCount) -> wordCount.getKey() + ": " + wordCount.getValue()))
// CHANGE 3/3: The Google Cloud Storage path is required for outputting the results to.
.apply(TextIO.write().to("gs://YOUR_OUTPUT_BUCKET/AND_OUTPUT_PREFIX"));
p.run().waitUntilFinish();
}
}
This example code will give you more control on writing the output:
/**
* A {#link FilenamePolicy} produces a base file name for a write based on metadata about the data
* being written. This always includes the shard number and the total number of shards. For
* windowed writes, it also includes the window and pane index (a sequence number assigned to each
* trigger firing).
*/
protected static class PerWindowFiles extends FilenamePolicy {
private final ResourceId prefix;
public PerWindowFiles(ResourceId prefix) {
this.prefix = prefix;
}
public String filenamePrefixForWindow(IntervalWindow window) {
String filePrefix = prefix.isDirectory() ? "" : prefix.getFilename();
return String.format(
"%s-%s-%s", filePrefix, formatter.print(window.start()), formatter.print(window.end()));
}
#Override
public ResourceId windowedFilename(int shardNumber,
int numShards,
BoundedWindow window,
PaneInfo paneInfo,
OutputFileHints outputFileHints) {
IntervalWindow intervalWindow = (IntervalWindow) window;
String filename =
String.format(
"%s-%s-of-%s%s",
filenamePrefixForWindow(intervalWindow),
shardNumber,
numShards,
outputFileHints.getSuggestedFilenameSuffix());
return prefix.getCurrentDirectory().resolve(filename, StandardResolveOptions.RESOLVE_FILE);
}
#Override
public ResourceId unwindowedFilename(
int shardNumber, int numShards, OutputFileHints outputFileHints) {
throw new UnsupportedOperationException("Unsupported.");
}
}
#Override
public PDone expand(PCollection<InputT> teamAndScore) {
if (windowed) {
teamAndScore
.apply("ConvertToRow", ParDo.of(new BuildRowFn()))
.apply(new WriteToText.WriteOneFilePerWindow(filenamePrefix));
} else {
teamAndScore
.apply("ConvertToRow", ParDo.of(new BuildRowFn()))
.apply(TextIO.write().to(filenamePrefix));
}
return PDone.in(teamAndScore.getPipeline());
}

Yes. Per documentation of TextIO:
If you want better control over how filenames are generated than the default policy allows, a custom FilenamePolicy can also be set using TextIO.Write.to(FilenamePolicy)

This is perfectly valid example with beam 2.1.0. You can call on your data (PCollection e.g)
import org.apache.beam.sdk.io.FileBasedSink.FilenamePolicy;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.io.fs.ResolveOptions.StandardResolveOptions;
import org.apache.beam.sdk.io.fs.ResourceId;
import org.apache.beam.sdk.transforms.display.DisplayData;
#SuppressWarnings("serial")
public class FilePolicyExample {
public static void main(String[] args) {
FilenamePolicy policy = new WindowedFilenamePolicy("somePrefix");
//data
data.apply(TextIO.write().to("your_DIRECTORY")
.withFilenamePolicy(policy)
.withWindowedWrites()
.withNumShards(4));
}
private static class WindowedFilenamePolicy extends FilenamePolicy {
final String outputFilePrefix;
WindowedFilenamePolicy(String outputFilePrefix) {
this.outputFilePrefix = outputFilePrefix;
}
#Override
public ResourceId windowedFilename(
ResourceId outputDirectory, WindowedContext input, String extension) {
String filename = String.format(
"%s-%s-%s-of-%s-pane-%s%s%s",
outputFilePrefix,
input.getWindow(),
input.getShardNumber(),
input.getNumShards() - 1,
input.getPaneInfo().getIndex(),
input.getPaneInfo().isLast() ? "-final" : "",
extension);
return outputDirectory.resolve(filename, StandardResolveOptions.RESOLVE_FILE);
}
#Override
public ResourceId unwindowedFilename(
ResourceId outputDirectory, Context input, String extension) {
throw new UnsupportedOperationException("Expecting windowed outputs only");
}
#Override
public void populateDisplayData(DisplayData.Builder builder) {
builder.add(DisplayData.item("fileNamePrefix", outputFilePrefix)
.withLabel("File Name Prefix"));
}
}
}

You can check https://beam.apache.org/releases/javadoc/2.3.0/org/apache/beam/sdk/io/FileIO.html for more information, you should search "File naming" in "Writing files".
.apply(
FileIO.<RootElement>write()
.via(XmlIO
.sink(RootElement.class)
.withRootElement(ROOT_XML_ELEMENT)
.withCharset(StandardCharsets.UTF_8))
.to(FILE_PATH)
.withNaming((window, pane, numShards, shardIndex, compression) -> NEW_FILE_NAME)

Element value based writing to Google Cloud Storage using Dataflow

I'm trying to build a dataflow process to help archive data by storing data into Google Cloud Storage. I have a PubSub stream of Event data which contains the client_id and some metadata. This process should archive all incoming events, so this needs to be a streaming pipeline.
I'd like to be able to handle archiving the events by putting each Event I receive inside a bucket that looks like gs://archive/client_id/eventdata.json . Is that possible to do within dataflow/apache beam, specifically being able to assign the file name differently for each Event in the PCollection?
EDIT:
So my code currently looks like:
public static class PerWindowFiles extends FileBasedSink.FilenamePolicy {
private String customerId;
public PerWindowFiles(String customerId) {
this.customerId = customerId;
}
#Override
public ResourceId windowedFilename(ResourceId outputDirectory, WindowedContext context, String extension) {
String filename = bucket+"/"+customerId;
return outputDirectory.resolve(filename, ResolveOptions.StandardResolveOptions.RESOLVE_FILE);
}
#Override
public ResourceId unwindowedFilename(
ResourceId outputDirectory, Context context, String extension) {
throw new UnsupportedOperationException("Unsupported.");
}
}
public static void main(String[] args) throws IOException {
DataflowPipelineOptions options = PipelineOptionsFactory.fromArgs(args)
.withValidation()
.as(DataflowPipelineOptions.class);
options.setRunner(DataflowRunner.class);
options.setStreaming(true);
Pipeline p = Pipeline.create(options);
PCollection<Event> set = p.apply(PubsubIO.readStrings()
.fromTopic("topic"))
.apply(new ConvertToEvent()));
PCollection<KV<String, Event>> events = labelEvents(set);
PCollection<KV<String, EventGroup>> sessions = groupEvents(events);
String customers = System.getProperty("CUSTOMERS");
JSONArray custList = new JSONArray(customers);
for (Object cust : custList) {
if (cust instanceof String) {
String customerId = (String) cust;
PCollection<KV<String, EventGroup>> custCol = sessions.apply(new FilterByCustomer(customerId));
stringifyEvents(custCol)
.apply(TextIO.write()
.to("gs://archive/")
.withFilenamePolicy(new PerWindowFiles(customerId))
.withWindowedWrites()
.withNumShards(3));
} else {
LOG.info("Failed to create TextIO: customerId was not String");
}
}
p.run()
.waitUntilFinish();
}
This code is ugly because I need to redeploy every time a new client happens in order to be able to save their data. I would prefer to be able to assign customer data to an appropriate bucket dynamically.

"Dynamic destinations" - choosing the file name based on the elements being written - will be a new feature available in Beam 2.1.0, which has not yet been released.

Execute read operations in sequence - Apache Beam

I need to execute below operations in sequence as given:-
PCollection<String> read = p.apply("Read Lines",TextIO.read().from(options.getInputFile()))
.apply("Get fileName",ParDo.of(new DoFn<String,String>(){
ValueProvider<String> fileReceived = options.getfilename();
#ProcessElement
public void procesElement(ProcessContext c)
{
fileName = fileReceived.get().toString();
LOG.info("File: "+fileName);
}
}));
PCollection<TableRow> rows = p.apply("Read from BigQuery",
BigQueryIO.read()
.fromQuery("SELECT table,schema FROM `DatasetID.TableID` WHERE file='" + fileName +"'")
.usingStandardSql());
How to accomplish this in Apache Beam/Dataflow?

It seems that you want to apply BigQueryIO.read().fromQuery() to a query that depends on a value available via a property of type ValueProvider<String> in your PipelineOptions, and the provider is not accessible at pipeline construction time - i.e. you are invoking your job via a template.
In that case, the proper solution is to use NestedValueProvider:
PCollection<TableRow> tableRows = p.apply(BigQueryIO.read().fromQuery(
NestedValueProvider.of(
options.getfilename(),
new SerializableFunction<String, String>() {
#Override
public String apply(String filename) {
return "SELECT table,schema FROM `DatasetID.TableID` WHERE file='" + fileName +"'";
}
})));

Stateful ParDo not working on Dataflow Runner

Based on Javadocs and the blog post at https://beam.apache.org/blog/2017/02/13/stateful-processing.html, I tried using a simple de-duplication example using 2.0.0-beta-2 SDK which reads a file from GCS (containing a list of jsons each with a user_id field) and then running it through a pipeline as explained below.
The input data contains about 146K events of which only 50 events are unique. The entire input is about 50MB which should be processable in considerably less time than the 2 min Fixed window. I just placed a window there to make sure the per-key-per-window semantics hold without using a GlobalWindow. I run the windowed data through 3 parallel stages to compare the results, each of which are explained below.
just copies the contents into a new file on GCS - this ensures all the events were being processed as expected and I verified the contents are exactly the same as input
Combine.PerKey on the user_id and pick only the first element from the Iterable - this essentially should deduplicate the data and it works as expected. The resulting file has the exact number of unique items from the original list of events - 50 elements
stateful ParDo which checks if the key has been seen already and emits an output only when its not. Ideally, the result from this should match the deduped data as [2] but all I am seeing is only 3 unique events. These 3 unique events always point to the same 3 user_ids in a few runs I did.
Interestingly, when I just switch from the DataflowRunner to the DirectRunner running this whole process locally, I see that the output from [3] matches [2] having only 50 unique elements as expected. So, I am doubting if there are any issues with the DataflowRunner for the Stateful ParDo.
public class StatefulParDoSample {
private static Logger logger = LoggerFactory.getLogger(StatefulParDoSample.class.getName());
static class StatefulDoFn extends DoFn<KV<String, String>, String> {
final Aggregator<Long, Long> processedElements = createAggregator("processed", Sum.ofLongs());
final Aggregator<Long, Long> skippedElements = createAggregator("skipped", Sum.ofLongs());
#StateId("keyTracker")
private final StateSpec<Object, ValueState<Integer>> keyTrackerSpec =
StateSpecs.value(VarIntCoder.of());
#ProcessElement
public void processElement(
ProcessContext context,
#StateId("keyTracker") ValueState<Integer> keyTracker) {
processedElements.addValue(1l);
final String userId = context.element().getKey();
int wasSeen = firstNonNull(keyTracker.read(), 0);
if (wasSeen == 0) {
keyTracker.write( 1);
context.output(context.element().getValue());
} else {
keyTracker.write(wasSeen + 1);
skippedElements.addValue(1l);
}
}
}
public static void main(String[] args) {
DataflowPipelineOptions pipelineOptions = PipelineOptionsFactory.create().as(DataflowPipelineOptions.class);
pipelineOptions.setRunner(DataflowRunner.class);
pipelineOptions.setProject("project-name");
pipelineOptions.setStagingLocation(GCS_STAGING_LOCATION);
pipelineOptions.setStreaming(false);
pipelineOptions.setAppName("deduper");
Pipeline p = Pipeline.create(pipelineOptions);
final ObjectMapper mapper = new ObjectMapper();
PCollection<KV<String, String>> keyedEvents =
p
.apply(TextIO.Read.from(GCS_SAMPLE_INPUT_FILE_PATH))
.apply(WithKeys.of(new SerializableFunction<String, String>() {
#Override
public String apply(String input) {
try {
Map<String, Object> eventJson =
mapper.readValue(input, Map.class);
return (String) eventJson.get("user_id");
} catch (Exception e) {
}
return "";
}
}))
.apply(
Window.into(
FixedWindows.of(Duration.standardMinutes(2))
)
);
keyedEvents
.apply(ParDo.of(new StatefulDoFn()))
.apply(TextIO.Write.to(GCS_SAMPLE_OUTPUT_FILE_PATH).withNumShards(1));
keyedEvents
.apply(Values.create())
.apply(TextIO.Write.to(GCS_SAMPLE_COPY_FILE_PATH).withNumShards(1));
keyedEvents
.apply(Combine.perKey(new SerializableFunction<Iterable<String>, String>() {
#Override
public String apply(Iterable<String> input) {
return !input.iterator().hasNext() ? "empty" : input.iterator().next();
}
}))
.apply(Values.create())
.apply(TextIO.Write.to(GCS_SAMPLE_COMBINE_FILE_PATH).withNumShards(1));
PipelineResult result = p.run();
result.waitUntilFinish();
}
}

This was a bug in the Dataflow service in batch mode, fixed in the upcoming 0.6.0 Beam release (or HEAD if you track the bleeding edge).
Thank you for bringing it to my attention! For reference, or if anything else comes up, this was tracked by BEAM-1611.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Flink Avro Parquet Writer in RollingSink - avro

Related

Wrong Kafka topic names for Spring-Cloud-Function app deployed as part of Spring-Cloud-Data-Flow stream

Does Apache Beam support custom file names for its output?

Element value based writing to Google Cloud Storage using Dataflow

Execute read operations in sequence - Apache Beam

Stateful ParDo not working on Dataflow Runner

Categories

Resources