Only output projection to partitioned stream

Only output projection to partitioned stream - eventstoredb

from what I’ve found so far, I have to use the following fragment in my projection to output my projection results:
.partitionBy(({ metadata }) => metadata.someProperty)
.outputTo('some-stream', 'some-stream-{0}')
I tried it with only the partitioned stream name - but this results in a stream named “some-stream-{0}”. Is there a way to ONLY output into a stream based on the configured partitioning (e.g. partition by tenant)?

Related

KSQL stream created using AVRO as the value format outputs null

I have data coming in this format
{"ROWTIME":1557825832927,"ROWKEY":"null","respondent_id":"noon","machine_data":{"resolution":"1920x1080","region":860}}
when I create a stream called COMPLEX like this:
CREATE STREAM complex WITH (KAFKA_TOPIC='test-topic-complex-2', VALUE_FORMAT='AVRO');
And then run:
SELECT MACHINE_DATA from COMPLEX;
it works fine.
Running this:
SELECT MACHINE_DATA->RESOLUTION from COMPLEX;
doesn't work, saying RESOLUTION is not a field in machine data. But it clearly is
I dropped the COMPLEX stream then recreated it and explicitly specified that resolution is a field by creating the stream using this syntax
CREATE STREAM COMPLEX (respondent_id VARCHAR, machine_data struct<resolution VARCHAR, region INT>) WITH (KAFKA_TOPIC='test-topic-complex-2', VALUE_FORMAT='AVRO');
after this I can run this query select MACHINE_DATA->RESOLUTION from COMPLEX; but I get null as output for resolution
Everything works fine when using JSON as the value format. What gives? Could anyone point out what I am doing wrong?

Beam pipeline does not produce any output after GroupByKey with windowing and I got memory error

purpose:
I want to load stream data, then add a key and then count them by key.
problem:
Apache Beam Dataflow pipline gets a memory error when i try to load and group-by-key a big-size data using streaming approach (unbounded data)
. Because it seems that data is accumulated in group-by and it does not fire data earlier with triggering of each window.
If I decrease the elements size (elements count will not change) it works! because actually group-by step waits for all the data to be grouped and then fire all the new windowed data.
I tested with both:
beam version 2.11.0 and scio version 0.7.4
beam version 2.6.0 and scio version 0.6.1
The way to regenerate the error:
Read a Pubsub message that contains file name
Read and load the related file from GCS as a row by row iterator
Flatten row by row (so it generates around 10,000) elements
Add timestamps (current instant time) to elements
Create a key-value of my data (with some random integer keys from 1 to 10)
Apply window with triggering (it will trigger around 50 times in the case when rows are small and no memory problem)
Count per key ( group by key then combine them )
Finally we supposed to have around 50 * 10 elements that represent counts by window and key (tested successfully when rows size are small enough)
Visualization of the pipeline ( steps 4 to 7 ):
Summary for group-by-key step :
As you can see the data is accumulated in group-by step and does not get emitted.
Windowing code is here :
val windowedData = data.applyKvTransform(
Window.into[myt](
Sessions.withGapDuration(Duration.millis(1)))
.triggering(
Repeatedly.forever(AfterFirst.of(
AfterPane.elementCountAtLeast(10),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.millis(1)))
).orFinally(AfterWatermark.pastEndOfWindow())
).withAllowedLateness(Duration.standardSeconds(100))
.discardingFiredPanes()
)
The error:
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$KeyCommitTooLargeException: Commit request for stage S2 and key 2 is larger than 2GB and cannot be processed. This may be caused by grouping a very large amount of data in a single window without using Combine, or by producing a large amount of data from a single input element.
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$KeyCommitTooLargeException.causedBy(StreamingDataflowWorker.java:230)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1287)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:146)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:1008)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
Is there any solution to solve the memory problem maybe by forcing group-by to emit early results of each window.

The KeyCommitTooLargeException is not a memory problem but a protobuf serialization problem. Protobuf has a limit of 2GB for an object (google protobuf maximum size). Dataflow found that the value of a single key in the pipeline was larger than 2GB therefore it couldn't shuffle the data. The error message indicates that "This may be caused by grouping a very large amount of data in a single window without using Combine, or by producing a large amount of data from a single input element." Based on your pipeline setup (i.e., assigned random keys), it is more likely the latter.
The pipeline may have read a large file (>2GB) from GCS and assigned it to a random key. GroupByKey requires a key shuffle operation and Dataflow failed to do due to the protobuf limitation therefore stuck on that key and hold the watermark.
If a single key has large value, you may want to reduce the value size, for example, compress the string, or split the string to multiple keys, or generate smaller GCS file in the first place.
If the large value is from grouping of multiple keys, you may want to increase the key space so every group by key operations end up group fewer keys together.

Reading BigTable and converting to Generic Records using GCP Cloud DataFlow

I am trying to convert BigTable table data to genric record using dataflow .After the conversion is done , i have to compare with another datasets in bucket .
Below is my pseudo code , for pipeline i have used
pipeline
.apply("Read from bigtable", BigTableIo.read)
.apply("Transform BigTable to Avro Genric Records ",
ParDo.of(new TransformAvro(out.toString())))
.apply("Compare to existing avro file ")
.apply("Write back the data to bigTable")
// Function code is below to convert genric record
public class BigTableToAvroFunction
extends DoFn<KV<ByteString, Iterable<Mutation>>, GenericRecord> {
#ProcessElement
public void processelement(ProcessContext context){
GenericRecord gen = null ;
ByteString key = context.element().getKey();
Iterable<Mutation> value = context.element().getValue();
KV<ByteString, Iterable<Mutation>> element = context.element();
}
I am stuck here .

It is unclear what do you mean by comparing to existing data in a bucket. It depends on how do you want to do the comparison, what the file size is, probably other things. Examples of input vs output would help.
For example, if what you're trying to do is similar to Join operation, you can try using CoGroupByKey (link to the doc) to join two PCollections, one reading from BigTable, another reading Avros from GCS.
Or alternatively, if the file has reasonable size (fits in memory), you can probably model it as a side input (link to the doc).
Or, ultimately you can always use raw GCS API to query the data in a ParDo and doing everything manually.

Streaming .csv files into Cloud Storage using Pub/Sub

General question if anyone can point me in the right way if possible, what is the Best way to get incoming streaming .csv files into BigQuery (with some transformations applied using dataflow) at a large scale, using pub/sub ?..
since im thinking to use pub/ sub to handle the many multiple large raw streams of incoming .csv files
for example the approach I’m thinking of is:
1.incoming raw.csv file > 2. pub/sub > 3. cloud storage > 4. cloud Function (to trigger dataflow) > 5. DataFlow (to transform) > 5. BigQuery
let me know if there are any issues with this Approach at scale Or a better alternative?
If that is a good approach, how to I get pub /sub to pickup the .csv files / and how do I construct this?
Thanks
Ben

There's a couple of different ways to approach this but much of your use case can be solved using the Google-provided Dataflow templates. When using the templates, the light transformations can be done within a JavaScript UDF. This saves you from needing to maintain an entire pipeline and only writing the transformations necessary for your incoming data.
If your accepting many files input as a stream to Cloud Pub/Sub, remember that Cloud Pub/Sub has no guarantees on ordering so records from different files would likely get intermixed in the output. If you're looking to capture an entire file as is, uploading directly to GCS would be the better approach.
Using the provided templates either Cloud Pub/Sub to BigQuery or GCS to BigQuery, you could utilize a simple UDF to transform the data from CSV format to a JSON format matching the BigQuery output table schema.
For example if you had CSV records such as:
transactionDate,product,retailPrice,cost,paymentType
2018-01-08,Product1,99.99,79.99,Visa
You could write a UDF to transform that data into your output schema as such:
function transform(line) {
var values = line.split(',');
// Construct output and add transformations
var obj = new Object();
obj.transactionDate = values[0];
obj.product = values[1];
obj.retailPrice = values[2];
obj.cost = values[3];
obj.marginPct = (obj.retailPrice - obj.cost) / obj.retailPrice;
obj.paymentType = values[4];
var jsonString = JSON.stringify(obj);
return jsonString;
}

Output sorted text file from Google Cloud Dataflow

I have a PCollection<String> in Google Cloud DataFlow and I'm outputting it to text files via TextIO.Write.to:
PCollection<String> lines = ...;
lines.apply(TextIO.Write.to("gs://bucket/output.txt"));
Currently the lines of each shard of output are in random order.
Is it possible to get Dataflow to output the lines in sorted order?

This is not directly supported by Dataflow.
For a bounded PCollection, if you shard your input finely enough, then you can write sorted files with a Sink implementation that sorts each shard. You may want to refer to the TextSink implementation for a basic outline.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Only output projection to partitioned stream - eventstoredb

Related

KSQL stream created using AVRO as the value format outputs null

Beam pipeline does not produce any output after GroupByKey with windowing and I got memory error

Reading BigTable and converting to Generic Records using GCP Cloud DataFlow

Streaming .csv files into Cloud Storage using Pub/Sub

Output sorted text file from Google Cloud Dataflow

Categories

Resources