Reading BigTable and converting to Generic Records using GCP Cloud DataFlow - avro

I am trying to convert BigTable table data to genric record using dataflow .After the conversion is done , i have to compare with another datasets in bucket .
Below is my pseudo code , for pipeline i have used
pipeline
.apply("Read from bigtable", BigTableIo.read)
.apply("Transform BigTable to Avro Genric Records ",
ParDo.of(new TransformAvro(out.toString())))
.apply("Compare to existing avro file ")
.apply("Write back the data to bigTable")
// Function code is below to convert genric record
public class BigTableToAvroFunction
extends DoFn<KV<ByteString, Iterable<Mutation>>, GenericRecord> {
#ProcessElement
public void processelement(ProcessContext context){
GenericRecord gen = null ;
ByteString key = context.element().getKey();
Iterable<Mutation> value = context.element().getValue();
KV<ByteString, Iterable<Mutation>> element = context.element();
}
I am stuck here .

It is unclear what do you mean by comparing to existing data in a bucket. It depends on how do you want to do the comparison, what the file size is, probably other things. Examples of input vs output would help.
For example, if what you're trying to do is similar to Join operation, you can try using CoGroupByKey (link to the doc) to join two PCollections, one reading from BigTable, another reading Avros from GCS.
Or alternatively, if the file has reasonable size (fits in memory), you can probably model it as a side input (link to the doc).
Or, ultimately you can always use raw GCS API to query the data in a ParDo and doing everything manually.

Related

Slowness / Lag in beam streaming pipeline in group by key stage

Context
Hi all, I have been using Apache Beam pipelines to generate columnar DB to store in GCS, I have a datastream coming in from Kafka and have a window of 1m.
I want to transform all data of that 1m window into a columnar DB file (ORC in my case, can be Parquet or anything else), I have written a pipeline for this transformation.
Problem
I am experiencing general slowness. I suspect it could be due to the group by key transformation as I have only key. Is there really a need to do that? If not, what should be done instead? I read that combine isn't very useful for this as my pipeline isn't really aggregating the data but creating a merged file. What I exactly need is an iterable list of objects per window which will be transformed to ORC files.
Pipeline Representation
input -> window -> group by key (only 1 key) -> pardo (to create DB) -> IO (to write to GCS)
What I have tried
I have tried using the profiler, scaling horizontally/vertically. Using the profiler I saw more than 50% of the time going into group by key operation. I do believe the problem is of hot keys but I am unable to find a solution on what should be done. When I removed the group by key operation, my pipeline keeps up with the Kafka lag (ie, it doesn't seem to be an issue at Kafka end).
Code Snippet
p.apply("ReadLines", KafkaIO.<Long, byte[]>read().withBootstrapServers("myserver.com:9092")
.withTopic(options.getInputTopic())
.withTimestampPolicyFactory(MyTimePolicy.myTimestampPolicyFactory())
.withConsumerConfigUpdates(Map.of("group.id", "mygroup-id")).commitOffsetsInFinalize()
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer(ByteArrayDeserializer.class).withoutMetadata())
.apply("UncompressSnappy", ParDo.of(new UncompressSnappy()))
.apply("DecodeProto", ParDo.of(new DecodePromProto()))
.apply("MapTSSample", ParDo.of(new MapTSSample()))
.apply(Window.<TSSample>into(FixedWindows.of(Duration.standardMinutes(1)))
.withTimestampCombiner(TimestampCombiner.END_OF_WINDOW))
.apply(WithKeys.<Integer, TSSample>of(1))
.apply(GroupByKey.<Integer, TSSample>create())
.apply("CreateTSORC", ParDo.of(new CreateTSORC()))
.apply(new WriteOneFilePerWindow(options.getOutput(), 1));
Wall Time Profile
https://gist.github.com/anandsinghkunwar/4cc26f7e3da7473af66ce9a142a74c35
The problem indeed seems to be a hot keys issue, I had to change my pipeline to create a custom IO for ORC files and bump up the number of shards to 50 for my case. I removed the GroupByKey totally. Since beam doesn't yet have auto determination of number of shards for FileIO.write(), you'll have to manually choose a number that suits your workload.
Also, enabling streaming engine API in Google Dataflow sped up the ingestion even more.

Cloud Dataflow/Beam - PCollection lookup another PCollection

a) Reading from an Bounded source, how big can a PCollection size be when running in Dataflow?
b) When dealing with Big Data, say about 50 Million data of PCollection trying to lookup another PCollection of about 10 Million data of PCollection. Can that be done, and how good does Beam/Dataflow perform? In a ParDo function, given that we can pass only one input and get back one output, how can a look up be performed based on 2 input datasets? I am trying to look at Dataflow/Beam similar to any other ETL tool, where an easy look-up might be possible to create a new PCollection. Please provide with any code snippets, which might help.
I also have seen the side input functionality, but can side input really hold that big dataset, if that is how lookup can be accomplished?
You can definitely do this with side inputs, as a side input may be arbitrarily large.
In Java you'd do something like this:
Pipeline pipeline = Pipeline.create(options);
PCollectionView<Map<...>> lookupCollection = pipeline
.apply(new ReadMyLookupCollection())
.apply(View.asMap());
PCollection<..> mainCollection = pipeline
.apply(new ReadMyPCollection())
.apply(
ParDo.of(new JoinPCollsDoFn()).withSideInputs(lookupCollection));
class JoinPCollsDoFn<...> extends DoFn<...> {
#ProcessElement
public void processElement(ProcessContext c) {
Map<...> siMap = c.sideInput(lookupCollection);
String lookupKey = c.element().lookupKey;
AugmentedElement result = c.element().mergeWith(siMap.get(lookupKey))
c.output(result);
}
}
FWIW, this is a bit pseudo-codey, but it is a snippet of what you'd like to do. Let me know if you want to have further clarification.

Streaming .csv files into Cloud Storage using Pub/Sub

General question if anyone can point me in the right way if possible, what is the Best way to get incoming streaming .csv files into BigQuery (with some transformations applied using dataflow) at a large scale, using pub/sub ?..
since im thinking to use pub/ sub to handle the many multiple large raw streams of incoming .csv files
for example the approach I’m thinking of is:
1.incoming raw.csv file > 2. pub/sub > 3. cloud storage > 4. cloud Function (to trigger dataflow) > 5. DataFlow (to transform) > 5. BigQuery
let me know if there are any issues with this Approach at scale Or a better alternative?
If that is a good approach, how to I get pub /sub to pickup the .csv files / and how do I construct this?
Thanks
Ben
There's a couple of different ways to approach this but much of your use case can be solved using the Google-provided Dataflow templates. When using the templates, the light transformations can be done within a JavaScript UDF. This saves you from needing to maintain an entire pipeline and only writing the transformations necessary for your incoming data.
If your accepting many files input as a stream to Cloud Pub/Sub, remember that Cloud Pub/Sub has no guarantees on ordering so records from different files would likely get intermixed in the output. If you're looking to capture an entire file as is, uploading directly to GCS would be the better approach.
Using the provided templates either Cloud Pub/Sub to BigQuery or GCS to BigQuery, you could utilize a simple UDF to transform the data from CSV format to a JSON format matching the BigQuery output table schema.
For example if you had CSV records such as:
transactionDate,product,retailPrice,cost,paymentType
2018-01-08,Product1,99.99,79.99,Visa
You could write a UDF to transform that data into your output schema as such:
function transform(line) {
var values = line.split(',');
// Construct output and add transformations
var obj = new Object();
obj.transactionDate = values[0];
obj.product = values[1];
obj.retailPrice = values[2];
obj.cost = values[3];
obj.marginPct = (obj.retailPrice - obj.cost) / obj.retailPrice;
obj.paymentType = values[4];
var jsonString = JSON.stringify(obj);
return jsonString;
}

Determining size of PCollection

I am writing a dataflow job which will read data from GCS and BigQuery.
This job will consolidate the data read from two sources. Consolidated data is just String.
Then this job will publish the consolidated data into external api.Custom sink is written to publish the consolidated data.
External API will not allow to publish data if the consolidated data is more than 1 GB.
I just want to fail the dataflow job if the consolidated data is more than 1 GB. How can I get the size of data present in PCollection?
Currently I am determining the size using the below code
private static class CalculateSize extends PTransform<PCollection<String>, PCollection<Long>> {
private static final long serialVersionUID = -7383871712471335638L;
#Override
public PCollection<Long> apply(PCollection<String> input) {
return input
.apply(ParDo.named("IndividualSize").of(new DoFn<String, Long>() {
#Override
public void processElement(ProcessContext c) throws Exception {
c.output(Integer.valueOf(c.element().length()).longValue());
}
}))
.apply(Combine.globally(new Sum.SumLongFn()));
}
}
Is there any other better way to find the size?
The code you posted is the correct way to do this. Determining approximately how much your data will take up when written to your sink in its expected format is entirely sink-specific and Dataflow cannot do this for you. So, writing a function to manually compute this is the best way to do it.
Note that you need to account for different sources of overhead. E.g. if your sink was, say, a CSV file, then simply adding up lengths of individual record fields would give you an under-estimate of the number of bytes the file would take up. You would need to account for the commas, spaces, newlines, quotes, multi-byte characters etc. This overhead is also entirely format-specific.
But if it is only important to make sure you don't exceed 1GB, you can simply slightly pessimistically scale up your approximation.

Using Dataflow to Remove Duplicates

I have a large datafile (1 TB) of data to import into BigQuery. Each line contains a key. While importing the data and creating my PCollection to export to BigQuery, I'd like to insure that I am not importing duplicate records based on this key value. What would be the most efficient approach to doing this in my Java program using Dataflow?
Thanks
The following might be worth a look
https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/transforms/RemoveDuplicates
GroupByKey concept in Dataflow allows arbitrary groupings, which can be leveraged to remove duplicate keys from a PCollection.
The most generic approach to this problem would be:
read from your source file, producing a PCollection of input records,
use a ParDo transform to separate keys and values, producing a
PCollection of KV,
perform a GroupByKey operation
on it, producing a PCollection of KV>,
use a ParDo transform to select which value mapped to the given key
should be written, producing PCollection of KV,
use a ParDo transform to format the data for writing,
finally, write
the results to BigQuery or any other sink.
Some of these steps may be
omitted, if you are solving a particular special case of the generic
problem.
In particular, if the entire record is considered a key, the problem can be simplified to just running a Count transform and iterating over the resulting PCollection.
Here's an approximate code example for GroupByKey:
PCollection<KV<String, Doc>> urlDocPairs = ...;
PCollection<KV<String, Iterable<Doc>>> urlToDocs =
urlDocPairs.apply(GroupByKey.<String, Doc>create());
PCollection<KV<String, Doc>> results = urlToDocs.apply(
ParDo.of(new DoFn<KV<String, Iterable<Doc>>, KV<String, Doc>>() {
public void processElement(ProcessContext c) {
String url = c.element().getKey();
Iterable<Doc> docsWithThatUrl = c.element().getValue();
// return a pair of url and an element from Iterable<Doc>.
}}));
Can use org.apache.beam.sdk.transforms.Reshuffle
https://beam.apache.org/releases/javadoc/2.0.0/index.html?org/apache/beam/sdk/transforms/Reshuffle.html
https://www.tabnine.com/code/java/classes/org.apache.beam.sdk.transforms.Reshuffle

Resources