UPDATE: it seems that the recently released org.apache.beam.sdk.io.hbase-2.6.0 includes the HBaseIO.readAll() api. I tested in google dataflow, and it seems to be working. Will there be any issue or pitfall of using HBaseIO directly in Google Cloud Dataflow setting?
The BigtableIO.read takes PBegin as an input, I am wondering if there is anything like SpannerIO's readAll API, where the BigtableIO's read API input could be a PCollection of ReadOperations (e.g, Scan), and produce a PCollection<Result> from those ReadOperations.
I have a use case where I need to have multiple prefix scans, each with different prefix, and the number of rows with the same prefix can be small (a few hundred) or big (a few hundreds of thousands). If nothing like ReadAll is already available. I am thinking about having a DoFn to have a 'limit' scan, and if the limit scan doesn't reach the end of the key range, I will split it into smaller chunks. In my case, the key space is uniformly distributed, so the number of remaining rows can be well estimated by the last scanned row (assuming all keys smaller than the last scanned key is returned from the scan).
Apology if similar questions have been asked before.
HBaseIO is not compatible with Bigtable HBase connector due to region locator logic. And we haven't implemented the SplittableDoFn api for Bigtable yet.
How big are your rows, are they small enough that scanning a few hundred thousand row can be handled by a single worker?
If yes, then I'll assume that the expensive work you are trying parallelize is further down in your pipeline. In this case, you can:
create a subclass of AbstractCloudBigtableTableDoFn
in the DoFn, use the provided client directly, issuing scan for each prefix element
Each row resulting from the scan should be assigned a shard id and emitted as a KV(shard id, row). The shard id should be a incrementing int mod some multiple of the number of workers.
Then do a GroupBy after the custom DoFn to fan out the shards. It's important to do a GroupByKey to allow for fanout, otherwise a single worker will have to process all of the emitted rows for a prefix.
If your rows are big and you need to split each prefix scan across multiple workers then you will have to augment the above approach:
in main(), issue a SampleRowKeys request, which will give rough split points
insert a step in your pipeline before the manual scanning DoFn to split the prefixes using the results from SampleRowsKeys. ie. If the prefix is a and SampleRowKeys contains 'ac', 'ap', 'aw', then the range that it should emit would be [a-ac), [ac-ap), [ap-aw), [aw-b). Assign a shard id and group by it.
feed the prefixes to manual scan step from above.
Related
I have a SpringBoot application that is under moderate load. I want to collect metric data for a few of the operations of my app. I am majorly interested in Counters and Timers.
I want to count the number of times a method was invoked (# of invocation over a window, for example, #invocation over last 1 day, 1 week, or 1 month)
If the method produces any unexpected result increase failure count and publish a few tags with that metric
I want to time a couple of expensive methods, i.e. I want to see how much time did that method took, and also I want to publish a few tags with metrics to get more context
I have tried StatsD-SignalFx and Micrometer-InfluxDB, but both these solutions have some issues I could not solve
StatsD aggregates the data over flush window and due to aggregation metric tags get messed up. For example, if I send 10 events in a flush window with different tag values, and the StatsD agent aggregates those events and publishes only one event with counter = 10, then I am not sure what tag values it's sending with aggregated data
Micrometer-InfluxDB setup has its own problems, one of them being micrometer sending 0 values for counters if no new metric is produced and in that fake ( 0 value counter) it uses same tag values from last valid (non zero counter)
I am not sure how, but Micrometer also does some sort of aggregation at the client-side in MeterRegistry I believe, because I was getting a few counters with a value of 0.5 in InfluxDB
Next, I am planning to explore Micrometer/StatsD + Telegraf + Influx + Grafana to see if it suits my use case.
Questions:
How to avoid metric aggregation till it reaches the data store (InfluxDB). I can do the required aggregation in Grafana
Is there any standard solution to the problem that I am trying to solve?
Any other suggestion or direction for my use case?
How to check if a pcollection is empty or not before writing out to a text file in apache beam(2.1.0)?
What i'm trying to do here is to break a file into pcollections of specified number given as a parameter to the pipeline via ValueProvider. As this ValueProvider is not available at pipeline construction time, i declare a decent no 26(total no of alphabets and this is the max no which a user can input) to make it available for .withOuputTags(). So I get 26 tuple tags from which i have to retrieve pcollections before writing to text files. So here, only few number of tags as inputted by user will get populated and rest all are empty. Hence want to ignore empty pcollections returned by some of the tags before i apply TextIO.write().
It seems like actually you want to write a collection into multiple sets of files, where some sets may be empty. The proper way to do this is using the DynamicDestinations API - see TextIO.write().to(DynamicDestinations) which will be available in Beam 2.2.0 which should be cut within the next couple of weeks. Meanwhile if you'd like to use it, you can build a snapshot of Beam at HEAD yourself.
Hi after performing a group by key on a KV Pcollection, I need to:-
1) Make every element in that PCollection a separate individual PCollection.
2) Insert the records in those individual PCollections into a BigQuery Table.
Basically my intention is to create a dynamic date partition in the BigQuery table.
How can I do this?
An example would really help.
For Google Dataflow to be able to perform the massive parallelisation which makes it as one of its kind (as a service on the public cloud), the job flow needs to be predefined before submitting it to on the Google cloud console. Everytime you execute the jar file that conatins your pipleline code (which includes pipeline options and the transforms), a json file with the description of the job is created and submitted to Google cloud platform. The managed service then uses this to execute your job.
For the use case mentioned in the question, it demands that the input PCollection be split into as many PCollections as their are unique dates. For the split, the Tuple Tags needed to split the collection should be created dynamically which is not possible at this time. Creating tuple tags dynamically is not allowed because that doesn't help in creating the job description json file and beats the whole design/purpose with which dataflow was built.
I can think of a couple of solutions to this problem (both having its own pros and cons) :
Solution 1 (a workaround for the exact use case in the question):
Write a dataflow transform that takes the input PCollection and for each element in the input -
1. Checks the date of the element.
2. Appends the date to a pre-defined Big Query Table Name as a decorator (in the format yyyyMMDD).
3. Makes an HTTP request to the BQ API to insert the row into the table with the table name added with a decorator.
You will have to take into consideration the cost perspective in this approach because there is single HTTP request for every element rather than a BQ load job that would have done it if we had used the BigQueryIO dataflow sdk module.
Solution 2 (best practice that should be followed in these type of use cases):
1. Run the dataflow pipeline in the streaming mode instead of batch mode.
2. Define a time window with whatever is suitable to the scenario in which it is being is used.
3. For the `PCollection` in each window, write it to a BQ table with the decorator being the date of the time window itself.
You will have to consider rearchitecting your data source to send data to dataflow in the real time but you will have a dynamically date partitioned big query table with the results of your data processing being near real time.
References -
Google Big Query Table Decorators
Google Big Query Table insert using HTTP POST request
How job description files work
Note: Please mention in the comments and I will elaborate the answer with code snippets if needed.
We have an input data source that is approximately 90 GB (it can be either a CSV or XML, it doesn't matter) that contains an already ordered list of data. For simplicity, you can think of it as having two columns: time column, and a string column. The hundreds of millions of rows in this file are already ordered by the time column in ascending order.
In our Google cloud DataFlow, we have modeled each row as an element in our Pcollection, and we apply DoFn transformations to the string field (e.g. count the number of characters that are uppercase in the string etc.). This works fine.
However, we then need to apply functions that are supposed to be calculated for a block of time (e.g. five minutes) with a one minute overlap. So, we are thinking about using a sliding windowing function (even though the data is bounded).
However, the calculations logic that needs to be applied over these five-minute windows assumes that the data is ordered logically ( i.e. ascending) by the time field. My understanding is that even when using these windowing functions, one cannot assume that within each window the P collection objects are ordered in any way – so one would need to manually iterate through every P collection and reorder them, right? However, this seems like a huge waste of computational power, since the incoming data already contains ordered data. So is there a way to teach/inform Google cloud data flow that the input data is ordered and so to maintain that order even within the windows?
On a minor note, I had another question: my understanding is that if the data source is unbounded, there is never a "overall aggregation" function that would ever execute, as it never really make sense (since there is no end to the incoming data); however, if one uses a windowing function for bounded data, there is a true end state which corresponds to when all the data has been read from the CSV file. Therefore, is there a way to tell Google cloud data flow to do a final calculation once all the data has been read in, even though we are using a windowing function to divide the data up?
SlidingWindows sounds like the right solution for your problem. The ordering of the incoming data is not preserved across a GroupByKey, so informing Dataflow of that would not be useful currently. However, the batch Dataflow runner does already sort by timestamp in order to implement windowing efficiently, so for simple windowing like SlidingWindows, your code will see the data in order.
If you want to do a final calculation after doing some windowed calculations on a bounded data set, you can re-window your data into the global window again, and do your final aggregation after that:
p.apply(Window.into(new GlobalWindows()));
I'm looking into grouping elements during the flow into batch groups that are grouped based on a batch size.
In Pseudo code:
PCollection[String].apply(Grouped.size(10))
Basically converting a PCollection[String] into PCollection[List[String]] where each list now contains 10 elements. As it is batch and in case it doesn't evenly divide the last batch would contain the left over elements.
I have two ugly ideas with windows and fake timestamps or a GroupBy using keys based on a random index to distribute evenly, but this seems like a to complex solution for the simple problem.
This question is similar to a variety of questions on how to batch elements. Take a look at these to get you started:
Can datastore input in google dataflow pipeline be processed in a batch of N entries at a time?
Partition data coming from CSV so I can process larger patches rather then individual lines