Determine if a pcollection is empty or not - google-cloud-dataflow

How to check if a pcollection is empty or not before writing out to a text file in apache beam(2.1.0)?
What i'm trying to do here is to break a file into pcollections of specified number given as a parameter to the pipeline via ValueProvider. As this ValueProvider is not available at pipeline construction time, i declare a decent no 26(total no of alphabets and this is the max no which a user can input) to make it available for .withOuputTags(). So I get 26 tuple tags from which i have to retrieve pcollections before writing to text files. So here, only few number of tags as inputted by user will get populated and rest all are empty. Hence want to ignore empty pcollections returned by some of the tags before i apply TextIO.write().

It seems like actually you want to write a collection into multiple sets of files, where some sets may be empty. The proper way to do this is using the DynamicDestinations API - see TextIO.write().to(DynamicDestinations) which will be available in Beam 2.2.0 which should be cut within the next couple of weeks. Meanwhile if you'd like to use it, you can build a snapshot of Beam at HEAD yourself.

Related

Cloud Bigtable multi-prefix scan in dataflow

UPDATE: it seems that the recently released org.apache.beam.sdk.io.hbase-2.6.0 includes the HBaseIO.readAll() api. I tested in google dataflow, and it seems to be working. Will there be any issue or pitfall of using HBaseIO directly in Google Cloud Dataflow setting?
The BigtableIO.read takes PBegin as an input, I am wondering if there is anything like SpannerIO's readAll API, where the BigtableIO's read API input could be a PCollection of ReadOperations (e.g, Scan), and produce a PCollection<Result> from those ReadOperations.
I have a use case where I need to have multiple prefix scans, each with different prefix, and the number of rows with the same prefix can be small (a few hundred) or big (a few hundreds of thousands). If nothing like ReadAll is already available. I am thinking about having a DoFn to have a 'limit' scan, and if the limit scan doesn't reach the end of the key range, I will split it into smaller chunks. In my case, the key space is uniformly distributed, so the number of remaining rows can be well estimated by the last scanned row (assuming all keys smaller than the last scanned key is returned from the scan).
Apology if similar questions have been asked before.
HBaseIO is not compatible with Bigtable HBase connector due to region locator logic. And we haven't implemented the SplittableDoFn api for Bigtable yet.
How big are your rows, are they small enough that scanning a few hundred thousand row can be handled by a single worker?
If yes, then I'll assume that the expensive work you are trying parallelize is further down in your pipeline. In this case, you can:
create a subclass of AbstractCloudBigtableTableDoFn
in the DoFn, use the provided client directly, issuing scan for each prefix element
Each row resulting from the scan should be assigned a shard id and emitted as a KV(shard id, row). The shard id should be a incrementing int mod some multiple of the number of workers.
Then do a GroupBy after the custom DoFn to fan out the shards. It's important to do a GroupByKey to allow for fanout, otherwise a single worker will have to process all of the emitted rows for a prefix.
If your rows are big and you need to split each prefix scan across multiple workers then you will have to augment the above approach:
in main(), issue a SampleRowKeys request, which will give rough split points
insert a step in your pipeline before the manual scanning DoFn to split the prefixes using the results from SampleRowsKeys. ie. If the prefix is a and SampleRowKeys contains 'ac', 'ap', 'aw', then the range that it should emit would be [a-ac), [ac-ap), [ap-aw), [aw-b). Assign a shard id and group by it.
feed the prefixes to manual scan step from above.

Split a KV<K,V> PCollection into multiple PCollections

Hi after performing a group by key on a KV Pcollection, I need to:-
1) Make every element in that PCollection a separate individual PCollection.
2) Insert the records in those individual PCollections into a BigQuery Table.
Basically my intention is to create a dynamic date partition in the BigQuery table.
How can I do this?
An example would really help.
For Google Dataflow to be able to perform the massive parallelisation which makes it as one of its kind (as a service on the public cloud), the job flow needs to be predefined before submitting it to on the Google cloud console. Everytime you execute the jar file that conatins your pipleline code (which includes pipeline options and the transforms), a json file with the description of the job is created and submitted to Google cloud platform. The managed service then uses this to execute your job.
For the use case mentioned in the question, it demands that the input PCollection be split into as many PCollections as their are unique dates. For the split, the Tuple Tags needed to split the collection should be created dynamically which is not possible at this time. Creating tuple tags dynamically is not allowed because that doesn't help in creating the job description json file and beats the whole design/purpose with which dataflow was built.
I can think of a couple of solutions to this problem (both having its own pros and cons) :
Solution 1 (a workaround for the exact use case in the question):
Write a dataflow transform that takes the input PCollection and for each element in the input -
1. Checks the date of the element.
2. Appends the date to a pre-defined Big Query Table Name as a decorator (in the format yyyyMMDD).
3. Makes an HTTP request to the BQ API to insert the row into the table with the table name added with a decorator.
You will have to take into consideration the cost perspective in this approach because there is single HTTP request for every element rather than a BQ load job that would have done it if we had used the BigQueryIO dataflow sdk module.
Solution 2 (best practice that should be followed in these type of use cases):
1. Run the dataflow pipeline in the streaming mode instead of batch mode.
2. Define a time window with whatever is suitable to the scenario in which it is being is used.
3. For the `PCollection` in each window, write it to a BQ table with the decorator being the date of the time window itself.
You will have to consider rearchitecting your data source to send data to dataflow in the real time but you will have a dynamically date partitioned big query table with the results of your data processing being near real time.
References -
Google Big Query Table Decorators
Google Big Query Table insert using HTTP POST request
How job description files work
Note: Please mention in the comments and I will elaborate the answer with code snippets if needed.

Is there a way to tell Google Cloud Dataflow that the data coming in is already ordered?

We have an input data source that is approximately 90 GB (it can be either a CSV or XML, it doesn't matter) that contains an already ordered list of data. For simplicity, you can think of it as having two columns: time column, and a string column. The hundreds of millions of rows in this file are already ordered by the time column in ascending order.
In our Google cloud DataFlow, we have modeled each row as an element in our Pcollection, and we apply DoFn transformations to the string field (e.g. count the number of characters that are uppercase in the string etc.). This works fine.
However, we then need to apply functions that are supposed to be calculated for a block of time (e.g. five minutes) with a one minute overlap. So, we are thinking about using a sliding windowing function (even though the data is bounded).
However, the calculations logic that needs to be applied over these five-minute windows assumes that the data is ordered logically ( i.e. ascending) by the time field. My understanding is that even when using these windowing functions, one cannot assume that within each window the P collection objects are ordered in any way – so one would need to manually iterate through every P collection and reorder them, right? However, this seems like a huge waste of computational power, since the incoming data already contains ordered data. So is there a way to teach/inform Google cloud data flow that the input data is ordered and so to maintain that order even within the windows?
On a minor note, I had another question: my understanding is that if the data source is unbounded, there is never a "overall aggregation" function that would ever execute, as it never really make sense (since there is no end to the incoming data); however, if one uses a windowing function for bounded data, there is a true end state which corresponds to when all the data has been read from the CSV file. Therefore, is there a way to tell Google cloud data flow to do a final calculation once all the data has been read in, even though we are using a windowing function to divide the data up?
SlidingWindows sounds like the right solution for your problem. The ordering of the incoming data is not preserved across a GroupByKey, so informing Dataflow of that would not be useful currently. However, the batch Dataflow runner does already sort by timestamp in order to implement windowing efficiently, so for simple windowing like SlidingWindows, your code will see the data in order.
If you want to do a final calculation after doing some windowed calculations on a bounded data set, you can re-window your data into the global window again, and do your final aggregation after that:
p.apply(Window.into(new GlobalWindows()));

Multiple response crosstabs/frequencies based on categorical variable in SPSS

I've just started using SPSS after using R for about five years (I'm not happy about it, but you do what your boss tells you). I'm just trying to do a simple count based on a categorical variable.
I have a data set where I know a person's year of birth. I've recoded into a new variable so that I have their generation as a categorical variable, named Generation. I also have a question that allows for multiple responses. I want a frequency of how many times each response was collected.
I've created a multiple response variable (analyze>multiple response > Define variable sets). However, when I go to create crosstabs, the Generation variable isn't an option to select. I've tried googling, but the videos I have watched have the row variables as numeric.
Here is a google sheet that shows what I have and what I'm looking to achieve:
https://docs.google.com/spreadsheets/d/1oIMrhYv33ZQwPz3llX9mfxulsxsnZF9zaRf9Gh37tj8/edit#gid=0
Is it possible to do this?
First of all, to double check, when you say you go to crosstabs, is this Analyze > Multiple Response > Crosstabs (and not Analyze > Descriptive Statistics > Crosstabs)?
Second, with multiple response data, you are much better off working with Custom Tables. Start by defining the set with Analyze > Custom Tables > Multiple Response Sets. If you save your data file, those definitions are saved with it (unlike the Mult Response Procedure).
Then you can just use Custom Tables to tabulate mult response data pretty much as if it were a regular variable, but you have more choices about appropriate statistics, tests of significance etc. No need in the CTABLES code to explicitly list the set members.
Try CUSTOM TABLES, although this is an additional add-on modules that you need to have a licence for:
CTABLES /TABLE Generation[c] by (1_a+ 1_b + 1_c)[s][sum f8.0 'Count'].

blackberry reading a text file and updating after sort

I am successfully able to read and print the contents of a text file. My text file contains 5 data entries such as
Rashmi 120
Prema 900
It must sort only the integers in descending order and swap the respective names attached to them. the first column of serial number must remain the same. Each time a new entry is made that score must be compared to the existing 5 records and placed accordingly with new name and score.
Since this is blackberry programming and blackberry APIs don't support Collections.sort,please tell me how do I do this. I tried using SimpleSortingVector but I am unable to put it into coding form.
i believe u need to start with your own logic like
1) sorting depends on comparison
2) before making any comparison u need to split each string by spaces
3) after splitting save the name and numbers in different arrays
4) compare the numbers and accordingly do sorting
5) after this merge the array contents using indexing
m just giving u a way may be its not the perfect but drilling down may refine logics and usage of the api

Resources