Making use of workers in Custom Sink - google-cloud-dataflow

I have a custom sink which will publish the final result from a pipeline to a repository.
I am getting the inputs for this pipeline from BigQuery and GCS.
The custom writer present in the sink is called for each in all workers. Custom Writer will just collect the objects to be psuhed and return it as part of WriteResult. And then finally I merge these records in the CustomWriteOperation.finalize() and push it into my repository.
This works fine for smaller files. But, my repository will not accept if the result is greater than 5 MB. Also it will not accept not more than 20 writes per hour.
If I push the result via worker, then the writes per day limit will be violated. If I write it in a CustomWriteOperation.finalize(), then it may violate size limt i.e. 5MB.
Current approach is to write in chunks in CustomWriteOperation.finalize(). As this is not executed in many workers it might cause delay in my job. How can I make use of workers in finalize() and how can I specify the number of workers to be used inside a pipeline for a specific job (i.e) write job?
Or is there any better approach?

The sink API doesn't explicitly allow tuning of bundle size.
One work around might be to use a ParDo to group records into bundles. For example, you can use a DoFn to randomly assign each record a key between 1,..., N. You could then use a GroupByKey to group the records into KV<Integer, Iterable<Records>>. This should produce N groups of roughly the same size.
As a result, an invocation of Sink.Writer.write could write all the records with the same key at once and since write is invoked in parallel the bundles would be written in parallel.
However, since a given KV pair could be processed multiple times or in multiple workers at the same time, you will need to implement some mechanism to create a lock so that you only try to write each group of records once.
You will also need to handle failures and retries.

So, if I understand correctly, you have a repository that
Accepts no more than X write operations per hour (I suppose if you try to do more, you get an error from the API you're writing to), and
Each write operation can be no bigger than Y in size (with similar error reporting).
That means it is not possible to write more than X*Y data in 1 hour, so I suppose, if you want to write more than that, you would want your pipeline to wait longer than 1 hour.
Dataflow currently does not provide built-in support for enforcing either of these limits, however it seems like you should be able to simply do retries with randomized exponential back-off to get around the first limitation (here's a good discussion), and it only remains to make sure individual writes are not too big.
Limiting individual writes can be done in your Writer class in the custom sink. You can maintain a buffer of records, and have write() add to the buffer and flush it by issuing the API call (with exponential back-off, as mentioned) if it becomes just below the allowed write size, and flush one more time in close().
This way you will write bundles that are as big as possible but not bigger, and if you add retry logic, throttling will also be respected.
Overall, this seems to fit well in the Sink API.

I am working with Sam on this and here are the actual limits imposed by our target system: 100 GB per api call, and max of 25 api calls per day.
Given these limits, the retry method with back-off logic may cause the upload to take many days to complete since we don't have control on the number of workers.
Another approach would be to leverage FileBasedSink to write many files in parallel. Once all these files are written, finalize (or copyToOutputFiles) can combine files until total size reaches 100 GB and push to target system. This way we leverage parallelization from writer threads, and honor the limit from target system.
Thoughts on this, or any other ideas?

Related

How are Dataflow bundles created after GroupBy/Combine?

Setup:
read from pubsub -> window of 30s -> group by user -> combine -> write to cloud datastore
Problem:
I'm seeing DataStoreIO writer errors as objects with similar keys are present in the same transaction.
Question:
I want to understand how my pipeline combines results into bundles after a group by/combine operation. I would expect the bundle to be created for every window after the combine. But apparently, a bundle can contain more than 2 occurrences of the same user?
Can re-execution (retries) of bundles cause this behavior?
Is this bundling dependent of the runner?
Is deduplication an option? if so, how would I best approach that?
Note that I'm not looking for a replacement for the datastore writer at the end of the pipeline, I already know that we can use a different strategy. I'm merely trying to understand how the bundling happens.
There are two answers to your question. One is specific to your use case, and the other is in general about bundling / windowing in streaming.
Specific to your pipeline
I am assuming that the 'key' for Datastore is the User ID? In that case, if you have events from the same user in more than one window, your GroupByKey or Combine operations will have one separate element for every pair of user+window.
So the question is: What are you trying to insert into datastore?
An individual user's resulting aggregate over all time? In that case, you'd need to use a Global Window.
A user's resulting aggregate for every 30 seconds in time? Then you need to use the window as part of the key you use to insert to datastore. Does that help / make sense?
Happy to help you design your pipeline to do what you want. Chat with me in the comments or via SO chat.
The larger question about bundling of data
Bundling strategies will vary by runner. In Dataflow, you should consider the following two factors:
Every worker is assigned a key range. Elements for the same key will be processed by the same worker.
Windows belong to single elements; but a bundle may contain elements from multiple windows. As an example, if the data freshness metric makes a big jump*, a number of windows may be triggered - and elements of the same key in different windows would be processed in the same bundle.
*- when can Data freshness jump suddenly? A stream with a single element with a very old timestamp, and that is very slow to process may hold the watermark for a long time. Once this element is processed, the watermark may jump a lot, to the next oldest element (Check out this lecture on watermarks ; )).

Dataflow batch vs streaming: window size larger than batch size

Let's say we have log data with timestamps that can either be streamed into BigQuery or stored as files in Google Storage, but not streamed directly to the unbounded collection source types that Dataflow supports.
We want to analyse this data based on timestamp, either relatively or absolutely, e.g. "how many hits in the last 1 hour?" and "how many hits between 3pm and 4pm on 5th Feb 2018"?
Having read the documentation on windows and triggers, it's not clear how we would divide our incoming data into batches in a way that is supported by Dataflow if we want to have a large window - potentially we want to aggregate over the last day, 30 days, 3 months, etc.
For example, if our batched source is a BigQuery query, run every 5 mins, for the last 5 mins worth of data, will Dataflow keep the windows open between job runs, even though the data is arriving in 5 min chunks?
Similarly, if the log files are rotated every 5 mins, and we start Dataflow as a new file is saved to the bucket, the same question applies - is the job stopped and started, and all knowledge of previous jobs discarded, or does the large window (e.g. up to a month) remain open for new events?
How do we change/modify this pipeline without disturbing the existing state?
Apologies if these are basic questions, even a link to some docs would be appreciated.
It sounds like you want arbitrary interactive aggregation queries on your data. Beam / Dataflow are not a good fit for this per se, however one of the most common use cases of Dataflow is to ingest data into BigQuery (e.g. from GCS files or from Pubsub), which is a very good fit for that.
A few more comments on your question:
it's not clear how we would divide our incoming data into batches
Windowing in Beam is simply a way to specify the aggregation scope in the time dimension. E.g. if you're using sliding windows of size 15 minutes every 5 minutes, then a record whose event-time timestamp is 14:03 counts towards aggregations in three windows: 13:50..14:05, 13:55..14:10, 14:00..14:15.
So: same way as you don't need to divide your incoming data into "keys" when grouping by a key (the data processing framework performs the group-by-key for you), you don't divide it into windows either (the framework performs group-by-window implicitly as part of every aggregating operation).
will Dataflow keep the windows open between job runs
I'm hoping this is addressed by the previous point, but to clarify more: No. Stopping a Dataflow job discards all of its state. However, you can "update" a job with new code (e.g. if you've fixed a bug or added an extra processing step) - in that case state is not discarded, but I think that's not what you're asking.
if the log files are rotated every 5 mins, and we start Dataflow as a new file is saved
It sounds like you want to ingest data continuously. The way to do that is to write a single continuously running streaming pipeline that ingests the data continuously, rather than to start a new pipeline every time new data arrives. In the case of files arriving into a bucket, you can use TextIO.read().watchForNewFiles() if you're reading text files, or its various analogues if you're reading some other kind of files (most general is FileIO.matchAll().continuously()).

Marking a key as complete in a GroupBy | Dataflow Streaming Pipeline

To our Streaming pipeline, we want to submit unique GCS files, each file containing multiple event information, each event also containing a key (for example, device_id). As part of the processing, we want to shuffle by this device_id so as to achieve some form of worker to device_id affinity (more background on why we want to do it is in this another SO question. Once all events from the same file are complete, we want to reduce (GroupBy) by their source GCS file (which we will make a property of the event itself, something like file_id) and finally write the output to GCS (could be multiple files).
The reason we want to do the final GroupBy is because we want to notify an external service once a specific input file has completed processing. The only problem with this approach is that since the data is shuffled by the device_id and then grouped at the end by the file_id, there is no way to guarantee that all data from a specific file_id has completed processing.
Is there something we could do about it? I understand that Dataflow provides exactly_once guarantees which means all the events will be eventually processed but is there a way to set a deterministic trigger to say all data for a specific key has been grouped?
EDIT
I wanted to highlight the broader problem we are facing here. The ability to mark
file-level completeness would help us checkpoint different stages of the data as seen by external consumers. For example,
this would allow us to trigger per-hour or per-day completeness which are critical for us to generate reports for that window. Given that these stages/barriers (hour/day) are clearly defined on the input (GCS files are date/hour partitioned), it is only natural to expect the same of the output. But with Dataflow's model, this seems impossible.
Similarly, although Dataflow guarantees exactly-once, there will be cases where the entire pipeline needs to be restarted since something went horribly wrong - in those cases, it is almost impossible to restart from the correct input marker since there is no guarantee that what was already consumed has been completely flushed out. The DRAIN mode tries to achieve this but as mentioned, if the entire pipeline is messed up and draining itself cannot make progress, there is no way to know which part of the source should be the starting point.
We are considering using Spark since its micro-batch based Streaming model seems to fit better. We would still like to explore Dataflow if possible but it seems that we wont be able to achieve it without storing these checkpoints externally from within the application. If there is an alternative way of providing these guarantees from Dataflow, it would be great. The idea behind broadening this question was to see if we are missing an alternate perspective which would solve our problem.
Thanks
This is actually tricky. Neither Beam nor Dataflow have a notion of a per-key watermark, and it would be difficult to implement that level of granularity.
One idea would be to use a stateful DoFn instead of the second shuffle. This DoFn would need to receive the number of elements expected in the file (from either a side-input or some special value on the main input). Then it could count the number of elements it had processed, and only output that everything has been processed once it had seen that number of elements.
This would be assuming that the expected number of elements can be determined ahead of time, etc.

Parse large and multiple fetches for statistics

In my app I want to retrieve a large amount of data from Parse to build a view of statistics. However, in future, as data builds up, this may be a huge amount.
For example, 10,000 results. Even if I fetched in batches of 1000 at a time, this would result in 10 fetches. This could rapidly, send me over the 30 requests per second limitation by Parse. Specifically when several other chunks of data may need to be collected at the same time for other stats.
Any recommendations/tips/advice for this scenario?
You will also run into limits with the skip and limit query variables. And heavy weight lifting on a mobile device could also present issues for you.
If you can you should pre-aggregate these statistics, perhaps once per day, so that you can simply directly request the details.
Alternatively, create a cloud code function to do some processing for you and return the results. Again, you may well run into limits here, so a cloud job may meed your needs better, and then you may need to effectively create a request object which is processed by the job and then poll for completion or send out push notifications on completion.

Google Cloud Dataflow: 413 Request Entity Too Large

Any suggestions on how to work around this error beside reducing the number of transformations in the flow (or, likely, reducing total serialized size of all transformation objects in flow graph)?
Thanks,
Dataflow currently has a limitation in our system that caps requests at 1MB. The size of the job is specifically tied to the JSON representation of the pipeline; a larger pipeline means a larger request.
We are working on increasing this limit. In the meantime, you can work around this limitation by breaking your job into smaller jobs so that each job description takes less than 1MB
To estimate the size of your request run your pipeline with the option
--dataflowJobFile = <path to output file>
This will write a JSON representation of your job to a file. The size of that file is a good estimate of size of the request. The actual size of the request will be slightly larger due to additional information that is part of the request.
Thank you for your patience.
We will update this thread once the limit has been increased.
This kind of errors usually come up when your bundle batch size for ingestion is over limitation (20MB).
I'm not sure if you're using WriteToBigQuery. If you're not, feel free to ignore this answer. I usually get solved by trying one of these 2 solutions:
Solution1: Set batch_size of WriteToBigQuery to the number lower than 500. The Default is 500.
Solution2: Set method of WriteToBigQuery to "FILE_LOADS", and also set other necessary parameters, such as triggering_frequency and custom_gcs_temp_location.
If above 2 solutions cannot solve your problem or are not suitable for your case, you have to make the granularity of each row smaller, so that the size of each row becomes smaller. This will need to modify parsing logic and BigQuery table schema.
To see the detail of parameters, please see the reference link.
Reference:
https://beam.apache.org/releases/pydoc/2.39.0/apache_beam.io.gcp.bigquery.html#apache_beam.io.gcp.bigquery.WriteToBigQuery
https://cloud.google.com/dataflow/docs/guides/common-errors#json-request-too-large
Are you serializing a large amount of data as part of your pipeline specification? For example, are you using the Create Transform to create PCollections from inlined data?
Could you possible share the json file? If you don't want to share it publicly you could email it privately to the Dataflow team.
This has been merged into Beam on Nov, 16 2018. It should not be too much longer before this is included in Dataflow.

Resources