I am in the middle of designing a component to load CSV files arrives on GCS bucket into BQ tables.
Since our requirements involves inserting additional columns,I have come up with the following design using Dataflow Streaming job.
1] CSV file arrives on GCS bucket
2] GCS object creation event is sent to a pub/sub topic.
These events are consumed by the 'GCS-BQ Loader' Dataflow streaming.
2a] Looking at the meta-data attached in GCS Object, decide the BQ table name,settings..etc
and calculate values for new columns (lot_num and batch)
3] Using BigqueryAPI, a temporary external table is created for the CSV file.
4] Using BigqueryAPI, query is executed to insert data from BQ external table into the final BQ table.This step is done because additional columns(lot_num,batch) needs to be added to the final table.
Finally the temporary external is deleted.
Depend on the metadata attached in GCS object, we expect to have around 1000 BQ tables. These CSV files range from couple of kilobytes to ~1Gb.
I have following questions regarding this design:
This process is IO bound (rather than CPU bound), since BQ API calls are blocking.
In this case, these blocking calls will block the Dataflow threads.Will this impact the performance?
How does Dataflow autoscaling will work in this case?
From the docs:
Dataflow scales based on the parallelism of a pipeline. The
parallelism of a pipeline is an estimate of the number of threads
needed to most efficiently process data at any given time.
What kind of metric Dataflow use for auto-scaling for these kind of IO bound processes?
Is it the number of unacknowledged pub/sub messages in the buffer?
Is Dataflow suitable for this kind of IO bound processing? Or a plain Java app running on GKE(k8s) is more suitable for this case ?
Related
I have 3 dataflow steps in a Dataflow pipeline.
Reads from pubsub , saves in a table and splits into multiple events(puts into context output).
For each split, queries db and decorates the event with additional data.
Publishes to another pubsub topic for further procession.
PROBLEM:
After step 1, its splitting into 10K to 20K events.
Now in step 2 its running out of database connections. (I have a static hikari connection pool).
It works absolutely fine will less data. I am using a n1-standard-32 machine.
What should I do to limit the input to the next step? So that the parallelism is restricted or throttle events to next step.
I think basic idea is to reduce parallelism when executing step2 (If you have a massive parallelism, you will need 20k connections for 20k events because 20k events are processed in parallel).
Ideas include:
Stateful ParDo's execution is serialized per key per window, which means only one connection is need for a stateful ParDo because only one element should be processed at a given time for a key and a window.
One connection per bundle. You can initialize a connection at startBundle and make elements within a same bundle use a same connection (if my understanding is correct, within a bundle, execution is likely serialized).
So I am working on a little project that sets up a streaming pipeline using Google Dataflow and apache beam. I went through some tutorials and was able to get a pipeline up and running streaming into BigQuery, but I am going to want to Stream it into a full relational DB(ie: Cloud SQL). I have searched through this site and throughout google and it seems that the best route to achieve that would be to use the JdbcIO. I am a bit confused here because when I am looking up info on how to do this it all refers to writing to cloud SQL in batches and not full out streaming.
My simple question is can I stream data directly into Cloud SQL or would I have to send it via batch instead.
Cheers!
You should use JdbcIO - it does what you want, and it makes no assumption about whether its input PCollection is bounded or unbounded, so you can use it in any pipeline and with any Beam runner; the Dataflow Streaming Runner is no exception to that.
In case your question is prompted by reading its source code and seeing the word "batching": it simply means that for efficiency, it writes multiple records per database call - the overloaded use of the word "batch" can be confusing, but here it simply means that it tries to avoid the overhead of doing an expensive database call for every single record.
In practice, the number of records written per call is at most 1000 by default, but in general depends on how the particular runner chooses to execute this particular pipeline on this particular data at this particular moment, and can be less than that.
I have a Python beam.DoFn which is uploading a file to the internet. This process uses 100% of one core for ~5 seconds and then proceeds to upload a file for 2-3 minutes (and uses a very small fraction of the cpu during the upload).
Is DataFlow smart enough to optimize around this by spinning up multiple DoFns in separate threads/processes?
Yes Dataflow will run up multiple instances of a DoFn using python multiprocessing.
However, keep in mind that if you use a GroupByKey, then the ParDo will process elements for a particular key serially. Though you still achieve parallelism on the worker since you are processing multiple keys at once. However, if all of your data is on a single "hot key" you may not achieve good parallelism.
Are you using TextIO.Write in a batch pipeline? I believe that the files are prepared locally and then uploaded after your main DoFn is processed. That is the file is not uploaded until the PCollection is complete and will not receive more elements.
I don't think it streams out the files as you are producing elements.
We have a streaming Dataflow pipeline running on Google Cloud Dataflow
workers, which needs to read from a PubSub subscription, group
messages, and write them to BigQuery. The built-in BigQuery Sink does
not fit our needs as we need to target specific datasets and tables
for each group. As the custom sinks are not supported for streaming
pipelines, it seems like the only solution is to perform the insert
operations in a ParDo. Something like this:
Is there any known issue with not having a sink in a pipeline, or anything to be aware of when writing this kind of pipeline?
There should not be any issues for writing a pipeline without a sink. In fact, a sink is a type of ParDo in streaming.
I recommend that you use a custom ParDo and use the BigQuery API with your custom logic. Here is the definition of the BigQuerySink, you can use this code as a starting point.
You can define your own DoFn similar to StreamingWriteFn to add your custom ParDo logic, which will write to the appropriate BigQuery dataset/table.
Note that this is using Reshuffle instead of GroupByKey. I recommend that you use Reshuffle, which will also group by key, but avoid unnecessary windowing delays. In this case it means that the elements should be written out as soon as they come in, without extra buffering/delay. Additionally, this allows you to determine BQ table names at runtime.
Edit: I do not recommend using the built in BigQuerySink to write to different tables. This suggestion is to use the BigQuery API in your custom DoFn, rather than using the BigQuerySink
This wasn't clear from the documentation, but it looks like BigQueryIO.write performs a streaming write, which in turn limits the row size to <20KB?
Is it possible to configure a non-streaming BigQuery write that enables support for the larger (1MB) row size? My DataFlow job is a batch job, not a streaming one, and BigQuery streaming is not necessary, and undesired in this case, since it restricts me from importing my data.
If not, what's the recommended workflow for importing large rows into BigQuery? I guess I can run the DataFlow ETL and write my data into text files using TextIO, but then I'd have to add a manual step outside of this pipeline to trigger a BQ import?
Batch Datflow jobs don't stream data to BigQuery. The data is written to GCS and then we execute BigQuery import jobs to import the GCS files. So the streaming limits shouldn't apply.
Note the import job is executed by the service not by the workers which is why you don't see code for this in BigQueryIO.write.