What determines batch size in beam/dataflow? - google-cloud-dataflow

I have a pipeline that uses the batch variant of DoFn (which the docs weren't very helpful for). It looks like this
class MyFn(beam.DoFn):
def process_batch(self, batch: List[MyType]) -> Iterator[List[MyType]]:
# process batches
results = []
for foo in batch:
# do work, add to results
yield results
I've got some logging setup to shows me that my process_batch method is operating on 4096 items consistently. Does anyone know why its 4096, or how to make it higher or lower?

Currently the BATCH SIZE is hardcoded to 4096 in the batched DoFns.
You can raise a feature request in the apache beam community to make it configurable.

Related

ML-Engine with GPUs workers errors

Hi I am using ML Engine with a custom tier made up of a complex_m master, four workers each with a GPU and one complex_m as parameter server.
The model is training a CNN. However, there seem to be trouble with the workers.
This is an image of the logs https://i.stack.imgur.com/VJqE0.png.
The master still seems to be working because there are session checkpoints being saved, however, this is nowwhere near the speed it should be.
With complex_m workers, the model works. It just gives a waiting for the model to be ready in the beginning (i assume it is until the master intializes global variables, correct me if i am wrong) and then works normally. With GPUs however there seem to be a problem with the task.
I didnt' use the tf.Device() function anywhere, in the cloud i thought the device is set automatically if a GPU is available.
I followed the Census example and loaded the TF_CONFIG environment variable.
tf.logging.info('Setting up the server')
tf_config = os.environ.get('TF_CONFIG')
# If TF_CONFIG is not available run local
if not tf_config:
return run('', True, *args, **kwargs)
tf_config_json = json.loads(tf_config)
cluster = tf_config_json.get('cluster')
job_name = tf_config_json.get('task', {}).get('type')
task_index = tf_config_json.get('task', {}).get('index')
# If cluster information is empty run local
if job_name is None or task_index is None:
return run('', True, *args, **kwargs)
cluster_spec = tf.train.ClusterSpec(cluster)
server = tf.train.Server(cluster_spec,
job_name=job_name,
task_index=task_index)
# Wait for incoming connections forever
# Worker ships the graph to the ps server
# The ps server manages the parameters of the model.
if job_name == 'ps':
server.join()
return
elif job_name in ['master', 'worker']:
return run(server.target, job_name == 'master', *args, **kwargs)
Then used the tf.replica_device_setter before defining the main graph.
As a session i am using tf.train.MonitoredTrainingSession, this should handle the initialization of variables and checkpoint saving. I do not know why the workers are saying that the variables are not initialized.
Variables to be initialized are all variables: https://i.stack.imgur.com/hAHPL.png
Optimizer: AdaDelta
I appreciate the help!
In the comments, you seem to have answered your own question (using cluster_spec in replica_setter). Allow me to address the issue of throughput of a cluster of CPUs vs. a cluster of GPUs.
GPUs are fairly powerful. You'll typically get higher throughput by getting a single machine with many GPUs rather than having many machines each with a single GPU. That's because the communication overhead becomes a bottleneck (the bandwidth and latency to main memory on the same machine is much better than communicating with a parameter server on a remote machine).
The reason for the GPUs being slower than CPUs may be due to the extra overhead of GPUs needing to copy data from main memory to the GPU and back. If you're doing a lot of parallelizable computation, then this copy is negligible. Your model may be doing too little on the GPU and the overhead may swamp the actual computation.
For more information about building high performance models, see this guide.
In the meantime, I recommend using a single machine with more GPUs to see if that helps:
{
"scaleTier": "CUSTOM",
"masterType": "complex_model_l_gpu",
...
}
Just beware, that you'll have to modify your code to assign ops to the right GPUs, probably using towers.

Reading large gzip JSON files from Google Cloud Storage via Dataflow into BigQuery

I am trying to read about 90 gzipped JSON logfiles from Google Cloud Storage (GCS), each about 2GB large (10 GB uncompressed), parse them, and write them into a date-partitioned table to BigQuery (BQ) via Google Cloud Dataflow (GCDF).
Each file holds 7 days of data, the whole date range is about 2 years (730 days and counting). My current pipeline looks like this:
p.apply("Read logfile", TextIO.Read.from(bucket))
.apply("Repartition", Repartition.of())
.apply("Parse JSON", ParDo.of(new JacksonDeserializer()))
.apply("Extract and attach timestamp", ParDo.of(new ExtractTimestamps()))
.apply("Format output to TableRow", ParDo.of(new TableRowConverter()))
.apply("Window into partitions", Window.into(new TablePartWindowFun()))
.apply("Write to BigQuery", BigQueryIO.Write
.to(new DayPartitionFunc("someproject:somedataset", tableName))
.withSchema(TableRowConverter.getSchema())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
The Repartition is something I've built in while trying to make the pipeline reshuffle after decompressing, I have tried running the pipeline with and without it. Parsing JSON works via a Jackon ObjectMapper and corresponding classes as suggested here. The TablePartWindowFun is taken from here, it is used to assign a partition to each entry in the PCollection.
The pipeline works for smaller files and not too many, but breaks for my real data set. I've selected large enough machine types and tried setting a maximum number of workers, as well as using autoscaling up to 100 of n1-highmem-16 machines. I've tried streaming and batch mode and disSizeGb values from 250 up to 1200 GB per worker.
The possible solutions I can think of at the moment are:
Uncompress all files on GCS, and so enabling the dynamic work splitting between workers, as it is not possible to leverage GCS's gzip transcoding
Building "many" parallel pipelines in a loop, with each pipeline processsing only a subset of the 90 files.
Option 2 seems to me like programming "around" a framework, is there another solution?
Addendum:
With Repartition after Reading the gzip JSON files in batch mode with 100 workers max (of type n1-highmem-4), the pipeline runs for about an hour with 12 workers and finishes the Reading as well as the first stage of Repartition. Then it scales up to 100 workers and processes the repartitioned PCollection. After it is done the graph looks like this:
Interestingly, when reaching this stage, first it's processing up to 1.5 million element/s, then the progress goes down to 0. The size of OutputCollection of the GroupByKey step in the picture first goes up and then down from about 300 million to 0 (there are about 1.8 billion elements in total). Like it is discarding something. Also, the ExpandIterable and ParDo(Streaming Write) run-time in the end is 0. The picture shows it slightly before running "backwards".
In the logs of the workers I see some exception thrown while executing request messages that are coming from the com.google.api.client.http.HttpTransport logger, but I can't find more info in Stackdriver.
Without Repartition after Reading the pipeline fails using n1-highmem-2 instances with out of memory errors at exactly the same step (everything after GroupByKey) - using bigger instance types leads to exceptions like
java.util.concurrent.ExecutionException: java.io.IOException:
CANCELLED: Received RST_STREAM with error code 8 dataflow-...-harness-5l3s
talking to frontendpipeline-..-harness-pc98:12346
Thanks to Dan from the Google Cloud Dataflow Team and the example he provided here, I was able to solve the issue. The only changes I made:
Looping over the days in 175 = (25 weeks) large chunks, running one pipeline after the other, to not overwhelm the system. In the loop make sure the last files of the previous iteration are re-processed and the startDate is moved forward at the same speed as the underlying data (175 days). As WriteDisposition.WRITE_TRUNCATE is used, incomplete days at the end of the chunks are overwritten with correct complete data this way.
Using the Repartition/Reshuffle transform mentioned above, after reading the gzipped files, to speed up the process and allow smoother autoscaling
Using DateTime instead of Instant types, as my data is not in UTC
UPDATE (Apache Beam 2.0):
With the release of Apache Beam 2.0 the solution became much easier. Sharding BigQuery output tables is now supported out of the box.
It may be worthwhile trying to allocate more resources to your pipeline by setting --numWorkers with a higher value when you run your pipeline. This is one of the possible solutions discussed in the “Troubleshooting Your Pipeline” online document, at the "Common Errors and Courses of Action" sub-chapter.

Too many 'steps' when executing a pipeline

We have a large data set which needs to be partition into 1,000 separate files, and the simplest implementation we wanted to use is to apply PartitionFn which, given an element of the data set, returns a random integer between 1 and 1,000.
The problem with this approach is it ends up creating 1,000 PCollections and the pipeline does not launch as there seems to be a hard limit on the number of 'steps' (which correspond to the boxes shown on the job monitoring UI in execution graph).
Is there a way to increase this limit (and what is the limit)?
The solution we are using to get around this issue is to partition the data into a smaller subsets first (say 50 subsets), and for each subset we run another layer of partitioning pipelines to produce 20 subsets of each subset (so the end result is 1000 subsets), but it'll be nice if we can avoid this extra layer (as ends up creating 1 + 50 pipelines, and incurs the extra cost of writing and reading the intermediate data).
Rather than using the Partition transform and introducing many steps in the pipeline consider using either of the following approaches:
Many sinks support the option to specify the number of output shards. For example, TextIO has a withNumShards method. If you pass this 1000 it will produce 1000 separate shards in the specified directory.
Using the shard number as a key and using a GroupByKey + a DoFn to write the results.

Set num of output shard in Write.to(Sink) in dataflow

I am having a customized sink extending FileBasedSink to which I write to by calling PCollection.apply(Write.to(MySink)) in dataflow (very simpler to XmlSink.java). However it seems by default simply calling Write.to will always result to 3 output shards? Is there any way that I could define the number of output shard (like TextTO.Write.withNumShards) just in customized sink class definition? or I have to define another customized PTransformer like TextIO.Write?
Unfortunately, right now FileBasedSink does not support specifying the number of shards.
In practice, the number of shards you get will be dependent on how the framework chooses to optimize the parts of the pipeline producing the collection you're writing, so there's essentially no control over that.
I've filed a JIRA issue for your request so you can subscribe to the status.

Can dask work with an endless streaming input

I understand that dask work well in batch mode like this
def load(filename):
...
def clean(data):
...
def analyze(sequence_of_data):
...
def store(result):
with open(..., 'w') as f:
f.write(result)
dsk = {'load-1': (load, 'myfile.a.data'),
'load-2': (load, 'myfile.b.data'),
'load-3': (load, 'myfile.c.data'),
'clean-1': (clean, 'load-1'),
'clean-2': (clean, 'load-2'),
'clean-3': (clean, 'load-3'),
'analyze': (analyze, ['clean-%d' % i for i in [1, 2, 3]]),
'store': (store, 'analyze')}
from dask.multiprocessing import get
get(dsk, 'store') # executes in parallel
Can we use dask to process streaming channel , where the number of chunks is unknown or even endless?
Can it perform the computation in an incremental way. for example can the 'analyze' step above could process ongoing chunks?
must we call the "get" operation only after all the data chunks are known , could we add new chunks after the "get" was called
Edit: see newer answer below
No
The current task scheduler within dask expects a single computational graph. It does not support dynamically adding to or removing from this graph. The scheduler is designed to evaluate large graphs in a small amount of memory; knowing the entire graph ahead of time is critical for this.
However, this doesn't stop one from creating other schedulers with different properties. One simple solution here is just to use a module like conncurrent.futures on a single machine or distributed on multiple machines.
Actually Yes
The distributed scheduler now operates fully asynchronously and you can submit tasks, wait on a few of them, submit more, cancel tasks, add/remove workers etc. all during computation. There are several ways to do this, but the simplest is probably the new concurrent.futures interface described briefly here:
http://dask.pydata.org/en/latest/futures.html

Resources