I have a job that among other things also inserts some of the data it reads from files into BigQuery table for later manual analysis.
It fails with the following error:
job error: Too many sources provided: 10001. Limit is 10000., error: Too many sources provided: 10001. Limit is 10000.
What does it refer to as "source"? Is it a file or a pipeline step?
I'm guessing the error is coming from BigQuery and means that we are trying to upload too many files when we create your output table.
Could you provide some more details on the error / context (like a snippet of the commandline output (if using the BlockingDataflowPipelineRunner) so I can confirm? A jobId would also be helpful.
Is there something about your pipeline structure that is going to result in a large number of output files? That could either be a large amount of data or perhaps finely sharded input files without a subsequent GroupByKey operation (which would let us reshard the data into larger pieces).
The note in In Google Cloud Dataflow BigQueryIO.Write occur Unknown Error (http code 500) mitigates this issue:
Dataflow SDK for Java 1.x: as a workaround, you can enable this experiment in : --experiments=enable_custom_bigquery_sink
In Dataflow SDK for Java 2.x, this behavior is default and no experiments are necessary.
Note that in both versions, temporary files in GCS may be left over if your job fails.
public static class ForceGroupBy <T> extends PTransform<PCollection<T>, PCollection<KV<T, Iterable<Void>>>> {
private static final long serialVersionUID = 1L;
public PCollection<KV<T, Iterable<Void>>> apply(PCollection<T> input) {
PCollection<KV<T,Void>> syntheticGroup = input.apply(
ParDo.of(new DoFn<T,KV<T,Void>>(){
private static final long serialVersionUID = 1L;
public void processElement(
DoFn<T, KV<T, Void>>.ProcessContext c)
throws Exception {
} }));
return syntheticGroup.apply(GroupByKey.<T,Void>create());
I'm setting up a slow-changing lookup Map in my Apache-Beam pipeline. It continuously updates the lookup map. For each key in lookup map, I retrieve the latest value in the global window with accumulating mode.
But it always meets Exception :
org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.IllegalArgumentException: Duplicate values for mykey
Is anything wrong with this snippet code?
If I use .discardingFiredPanes() instead, I will lose information in the last emit.
.apply(GenerateSequence.from(0).withRate(1, Duration.standardMinutes(1L)))
Window.<Long>into(new GlobalWindows())
.apply(new ReadSlowChangingTable())
Example Input Trigger:
t1 : KV<k1,v1> KV< k2,v2>
t2 : KV<k1,v1>
accumulatingFiredPanes => expected result at t2 => KV(k1,v1), KV(k2,v2) but failed due to duplicated exception
discardingFiredPanes => expected result at t2 => KV(k1,v1) Success
Specifically with regards to view.asMap and accumulating panes discussion in the comments:
If you would like to make use of the View.asMap side input (for example, when the source of the map elements is itself distributed – often because you are creating a side input from the output of a previous transform), there are some other factors that will need to be taken into consideration: View.asMap is itself an aggregation, it will inherit triggering and accumulate its input. In this specific pattern, setting the pipeline to accumulatingPanes mode before this transform will result in duplicate key errors even if a transform such as Latest.perKey is used before the View.asMap transform.
Given the read updates the whole map, then the use of View.asSingleton would I think be a better approach for this use case.
Some general notes around this pattern, which will hopefully be useful for others as well:
For this pattern we can use the GenerateSequence source transform to emit a value periodically for example once a day. Pass this value into a global window via a data-driven trigger that activates on each element. In a DoFn, use this process as a trigger to pull data from your bounded source Create your SideInput for use in downstream transforms.
It's important to note that because this pattern uses a global-window side input triggering on processing time, matching to elements being processed in event time will be nondeterministic. For example if we have a main pipeline which is windowed on event time, the version of the SideInput View that those windows will see will depend on the latest trigger that has fired in processing time rather than the event time.
Also important to note that in general the side input should be something that fits into memory.
Java (SDK 2.9.0):
In the sample below the side input is updated at very short intervals, this is so that effects can be easily seen. The expectation is that the side input is updating slowly, for example every few hours or once a day.
In the example code below we make use of a Map that we create in a DoFn which becomes the View.asSingleton, this is the recommended approach for this pattern.
The sample below illustrates the pattern, please note the View.asSingleton is rebuilt on every counter update.
public static void main(String[] args) {
// Create pipeline
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
// Using View.asSingleton, this pipeline uses a dummy external service as illustration.
// Run in debug mode to see the output
Pipeline p = Pipeline.create(options);
// Create slowly updating sideinput
PCollectionView<Map<String, String>> map = p
.apply(GenerateSequence.from(0).withRate(1, Duration.standardSeconds(5L)))
.apply(Window.<Long>into(new GlobalWindows())
.apply(ParDo.of(new DoFn<Long, Map<String, String>>() {
#ProcessElement public void process(#Element Long input,
OutputReceiver<Map<String, String>> o) {
// Do any external reads needed here...
// We will make use of our dummy external service.
// Every time this triggers, the complete map will be replaced with that read from
// the service.
// ---- Consume slowly updating sideinput
// GenerateSequence is only used here to generate dummy data for this illustration.
// You would use your real source for example PubSubIO, KafkaIO etc...
p.apply(GenerateSequence.from(0).withRate(1, Duration.standardSeconds(1L)))
.apply(ParDo.of(new DoFn<Long, KV<Long, Long>>() {
#ProcessElement public void process(ProcessContext c) {
Map<String, String> keyMap = c.sideInput(map);
c.outputWithTimestamp(KV.of(1L, c.element()), Instant.now());
LOG.debug("Value is {} key A is {} and key B is {}"
, c.element(), keyMap.get("Key_A"),keyMap.get("Key_B"));
public static class DummyExternalService {
public static Map<String, String> readDummyData() {
Map<String, String> map = new HashMap<>();
Instant now = Instant.now();
DateTimeFormatter dtf = DateTimeFormat.forPattern("HH:MM:SS");
map.put("Key_A", now.minus(Duration.standardSeconds(30)).toString(dtf));
map.put("Key_B", now.minus(Duration.standardSeconds(30)).toString());
return map;
So, I have 2 partitions in a step which writes into a database. I want to record the number of rows written in each partition, get the sum, and print it to the log;
I was thinking of using a static variable in the Writer and use Step Context/Job Context to get it in afterStep() of the Step Listener. However when I tried it I got null. I am able to get these values in close() of the Reader.
Is this the right way to go about it? Or should I use Partition Collector/Reducer/ Analyzer?
I am using a java batch in Websphere Liberty. And I am developing in Eclipse.
I was thinking of using a static variable in the Writer and use Step Context/Job Context to get it in afterStep() of the Step Listener. However when i tried it i got null.
The ItemWriter might already be destroyed at this point, but I'm not sure.
Is this the right way to go about it?
Yes, it should be good enough. However, you need to ensure the total row count is shared for all partitions because the batch runtime maintains a StepContext clone per partition. You should rather use JobContext.
I think using PartitionCollector and PartitionAnalyzer is a good choice, too. Interface PartitionCollector has a method collectPartitionData() to collect data coming from its partition. Once collected, batch runtime passes this data to PartitionAnalyzer to analyze the data. Notice that there're
N PartitionCollector per step (1 per partition)
N StepContext per step (1 per partition)
1 PartitionAnalyzer per step
The records written can be passed via StepContext's transientUserData. Since the StepContext is reserved for its own step-partition, the transient user data won't be overwritten by other partition.
Here's the implementation :
MyItemWriter :
private StepContext stepContext;
public void writeItems(List<Object> items) throws Exception {
// ...
Object userData = stepContext.getTransientUserData();
private StepContext stepContext;
public Serializable collectPartitionData() throws Exception {
// get transient user data
Object userData = stepContext.getTransientUserData();
int partRowCount = userData != null ? (int) userData : 0;
return partRowCount;
private int rowCount = 0;
public void analyzeCollectorData(Serializable fromCollector) throws Exception {
rowCount += (int) fromCollector;
System.out.printf("%d rows processed (all partitions).%n", rowCount);
Reference : JSR352 v1.0 Final Release.pdf
Let me offer a bit of an alternative on the accepted answer and add some comments.
PartitionAnalyzer variant - Use analyzeStatus() method
Another technique would be to use analyzeStatus which only gets called at the end of each entire partition, and is passed the partition-level exit status.
public void analyzeStatus(BatchStatus batchStatus, String exitStatus)
In contrast, the above answer using analyzeCollectorData gets called at the end of each chunk on each partition.
public class MyItemWriteListener extends AbstractItemWriteListener {
StepContext stepCtx;
public void afterWrite(List<Object> items) throws Exception {
// update 'newCount' based on items.size()
Obviously this only works if you weren't using the exit status for some other purpose. You can set the exit status from any artifact (though this freedom might be one more thing to have to keep track of).
The API is designed to facilitate an implementation dispatching individual partitions across JVMs, (e.g. in Liberty you can see this here.) But using a static ties you to a single JVM, so it's not a recommended approach.
Also note that both the JobContext and the StepContext are implemented in the "thread-local"-like fashion we see in batch.
I’m trying to write a proof-of-concept which takes messages from Kafka, transforms them using Beam on Flink, then pushes the results onto a different Kafka topic.
I’ve used the KafkaWindowedWordCountExample as a starting point, and that’s doing the first part of what I want to do, but it outputs to text files as opposed to Kafka. FlinkKafkaProducer08 looks promising, but I can’t figure out how to plug it into the pipeline. I was thinking that it would be wrapped with an UnboundedFlinkSink, or some such, but that doesn’t seem to exist.
Any advice or thoughts on what I’m trying to do?
I’m running the latest incubator-beam (as of last night from Github), Flink 1.0.0 in cluster mode and Kafka, all on Google Compute Engine (Debian Jessie).
There is currently no UnboundedSink class in Beam. Most unbounded sinks are implemented using a ParDo.
You may wish to check out the KafkaIO connector. This is a Kafka reader that works in all Beam runners, and implements the parallel reading, checkpointing, and other UnboundedSource APIs. That pull request also includes a crude sink in the TopHashtags example pipeline by writing to Kafka in a ParDo:
class KafkaWriter extends DoFn<String, Void> {
private final String topic;
private final Map<String, Object> config;
private transient KafkaProducer<String, String> producer = null;
public KafkaWriter(Options options) {
this.topic = options.getOutputTopic();
this.config = ImmutableMap.<String, Object>of(
"bootstrap.servers", options.getBootstrapServers(),
"key.serializer", StringSerializer.class.getName(),
"value.serializer", StringSerializer.class.getName());
public void startBundle(Context c) throws Exception {
if (producer == null) { // in Beam, startBundle might be called multiple times.
producer = new KafkaProducer<String, String>(config);
public void finishBundle(Context c) throws Exception {
public void processElement(ProcessContext ctx) throws Exception {
producer.send(new ProducerRecord<String, String>(topic, ctx.element()));
Of course, we would like to add sink support in KafkaIO as well. It would effectively be same as KafkaWriter above, but much simpler to use.
Sink transform for writing to Kafka was added to Apache Beam / Dataflow in 2016. See JavaDoc for KafkaIO in Apache Beam for usage example.
As a followup question to the following question and answer:
I'd like to confirm with google dataflow engineering team (#jkff) if the 3rd option proposed by Eugene is at all possible with google dataflow:
"have a ParDo that takes these keys and creates the BigQuery tables, and another ParDo that takes the data and streams writes to the tables"
My understanding is that ParDo/DoFn will process each element, how could we specify a table name (function of the keys passed in from side inputs) when writing out from processElement of a ParDo/DoFn?
Updated with a DoFn, which is not working obviously since c.element().value is not a pcollection.
PCollection<KV<String, Iterable<String>>> output = ...;
public class DynamicOutput2Fn extends DoFn<KV<String, Iterable<String>>, Integer> {
private final PCollectionView<List<String>> keysAsSideinputs;
public DynamicOutput2Fn(PCollectionView<List<String>> keysAsSideinputs) {
this.keysAsSideinputs = keysAsSideinputs;
public void processElement(ProcessContext c) {
List<String> keys = c.sideInput(keysAsSideinputs);
String key = c.element().getKey();
//the below is not working!!! How could we write the value out to a sink, be it gcs file or bq table???
c.element().getValue().apply(Pardo.of(new FormatLineFn()))
The BigQueryIO.Write transform does not support this. The closest thing you can do is to use per-window tables, and encode whatever information you need to select the table in the window objects by using a custom WindowFn.
If you don't want to do that, you can make BigQuery API calls directly from your DoFn. With this, you can set the table name to anything you want, as computed by your code. This could be looked up from a side input, or computed directly from the element the DoFn is currently processing. To avoid making too many small calls to BigQuery, you can batch up the requests using finishBundle();
You can see how the Dataflow runner does the streaming import here:
I'm using dataflow to generate a large amount of data.
I've tested two versions of my pipeline: one with a side input (of varying sizes), and other other without.
When I run the pipeline without the side input, my job will finish in about 7 minutes. When I run my job with the side input, my job will never finish.
Here's what my DoFn looks like:
public class MyDoFn extends DoFn<String, String> {
final PCollectionView<Map<String, Iterable<TreeMap<Long, Float>>>> pCollectionView;
final List<CSVRecord> stuff;
private Aggregator<Integer, Integer> dofnCounter =
createAggregator("DoFn Counter", new Sum.SumIntegerFn());
public MyDoFn(PCollectionView<Map<String, Iterable<TreeMap<Long, Float>>>> pcv, List<CSVRecord> m) {
this.pCollectionView = pcv;
this.stuff = m;
public void processElement(ProcessContext processContext) throws Exception {
Map<String, Iterable<TreeMap<Long, Float>>> pdata = processContext.sideInput(pCollectionView);
processContext.output(AnotherClass.generateData(stuff, pdata));
And here's what my pipeline looks like:
final Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());
PCollection<KV<String, TreeMap<Long, Float>>> data;
data = p.apply(TextIO.Read.from("gs://where_the_files_are/*").named("Reading Data"))
.apply(ParDo.named("Parsing data").of(new DoFn<String, KV<String, TreeMap<Long, Float>>>() {
public void processElement(ProcessContext processContext) throws Exception {
// Parse some data
processContext.output(KV.of(key, value));
final PCollectionView<Map<String, Iterable<TreeMap<Long, Float>>>> pcv =
data.apply(GroupByKey.<String, TreeMap<Long, Float>>create())
.apply(View.<String, Iterable<TreeMap<Long, Float>>>asMap());
DoFn<String, String> dofn = new MyDoFn(pcv, localList);
.apply(ParDo.named("Generating the Data").withSideInputs(pvc).of(dofn))
We've spent about two days trying various methods of getting this to work. We've narrowed it down to the inclusion of the side input. If the processContext is modified to not use the side input, it will still be very slow as long as it's included. If we don't call .withSideInput() it's very fast again.
Just to clarify, we've tested this on sideinput sizes from 20mb - 1.5gb.
Very grateful for any insight.
Including a few job ID's:
2016-01-21_08_04_33-1642110636871153093 (latest)
Please try out the Dataflow SDK 1.5.0+, they should have addressed the known performance problems of your issue.
Side inputs in the Dataflow SDK 1.5.0+ use a new distributed format when running batch pipelines. Note that streaming pipelines and pipelines using older versions of the Dataflow SDK are still subject to re-reading the side input if the view can not be cached entirely in memory.
With the new format, we use an index to provide a block based lookup and caching strategy. Thus when looking into a list by index or looking into a map by key, only the block that contains said index or key will be loaded. Having a cache size which is greater than the working set size will aid in performance as frequently accessed indices/keys will not require re-reading the block they are contained in.
Side inputs in the Dataflow SDK can, indeed, introduce slowness if not used carefully. Most often, this happens when each worker has to re-read the entire side input per main input element.
You seem to be using a PCollectionView created via asMap. In this case, the entire side input PCollection must fit into memory of each worker. When needed, Dataflow SDK will copy this data on each worker to create such a map.
That said, the map on each worker may be created just once or multiple times, depending on several factors. If its size is small enough (usually less than 100 MB), it is likely that the map is read only once per worker and reused across elements and across bundles. However, if its size cannot fit into our cache (or something else evicts it from the cache), the entire map may be re-read again and again on each worker. This is, most often, the root-cause of the slowness.
The cache size is controllable via PipelineOptions as follows, but due to several important bugfixes, this should be used in version 1.3.0 and later only.
DataflowWorkerHarnessOptions opts = PipelineOptionsFactory.fromArgs(args).withValidation().create().cloneAs(DataflowWorkerHarnessOptions.class);
Pipeline p = Pipeline.create(opts);
For the time being, the fix is to change the structure of the pipeline to avoid excessive re-reading. I cannot offer you a specific advice there, as you haven't shared enough information about your use case. (Please post a separate question if needed.)
We are actively working on a related feature we refer to as distributed side inputs. This will allow a lookup into the side input PCollection without constructing the entire map on the worker. It should significantly help performance in this and related cases. We expect to release this very shortly.
I didn't see anything particularly suspicious about the two jobs you have quoted. They've been cancelled relatively quickly.
I'm manually setting the cache size when creating the pipeline in the following manner:
DataflowWorkerHarnessOptions opts = PipelineOptionsFactory.fromArgs(args).withValidation().create().cloneAs(DataflowWorkerHarnessOptions.class);
Pipeline p = Pipeline.create(opts);
for a side input of ~25mb, this speeds up the execution time considerably (job id 2016-01-25_08_42_52-657610385797048159) vs. creating a pipeline in the manner below (job id 2016-01-25_07_56_35-14864561652521586982)
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
However, when the side input is increased to ~400mb, no increase in cache size improves performance. Theoretically, is all the memory indicated by the GCE machine type available for use by the worker? What would invalidate or evict something from the worker cache, forcing the re-read?