Received message larger than max on a batch processing pipeline - google-cloud-dataflow

I have been getting this message on a batch processing pipeline that has been running daily on google's cloud dataflow service. It has started failing with the following message:
(88b342a0e3852af3): java.io.IOException: INVALID_ARGUMENT: Received message larger than max (21824326 vs. 4194304)
dataflow-batch-jetty-11171129-7ea5-harness-waia talking to localhost:12346 at
com.google.cloud.dataflow.sdk.runners.worker.ApplianceShuffleWriter.close(Native Method) at
com.google.cloud.dataflow.sdk.runners.worker.ChunkingShuffleEntryWriter.close(ChunkingShuffleEntryWriter.java:67) at
com.google.cloud.dataflow.sdk.runners.worker.ShuffleSink$ShuffleSinkWriter.close(ShuffleSink.java:286) at
com.google.cloud.dataflow.sdk.util.common.worker.WriteOperation.finish(WriteOperation.java:100) at
com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77) at
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:264) at
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:197) at
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:149) at
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.doWork(DataflowWorkerHarness.java:192) at
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:173) at
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:160) at
java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at
java.lang.Thread.run(Thread.java:745)
I am still using an old workaround to output a CSV file with headers such as
PCollection<String> output = data.apply(ParDo.of(new DoFn<String, String>() {
String new_line = System.getProperty("line.separator");
String csv_header = "id, stuff_1, stuff_2" + new_line;
StringBuilder csv_body = new StringBuilder().append(csv_header);
#Override
public void processElement(ProcessContext c) {
csv_body.append(c.element()).append(newline);
}
#Override
public void finishBundle(Context c) throws Exception {
c.output(csv_body.toString());
}
})).apply(TextIO.Write.named("WriteData").to(options.getOutput()));
What is causing this? Is the output of this DoFn too big now? The size of the dataset being processed has not increased.

This looks like it might be a bug on our side and we're looking into it, but in general the code is probably not doing what you intend it to do.
As written, you'll end up with an unspecified number of output files, whose names start with the given prefix, each file containing a concatenation of your expected CSV-like output (including headers) for different chunks of the data, in an unspecified order.
In order to properly implement writing to CSV files, simply use TextIO.Write.withHeader() to specify the header, and remove your CSV-constructing ParDo entirely. This will also not trigger the bug.

Related

Apply not applicable with ParDo and DoFn using Apache Beam

I am implementing a Pub/Sub to BigQuery pipeline. It looks similar to How to create read transform using ParDo and DoFn in Apache Beam, but here, I have already a PCollection created.
I am following what is described in the Apache Beam documentation to implement a ParDo operation to prepare a table row using the following pipeline:
static class convertToTableRowFn extends DoFn<PubsubMessage, TableRow> {
#ProcessElement
public void processElement(ProcessContext c) {
PubsubMessage message = c.element();
// Retrieve data from message
String rawData = message.getData();
Instant timestamp = new Instant(new Date());
// Prepare TableRow
TableRow row = new TableRow().set("message", rawData).set("ts_reception", timestamp);
c.output(row);
}
}
// Read input from Pub/Sub
pipeline.apply("Read from Pub/Sub",PubsubIO.readMessagesWithAttributes().fromTopic(topicPath))
.apply("Prepare raw data for insertion", ParDo.of(new convertToTableRowFn()))
.apply("Insert in Big Query", BigQueryIO.writeTableRows().to(BQTable));
I found the DoFn function in a gist.
I keep getting the following error:
The method apply(String, PTransform<? super PCollection<PubsubMessage>,OutputT>) in the type PCollection<PubsubMessage> is not applicable for the arguments (String, ParDo.SingleOutput<PubsubMessage,TableRow>)
I always understood that a ParDo/DoFn operations is a element-wise PTransform operation, am I wrong ? I never got this type of error in Python, so I'm a bit confused about why this is happening.
You're right, ParDos are element-wise transforms and your approach looks correct.
What you're seeing is the compilation error. Something like this happens when the argument type of the apply() method that was inferred by java compiler doesn't match the type of the actual input, e.g. convertToTableRowFn.
From the error you're seeing it looks like java infers that the second parameter for apply() is of type PTransform<? super PCollection<PubsubMessage>,OutputT>, while you're passing the subclass of ParDo.SingleOutput<PubsubMessage,TableRow> instead (your convertToTableRowFn). Looking at the definition of SingleOutput your convertToTableRowFn is basically a PTransform<PCollection<? extends PubsubMessage>, PCollection<TableRow>>. And java fails to use it in apply where it expects PTransform<? super PCollection<PubsubMessage>,OutputT>.
What looks suspicious is that java didn't infer the OutputT to PCollection<TableRow>. One reason it would fail to do so if you have other errors. Are you sure you don't have other errors as well?
For example, looking at convertToTableRowFn you're calling message.getData() which doesn't exist when I'm trying to do it and it fails compilation there. In my case I need to do something like this instead: rawData = new String(message.getPayload(), Charset.defaultCharset()). Also .to(BQTable)) expects a string (e.g. a string representing the BQ table name) as an argument, and you're passing some unknown symbol BQTable (maybe it exists in your program somewhere though and this is not a problem in your case).
After I fix these two errors your code compiles for me, apply() is fully inferred and the types are compatible.

Rinsim, load leuven map by dynamicgraph and consider collisionAvoidance. end up with there is too short connection in the graph

I am trying to do route planning with Rinsim. And I want to take collisionAvoidance into account, So I load the map by this method (because it seems collisionAvoidance is only supported in dynamicGraph):
private static ListenableGraph<LengthData> loadGrDynamicGraph(String name){
try {
Graph<LengthData> g = DotGraphIO.getLengthGraphIO(Filters.selfCycleFilter())
.read(DDRP.class.getResourceAsStream(name));
return new ListenableGraph<>(g);
}catch (Exception e){
}
return null;
}
and I set the vehicle length as 1d and the distance Unit as SI.METER. And it ends up with the following error.
Exception in thread "main" java.lang.IllegalArgumentException: Invalid graph: the minimum connection length is 1.0, connection (3296724.2131123254,2.5725043247255992E7)->(3296782.7337179,2.5724994399343655E7) defines length data that is too short: 0.8.
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:146)
at com.github.rinde.rinsim.core.model.road.CollisionGraphRoadModelImpl.checkConnectionLength(CollisionGraphRoadModelImpl.java:261)
at com.github.rinde.rinsim.core.model.road.RoadModelBuilders$CollisionGraphRMB.build(RoadModelBuilders.java:702)
at com.github.rinde.rinsim.core.model.road.RoadModelBuilders$CollisionGraphRMB.build(RoadModelBuilders.java:606)
at com.github.rinde.rinsim.core.model.DependencyResolver$Dependency.build(DependencyResolver.java:223)
at com.github.rinde.rinsim.core.model.DependencyResolver$Dependency.(DependencyResolver.java:217)
at com.github.rinde.rinsim.core.model.DependencyResolver.add(DependencyResolver.java:71)
at com.github.rinde.rinsim.core.model.ModelManager$Builder.doAdd(ModelManager.java:231)
at com.github.rinde.rinsim.core.model.ModelManager$Builder.add(ModelManager.java:212)
at com.github.rinde.rinsim.core.Simulator$Builder.addModel(Simulator.java:324)
at com.github.rinde.rinsim.examples.project.DDRP.run(DDRP.java:86)
at com.github.rinde.rinsim.examples.project.DDRP.main(DDRP.java:60)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
I tried to change the vehicle length, but the error still exit. Does anyone know how to overcome this error?
Thank you
A graph from OpenStreetMap (such as the map of Leuven) is not meant to be used in combination with the CollisionGraphRoadModel that you are trying to use. The reason is that the CollsionGrahpRoadModel is meant for a warehouse-like environment, not a public street. This model doesn't support multiple parallel lanes which is unrealistic in a city. The WarehouseExample defines two example graphs that can be used in combination with the CollsionGrahpRoadModel.

Using MySQL as input source and writing into Google BigQuery

I have an Apache Beam task that reads from a MySQL source using JDBC and it's supposed to write the data as it is to a BigQuery table. No transformation is performed at this point, that will come later on, for the moment I just want the database output to be directly written into BigQuery.
This is the main method trying to perform this operation:
public static void main(String[] args) {
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
Pipeline p = Pipeline.create(options);
// Build the table schema for the output table.
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("phone").setType("STRING"));
fields.add(new TableFieldSchema().setName("url").setType("STRING"));
TableSchema schema = new TableSchema().setFields(fields);
p.apply(JdbcIO.<KV<String, String>>read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
"com.mysql.jdbc.Driver", "jdbc:mysql://host:3306/db_name")
.withUsername("user")
.withPassword("pass"))
.withQuery("SELECT phone_number, identity_profile_image FROM scraper_caller_identities LIMIT 100")
.withRowMapper(new JdbcIO.RowMapper<KV<String, String>>() {
public KV<String, String> mapRow(ResultSet resultSet) throws Exception {
return KV.of(resultSet.getString(1), resultSet.getString(2));
}
})
.apply(BigQueryIO.Write
.to(options.getOutput())
.withSchema(schema)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)));
p.run();
}
But when I execute the template using maven, I get the following error:
Test.java:[184,6] cannot find symbol symbol: method
apply(com.google.cloud.dataflow.sdk.io.BigQueryIO.Write.Bound)
location: class org.apache.beam.sdk.io.jdbc.JdbcIO.Read<com.google.cloud.dataflow.sdk.values.KV<java.lang.String,java.lang.String>>
It seems that I'm not passing BigQueryIO.Write the expected data collection and that's what I am struggling with at the moment.
How can I make the data coming from MySQL meets BigQuery's expectations in this case?
I think that you need to provide a PCollection<TableRow> to BigQueryIO.Write instead of the PCollection<KV<String,String>> type that the RowMapper is outputting.
Also, please use the correct column name and value pairs when setting the TableRow.
Note: I think that your KVs are the phone and url values (e.g. {"555-555-1234": "http://www.url.com"}), not the column name and value pairs (e.g. {"phone": "555-555-1234", "url": "http://www.url.com"})
See the example here:
https://beam.apache.org/documentation/sdks/javadoc/0.5.0/
Would you please give this a try and let me know if it works for you? Hope this helps.

How to add column name as header when using dataflow to export data to csv

I am exporting some data to csv by Dataflow, but beyond data I want to add each column names as the first line on the output file such as
col_name1, col_name2, col_name3, col_name4 ...
data1.1, data1.2, data1.3, data1.4 ...
data2.1 ...
Is there anyway to do with current API?(searched around TextIO.Write but didn't find anything seems relevant...) or is there anyway I could sort of "insert" column name at the head of to-be-exported PCollection and enforce the the data to be written in order...?
There is no built-in way to do that using TextIO.Write. PCollections are unordered so it isn't possible ot add an eleemnt to the front. You could write a custom BoundedSink which does this.
Custom sink APIs are now available if you want to be the brave one to craft a CSV sink. Current workaround which builds up the output as a single string and outputs it all at finish bundle:
PCollection<String> output = data.apply(ParDo.of(new DoFn<String, String>() {
private static final long serialVersionUID = 0;
String new_line = System.getProperty("line.separator");
String csv_header = "id, stuff1, stuff2, stuff3" + new_line;
StringBuilder csv_body = new StringBuilder().append(csv_header);
#Override
public void processElement(ProcessContext c) {
csv_body.append(c.element()).append(newline);
}
#Override
public void finishBundle(Context c) throws Exception {
c.output(csv_body);
}
})).apply(TextIO.Write.named("WriteData").to(options.getOutput()));
This will only work if your BIG output string fits in memory
As of Dataflow SDK version 1.7.0, you have withHeader function in TextIO.Write .
So you can do this:
TextIO.Write.named("WriteToText")
.to("/path/to/the/file")
.withHeader("col_name1,col_name2,col_name3,col_name4")
.withSuffix(".csv"));
A new line character is automatically added to the end of the header.

Cloud Dataflow to BigQuery - too many sources

I have a job that among other things also inserts some of the data it reads from files into BigQuery table for later manual analysis.
It fails with the following error:
job error: Too many sources provided: 10001. Limit is 10000., error: Too many sources provided: 10001. Limit is 10000.
What does it refer to as "source"? Is it a file or a pipeline step?
Thanks,
G
I'm guessing the error is coming from BigQuery and means that we are trying to upload too many files when we create your output table.
Could you provide some more details on the error / context (like a snippet of the commandline output (if using the BlockingDataflowPipelineRunner) so I can confirm? A jobId would also be helpful.
Is there something about your pipeline structure that is going to result in a large number of output files? That could either be a large amount of data or perhaps finely sharded input files without a subsequent GroupByKey operation (which would let us reshard the data into larger pieces).
The note in In Google Cloud Dataflow BigQueryIO.Write occur Unknown Error (http code 500) mitigates this issue:
Dataflow SDK for Java 1.x: as a workaround, you can enable this experiment in : --experiments=enable_custom_bigquery_sink
In Dataflow SDK for Java 2.x, this behavior is default and no experiments are necessary.
Note that in both versions, temporary files in GCS may be left over if your job fails.
public static class ForceGroupBy <T> extends PTransform<PCollection<T>, PCollection<KV<T, Iterable<Void>>>> {
private static final long serialVersionUID = 1L;
#Override
public PCollection<KV<T, Iterable<Void>>> apply(PCollection<T> input) {
PCollection<KV<T,Void>> syntheticGroup = input.apply(
ParDo.of(new DoFn<T,KV<T,Void>>(){
private static final long serialVersionUID = 1L;
#Override
public void processElement(
DoFn<T, KV<T, Void>>.ProcessContext c)
throws Exception {
c.output(KV.of(c.element(),(Void)null));
} }));
return syntheticGroup.apply(GroupByKey.<T,Void>create());
}
}

Resources