Suppose I create two output PCollections as a result of SideOutputs and depending on some condition I want to write only one of them to BigQuery. How to do this?
Basically my use case is that I'm trying to make Write_Append and Write_Truncate dynamic. I fetch the information(append/truncate) from a config table that I maintain in BigQuery. So depending on what I have in the config table I must apply Truncate or Append.
So using SideOutputs I was able to create two PCollections(Append and Truncate respectively) out of which one will be empty. And the one which has all the rows must be written to BigQuery. Is this approach correct?
The code that i'm using:
final TupleTag<TableRow> truncate =
new TupleTag<TableRow>(){};
// Output that contains word lengths.
final TupleTag<TableRow> append =
new TupleTag<TableRow>(){};
PCollectionTuple results = read.apply("convert to table row",ParDo.of(new DoFn<String,TableRow>(){
#ProcessElement
public void processElement(ProcessContext c)
{
String value = c.sideInput(configView).get(0).toString();
LOG.info("config: "+value);
if(value.equals("truncate")){
LOG.info("outputting to truncate");
c.output(new TableRow().set("color", c.element()));
}
else
{
LOG.info("outputting to append");
c.output(append,new TableRow().set("color", c.element()));
}
//c.output(new TableRow().set("color", c.element()));
}
}).withSideInputs(configView).withOutputTags(truncate,
TupleTagList.of(append)));
results.get(truncate).apply("truncate",BigQueryIO.writeTableRows()
.to("projectid:datasetid.tableid")
.withSchema(schema)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
results.get(append).apply("append",BigQueryIO.writeTableRows()
.to("projectid:datasetid.tableid")
.withSchema(schema)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
I need to perform one out of the two. If I do both table is going to get truncated anyways.
P.S. I'm using Java SDK (Apache Beam 2.1)
I believe you are right that, if your pipeline includes at all a write to a BigQuery table with WRITE_TRUNCATE, currently the table will get truncated even if there's no data. Feel free to file a JIRA to support more configurable behavior in this case.
So if you want it to conditionally not get truncated, you need to conditionally not include that write transform at all. Is there a way to push the condition to that level, or does the condition actually have to be computed from other data in the pipeline?
(the only workaround I can think of is to use DynamicDestinations to dynamically choose the name of the table to truncate, and truncate some other dummy empty table instead - I can elaborate on this more after your answer to the previous paragraph)
Related
Trying to write two set of files using AvroIO.
I have a PCollection<KV<Item1, Item2>> and I want to write Item1s and Item2s in different set of files.
I want to split the shards in a way that fileItem1-XX-of-NN should contain the corresponding values in fileItem2-XX-of-NN.
We have a constraint on number of elements in each shard files(say 20,000 items in a single file)
There's no direct option to limit the number of records per shard, but a similar result can be achieved by setting the number of shards to 1 and then limiting the number of records per file.
One option is to use FinishBundle in a DoFn to limit the number of elements, and then write them, you can see an example of this approach with Python here.
In addition, I think that a similar behavior can be achieved with GroupIntoBatches by creating batches of maximum a a determined number of records, and then use those results to write the files.
Note that both approaches will likely affect the performance of the pipeline, as the need to track the number of records will limit the parallelism.
Here's a code that you can use as a reference. I used the WordCount quickstart and I modified it to write the Avro files. The example gives you a KV<String, Long> with the words counted, so I used it as if it were similar to your KV<item1,item2>.
static void runWordCount(WordCountOptions options) {
Pipeline p = Pipeline.create(options);
// Reading a file
p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
.apply(new CountWords()) // Original data - KV<item1, item2>
.apply(WithKeys.of("1")) // Add artificial key - KV<1, KV<item1, item2>>
.apply(GroupIntoBatches.ofSize(1000)) // Group with max size of records per shard -> KV<1, [KV<item1, item2>, KV<item1, item2>, KV<item1, item2>]>
.apply(ParDo.of(new SplitForFiles())) // Split to write in files - KV<String, GenericRecord>
.setCoder(KvCoder.of(StringUtf8Coder.of(), AvroCoder.of(GenericRecord.class, getSchema())))
// Write dynamically using FileIO
.apply(FileIO.<String, KV<String, GenericRecord>> writeDynamic()
.by(elem -> elem.getKey()) // dest = key in the KV element
.via(Contextful.fn(elem -> elem.getValue()), // Get the Generic Record
Contextful.fn(dest -> AvroIO.<GenericRecord>sink(getSchema())))
.withNumShards(1) // Limit number of shards
.to(options.getOutput())
.withNaming(key -> FileIO.Write.defaultNaming("results-", key + ".avro"))
.withDestinationCoder(StringUtf8Coder.of()));
p.run().waitUntilFinish();
}
Here's the method that will be used to SplitForFiles where it is defined that items1 and items2 ends up with the same key prefix, and different suffix, which will later be used to write the files:
static class SplitForFiles extends DoFn<KV<String, Iterable<KV<String, Long>>>, KV<String, GenericRecord>> {
#ProcessElement
public void processElement(#Element KV<String, Iterable<KV<String, Long>>> element, OutputReceiver<KV<String, GenericRecord>> receiver) {
String key = getCurrentTimeStamp();
Schema schema = getSchema();
for (KV<String,Long> record : element.getValue()) {
GenericRecord item1 = new GenericData.Record(schema);
item1.put("value", record.getKey()) ;
receiver.output(KV.of(key + "-01", item1));
GenericRecord item2 = new GenericData.Record(schema);
item2.put("value", record.getValue().toString()) ;
receiver.output(KV.of(key + "-02", item2));
}
}
}
The key uses the custom method getCurrentTimeStamp() which retrieves a string with the timestamp up to miliseconds. This means that if the unlikely event of two keys are generated at the same milisecond, the files might have more record that you were expecting. If this limit is critical, then I suggest to change the generation of the key, you can use unique identifiers, or even combine it with the approach mentioned above of using FinishBundle.
At the end of the pipeline, you can see that I used FileIO to write Avro files with dynamic destinations.
This was a simplified example writing Avro files with the same schema, but you can also use it as a reference to write more complex pipelines, for examble, writing to files with different schemas.
None of the provided DataFlow templates match what I need to do, so I'm trying to write my own. I managed to run the example code like word count example without issue, so I tried to butcher together parts separate examples that read from BigQuery and writes to Spanner but there's just so many things in the source code I don't understand and cannot adapt to my own problem.
I'm REALLY lost on this and any help is greatly appreciated!
The goal is to use DataFlow and Apache Beam SDK to read from a BigQuery table with 3 string fields and 1 integer field, then concatenate the content of the 3 string fields into one string and put that new string in a new field called "key", then I want to write the key field and the integer field (which is unchanged) to a Spanner table that already exists, ideally append rows with a new key and update the integer field of rows with a key that already exists.
I'm trying to do this in Java because there is no i/o connector for Python. Any advice on doing this with Python are much appreciated.
For now I would be super happy if I could just read a table from BigQuery and write whatever I get from that table to a table in Spanner, but I can't even make that happen.
Problems:
I'm using Maven and I don't know what dependencies I need to put in the pom file
I don't know which package and import I need at the beginning of my java file
I don't know if I should use readTableRows() or read(SerializableFunction) to read from BigQuery
I have no idea how to access the string fields in the PCollection to concatenate them or how to make the new PCollection with only the key and integer field
I somehow need to make the PCollection into a Mutation to write to Spanner
I want to use an INSERT UPDATE query to write to the Spanner table, which doesn't seem to be an option in the Spanner i/o connector.
Honestly, I'm too embarrassed to even show that code I'm trying to run.
public class SimpleTransfer {
public static void main(String[] args) {
// Create and set your PipelineOptions.
DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
// For Cloud execution, set the Cloud Platform project, staging location, and specify DataflowRunner.
options.setProject("myproject");
options.setStagingLocation("gs://mybucket");
options.setRunner(DataflowRunner.class);
// Create the Pipeline with the specified options.
Pipeline p = Pipeline.create(options);
String tableSpec = "database.mytable";
// read whole table from bigquery
rowsFromBigQuery =
p.apply(
BigQueryIO.readTableRows()
.from(tableSpec);
// Hopefully some day add a transform
// Somehow make a Mutation
PCollection<Mutation> mutation = rowsFromBigQuery;
// Only way I found to write to Spanner, not even sure if that works.
SpannerWriteResult result = mutation.apply(
SpannerIO.write().withInstanceId("myinstance").withDatabaseId("mydatabase").grouped());
p.run().waitUntilFinish();
}
}
It's intimidating to deal with these strange data types, but once you get used to the TableRow and Mutation types, you'll be able to code robust pipelines.
The first thing you need to do is take your PCollection of TableRows, and convert those into an intermediate format that is convenient for you. Let's use Beam's KV, which defines a key-value pair. In the following snippet, we're extracting the values from the TableRow, and concatenating the string you want:
rowsFromBigQuery
.apply(
MapElements.into(TypeDescriptors.kvs(TypeDescriptors.strings()
TypeDescriptors.integers()))
.via(tableRow -> KV.of(
(String) tableRow.get("myKey1")
+ (String) tableRow.get("myKey2")
+ (String) tableRow.get("myKey3"),
(Integer) tableRow.get("myIntegerField"))))
Finally, to write to Spanner, we use Mutation-type objects, which define the kind of mutation that we want to apply to a row in Spanner. We'll do it with another MapElements transform, which takes N inputs, and returns N outputs. We define the insert or update mutations there:
myKvPairsPCollection
.apply(
MapElements.into(TypeDescriptor.of(Mutation.class))
.via(elm -> Mutation.newInsertOrUpdateBuilder("myTableName)
.set("key").to(elm.getKey())
.set("value").to(elm.getValue()));
And then you can pass the output to that to SpannerIO.write. The whole pipeline looks something like this:
Pipeline p = Pipeline.create(options);
String tableSpec = "database.mytable";
// read whole table from bigquery
PCollection<TableRow> rowsFromBigQuery =
p.apply(
BigQueryIO.readTableRows().from(tableSpec));
// Take in a TableRow, and convert it into a key-value pair
PCollection<Mutation> mutations = rowsFromBigQuery
// First we make the TableRows into the appropriate key-value
// pair of string key and integer.
.apply(
MapElements.into(TypeDescriptors.kvs(TypeDescriptors.strings()
TypeDescriptors.integers()))
.via(tableRow -> KV.of(
(String) tableRow.get("myKey1")
+ (String) tableRow.get("myKey2")
+ (String) tableRow.get("myKey3"),
(Integer) tableRow.get("myIntegerField"))))
// Now we construct the mutations
.apply(
MapElements.into(TypeDescriptor.of(Mutation.class))
.via(elm -> Mutation.newInsertOrUpdateBuilder("myTableName)
.set("key").to(elm.getKey())
.set("value").to(elm.getValue()));
// Now we pass the mutations to spanner
SpannerWriteResult result = mutations.apply(
SpannerIO.write()
.withInstanceId("myinstance")
.withDatabaseId("mydatabase").grouped());
p.run().waitUntilFinish();
}
I recently started checking new Java 8 features.
I've come across this forEach iterator-which iterates over the Collection.
Let's take I've one ArrayList of type <Integer> having values= {1,2,3,4,5}
list.forEach(i -> System.out.println(i));
This statement iteates over a list and prints the values inside it.
I'd like to know How am I going to specify that I want it to iterate over some specific values only.
Like, I want it to start from 2nd value and iterate it till 2nd last value. or something like that- or on alternate elements.
How am I going to do that?
To iterate on a section of the original list, use the subList method:
list.subList(1, list.length()-1)
.stream() // This line is optional since List already has a foreach method taking a Consumer as parameter
.forEach(...);
This is the concept of streams. After one operation, the results of that operation become the input for the next.
So for your specific example, you can follow #Joni's command. But if you're asking in general, then you can create a filter to only get the values you want to loop over.
For example, if you only wanted to print the even numbers, you could create a filter on the streams before you forEached them. Like this:
List<Integer> intList = Arrays.asList(1,2,3,4,5);
intList.stream()
.filter(e -> (e & 1) == 0)
.forEach(System.out::println);
You can similarly pick out the stuff you want to loop over before reaching your terminal operation (in your case the forEach) on the stream. I suggest you read this stream tutorial to get a better idea of how they work: http://winterbe.com/posts/2014/07/31/java8-stream-tutorial-examples/
Now, I have the below code:
PCollection<String> input_data =
pipeline
.apply(PubsubIO
.Read
.withCoder(StringUtf8Coder.of())
.named("ReadFromPubSub")
.subscription("/subscriptions/project_name/subscription_name"));
Looks like you want to read some messages from pubsub and convert each of them to multiple parts by splitting a message on space characters, and then feed the parts to the rest of your pipeline. No special configuration of PubsubIO is needed, because it's not a "reading data" problem - it's a "transforming data you have already read" problem - you simply need to insert a ParDo which takes your "composite" record and breaks it down in the way you want, e.g.:
PCollection<String> input_data =
pipeline
.apply(PubsubIO
.Read
.withCoder(StringUtf8Coder.of())
.named("ReadFromPubSub")
.subscription("/subscriptions/project_name/subscription_name"))
.apply(ParDo.of(new DoFn<String, String>() {
public void processElement(ProcessContext c) {
String composite = c.element();
for (String part : composite.split(" ")) {
c.output(part);
}
}}));
}));
I take it you mean that the data you want is present in different elements of the PCollection and want to extract and group it somehow.
A possible approach is to write a DoFn function that processes each String in the PCollection. You output a key value pair for each piece of data you want to group. You can then use the GroupByKey transform to group all the relevant data together.
For example you have the following messages from pubsub in your PCollection:
User 1234 bought item A
User 1234 bought item B
The DoFn function will output a key value pair with the user id as key and the item bought as value. ( <1234,A> , <1234, B> ).
Using the GroupByKey transform you group the two values together in one element. You can then perform further processing on that element.
This is a very common pattern in bigdata called mapreduce.
You can output an Iterable<A> then use Flatten to squash it. Unsurprisingly this is termed flatMap in many next-gen data processing platforms, c.f. spark / flink.
I know of:
http://lua-users.org/wiki/SimpleLuaApiExample
It shows me how to build up a table (key, value) pair entry by entry.
Suppose instead, I want to build a gigantic table (say something a 1000 entry table, where both key & value are strings), is there a fast way to do this in lua (rather than 4 func calls per entry:
push
key
value
rawset
What you have written is the fast way to solve this problem. Lua tables are brilliantly engineered, and fast enough that there is no need for some kind of bogus "hint" to say "I expect this table to grow to contain 1000 elements."
For string keys, you can use lua_setfield.
Unfortunately, for associative tables (string keys, non-consecutive-integer keys), no, there is not.
For array-type tables (where the regular 1...N integer indexing is being used), there are some performance-optimized functions, lua_rawgeti and lua_rawseti: http://www.lua.org/pil/27.1.html
You can use createtable to create a table that already has the required number of slots. However, after that, there is no way to do it faster other than
for(int i = 0; i < 1000; i++) {
lua_push... // key
lua_push... // value
lua_rawset(L, tableindex);
}