Reprocess failed records in a bundle

Reprocess failed records in a bundle - google-cloud-dataflow

I have use case where the end goal is to make a rest call with the transformed data in the apache beam program. If a record in a bundle fails due to connection or read timed out error, how can i reprocess only the failed records rather than processing the entire bundle containing that record.

You can have multiple outputs for a single transform. So, for your case, you can output the failed records into the dedicated PCollection of "dead letters" and process it separately. Please, see an example here
final TupleTag<String> successElms = new TupleTag<String>(){};
final TupleTag<String> failedElms = new TupleTag<String>(){};
PCollectionTuple mixedCollection =
dbRowCollection.apply(ParDo
.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
RestResult res = runRestCall(c.element());
if (res.success()) {
// Emit to main output, which is the output for successful elements.
c.output(c.element());
} else {
// Emit to output for failed elements
c.output(failedElms, c.element());
}
}
})
.withOutputTags(successElms,
// Specify the other outputs as a TupleTagList.
TupleTagList.of(failedElms)));
// Get subset of the output with failed elements.
mixedCollection.get(failedElms).apply(ProcessFailedElms.create());

Related

Creating Custom Windowing Function in Apache Beam

I have a Beam pipeline that starts off with reading multiple text files where each line in a file represents a row that gets inserted into Bigtable later in the pipeline. The scenario requires confirming that the count of rows extracted from each file & count of rows later inserted into Bigtable match. For this I am planning to develop a custom Windowing strategy so that lines from a single file get assigned to a single window based on the file name as the key that will be passed to the Windowing function.
Is there any code sample for creating custom Windowing functions?

Although I changed my strategy for confirming the inserted number of rows, for anyone who is interested in windowing elements read from a batch source e.g. FileIO in a batch job, here's the code for creating a custom windowing strategy:
public class FileWindows extends PartitioningWindowFn<Object, IntervalWindow>{
private static final long serialVersionUID = -476922142925927415L;
private static final Logger LOG = LoggerFactory.getLogger(FileWindows.class);
#Override
public IntervalWindow assignWindow(Instant timestamp) {
Instant end = new Instant(timestamp.getMillis() + 1);
IntervalWindow interval = new IntervalWindow(timestamp, end);
LOG.info("FileWindows >> assignWindow(): Window assigned with Start: {}, End: {}", timestamp, end);
return interval;
}
#Override
public boolean isCompatible(WindowFn<?, ?> other) {
return this.equals(other);
}
#Override
public void verifyCompatibility(WindowFn<?, ?> other) throws IncompatibleWindowException {
if (!this.isCompatible(other)) {
throw new IncompatibleWindowException(other, String.format("Only %s objects are compatible.", FileWindows.class.getSimpleName()));
}
}
#Override
public Coder<IntervalWindow> windowCoder() {
return IntervalWindow.getCoder();
}
}
and then it can be used in the pipeline as below:
p
.apply("Assign_Timestamp_to_Each_Message", ParDo.of(new AssignTimestampFn()))
.apply("Assign_Window_to_Each_Message", Window.<KV<String,String>>into(new FileWindows())
.withAllowedLateness(Duration.standardMinutes(1))
.discardingFiredPanes());
Please keep in mind that you will need to write the AssignTimestampFn() so that each message carries a timestamp.

Apply Side input to BigQueryIO.read operation in Apache Beam

Is there a way to apply a side input to a BigQueryIO.read() operation in Apache Beam.
Say for example I have a value in a PCollection that I want to use in a query to fetch data from a BigQuery table. Is this possible using side input? Or should something else be used in such a case?
I used NestedValueProvider in a similar case but I guess we can use that only when a certain value depends on my runtime value. Or can I use the same thing here? Please correct me if I'm wrong.
The code that I tried:
Bigquery bigQueryClient = start_pipeline.newBigQueryClient(options.as(BigQueryOptions.class)).build();
Tabledata tableRequest = bigQueryClient.tabledata();
PCollection<TableRow> existingData = readData.apply("Read existing data",ParDo.of(new DoFn<String,TableRow>(){
#ProcessElement
public void processElement(ProcessContext c) throws IOException
{
List<TableRow> list = c.sideInput(bqDataView);
String tableName = list.get(0).get("table").toString();
TableDataList table = tableRequest.list("projectID","DatasetID",tableName).execute();
for(TableRow row:table.getRows())
{
c.output(row);
}
}
}).withSideInputs(bqDataView));
The error that I get is:
Exception in thread "main" java.lang.IllegalArgumentException: unable to serialize BeamTest.StarterPipeline$1#86b455
at org.apache.beam.sdk.util.SerializableUtils.serializeToByteArray(SerializableUtils.java:53)
at org.apache.beam.sdk.util.SerializableUtils.clone(SerializableUtils.java:90)
at org.apache.beam.sdk.transforms.ParDo$SingleOutput.<init>(ParDo.java:569)
at org.apache.beam.sdk.transforms.ParDo.of(ParDo.java:434)
at BeamTest.StarterPipeline.main(StarterPipeline.java:158)
Caused by: java.io.NotSerializableException: com.google.api.services.bigquery.Bigquery$Tabledata
at java.io.ObjectOutputStream.writeObject0(Unknown Source)
at java.io.ObjectOutputStream.defaultWriteFields(Unknown Source)
at java.io.ObjectOutputStream.writeSerialData(Unknown Source)
at java.io.ObjectOutputStream.writeOrdinaryObject(Unknown Source)
at java.io.ObjectOutputStream.writeObject0(Unknown Source)
at java.io.ObjectOutputStream.writeObject(Unknown Source)
at org.apache.beam.sdk.util.SerializableUtils.serializeToByteArray(SerializableUtils.java:49)
... 4 more

The Beam model does not currently support this kind of data-dependent operation very well.
A way of doing it is to code your own DoFn that receives the side input and connects directly to BQ. Unfortunately, this would not give you any parallelism, as the DoFn would run completely on the same thread.
Once Splittable DoFns are supported in Beam, this will be a different story.
In the current state of the world, you would need to use the BQ client library to add code that would query BQ as if you were not in a Beam pipeline.
Given the code in your question, a rough idea on how to implement this is the following:
class ReadDataDoFn extends DoFn<String,TableRow>(){
private Tabledata tableRequest;
private Bigquery bigQueryClient;
private Bigquery createBigQueryClientWithinDoFn() {
// I'm not sure how you'd implement this, but you had the right idea
}
#Setup
public void setup() {
bigQueryClient = createBigQueryClientWithinDoFn();
tableRequest = bigQueryClient.tabledata();
}
#ProcessElement
public void processElement(ProcessContext c) throws IOException
{
List<TableRow> list = c.sideInput(bqDataView);
String tableName = list.get(0).get("table").toString();
TableDataList table = tableRequest.list("projectID","DatasetID",tableName).execute();
for(TableRow row:table.getRows())
{
c.output(row);
}
}
}
PCollection<TableRow> existingData = readData.apply("Read existing data",ParDo.of(new ReadDataDoFn()));

From Bigtable To GCS (and vice versa) via Dataflow

We are trying to run a daily Dataflow pipeline that reads off Bigtable and dumps data into GCS (using HBase's Scan and BaseResultCoder as coder) as follows (just to highlight the idea):
Pipeline pipeline = Pipeline.create(options);
Scan scan = new Scan();
scan.setCacheBlocks(false).setMaxVersions(1);
scan.addFamily(Bytes.toBytes("f"));
CloudBigtableScanConfiguration btConfig = BCloudBigtableScanConfiguration.Builder().withProjectId("aaa").withInstanceId("bbb").withTableId("ccc").withScan(scan).build();
pipeline.apply(Read.from(CloudBigtableIO.read(btConfig))).apply(TextIO.Write.to("gs://bucket/dir/file").withCoder(HBaseResultCoder.getInstance()));
pipeline.run();
This seems to run perfectly as expected.
Now, we want to be able to use the dumped file in GCS for a recovery job if needed. That is, we want to have a dataflow pipeline which reads the dumped data (which is PCollection) from GCS and creates Mutations ('Put' objects, basically). For some reason, the following code fails with a bunch of NullPointerExceptions. We are unsure why that would be the case -- if-statements below which check for null or 0-length strings were added to see if that would make a difference, but it did not.
// Part of DoFn<Result,Mutation>
#Override
public void processElement(ProcessContext c) {
Result result = c.element();
byte[] row = result.getRow();
if (row == null || row.length == 0) { // NullPointerException at this line
return;
}
Put mutation = new Put(result.getRow());
// go through the column/value entries of this row, and create a corresponding put mutation.
for (Entry<byte[], byte[]> entry : result.getFamilyMap(Bytes.toBytes(cf)).entrySet()) {
byte[] qualifier = entry.getKey();
if (qualifier == null || qualifier.length == 0) {
continue;
}
byte[] val = entry.getValue();
if (val == null || val.length == 0) {
continue;
}
mutation.addImmutable(cf_bytes, qualifier, entry.getValue());
}
c.output(mutation);
}
The error we get is the following (line 83 is marked above):
(2a6ad6372944050d): java.lang.NullPointerException at some.package.RecoveryFromGcs$CreateMutationFromResult.processElement(RecoveryFromGcs.java:83)
I have two questions:
1. Has someone experienced something like this when they try to ParDo on PCollection to get PCollection which is to be written to a bigtable?
2. Is this a reasonable approach? The end-goal is to be able to leave a daily snapshot of our bigtable (for a specific column family) on a regular basis by means of a back-up in case something bad happens. We wish to be able to read the back-up data via dataflow, and write it to bigtable when we need to.
Any suggestions and help will be really appreciated!
-------- Edit
Here is the code that scans Bigtable and dumps data to GCS:
(Some details are hidden if they are not relevant.)
public static void execute(Options options) {
Pipeline pipeline = Pipeline.create(options);
final String cf = "f"; // some specific column family.
Scan scan = new Scan();
scan.setCacheBlocks(false).setMaxVersions(1); // Disable caching and read only the latest cell.
scan.addFamily(Bytes.toBytes(cf));
CloudBigtableScanConfiguration btConfig =
BigtableUtils.getCloudBigtableScanConfigurationBuilder(options.getProject(), "some-bigtable-name").withScan(scan).build();
PCollection<Result> result = pipeline.apply(Read.from(CloudBigtableIO.read(btConfig)));
PCollection<Mutation> mutation =
result.apply(ParDo.of(new CreateMutationFromResult(cf))).setCoder(new HBaseMutationCoder());
mutation.apply(TextIO.Write.to("gs://path-to-files").withCoder(new HBaseMutationCoder()));
pipeline.run();
}
}
The job that reads the output of the above code has the following code:
(This is the one throwing exception when reading from GCS)
public static void execute(Options options) {
Pipeline pipeline = Pipeline.create(options);
PCollection<Mutation> mutations = pipeline.apply(TextIO.Read
.from("gs://path-to-files").withCoder(new HBaseMutationCoder()));
CloudBigtableScanConfiguration config =
BigtableUtils.getCloudBigtableScanConfigurationBuilder(options.getProject(), btTarget).build();
if (config != null) {
CloudBigtableIO.initializeForWrite(pipeline);
mutations.apply(CloudBigtableIO.writeToTable(config));
}
pipeline.run();
}
}
The error I am getting (https://jpst.it/Qr6M) is a bit confusing as the mutations are all Put objects, but the error is about 'Delete' object.

It's probably best to discuss this issue on the cloud bigtable client github issues page. We are currently working on import / export features like this one, so we'll respond quickly. We'll also explore this approach on our own, even if you don't add the github issue. The github issue will allow us to communicate better.
FWIW, I don't understand how you could get an NPE on the line you highlighted. Are you sure you have the right line?
EDIT (12/12):
The following processElement() method should work to convert a Result to a Put:
#Override
public void processElement(DoFn<Result, Mutation>.ProcessContext c) throws Exception {
Result result = c.element();
byte[] row = result.getRow();
if (row != null && row.length > 0) {
Put put = new Put(row);
for (Cell cell : result.rawCells()) {
put.add(cell);
}
c.output(put);
}
}

How do I make View's asList() sortable in Google Dataflow SDK?

We have a problem making asList() method sortable.
We thought we could do this by just extending the View class and override the asList method but realized that View class has a private constructor so we could not do this.
Our other attempt was to fork the Google Dataflow code on github and modify the PCollectionViews class to return a sorted list be using the Collections.sort method as shown in the code snippet below
#Override
protected List<T> fromElements(Iterable<WindowedValue<T>> contents) {
Iterable<T> itr = Iterables.transform(
contents,
new Function<WindowedValue<T>, T>() {
#SuppressWarnings("unchecked")
#Override
public T apply(WindowedValue<T> input){
return input.getValue();
}
});
LOG.info("#### About to start sorting the list !");
List<T> tempList = new ArrayList<T>();
for (T element : itr) {
tempList.add(element);
};
Collections.sort((List<? extends Comparable>) tempList);
LOG.info("##### List should now be sorted !");
return ImmutableList.copyOf(tempList);
}
Note that we are now sorting the list.
This seemed to work, when run with the DirectPipelineRunner but when we tried the BlockingDataflowPipelineRunner, it didn't seem like the code change was being executed.
Note: We actually recompiled the dataflow used it in our project but this did not work.
How can we be able to achieve this (as sorted list from the asList method call)?

The classes in PCollectionViews are not intended for extension. Only the primitive view types provided by View.asSingleton, View.asSingleton View.asIterable, View.asMap, and View.asMultimap are supported.
To obtain a sorted list from a PCollectionView, you'll need to sort it after you have read it. The following code demonstrates the pattern.
// Assume you have some PCollection
PCollection<MyComparable> myPC = ...;
// Prepare it for side input as a list
final PCollectionView<List<MyComparable> myView = myPC.apply(View.asList());
// Side input the list and sort it
someOtherValue.apply(
ParDo.withSideInputs(myView).of(
new DoFn<A, B>() {
#Override
public void processElement(ProcessContext ctx) {
List<MyComparable> tempList =
Lists.newArrayList(ctx.sideInput(myView));
Collections.sort(tempList);
// do whatever you want with sorted list
}
}));
Of course, you may not want to sort it repeatedly, depending on the cost of sorting vs the cost of materializing it as a new PCollection, so you can output this value and read it as a new side input without difficulty:
// Side input the list, sort it, and put it in a PCollection
PCollection<List<MyComparable>> sortedSingleton = Create.<Void>of(null).apply(
ParDo.withSideInputs(myView).of(
new DoFn<Void, B>() {
#Override
public void processElement(ProcessContext ctx) {
List<MyComparable> tempList =
Lists.newArrayList(ctx.sideInput(myView));
Collections.sort(tempList);
ctx.output(tempList);
}
}));
// Prepare it for side input as a list
final PCollectionView<List<MyComparable>> sortedView =
sortedSingleton.apply(View.asSingleton());
someOtherValue.apply(
ParDo.withSideInputs(sortedView).of(
new DoFn<A, B>() {
#Override
public void processElement(ProcessContext ctx) {
... ctx.sideInput(sortedView) ...
// do whatever you want with sorted list
}
}));
You may also be interested in the unsupported sorter contrib module for doing larger sorts using both memory and local disk.

We tried to do it the way Ken Knowles suggested. There's a problem for large datasets. If the tempList is large (so sort takes some measurable time as it's proportion to O(n * log n)) and if there are millions of elements in the "someOtherValue" PCollection, then we are unecessarily re-sorting the same list millions of times. We should be able to sort ONCE and FIRST, before passing the list to the someOtherValue.apply's DoFn.

How do I get the current attempt number on a background job in Hangfire?

There are some database operations I need to execute before the end of the final attempt of my Hangfire background job (I need to delete the database record related to the job)
My current job is set with the following attribute:
[AutomaticRetry(Attempts = 5, OnAttemptsExceeded = AttemptsExceededAction.Delete)]
With that in mind, I need to determine what the current attempt number is, but am struggling to find any documentation in that regard from a Google search or Hangfire.io documentation.

Simply add PerformContext to your job method; you'll also be able to access your JobId from this object. For attempt number, this still relies on magic strings, but it's a little less flaky than the current/only answer:
public void SendEmail(PerformContext context, string emailAddress)
{
string jobId = context.BackgroundJob.Id;
int retryCount = context.GetJobParameter<int>("RetryCount");
// send an email
}

(NB! This is a solution to the OP's problem. It does not answer the question "How to get the current attempt number". If that is what you want, see the accepted answer for instance)
Use a job filter and the OnStateApplied callback:
public class CleanupAfterFailureFilter : JobFilterAttribute, IServerFilter, IApplyStateFilter
{
public void OnStateApplied(ApplyStateContext context, IWriteOnlyTransaction transaction)
{
try
{
var failedState = context.NewState as FailedState;
if (failedState != null)
{
// Job has finally failed (retry attempts exceeded)
// *** DO YOUR CLEANUP HERE ***
}
}
catch (Exception)
{
// Unhandled exceptions can cause an endless loop.
// Therefore, catch and ignore them all.
// See notes below.
}
}
public void OnStateUnapplied(ApplyStateContext context, IWriteOnlyTransaction transaction)
{
// Must be implemented, but can be empty.
}
}
Add the filter directly to the job function:
[CleanupAfterFailureFilter]
public static void MyJob()
or add it globally:
GlobalJobFilters.Filters.Add(new CleanupAfterFailureFilter ());
or like this:
var options = new BackgroundJobServerOptions
{
FilterProvider = new JobFilterCollection { new CleanupAfterFailureFilter () };
};
app.UseHangfireServer(options, storage);
Or see http://docs.hangfire.io/en/latest/extensibility/using-job-filters.html for more information about job filters.
NOTE: This is based on the accepted answer: https://stackoverflow.com/a/38387512/2279059
The difference is that OnStateApplied is used instead of OnStateElection, so the filter callback is invoked only after the maximum number of retries. A downside to this method is that the state transition to "failed" cannot be interrupted, but this is not needed in this case and in most scenarios where you just want to do some cleanup after a job has failed.
NOTE: Empty catch handlers are bad, because they can hide bugs and make them hard to debug in production. It is necessary here, so the callback doesn't get called repeatedly forever. You may want to log exceptions for debugging purposes. It is also advisable to reduce the risk of exceptions in a job filter. One possibility is, instead of doing the cleanup work in-place, to schedule a new background job which runs if the original job failed. Be careful to not apply the filter CleanupAfterFailureFilter to it, though. Don't register it globally, or add some extra logic to it...

You can use OnPerforming or OnPerformed method of IServerFilter if you want to check the attempts or if you want you can just wait on OnStateElection of IElectStateFilter. I don't know exactly what requirement you have so it's up to you. Here's the code you want :)
public class JobStateFilter : JobFilterAttribute, IElectStateFilter, IServerFilter
{
public void OnStateElection(ElectStateContext context)
{
// all failed job after retry attempts comes here
var failedState = context.CandidateState as FailedState;
if (failedState == null) return;
}
public void OnPerforming(PerformingContext filterContext)
{
// do nothing
}
public void OnPerformed(PerformedContext filterContext)
{
// you have an option to move all code here on OnPerforming if you want.
var api = JobStorage.Current.GetMonitoringApi();
var job = api.JobDetails(filterContext.BackgroundJob.Id);
foreach(var history in job.History)
{
// check reason property and you will find a string with
// Retry attempt 3 of 3: The method or operation is not implemented.
}
}
}
How to add your filter
GlobalJobFilters.Filters.Add(new JobStateFilter());
----- or
var options = new BackgroundJobServerOptions
{
FilterProvider = new JobFilterCollection { new JobStateFilter() };
};
app.UseHangfireServer(options, storage);
Sample output :

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Reprocess failed records in a bundle - google-cloud-dataflow

I have use case where the end goal is to make a rest call with the transformed data in the apache beam program. If a record in a bundle fails due to connection or read timed out error, how can i reprocess only the failed records rather than processing the entire bundle containing that record.

Related

Creating Custom Windowing Function in Apache Beam

Apply Side input to BigQueryIO.read operation in Apache Beam

From Bigtable To GCS (and vice versa) via Dataflow

How do I make View's asList() sortable in Google Dataflow SDK?

How do I get the current attempt number on a background job in Hangfire?

Categories

Resources