From Bigtable To GCS (and vice versa) via Dataflow - google-cloud-dataflow

We are trying to run a daily Dataflow pipeline that reads off Bigtable and dumps data into GCS (using HBase's Scan and BaseResultCoder as coder) as follows (just to highlight the idea):
Pipeline pipeline = Pipeline.create(options);
Scan scan = new Scan();
scan.setCacheBlocks(false).setMaxVersions(1);
scan.addFamily(Bytes.toBytes("f"));
CloudBigtableScanConfiguration btConfig = BCloudBigtableScanConfiguration.Builder().withProjectId("aaa").withInstanceId("bbb").withTableId("ccc").withScan(scan).build();
pipeline.apply(Read.from(CloudBigtableIO.read(btConfig))).apply(TextIO.Write.to("gs://bucket/dir/file").withCoder(HBaseResultCoder.getInstance()));
pipeline.run();
This seems to run perfectly as expected.
Now, we want to be able to use the dumped file in GCS for a recovery job if needed. That is, we want to have a dataflow pipeline which reads the dumped data (which is PCollection) from GCS and creates Mutations ('Put' objects, basically). For some reason, the following code fails with a bunch of NullPointerExceptions. We are unsure why that would be the case -- if-statements below which check for null or 0-length strings were added to see if that would make a difference, but it did not.
// Part of DoFn<Result,Mutation>
#Override
public void processElement(ProcessContext c) {
Result result = c.element();
byte[] row = result.getRow();
if (row == null || row.length == 0) { // NullPointerException at this line
return;
}
Put mutation = new Put(result.getRow());
// go through the column/value entries of this row, and create a corresponding put mutation.
for (Entry<byte[], byte[]> entry : result.getFamilyMap(Bytes.toBytes(cf)).entrySet()) {
byte[] qualifier = entry.getKey();
if (qualifier == null || qualifier.length == 0) {
continue;
}
byte[] val = entry.getValue();
if (val == null || val.length == 0) {
continue;
}
mutation.addImmutable(cf_bytes, qualifier, entry.getValue());
}
c.output(mutation);
}
The error we get is the following (line 83 is marked above):
(2a6ad6372944050d): java.lang.NullPointerException at some.package.RecoveryFromGcs$CreateMutationFromResult.processElement(RecoveryFromGcs.java:83)
I have two questions:
1. Has someone experienced something like this when they try to ParDo on PCollection to get PCollection which is to be written to a bigtable?
2. Is this a reasonable approach? The end-goal is to be able to leave a daily snapshot of our bigtable (for a specific column family) on a regular basis by means of a back-up in case something bad happens. We wish to be able to read the back-up data via dataflow, and write it to bigtable when we need to.
Any suggestions and help will be really appreciated!
-------- Edit
Here is the code that scans Bigtable and dumps data to GCS:
(Some details are hidden if they are not relevant.)
public static void execute(Options options) {
Pipeline pipeline = Pipeline.create(options);
final String cf = "f"; // some specific column family.
Scan scan = new Scan();
scan.setCacheBlocks(false).setMaxVersions(1); // Disable caching and read only the latest cell.
scan.addFamily(Bytes.toBytes(cf));
CloudBigtableScanConfiguration btConfig =
BigtableUtils.getCloudBigtableScanConfigurationBuilder(options.getProject(), "some-bigtable-name").withScan(scan).build();
PCollection<Result> result = pipeline.apply(Read.from(CloudBigtableIO.read(btConfig)));
PCollection<Mutation> mutation =
result.apply(ParDo.of(new CreateMutationFromResult(cf))).setCoder(new HBaseMutationCoder());
mutation.apply(TextIO.Write.to("gs://path-to-files").withCoder(new HBaseMutationCoder()));
pipeline.run();
}
}
The job that reads the output of the above code has the following code:
(This is the one throwing exception when reading from GCS)
public static void execute(Options options) {
Pipeline pipeline = Pipeline.create(options);
PCollection<Mutation> mutations = pipeline.apply(TextIO.Read
.from("gs://path-to-files").withCoder(new HBaseMutationCoder()));
CloudBigtableScanConfiguration config =
BigtableUtils.getCloudBigtableScanConfigurationBuilder(options.getProject(), btTarget).build();
if (config != null) {
CloudBigtableIO.initializeForWrite(pipeline);
mutations.apply(CloudBigtableIO.writeToTable(config));
}
pipeline.run();
}
}
The error I am getting (https://jpst.it/Qr6M) is a bit confusing as the mutations are all Put objects, but the error is about 'Delete' object.

It's probably best to discuss this issue on the cloud bigtable client github issues page. We are currently working on import / export features like this one, so we'll respond quickly. We'll also explore this approach on our own, even if you don't add the github issue. The github issue will allow us to communicate better.
FWIW, I don't understand how you could get an NPE on the line you highlighted. Are you sure you have the right line?
EDIT (12/12):
The following processElement() method should work to convert a Result to a Put:
#Override
public void processElement(DoFn<Result, Mutation>.ProcessContext c) throws Exception {
Result result = c.element();
byte[] row = result.getRow();
if (row != null && row.length > 0) {
Put put = new Put(row);
for (Cell cell : result.rawCells()) {
put.add(cell);
}
c.output(put);
}
}

Related

Reprocess failed records in a bundle

I have use case where the end goal is to make a rest call with the transformed data in the apache beam program. If a record in a bundle fails due to connection or read timed out error, how can i reprocess only the failed records rather than processing the entire bundle containing that record.
You can have multiple outputs for a single transform. So, for your case, you can output the failed records into the dedicated PCollection of "dead letters" and process it separately. Please, see an example here
final TupleTag<String> successElms = new TupleTag<String>(){};
final TupleTag<String> failedElms = new TupleTag<String>(){};
PCollectionTuple mixedCollection =
dbRowCollection.apply(ParDo
.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
RestResult res = runRestCall(c.element());
if (res.success()) {
// Emit to main output, which is the output for successful elements.
c.output(c.element());
} else {
// Emit to output for failed elements
c.output(failedElms, c.element());
}
}
})
.withOutputTags(successElms,
// Specify the other outputs as a TupleTagList.
TupleTagList.of(failedElms)));
// Get subset of the output with failed elements.
mixedCollection.get(failedElms).apply(ProcessFailedElms.create());

EMQX- Publish MQTT Topic with unique identifier is taking much more time than Static MQTT Topic

I was trying to publish messages on emqx broker on different topics.Scenario takes much time while publishing with dynamic topic with one client and if we put topic name as static it takes much less time.
Here I have posted result and code for the same.
I am using EMQX broker with Eclipse paho client Version 3 and Qos level 1.
Time for different topics with 100 simple publish message (Consider id as dynamic here):
Total time pattern 1: /config/{id}/outward::36 sec -----------------> HERE TOPIC is DYNAMIC. and {id} is a variable whose value is changing in loop as shown in below code
Total time pattern 2: /config/test::1.2 sec -----------------------> HERE TOPIC is STATIC
How shall I publish message with different id so topic creation wont take much time?
public class MwttPublish {
static IMqttClient instance= null;
public static IMqttClient getInstance() {
try {
if (instance == null) {
instance = new MqttClient(mqttHostUrl, "SimpleTestMQTT");
}
if (!instance.isConnected()) {
MqttConnectOptions options = new MqttConnectOptions();
options.setUserName("test");
options.setPassword("test".toCharArray());
options.setAutomaticReconnect(true);
options.setCleanSession(false);
options.setConnectionTimeout(10);
instance.connect(options);
}
} catch (final Exception e) {
System.out.println("Exception in mqtt: {}" + e.getMessage());
}
return instance;
}
public static void publishMessage() throws MqttException {
IMqttClient iMqttClient = getInstance();
MqttMessage mqttMessage = new MqttMessage("Hello".getBytes());
mqttMessage.setQos(1);
mqttMessage.setRetained(true);
System.out.println("Publish Start for pattern 1");
int i =0;
final BigDecimal mqttmsgPublishstartTime = new BigDecimal(System.currentTimeMillis());
do {
iMqttClient.publish("/config/" +i +"/outward", mqttMessage);
i++;
}while(i<100);
System.out.println("Total time pattern 1 /config/i/outward::" + (new BigDecimal(System.currentTimeMillis())).subtract(mqttmsgPublishstartTime));
System.out.println("Publish Start for pattern 2");
final BigDecimal mqttmsgPublishstartTime1 = new BigDecimal(System.currentTimeMillis());
i =0;
do {
iMqttClient.publish("/config/test", mqttMessage);
i++;
}while(i<100);
System.out.println("Total time pattern 2 /config/test::" + (new BigDecimal(System.currentTimeMillis())).subtract(mqttmsgPublishstartTime1));
}
}
This is not a valid test, you've fallen into many of the clasic micro benchmark traps e.g.
Way too small a sample size
No account for JVM JIT warm up or GC overhead
Not comparing like to like e.g. time taken to concatenate the strings for the topics
Please check out the following: https://stackoverflow.com/a/2844291/504554
Also from a MQTT point of view topics are ephemeral they only really "exist" for the instant a message is published while the broker checks for subscribed clients with a matching pattern.

Creating Custom Windowing Function in Apache Beam

I have a Beam pipeline that starts off with reading multiple text files where each line in a file represents a row that gets inserted into Bigtable later in the pipeline. The scenario requires confirming that the count of rows extracted from each file & count of rows later inserted into Bigtable match. For this I am planning to develop a custom Windowing strategy so that lines from a single file get assigned to a single window based on the file name as the key that will be passed to the Windowing function.
Is there any code sample for creating custom Windowing functions?
Although I changed my strategy for confirming the inserted number of rows, for anyone who is interested in windowing elements read from a batch source e.g. FileIO in a batch job, here's the code for creating a custom windowing strategy:
public class FileWindows extends PartitioningWindowFn<Object, IntervalWindow>{
private static final long serialVersionUID = -476922142925927415L;
private static final Logger LOG = LoggerFactory.getLogger(FileWindows.class);
#Override
public IntervalWindow assignWindow(Instant timestamp) {
Instant end = new Instant(timestamp.getMillis() + 1);
IntervalWindow interval = new IntervalWindow(timestamp, end);
LOG.info("FileWindows >> assignWindow(): Window assigned with Start: {}, End: {}", timestamp, end);
return interval;
}
#Override
public boolean isCompatible(WindowFn<?, ?> other) {
return this.equals(other);
}
#Override
public void verifyCompatibility(WindowFn<?, ?> other) throws IncompatibleWindowException {
if (!this.isCompatible(other)) {
throw new IncompatibleWindowException(other, String.format("Only %s objects are compatible.", FileWindows.class.getSimpleName()));
}
}
#Override
public Coder<IntervalWindow> windowCoder() {
return IntervalWindow.getCoder();
}
}
and then it can be used in the pipeline as below:
p
.apply("Assign_Timestamp_to_Each_Message", ParDo.of(new AssignTimestampFn()))
.apply("Assign_Window_to_Each_Message", Window.<KV<String,String>>into(new FileWindows())
.withAllowedLateness(Duration.standardMinutes(1))
.discardingFiredPanes());
Please keep in mind that you will need to write the AssignTimestampFn() so that each message carries a timestamp.

Read from multiple Pubsub subscriptions using ValueProvider

I have multiple subscriptions from Cloud PubSub to read based on certain prefix pattern using Apache Beam. I extend PTransform class and implement expand() method to read from multiple subscriptions and do Flatten transformation to the PCollectionList (multiple PCollection on from each subscription). I have a problem to pass subscription prefix as ValueProvider into the expand() method, since expand() is called on template creation time, not when launching the job. However, if I only use 1 subscription, I can pass ValueProvider into PubsubIO.readStrings().fromSubscription().
Here's some sample code.
public class MultiPubSubIO extends PTransform<PBegin, PCollection<PubsubMessage>> {
private ValueProvider<String> prefixPubsub;
public MultiPubSubIO(#Nullable String name, ValueProvider<String> prefixPubsub) {
super(name);
this.prefixPubsub = prefixPubsub;
}
#Override
public PCollection<PubsubMessage> expand(PBegin input) {
List<String> myList = null;
try {
// prefixPubsub.get() will return error
myList = PubsubHelper.getAllSubscription("projectID", prefixPubsub.get());
} catch (Exception e) {
LogHelper.error(String.format("Error getting list of subscription : %s",e.toString()));
}
List<PCollection<PubsubMessage>> collectionList = new ArrayList<PCollection<PubsubMessage>>();
if(myList != null && !myList.isEmpty()){
for(String subs : myList){
PCollection<PubsubMessage> pCollection = input
.apply("ReadPubSub", PubsubIO.readMessagesWithAttributes().fromSubscription(this.prefixPubsub));
collectionList.add(pCollection);
}
PCollection<PubsubMessage> pubsubMessagePCollection = PCollectionList.of(collectionList)
.apply("FlattenPcollections", Flatten.pCollections());
return pubsubMessagePCollection;
} else {
LogHelper.error(String.format("No subscription with prefix %s found", prefixPubsub));
return null;
}
}
public static MultiPubSubIO read(ValueProvider<String> prefixPubsub){
return new MultiPubSubIO(null, prefixPubsub);
}
}
So I'm thinking of how to use the same way PubsubIO.read().fromSubscription() to read from ValueProvider. Or am I missing something?
Searched links:
extract-value-from-valueprovider-in-apache-beam - Answer talked about using DoFn, while I need PTransform that receives PBegin.
Unfortunately this is not possible currently:
It is not possible for the value of a ValueProvider to affect transform expansion - at expansion time, it is unknown; by the time it is known, the pipeline shape is already fixed.
There is currently no transform like PubsubIO.read() that can accept a PCollection of topic names. Eventually there will be (it is enabled by Splittable DoFn), but it will take a while - nobody is working on this currently.
You can use MultipleReadFromPubSub from apache beam io module https://beam.apache.org/releases/pydoc/2.27.0/_modules/apache_beam/io/gcp/pubsub.html
topic_1 = PubSubSourceDescriptor('projects/myproject/topics/a_topic')
topic_2 = PubSubSourceDescriptor(
'projects/myproject2/topics/b_topic',
'my_label',
'my_timestamp_attribute')
subscription_1 = PubSubSourceDescriptor(
'projects/myproject/subscriptions/a_subscription')
results = pipeline | MultipleReadFromPubSub(
[topic_1, topic_2, subscription_1])

How do I get the current attempt number on a background job in Hangfire?

There are some database operations I need to execute before the end of the final attempt of my Hangfire background job (I need to delete the database record related to the job)
My current job is set with the following attribute:
[AutomaticRetry(Attempts = 5, OnAttemptsExceeded = AttemptsExceededAction.Delete)]
With that in mind, I need to determine what the current attempt number is, but am struggling to find any documentation in that regard from a Google search or Hangfire.io documentation.
Simply add PerformContext to your job method; you'll also be able to access your JobId from this object. For attempt number, this still relies on magic strings, but it's a little less flaky than the current/only answer:
public void SendEmail(PerformContext context, string emailAddress)
{
string jobId = context.BackgroundJob.Id;
int retryCount = context.GetJobParameter<int>("RetryCount");
// send an email
}
(NB! This is a solution to the OP's problem. It does not answer the question "How to get the current attempt number". If that is what you want, see the accepted answer for instance)
Use a job filter and the OnStateApplied callback:
public class CleanupAfterFailureFilter : JobFilterAttribute, IServerFilter, IApplyStateFilter
{
public void OnStateApplied(ApplyStateContext context, IWriteOnlyTransaction transaction)
{
try
{
var failedState = context.NewState as FailedState;
if (failedState != null)
{
// Job has finally failed (retry attempts exceeded)
// *** DO YOUR CLEANUP HERE ***
}
}
catch (Exception)
{
// Unhandled exceptions can cause an endless loop.
// Therefore, catch and ignore them all.
// See notes below.
}
}
public void OnStateUnapplied(ApplyStateContext context, IWriteOnlyTransaction transaction)
{
// Must be implemented, but can be empty.
}
}
Add the filter directly to the job function:
[CleanupAfterFailureFilter]
public static void MyJob()
or add it globally:
GlobalJobFilters.Filters.Add(new CleanupAfterFailureFilter ());
or like this:
var options = new BackgroundJobServerOptions
{
FilterProvider = new JobFilterCollection { new CleanupAfterFailureFilter () };
};
app.UseHangfireServer(options, storage);
Or see http://docs.hangfire.io/en/latest/extensibility/using-job-filters.html for more information about job filters.
NOTE: This is based on the accepted answer: https://stackoverflow.com/a/38387512/2279059
The difference is that OnStateApplied is used instead of OnStateElection, so the filter callback is invoked only after the maximum number of retries. A downside to this method is that the state transition to "failed" cannot be interrupted, but this is not needed in this case and in most scenarios where you just want to do some cleanup after a job has failed.
NOTE: Empty catch handlers are bad, because they can hide bugs and make them hard to debug in production. It is necessary here, so the callback doesn't get called repeatedly forever. You may want to log exceptions for debugging purposes. It is also advisable to reduce the risk of exceptions in a job filter. One possibility is, instead of doing the cleanup work in-place, to schedule a new background job which runs if the original job failed. Be careful to not apply the filter CleanupAfterFailureFilter to it, though. Don't register it globally, or add some extra logic to it...
You can use OnPerforming or OnPerformed method of IServerFilter if you want to check the attempts or if you want you can just wait on OnStateElection of IElectStateFilter. I don't know exactly what requirement you have so it's up to you. Here's the code you want :)
public class JobStateFilter : JobFilterAttribute, IElectStateFilter, IServerFilter
{
public void OnStateElection(ElectStateContext context)
{
// all failed job after retry attempts comes here
var failedState = context.CandidateState as FailedState;
if (failedState == null) return;
}
public void OnPerforming(PerformingContext filterContext)
{
// do nothing
}
public void OnPerformed(PerformedContext filterContext)
{
// you have an option to move all code here on OnPerforming if you want.
var api = JobStorage.Current.GetMonitoringApi();
var job = api.JobDetails(filterContext.BackgroundJob.Id);
foreach(var history in job.History)
{
// check reason property and you will find a string with
// Retry attempt 3 of 3: The method or operation is not implemented.
}
}
}
How to add your filter
GlobalJobFilters.Filters.Add(new JobStateFilter());
----- or
var options = new BackgroundJobServerOptions
{
FilterProvider = new JobFilterCollection { new JobStateFilter() };
};
app.UseHangfireServer(options, storage);
Sample output :

Resources