How to use Apache beam connector without running inside pipeline - google-cloud-dataflow

We are running our program inside a kubernetes pod, which is listening to pubsub message. Based on message data type it launches dataflow job. And once job execution finishes, we again send pubsub message to another system.
Pipeline is launched in batch mode and it read from GCS and after processing write to GCS.
Pipeline pipeline = Pipeline.create(options);
PCollection<String> read = pipeline
.apply("Read from GCS",
TextIO.read().from("GCS_PATH").withCompression(Compression.GZIP));
//process
// write to GCS
....
PipelineResult result = pipeline.run();
result.waitUntilFinish();
# send job completed message to Pubsub to other component
....
....
As I have to send event to other components in the system. As of now I am using Pubsbub java client library to push message to pubsub.
Is there a way, I can use apache Pubsub connector to send message like below -
Or what is the right way to do the same
PubsubIO.writeMessages().to("topicName");

To solve this usecase you can use the Wait API. Details can be found here
PCollection<Void> firstWriteResults = data.apply(ParDo.of(...write to first database...));
data.apply(Wait.on(firstWriteResults))
// Windows of this intermediate PCollection will be processed no earlier than when
// the respective window of firstWriteResults closes.
.apply(ParDo.of(...write to second database...));

Related

how to make jenkins declarative pipeline waitForWebhook without consuming jenkins executor?

We have a Jenkins declarative pipeline which, after deploying our product on a cloud vm, needs to run some product tests on it. Tests are implemented as a separate job on another jenkins and tests will be run by main pipeline by triggering remote job on 2nd jenkins using parameterized remote trigger plugin parameterized remote trigger plugin.
While this plugin works great, when using option blockBuildUntilComplete it blocks for remote job to finish but doesn't release the jenkins executor. Since tests can take a lot of time to complete(upto 2 days), all this time executor will be blocked just waiting for another job to complete. When setting blockBuildUntilComplete as false it returns a job handle which can be used to fetch build status and result etc. Example here:
while( !handle.isFinished() ) {
echo 'Current Status: ' + handle.getBuildStatus().toString();
sleep 5
handle.updateBuildStatus()
}
But this still keeps consuming the executor, so we still have the same problem.
Based on comments in the article, we tried with waitForWebhook, but even when waiting for webhook it still keeps using the executor.
Based on article we tried with input, and we observed that it wasn't using executor when you have input block in stage and none agent for pipeline :
pipeline {
agent none
stages {
stage("test"){
input {
message "Should we continue?"
}
agent any
steps {
script{
echo "hello"
}
}
}
}
}
so input block does what we want to do with waitForWebhook, at least as far as not consuming an executor while waiting.
What our original idea was to have waitForWebhook inside a timeout surrounded by try catch surrounded by a while loop waiting for remote job to finish :
while(!handle.isFinished()){
try{
timeout(1) {
data = waitForWebhook hook
}
} catch(Exception e){
log.info("timeout occurred")
}
handle.updateBuildStatus()
}
This way we could avoid using executor for long period of time and also benefit from job handle returned by plugin. But we cannot find a way to do this with the input step. input doesn't use executor only when it's separate block inside stage, but it uses one if inside steps or steps->script. So we cannot use try catch, cannot check handle status andcannot loop on it. Is there a way to do this?
Even if waitForWebhook works like input is working we could use that.
Our main pipeline runs on a jenkins inside corporate network and test jenkins runs on cloud and cannot communicate in with our corp jenkins. so when using waitForWebhook, we would publish a message on a message pipeline which a consumer will read, fetch webhook url from db corresponding to job and post on it. We were hoping avoid this with our solution of using while and try-catch.

is there a way to inject and a config into a ParDo without sideInput?

I have a ParDo that uses state and timers with a periodically updating PcollectionView as sideInput to that parDo; google dataflow will throw an exception that timers are not allowed in such a case. Is there another way to feed config data to the parDo with out sideInput? Essentially, the sideInput was a map of config data that was reading from datastore about every 24 hours.
I am currently trying to see if I can create a ParDo before the one with state and timers to periodically update the config, but I don't see how we can access that map from within the next ParDo. Any suggestions?
Note: This pipeline is running in streaming mode with a global window and reading from pubsub messages as they arrive. Datastore is used to hold data needed to decide when to output an element to a pubsub topic.
Instead of using state timers to update the side input, you can use a fixed window to periodically update your PCollectionView with your data source:
PCollectionView<Map<String,String>> sideInput = pipeline
.apply(notifications)
.apply(
Window.<Long>into(FixedWindows.of(Duration.standardMinutes(refreshMinutes)))
.triggering(
Repeatedly.forever(AfterPane.elementCountAtLeast(1))
)
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes()
)
.apply( /* query data source */ )
.apply(View.<Map<String,String>>asSingleton());

Fixed window over Unbounded input (PubSub) stops firing after workers autoscale up

using scio version 0.4.7, I have a streaming job that's listening to a pubsub topic, I'm using event processing here with 'timestamp' attribute present on the message properties in RFC3339
val rtEvents: SCollection[RTEvent] = sc.pubsubTopic(args("topic"), timestampAttribute = "timestamp").map(jsonToObject)
val windowedEvents = rtEvents.withFixedWindows(Duration.standardMinutes(1L),
options = WindowOptions(trigger = Repeatedly.forever(AfterWatermark.pastEndOfWindow()),
accumulationMode = DISCARDING_FIRED_PANES,
allowedLateness = Duration.standardSeconds(1L)
)
)
I use windowedEvents for further aggregation and calculations in the pipeline
doSomeAggregation(windowedEvents)
def doSomeAggregation(events: SCollection[RTEvent]): SCollection[(String, Map[String, Int])] =
events.map(e => (e.properties.key, (e.properties.category, e.id)))
.groupByKey
.map { case (key, tuple: Iterable[(String, String)]) =>
val countPerCategory: Map[String, Int] = tuple.groupBy(_._1)
.mapValues(_.toList.distinct.size)
//some other http post and logging here
(key, countPerCategory)
}
sc.close().waitUntilFinish()
If i run the job with the following autoscaling parameters on google dataflow
--workerMachineType=n1-standard-8 --autoscalingAlgorithm=THROUGHPUT_BASED
--maxNumWorkers=4
the job runs and the fixed windows fire correctly if there is only one worker running. As soon as the job autoscales up to more > 1 worker, the fixed windows stop firing and initial pubsub step's system lag and wall time keeps growing, while data watermark does not move forward.
Is there something wrong with my trigger setup? Has anyone else experienced this on dataflow runner or other runners?
Any help is greatly appreciated. I'm inclined to drop scio and revert to back to apache-beam java sdk if I can't solve this.
I managed to resolve the issue. In my current setup the workers were unable to communicate with each other. The job silently fails without any timeout errors (something beam should probably propagate up as an error).
If you're using dataflow as your runner, make sure the firewall defined for dataflow on your project is defined for 'default' network.
If the dataflow firewall is defined for your network, you will need to pass additional runtime parameter into your job
--workerMachineType=n1-standard-8 --autoscalingAlgorithm=THROUGHPUT_BASED
--maxNumWorkers=4 --network='your-network'

How to write a custom ES sink in Flume 1.7

In the Flume agent I am collection the elements from Kafka topics and I need to insert them in ES. However I need to perform a previous digestion process in the sink, so I need to write a custom sink to pass the data from the Agent's channel to a java digestion module (which I have written already).
Can anyone share with me a template of a custom sink and can use as a reference? Flumes official website doesn't say much about this topic:
A custom sink’s class and its dependencies must be included in the agent’s classpath when starting the Flume agent. The type of the custom sink is its FQCN.
https://flume.apache.org/FlumeUserGuide.html#custom-sink
And once the custom sink is ready, How could I link the following three files to make the agent work:
custom sink
ingestion jar (java module to perform the ingestion process)
FlumeAgent.properties
Thank you for any feedback. I will keep adding information as soon as I progress in this task.
Hope you are trying to use Flume to recieve events from Kafka (source) and forwarding it to ES (sink) with some data processing logic already you have.
With this understanding, I would suggest you to look into Flume interceptors which is responsible for altering/filtering the events on fly before sending to Sink.
So all your business logic to alter the events can be implemented as a custom interceptor and it should be configured to the Flume channel.
For reference you can checkout the native interceptors source code already available. This should probably give you an idea on the Flume interceptor framework.
Here is the ES Sink source code
Sample Flume config
a1.sources = kafkaSource
a1.sinks = ES_Sink
a1.channels = channel1
a1.sources.kafkaSource.interceptors = i1
a1.sources.kafkaSource.interceptors.i1.type = org.apache.flume.interceptor.<Custom_Interceptor_name>$Builder
a1.sinks.ES_Sink.channel = channel1
a1.sinks.ES_Sink.type = elasticsearch
a1.sinks.ES_Sink.hostNames = 127.0.0.1:9200

Google Dataflow trigger firing before elementCountAtLeast reached

I was planning on using Google Dataflow to coordinate human-in-the-loop form completion, checking for conflict after 3 forms have been completed. I have setup Google PubSub for both Dataflow source and sink and want to simply have the trigger fire and send to the PubSub sink after three forms have been received for a given JobId.
This SO post looked similar to the problem I was trying to solve, however when I implement it, the trigger is firing and sending output to the PubSub sink before the AfterPane.elementCountAtLeast is reached.
I have tried it with the GlobalWindow and SlidingWindows. Once I get the trigger to fire after the elementCountAtLeast is reached, I was planning on implementing a GroupByKey for the jobId. However, before I moved to that step I'd like to get the elementCountAtLeast working in isolation.
Here is the code for reading from PubSub and the SlidingWindow:
PCollection<String> humanInTheLoopInput;
humanInTheLoopInput = pipeline
.apply(PubsubIO.Read
.named("ReadFromHumanInTheLoopSubscription")
.subscription(options.getInputHumanInTheLoopRawSubscription()));
PCollection<String> windowedInput = humanInTheLoopInput
.apply(Window
.<String>into(SlidingWindows
.of(Duration.standardSeconds(30))
.every(Duration.standardSeconds(5)))
.<String>triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(3)))
.discardingFiredPanes()
.withAllowedLateness(Duration.standardDays(10)));
Without a GroupByKey nothing is being triggered. Both windowing and triggering only affect grouping (and combining) operations.

Resources