How SCDF stream diagram reflect multiple source channel? - spring-cloud-dataflow

If a have a source app(named A-Source) which has multiple channels to emit messages, eg.
channelA.destination=b-topic, channelB.destination=c-topic.
The receiver for b-topic is B-Sink, for c-topic is C-Sink.
How can i construct my stream, describe them like: A|B and A|C? And if so, i think just part of my A-Source code is useful in every stream.
So my question is: how SCDF stream DSL deal with multiple tap for single source app.

You can use named channel destinations in the Stream DSL.
For example:
dataflow:>stream create tap1 --definition ":b-topic > B-SInk"
dataflow:>stream create tap2 --definition ":c-topic > C-SInk"

Related

Stream PubSub to Spanner - Wait.on Step

Requirement is to delete the data in spanner tables before inserting the data from pubsub messages. As MutationGroup does not guarantee the order of execution, separated delete mutations into separate set and so have two sets, one for Delete and other to AddReplace Mutations.
PCollection<Data> dataJson =
pipeLine
.apply(PubsubIO.readStrings().fromSubscription(options.getInputSubscription()))
.apply("ParsePubSubMessage", ParDo.of(new PubSubToDataFn()))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(10))))
;
SpannerWriteResult deleteResult = dataJson
.apply("DeleteDataMutation", MapElements.via(......))
.apply("DeleteData", SpannerIO.write().withSpannerConfig(spannerConfig).grouped());
dataJson
.apply("WaitOnDeleteMutation", Wait.on(deleteResult.getOutput()))
.apply("AddReplaceMutation", MapElements.via(...))
.apply("UpsertInfoToSpanner", SpannerIO.write().withSpannerConfig(spannerConfig).grouped());
This is a streaming dataflow job and I tried multiple Windowing but it never executes "UpsertInfoToSpanner" Step.
How can I fix this issue? Can someone suggest a path forward.
Update:
Requirement is to apply Two Mutation Groups sequential on same input data i.e. Read JSON from PubSub message to delete existing data from multiple tables with mutation group and then insert data reading from the JSON PubSub message.
Re-pasting the comment earlier for better visibility:
The Mutation operations within a single MutationGroup are guaranteed to be executed in order within a single transaction, so I don't see what the issue is here... The reason why Wait.on() never releases is because the output stream that is being waited on is on the global window, so will never be closed in a streaming pipeline.

is there a way to inject and a config into a ParDo without sideInput?

I have a ParDo that uses state and timers with a periodically updating PcollectionView as sideInput to that parDo; google dataflow will throw an exception that timers are not allowed in such a case. Is there another way to feed config data to the parDo with out sideInput? Essentially, the sideInput was a map of config data that was reading from datastore about every 24 hours.
I am currently trying to see if I can create a ParDo before the one with state and timers to periodically update the config, but I don't see how we can access that map from within the next ParDo. Any suggestions?
Note: This pipeline is running in streaming mode with a global window and reading from pubsub messages as they arrive. Datastore is used to hold data needed to decide when to output an element to a pubsub topic.
Instead of using state timers to update the side input, you can use a fixed window to periodically update your PCollectionView with your data source:
PCollectionView<Map<String,String>> sideInput = pipeline
.apply(notifications)
.apply(
Window.<Long>into(FixedWindows.of(Duration.standardMinutes(refreshMinutes)))
.triggering(
Repeatedly.forever(AfterPane.elementCountAtLeast(1))
)
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes()
)
.apply( /* query data source */ )
.apply(View.<Map<String,String>>asSingleton());

Launching composed task built by DSL from stream application

Every example I've seen (task-launcher sink and triggertask source ) shows how to launch the task defined by uri attribute.
My tasks definitions look like this :
sampleTask <t2: timestamp || t1: timestamp>
sampleTask-t1 timestamp
sampleTask-t2 timestamp
sampleTaskRunner composed-task-runner --graph=sampleTask
My question is how do I launch the composed task runner (sampleTaskRunner, defined by DSL) from stream application.
Thanks
UPDATE
I ended up with the below solution that triggers task using SCDF REST API :
composedTask definition :
<timestamp || mySampleTask>
Stream definition :
http | httpclient | log
Deployment properties :
app.http.port=81
app.httpclient.body=name=composedTask&arguments=--increment-instance-enabled=true
app.httpclient.http-method=POST
app.httpclient.url=http://localhost:9393/tasks/executions
app.httpclient.headers-expression={'Content-Type':'application/x-www-form-urlencoded'}
Though it's easy to implement http sink component, would be great if stream application starters will provide one out of the box.
Another concern I have is about discovering the SCDF REST URL when deployed in distributed environment.
Here's a quick take from one of the SCDF's R&D team members (Glenn Renfro).
stream create foozer --definition "trigger --fixed-delay=5 | tasklaunchrequest-transform --uri=maven://org.springframework.cloud.task.app:composedtaskrunner-task:1.1.0.BUILD-SNAPSHOT --command-line-arguments='--graph=sampleTask-t1||sampleTask-t2 --increment-instance-enabled=true --spring.datasource.url=jdbc:mariadb://localhost:3306/test --spring.datasource.username=root --spring.datasource.password=password --spring.datasource.driverClassName=org.mariadb.jdbc.Driver' | task-launcher-local" --deploy
In the foozer stream definition,
1) "trigger" source happens to trigger an upstream event every 5s
2) "tasklaunchrequest-transform" processor takes a few arguments; more specifically, it uses "composedtaskrunner-task:1.1.0.BUILD-SNAPSHOT" to launch a composed-task graph (i.e., sampleTask-t1||sampleTask-t2)
3) Pay attention to --increment-instance-enabled. This was recently added to CTR application and this provides the ability to re-launch a composed-task in a recurring cadence
4) Since the CTR and SCDF must share the same database, we are also passing datasource properties as command-line args. (SCDF-server is already started with the same datasource credentials)
Hope this helps.
Lastly, we will add a sample to the reference guide via: spring-cloud/spring-cloud-dataflow#1780

How to write a custom ES sink in Flume 1.7

In the Flume agent I am collection the elements from Kafka topics and I need to insert them in ES. However I need to perform a previous digestion process in the sink, so I need to write a custom sink to pass the data from the Agent's channel to a java digestion module (which I have written already).
Can anyone share with me a template of a custom sink and can use as a reference? Flumes official website doesn't say much about this topic:
A custom sink’s class and its dependencies must be included in the agent’s classpath when starting the Flume agent. The type of the custom sink is its FQCN.
https://flume.apache.org/FlumeUserGuide.html#custom-sink
And once the custom sink is ready, How could I link the following three files to make the agent work:
custom sink
ingestion jar (java module to perform the ingestion process)
FlumeAgent.properties
Thank you for any feedback. I will keep adding information as soon as I progress in this task.
Hope you are trying to use Flume to recieve events from Kafka (source) and forwarding it to ES (sink) with some data processing logic already you have.
With this understanding, I would suggest you to look into Flume interceptors which is responsible for altering/filtering the events on fly before sending to Sink.
So all your business logic to alter the events can be implemented as a custom interceptor and it should be configured to the Flume channel.
For reference you can checkout the native interceptors source code already available. This should probably give you an idea on the Flume interceptor framework.
Here is the ES Sink source code
Sample Flume config
a1.sources = kafkaSource
a1.sinks = ES_Sink
a1.channels = channel1
a1.sources.kafkaSource.interceptors = i1
a1.sources.kafkaSource.interceptors.i1.type = org.apache.flume.interceptor.<Custom_Interceptor_name>$Builder
a1.sinks.ES_Sink.channel = channel1
a1.sinks.ES_Sink.type = elasticsearch
a1.sinks.ES_Sink.hostNames = 127.0.0.1:9200

Aggregate-counter on an existing stream

I'm trying to create an aggregate counter for various streams I have set up. In SpringXD it would look like this: "tap:stream:MyCustomStream > aggregate-counter".
In Spring Cloud Dataflow so far I have done ":MyKafkaTopic > aggregate-counter", which seems to create a Kafka consumer and read the payload to determine a count of events on the topic. I'd like to be able to tap any stream not just a Kafka source, e.g. "MyApp1 | MyApp2" --name MyCustomStream.
The provided example "stream create --definition ":mainstream.http > counter" --name tap_at_http --deploy" essentially assumes mainstream.http is a Kafka topic (or RabbitMQ topic).
Anyone done this before?
Going by your example,
stream create foo --definition "MyApp1 | MyApp2"
If you'd have to TAP the foo stream at the producer, MyApp1 level, your TAP stream would like the following.
stream create bar --definition ":foo.MyApp1 > MyApp3"
You're just pointing to the producer in the stream where you'd like to TAP to get a copy of same data. The format is: :<streamName>.<label/appName>. You could use "labels" instead of app names, too. Please review the reference guide for more details.
The provided example "stream create --definition ":mainstream.http > counter" --name tap_at_http --deploy" essentially assumes mainstream.http is a Kafka topic (or RabbitMQ topic).
In this case, mainstream is the stream name and you're TAP'ing at http source application, which equates to :mainstream.http.
This is analogous to tap:stream:foo in Spring XD. By default, Spring XD assumes the producer if there's only in the stream. You'd have to specify it when you TAP at the processor, though.
In SCDF, we require it specifically to make it more descriptive and the DSL is easy to follow as well.

Resources