Read from multiple Pubsub subscriptions using ValueProvider - google-cloud-dataflow

I have multiple subscriptions from Cloud PubSub to read based on certain prefix pattern using Apache Beam. I extend PTransform class and implement expand() method to read from multiple subscriptions and do Flatten transformation to the PCollectionList (multiple PCollection on from each subscription). I have a problem to pass subscription prefix as ValueProvider into the expand() method, since expand() is called on template creation time, not when launching the job. However, if I only use 1 subscription, I can pass ValueProvider into PubsubIO.readStrings().fromSubscription().
Here's some sample code.
public class MultiPubSubIO extends PTransform<PBegin, PCollection<PubsubMessage>> {
private ValueProvider<String> prefixPubsub;
public MultiPubSubIO(#Nullable String name, ValueProvider<String> prefixPubsub) {
super(name);
this.prefixPubsub = prefixPubsub;
}
#Override
public PCollection<PubsubMessage> expand(PBegin input) {
List<String> myList = null;
try {
// prefixPubsub.get() will return error
myList = PubsubHelper.getAllSubscription("projectID", prefixPubsub.get());
} catch (Exception e) {
LogHelper.error(String.format("Error getting list of subscription : %s",e.toString()));
}
List<PCollection<PubsubMessage>> collectionList = new ArrayList<PCollection<PubsubMessage>>();
if(myList != null && !myList.isEmpty()){
for(String subs : myList){
PCollection<PubsubMessage> pCollection = input
.apply("ReadPubSub", PubsubIO.readMessagesWithAttributes().fromSubscription(this.prefixPubsub));
collectionList.add(pCollection);
}
PCollection<PubsubMessage> pubsubMessagePCollection = PCollectionList.of(collectionList)
.apply("FlattenPcollections", Flatten.pCollections());
return pubsubMessagePCollection;
} else {
LogHelper.error(String.format("No subscription with prefix %s found", prefixPubsub));
return null;
}
}
public static MultiPubSubIO read(ValueProvider<String> prefixPubsub){
return new MultiPubSubIO(null, prefixPubsub);
}
}
So I'm thinking of how to use the same way PubsubIO.read().fromSubscription() to read from ValueProvider. Or am I missing something?
Searched links:
extract-value-from-valueprovider-in-apache-beam - Answer talked about using DoFn, while I need PTransform that receives PBegin.

Unfortunately this is not possible currently:
It is not possible for the value of a ValueProvider to affect transform expansion - at expansion time, it is unknown; by the time it is known, the pipeline shape is already fixed.
There is currently no transform like PubsubIO.read() that can accept a PCollection of topic names. Eventually there will be (it is enabled by Splittable DoFn), but it will take a while - nobody is working on this currently.

You can use MultipleReadFromPubSub from apache beam io module https://beam.apache.org/releases/pydoc/2.27.0/_modules/apache_beam/io/gcp/pubsub.html
topic_1 = PubSubSourceDescriptor('projects/myproject/topics/a_topic')
topic_2 = PubSubSourceDescriptor(
'projects/myproject2/topics/b_topic',
'my_label',
'my_timestamp_attribute')
subscription_1 = PubSubSourceDescriptor(
'projects/myproject/subscriptions/a_subscription')
results = pipeline | MultipleReadFromPubSub(
[topic_1, topic_2, subscription_1])

Related

EMQX- Publish MQTT Topic with unique identifier is taking much more time than Static MQTT Topic

I was trying to publish messages on emqx broker on different topics.Scenario takes much time while publishing with dynamic topic with one client and if we put topic name as static it takes much less time.
Here I have posted result and code for the same.
I am using EMQX broker with Eclipse paho client Version 3 and Qos level 1.
Time for different topics with 100 simple publish message (Consider id as dynamic here):
Total time pattern 1: /config/{id}/outward::36 sec -----------------> HERE TOPIC is DYNAMIC. and {id} is a variable whose value is changing in loop as shown in below code
Total time pattern 2: /config/test::1.2 sec -----------------------> HERE TOPIC is STATIC
How shall I publish message with different id so topic creation wont take much time?
public class MwttPublish {
static IMqttClient instance= null;
public static IMqttClient getInstance() {
try {
if (instance == null) {
instance = new MqttClient(mqttHostUrl, "SimpleTestMQTT");
}
if (!instance.isConnected()) {
MqttConnectOptions options = new MqttConnectOptions();
options.setUserName("test");
options.setPassword("test".toCharArray());
options.setAutomaticReconnect(true);
options.setCleanSession(false);
options.setConnectionTimeout(10);
instance.connect(options);
}
} catch (final Exception e) {
System.out.println("Exception in mqtt: {}" + e.getMessage());
}
return instance;
}
public static void publishMessage() throws MqttException {
IMqttClient iMqttClient = getInstance();
MqttMessage mqttMessage = new MqttMessage("Hello".getBytes());
mqttMessage.setQos(1);
mqttMessage.setRetained(true);
System.out.println("Publish Start for pattern 1");
int i =0;
final BigDecimal mqttmsgPublishstartTime = new BigDecimal(System.currentTimeMillis());
do {
iMqttClient.publish("/config/" +i +"/outward", mqttMessage);
i++;
}while(i<100);
System.out.println("Total time pattern 1 /config/i/outward::" + (new BigDecimal(System.currentTimeMillis())).subtract(mqttmsgPublishstartTime));
System.out.println("Publish Start for pattern 2");
final BigDecimal mqttmsgPublishstartTime1 = new BigDecimal(System.currentTimeMillis());
i =0;
do {
iMqttClient.publish("/config/test", mqttMessage);
i++;
}while(i<100);
System.out.println("Total time pattern 2 /config/test::" + (new BigDecimal(System.currentTimeMillis())).subtract(mqttmsgPublishstartTime1));
}
}
This is not a valid test, you've fallen into many of the clasic micro benchmark traps e.g.
Way too small a sample size
No account for JVM JIT warm up or GC overhead
Not comparing like to like e.g. time taken to concatenate the strings for the topics
Please check out the following: https://stackoverflow.com/a/2844291/504554
Also from a MQTT point of view topics are ephemeral they only really "exist" for the instant a message is published while the broker checks for subscribed clients with a matching pattern.

How to get my object (Generator) from a Map<UUID, List<Generator>> with streams?

I've been wanting to check the location of my Generator and use streams to check if the location is valid.
The idea was as follows;
public Generator getGeneratorFromLocation(final Location location) {
for (List<Generator> generator : playerGeneratorMap.values()) {
for (Generator generator1 : generator) {
if (generator1.getGenLocation().equals(location)) {
return generator1;
}
}
}
return null;
}
I'm wanting to return a Generator from this using streams instead to try and learn more ways of doing it.
Current map:
public final Map<UUID, List<Generator>> playerGeneratorMap = new HashMap<>();
Any help would be greatly appreciated.
You can use AtomicRef object to init a retVal and then assign the wanted Generator to it in the lambda expression because regular vars can't be assigned in lambdas, only final or effectivly final can be used inside arrow functions.
This function should solve the problem :)
public Generator getGeneratorFromLocation(final Location location) {
AtomicReference<Generator> retVal = new AtomicReference<>(null);
playerGeneratorMap.values().stream().forEach(generators -> {
generators.forEach(generator -> {
if (generator.getLocation().equals(location)) {
retVal.set(generator);
}
});
});
return retVal.get();
}
By the way, streams are unnecessary because you have Collection.forEach instead of Stream.forEach, streams are used for more 'exotic' types of iterations like, filter, anyMatch, allMatch, reduce and such functionalities, you can read about Streams API on Oracle's website,
I'll link in the docs for you for future usage, important for functional proggraming.

Flink Count of Events using metric

I have a topic in kafka where i am getting multiple type of events in json format. I have created a filestreamsink to write these events to S3 with bucketing.
FlinkKafkaConsumer errorTopicConsumer = new FlinkKafkaConsumer(ERROR_KAFKA_TOPICS,
new SimpleStringSchema(),
properties);
final StreamingFileSink<Object> errorSink = StreamingFileSink
.forRowFormat(new Path(outputPath + "/error"), new SimpleStringEncoder<>("UTF-8"))
.withBucketAssigner(new EventTimeBucketAssignerJson())
.build();
env.addSource(errorTopicConsumer)
.name("error_source")
.setParallelism(1)
.addSink(errorSink)
.name("error_sink").setParallelism(1);
public class EventTimeBucketAssignerJson implements BucketAssigner<Object, String> {
#Override
public String getBucketId(Object record, Context context) {
StringBuffer partitionString = new StringBuffer();
Tuple3<String, Long, String> tuple3 = (Tuple3<String, Long, String>) record;
try {
partitionString.append("event_name=")
.append(tuple3.f0).append("/");
String timePartition = TimeUtils.getEventTimeDayPartition(tuple3.f1);
partitionString.append(timePartition);
} catch (Exception e) {
partitionString.append("year=").append(Constants.DEFAULT_YEAR).append("/")
.append("month=").append(Constants.DEFAULT_MONTH).append("/")
.append("day=").append(Constants.DEFAULT_DAY);
}
return partitionString.toString();
}
#Override
public SimpleVersionedSerializer<String> getSerializer() {
return SimpleVersionedStringSerializer.INSTANCE;
}
}
Now i want to publish hourly count of each event as metrics to prometheus and publish a grafana dashboard over that.
So please help me how can i achieve hourly count for each event using flink metrics and publish to prometheus.
Thanks
Normally, this is done by simply creating a counter for requests and then using the rate() function in Prometheus, this will give you the rate of requests in the given time.
If You, however, want to do this on Your own for some reason, then You can do something similar to what has been done in org.apache.kafka.common.metrics.stats.Rate. So You would, in this case, need to gather list of samples with the time at which they were collected, along with the window size You want to use for calculation of the rate, then You could simply do the calculation, i.e. remove samples that went out of scope and has expired and then simply calculate how many samples are in the window.
You could then set the Gauge to the calculated value.

Creating Custom Windowing Function in Apache Beam

I have a Beam pipeline that starts off with reading multiple text files where each line in a file represents a row that gets inserted into Bigtable later in the pipeline. The scenario requires confirming that the count of rows extracted from each file & count of rows later inserted into Bigtable match. For this I am planning to develop a custom Windowing strategy so that lines from a single file get assigned to a single window based on the file name as the key that will be passed to the Windowing function.
Is there any code sample for creating custom Windowing functions?
Although I changed my strategy for confirming the inserted number of rows, for anyone who is interested in windowing elements read from a batch source e.g. FileIO in a batch job, here's the code for creating a custom windowing strategy:
public class FileWindows extends PartitioningWindowFn<Object, IntervalWindow>{
private static final long serialVersionUID = -476922142925927415L;
private static final Logger LOG = LoggerFactory.getLogger(FileWindows.class);
#Override
public IntervalWindow assignWindow(Instant timestamp) {
Instant end = new Instant(timestamp.getMillis() + 1);
IntervalWindow interval = new IntervalWindow(timestamp, end);
LOG.info("FileWindows >> assignWindow(): Window assigned with Start: {}, End: {}", timestamp, end);
return interval;
}
#Override
public boolean isCompatible(WindowFn<?, ?> other) {
return this.equals(other);
}
#Override
public void verifyCompatibility(WindowFn<?, ?> other) throws IncompatibleWindowException {
if (!this.isCompatible(other)) {
throw new IncompatibleWindowException(other, String.format("Only %s objects are compatible.", FileWindows.class.getSimpleName()));
}
}
#Override
public Coder<IntervalWindow> windowCoder() {
return IntervalWindow.getCoder();
}
}
and then it can be used in the pipeline as below:
p
.apply("Assign_Timestamp_to_Each_Message", ParDo.of(new AssignTimestampFn()))
.apply("Assign_Window_to_Each_Message", Window.<KV<String,String>>into(new FileWindows())
.withAllowedLateness(Duration.standardMinutes(1))
.discardingFiredPanes());
Please keep in mind that you will need to write the AssignTimestampFn() so that each message carries a timestamp.

how to combine two publisher in one in spring reactor

I have implemented a dummy reactive repository but I am struggling with the update method:
#Override
public Mono<User> updateUser(int id, Mono<User> updateMono) {
return //todo with getUser
}
#Override
public Mono<User> getUser(int id) {
return Mono.justOrEmpty(this.users.get(id));
}
From one hand I have incoming publisher Mono<User> updateMono, from the other hand I have another publisher during Mono.justOrEmpty(this.users.get(id)).
How to combine it together, make update, and give back just one publisher?
The only thing come to my mind is:
#Override
public Mono<User> updateUser(int id, Mono<User> updateMono) {
return getUser(id).doOnNext(user -> {
updateMono.subscribe(update -> {
users.put(id, new User(id, update.getName(), update.getAge()));
System.out.format("Updated user with id %d to %s%n", id, update);
});
});
}
Is it correct?
See the reference guide on finding the right operator
Notably, for Mono you have and, when, then (note this last one will become flatMap in 3.1.0, and flatmap will become flatMapMany)
doOnNext is more for side operations like logging or stats gathering. Subscribe inside subscribe is another bad form; generally you want flatMap or similar instead.
I have played Spring 5 Reactive Streams features in these days, and have written down some sample codes(not public via blog or twitter yet, I still need more practice on Reactor).
I have encountered the same problems and finally used a Mono.zip to update the existing item in MongoDB.
https://github.com/hantsy/spring-reactive-sample/blob/master/boot-routes/src/main/java/com/example/demo/DemoApplication.java
public Mono<ServerResponse> update(ServerRequest req) {
return Mono
.zip(
(data) -> {
Post p = (Post) data[0];
Post p2 = (Post) data[1];
p.setTitle(p2.getTitle());
p.setContent(p2.getContent());
return p;
},
this.posts.findById(req.pathVariable("id")),
req.bodyToMono(Post.class)
)
.cast(Post.class)
.flatMap(post -> this.posts.save(post))
.flatMap(post -> ServerResponse.noContent().build());
}
Update: Another working version written in Kotlin.
fun update(req: ServerRequest): Mono<ServerResponse> {
return this.posts.findById(req.pathVariable("id"))
.and(req.bodyToMono(Post::class.java))
.map { it.t1.copy(title = it.t2.title, content = it.t2.content) }
.flatMap { this.posts.save(it) }
.flatMap { noContent().build() }
}

Resources