Wrong Kafka topic names for Spring-Cloud-Function app deployed as part of Spring-Cloud-Data-Flow stream - spring-cloud-dataflow

I have a simple SCDF stream that looks like this:
http --port=12346 | mvmn-transform | file --name=tmp.txt --directory=/tmp
The mvmn-transform is a simple custom transformer that looks like this:
#SpringBootApplication
#EnableBinding(Processor.class)
#EnableConfigurationProperties(ScdfTestTransformerProperties.class)
#Configuration
public class ScdfTestTransformer {
public static void main(String args[]) {
SpringApplication.run(ScdfTestTransformer.class, args);
}
#Autowired
protected ScdfTestTransformerProperties config;
#Transformer(inputChannel = Processor.INPUT, outputChannel = Processor.OUTPUT)
public Object transform(Message<?> message) {
Object payload = message.getPayload();
Map<String, Object> result = new HashMap<>();
Map<String, String> headersStr = new HashMap<>();
message.getHeaders().forEach((k, v) -> headersStr.put(k, v != null ? v.toString() : null));
result.put("headers", headersStr);
result.put("payload", payload);
result.put("configProp", config.getSomeConfigProp());
return result;
}
// See https://stackoverflow.com/questions/59155689/could-not-decode-json-type-for-key-file-name-in-a-spring-cloud-data-flow-stream
#Bean("kafkaBinderHeaderMapper")
public KafkaHeaderMapper kafkaBinderHeaderMapper() {
BinderHeaderMapper mapper = new BinderHeaderMapper();
mapper.setEncodeStrings(true);
return mapper;
}
}
This works fine.
But I've read that Spring Cloud Function should allow me to implement such apps without a necessity to specify binding and transformer annotations, so I've changed it to this:
#SpringBootApplication
// #EnableBinding(Processor.class)
#EnableConfigurationProperties(ScdfTestTransformerProperties.class)
#Configuration
public class ScdfTestTransformer {
public static void main(String args[]) {
SpringApplication.run(ScdfTestTransformer.class, args);
}
#Autowired
protected ScdfTestTransformerProperties config;
// #Transformer(inputChannel = Processor.INPUT, outputChannel = Processor.OUTPUT)
#Bean
public Function<Message<?>, Map<String, Object>> transform(
// Message<?> message
) {
return message -> {
Object payload = message.getPayload();
Map<String, Object> result = new HashMap<>();
Map<String, String> headersStr = new HashMap<>();
message.getHeaders().forEach((k, v) -> headersStr.put(k, v != null ? v.toString() : null));
result.put("headers", headersStr);
result.put("payload", payload);
result.put("configProp", "Config prop val: " + config.getSomeConfigProp());
return result;
};
}
// See https://stackoverflow.com/questions/59155689/could-not-decode-json-type-for-key-file-name-in-a-spring-cloud-data-flow-stream
#Bean("kafkaBinderHeaderMapper")
public KafkaHeaderMapper kafkaBinderHeaderMapper() {
BinderHeaderMapper mapper = new BinderHeaderMapper();
mapper.setEncodeStrings(true);
return mapper;
}
}
And now I have a problem - SCDF source and target topic names are ignored by Spring-Cloud-Function apparently, and topics transform-in-0 and transform-out-0 are created instead.
SCDF creates topics that have names like <stream-name>.<app-name> eg something like TestStream123.http and TestStream123.mvmn-transform
Previously they were used for transformer - as it should be, since it is a part of the SCDF stream. But now they are ignored by Spring-Cloud-Function and transform-in-0 and transform-out-0 are created instead.
Thus my transformer no longer receives any input, as it expects it on a wrong Kafka topic. And would probably produce no output to the stream as well, since it outputs to the wrong Kafka topic also.
P.S. Just in case, full project code on GitHub: https://github.com/mvmn/scdftest-transformer/tree/scfunc
In order to run locally start up Kafka, Skipper, SCDF and SCDF console, do mvn clean install in the app folder and then do app register --name mvmn-transform-1 --type processor --uri maven://x.mvmn.study.scdf.scdftest:scdftest-transformer:0.1.1-SNAPSHOT --metadata-uri maven://x.mvmn.study.scdf.scdftest:scdftest-transformer:0.1.1-SNAPSHOTin the coonsole. Then you can deploy stream using definition http --port=12346 | mvmn-transform | file --name=tmp.txt --directory=/tmp

Since you are using the functional model of writing Spring Cloud Stream applications, when you deploy this app, you need to pass two properties on the custom processor to restore the Spring Cloud Data Flow behavior.
spring.cloud.stream.function.bindings.transform-in-0=input
spring.cloud.stream.function.bindings.transform-out-0=output
Can you try that and see if that makes a difference?

Related

Dependency Injection in Apache Storm topology

Little background: I am working on a topology using Apache Storm, I thought why not use dependency injection in it, but I was not sure how it will behave on cluster environment when topology deployed to cluster. I started looking for answers on if DI is good option to use in Storm topologies, I came across some threads about Apache Spark where it was mentioned serialization is going to be problem and saw some responses for apache storm along the same lines. So finally I decided to write a sample topology with google guice to see what happens.
I wrote a sample topology with two bolts, and used google guice to injects dependencies. First bolt emits a tick tuple, then first bolt creates message, bolt prints the message on log and call some classes which does the same. Then this message is emitted to second bolt and same printing logic there as well.
First Bolt
public class FirstBolt extends BaseRichBolt {
private OutputCollector collector;
private static int count = 0;
private FirstInjectClass firstInjectClass;
#Override
public void prepare(Map map, TopologyContext topologyContext, OutputCollector outputCollector) {
collector = outputCollector;
Injector injector = Guice.createInjector(new Module());
firstInjectClass = injector.getInstance(FirstInjectClass.class);
}
#Override
public void execute(Tuple tuple) {
count++;
String message = "Message count "+count;
firstInjectClass.printMessage(message);
log.error(message);
collector.emit("TO_SECOND_BOLT", new Values(message));
collector.ack(tuple);
}
#Override
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
outputFieldsDeclarer.declareStream("TO_SECOND_BOLT", new Fields("MESSAGE"));
}
#Override
public Map<String, Object> getComponentConfiguration() {
Config conf = new Config();
conf.put(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, 10);
return conf;
}
}
Second Bolt
public class SecondBolt extends BaseRichBolt {
private OutputCollector collector;
private SecondInjectClass secondInjectClass;
#Override
public void prepare(Map map, TopologyContext topologyContext, OutputCollector outputCollector) {
collector = outputCollector;
Injector injector = Guice.createInjector(new Module());
secondInjectClass = injector.getInstance(SecondInjectClass.class);
}
#Override
public void execute(Tuple tuple) {
String message = (String) tuple.getValue(0);
secondInjectClass.printMessage(message);
log.error("SecondBolt {}",message);
collector.ack(tuple);
}
#Override
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
}
}
Class in which dependencies are injected
public class FirstInjectClass {
FirstInterface firstInterface;
private final String prepend = "FirstInjectClass";
#Inject
public FirstInjectClass(FirstInterface firstInterface) {
this.firstInterface = firstInterface;
}
public void printMessage(String message){
log.error("{} {}", prepend, message);
firstInterface.printMethod(message);
}
}
Interface used for binding
public interface FirstInterface {
void printMethod(String message);
}
Implementation of interface
public class FirstInterfaceImpl implements FirstInterface{
private final String prepend = "FirstInterfaceImpl";
public void printMethod(String message){
log.error("{} {}", prepend, message);
}
}
Same way another class that receives dependency via DI
public class SecondInjectClass {
SecondInterface secondInterface;
private final String prepend = "SecondInjectClass";
#Inject
public SecondInjectClass(SecondInterface secondInterface) {
this.secondInterface = secondInterface;
}
public void printMessage(String message){
log.error("{} {}", prepend, message);
secondInterface.printMethod(message);
}
}
another interface for binding
public interface SecondInterface {
void printMethod(String message);
}
implementation of second interface
public class SecondInterfaceImpl implements SecondInterface{
private final String prepend = "SecondInterfaceImpl";
public void printMethod(String message){
log.error("{} {}", prepend, message);
}
}
Module Class
public class Module extends AbstractModule {
#Override
protected void configure() {
bind(FirstInterface.class).to(FirstInterfaceImpl.class);
bind(SecondInterface.class).to(SecondInterfaceImpl.class);
}
}
Nothing fancy here, just two bolts and couple of classes for DI. I deployed it on server and it works just fine. The catch/problem though is that I have to initialize Injector in each bolt which makes me question what is side effect of it going to be?
This implementation is simple, just 2 bolts.. what if I have more bolts? what impact it would create on topology if I have to initialize Injector in all bolts?
If I try to initialize Injector outside prepare method I get error for serialization.

How to configure Micronaut and Micrometer to write ILP directly to InfluxDB?

I have a Micronaut application that uses Micrometer to report metrics to InfluxDB with the micronaut-micrometer project. Currently it is using the Statsd Registry provided via the io.micronaut.configuration:micronaut-micrometer-registry-statsd dependency.
I would like to instead output metrics in Influx Line Protocol (ILP), but the micronaut-micrometer project does not offer an Influx Registry currently. I tried to work around this by importing the io.micrometer:micrometer-registry-influx dependency and configuring an InfluxMeterRegistry manually like this:
#Factory
public class MyMetricRegistryConfigurer implements MeterRegistryConfigurer {
#Bean
#Primary
#Singleton
public MeterRegistry getMeterRegistry() {
InfluxConfig config = new InfluxConfig() {
#Override
public Duration step() {
return Duration.ofSeconds(10);
}
#Override
public String db() {
return "metrics";
}
#Override
public String get(String k) {
return null; // accept the rest of the defaults
}
};
return new InfluxMeterRegistry(config, Clock.SYSTEM);
}
#Override
public boolean supports(MeterRegistry meterRegistry) {
return meterRegistry instanceof InfluxMeterRegistry;
}
}
When the application runs, the metrics are exposed on my /metrics endpoint as I would expect, but nothing gets written to InfluxDB. I confirmed that my local InfluxDB accepts metrics at the expected localhost:8086/write?db=metrics endpoint using curl. Can anyone give me some pointers to get this working? I'm wondering if I need to manually define a reporter somewhere...
After playing around for a bit, I got this working with the following code:
#Factory
public class InfluxMeterRegistryFactory {
#Bean
#Singleton
#Requires(property = MeterRegistryFactory.MICRONAUT_METRICS_ENABLED, value =
StringUtils.TRUE, defaultValue = StringUtils.TRUE)
#Requires(beans = CompositeMeterRegistry.class)
public InfluxMeterRegistry getMeterRegistry() {
InfluxConfig config = new InfluxConfig() {
#Override
public Duration step() {
return Duration.ofSeconds(10);
}
#Override
public String db() {
return "metrics";
}
#Override
public String get(String k) {
return null; // accept the rest of the defaults
}
};
return new InfluxMeterRegistry(config, Clock.SYSTEM);
}
}
I also noticed that an InfluxMeterRegistry will be available out of the box in the future for micronaut-micrometer as of v1.2.0.

Apache Beam Coder for GenericRecord

I am building a pipeline that reads Avro generic records. To pass GenericRecord between stages I need to register AvroCoder. The documentation says that if I use generic record, the schema argument can be arbitrary: https://beam.apache.org/releases/javadoc/2.2.0/org/apache/beam/sdk/coders/AvroCoder.html#of-java.lang.Class-org.apache.avro.Schema-
However, when I pass an empty schema to the method AvroCoder.of(Class, Schema) it throws an exception at run time. Is there a way to create an AvroCoder for GenericRecord that does not require a schema? In my case, each GenericRecord has an embedded schema.
The exception and stacktrace:
Exception in thread "main" java.lang.NullPointerException
at org.apache.beam.sdk.coders.AvroCoder$AvroDeterminismChecker.checkIndexedRecord(AvroCoder.java:562)
at org.apache.beam.sdk.coders.AvroCoder$AvroDeterminismChecker.recurse(AvroCoder.java:430)
at org.apache.beam.sdk.coders.AvroCoder$AvroDeterminismChecker.check(AvroCoder.java:409)
at org.apache.beam.sdk.coders.AvroCoder.<init>(AvroCoder.java:260)
at org.apache.beam.sdk.coders.AvroCoder.of(AvroCoder.java:141)
I had a similar case and solved it with custom coder. The simplest (but sub-efficient) solution would be to encode schema along with each record. If your schemas are not too volatile you can get benefit of caching.
public class GenericRecordCoder extends AtomicCoder<GenericRecord> {
public static GenericRecordCoder of() {
return new GenericRecordCoder();
}
private static final ConcurrentHashMap<String, AvroCoder<GenericRecord>> avroCoders = new ConcurrentHashMap<>();
#Override
public void encode(GenericRecord value, OutputStream outStream) throws IOException {
String schemaString = value.getSchema().toString();
String schemaHash = getHash(schemaString);
StringUtf8Coder.of().encode(schemaString, outStream);
StringUtf8Coder.of().encode(schemaHash, outStream);
AvroCoder<GenericRecord> coder = avroCoders.computeIfAbsent(schemaHash,
s -> AvroCoder.of(value.getSchema()));
coder.encode(value, outStream);
}
#Override
public GenericRecord decode(InputStream inStream) throws IOException {
String schemaString = StringUtf8Coder.of().decode(inStream);
String schemaHash = StringUtf8Coder.of().decode(inStream);
AvroCoder<GenericRecord> coder = avroCoders.computeIfAbsent(schemaHash,
s -> AvroCoder.of(new Schema.Parser().parse(schemaString)));
return coder.decode(inStream);
}
}
While this solves the task, in fact I made it slightly different, using external schema registry (you can build this on the top of datastore for example). In this case you don't need to serialize/deserialize schema. The code looks like:
public class GenericRecordCoder extends AtomicCoder<GenericRecord> {
public static GenericRecordCoder of() {
return new GenericRecordCoder();
}
private static final ConcurrentHashMap<String, AvroCoder<GenericRecord>> avroCoders = new ConcurrentHashMap<>();
#Override
public void encode(GenericRecord value, OutputStream outStream) throws IOException {
SchemaRegistry.registerIfAbsent(value.getSchema());
String schemaName = value.getSchema().getFullName();
StringUtf8Coder.of().encode(schemaName, outStream);
AvroCoder<GenericRecord> coder = avroCoders.computeIfAbsent(schemaName,
s -> AvroCoder.of(value.getSchema()));
coder.encode(value, outStream);
}
#Override
public GenericRecord decode(InputStream inStream) throws IOException {
String schemaName = StringUtf8Coder.of().decode(inStream);
AvroCoder<GenericRecord> coder = avroCoders.computeIfAbsent(schemaName,
s -> AvroCoder.of(SchemaRegistry.get(schemaName)));
return coder.decode(inStream);
}
}
The usage is pretty straightforward:
PCollection<GenericRecord> inputCollection = pipeline
.apply(AvroIO
.parseGenericRecords(t -> t)
.withCoder(GenericRecordCoder.of())
.from(...));
After reviewing the code for AvroCoder, I do not think the documentation is correct there. Your AvroCoder instance will need a way to figure out the schema for your Avro records - and likely the only way to do that is by providing one.
So, I'd recommend calling AvroCoder.of(GenericRecord.class, schema), where schema is the correct schema for the GenericRecord objects in your PCollection.

Element value based writing to Google Cloud Storage using Dataflow

I'm trying to build a dataflow process to help archive data by storing data into Google Cloud Storage. I have a PubSub stream of Event data which contains the client_id and some metadata. This process should archive all incoming events, so this needs to be a streaming pipeline.
I'd like to be able to handle archiving the events by putting each Event I receive inside a bucket that looks like gs://archive/client_id/eventdata.json . Is that possible to do within dataflow/apache beam, specifically being able to assign the file name differently for each Event in the PCollection?
EDIT:
So my code currently looks like:
public static class PerWindowFiles extends FileBasedSink.FilenamePolicy {
private String customerId;
public PerWindowFiles(String customerId) {
this.customerId = customerId;
}
#Override
public ResourceId windowedFilename(ResourceId outputDirectory, WindowedContext context, String extension) {
String filename = bucket+"/"+customerId;
return outputDirectory.resolve(filename, ResolveOptions.StandardResolveOptions.RESOLVE_FILE);
}
#Override
public ResourceId unwindowedFilename(
ResourceId outputDirectory, Context context, String extension) {
throw new UnsupportedOperationException("Unsupported.");
}
}
public static void main(String[] args) throws IOException {
DataflowPipelineOptions options = PipelineOptionsFactory.fromArgs(args)
.withValidation()
.as(DataflowPipelineOptions.class);
options.setRunner(DataflowRunner.class);
options.setStreaming(true);
Pipeline p = Pipeline.create(options);
PCollection<Event> set = p.apply(PubsubIO.readStrings()
.fromTopic("topic"))
.apply(new ConvertToEvent()));
PCollection<KV<String, Event>> events = labelEvents(set);
PCollection<KV<String, EventGroup>> sessions = groupEvents(events);
String customers = System.getProperty("CUSTOMERS");
JSONArray custList = new JSONArray(customers);
for (Object cust : custList) {
if (cust instanceof String) {
String customerId = (String) cust;
PCollection<KV<String, EventGroup>> custCol = sessions.apply(new FilterByCustomer(customerId));
stringifyEvents(custCol)
.apply(TextIO.write()
.to("gs://archive/")
.withFilenamePolicy(new PerWindowFiles(customerId))
.withWindowedWrites()
.withNumShards(3));
} else {
LOG.info("Failed to create TextIO: customerId was not String");
}
}
p.run()
.waitUntilFinish();
}
This code is ugly because I need to redeploy every time a new client happens in order to be able to save their data. I would prefer to be able to assign customer data to an appropriate bucket dynamically.
"Dynamic destinations" - choosing the file name based on the elements being written - will be a new feature available in Beam 2.1.0, which has not yet been released.

Mbean registered but not found in mbean Server

I have a problem about the mbeans. I have created a simple mbean and I have registered it on the default mBeanServer that is run (Via eclipse or java -jar mbean.jar) and in the same process if I try to fouund the mbean registered with a simple query:
for (ObjectInstance instance : mbs.queryMBeans(ObjectNameMbean, null)) {
System.out.println(instance.toString());
}
the query retuerns my mbean, but if I start another process and try to search this mbean registered the mbeas is not found! why?
The approch is : (Process that is running)
public static void main(String[] args) throws Exception
{
MBeanServer mbeanServer =ManagementFactory.getPlatformMBeanServer();
ObjectName objectName = new ObjectName(ObjectNameMbean);
Simple simple = new Simple (1, 0);
mbeanServer.registerMBean(simple, objectName);
while (true)
{
wait (Is this necessary?)
}
}
So this is the first process that is running (that has the only pourpose to registry the mbean, because there is another process that want to read these informations.
So I start another process to search this mbean but nothing.
I 'm not using jboss but the local Java virtual Machine but my scope is to deploy this simple application in one ejb (autostart) and another ejb will read all informations.
All suggestions are really apprecciated.
This example should be more useful :
Object Hello:
public class Hello implements HelloMBean {
public void sayHello() {
System.out.println("hello, world");
}
public int add(int x, int y) {
return x + y;
}
public String getName() {
return this.name;
}
public int getCacheSize() {
return this.cacheSize;
}
public synchronized void setCacheSize(int size) {
this.cacheSize = size;
System.out.println("Cache size now " + this.cacheSize);
}
private final String name = "Reginald";
private int cacheSize = DEFAULT_CACHE_SIZE;
private static final int DEFAULT_CACHE_SIZE = 200;
}
Interface HelloBean (implemented by Hello)
public interface HelloMBean {
public void sayHello();
public int add(int x, int y);
public String getName();
public int getCacheSize();
public void setCacheSize(int size);
}
Simple Main
import java.lang.management.ManagementFactory;
import java.util.logging.Logger;
import javax.management.MBeanServer;
import javax.management.ObjectName;
public class Main {
static Logger aLog = Logger.getLogger("MBeanTest");
public static void main(String[] args) {
try{
MBeanServer mbs = ManagementFactory.getPlatformMBeanServer();
ObjectName name = new ObjectName("ApplicationDomain:type=Hello");
Hello mbean = new Hello();
mbs.registerMBean(mbean, name);
// System.out.println(mbs.getAttribute(name, "Name"));
aLog.info("Waiting forever...");
Thread.sleep(Long.MAX_VALUE);
}
catch(Exception x){
x.printStackTrace();
aLog.info("exception");
}
}
}
So now I have exported this project as jar file and run it as "java -jar helloBean.jar" and by eclipse I have modified the main class to read informations of this read (Example "Name" attribute) by using the same objectname used to registry it .
Main to read :
public static void main(String[] args) {
try{
MBeanServer mbs = ManagementFactory.getPlatformMBeanServer();
ObjectName name = new ObjectName("ApplicationDomain:type=Hello");
System.out.println(mbs.getAttribute(name, "Name"));
}
catch(Exception x){
x.printStackTrace();
aLog.info("exception");
}
}
But nothing, the bean is not found.
Project link : here!
Any idea?
I suspect the issue here is that you have multiple MBeanServer instances. You did not mention how you acquired the MBeanServer in each case, but in your second code sample, you are creating a new MBeanServer instance which may not be the same instance that other threads are reading from. (I assume this is all in one JVM...)
If you are using the platform agent, I recommend you acquire the MBeanServer using the ManagementFactory as follows:
MBeanServer mbs = java.lang.management.ManagementFactory.getPlatformMBeanServer() ;
That way, you will always get the same MBeanServer instance.

Resources