Source stream outside scdf - spring-cloud-dataflow

I have read the documentation and it's not clear for me the following scenario:
the consumer is outside scdf and the processor and sink are inside.
All the examples provided, the three components are inside.
From my point of view I think that there are two solutions:
The producer outside SCDF produces a message in the topic configured in SCDF
There is another binder outside SCDF and the processor/sink connects to this binder outside SCDF
If somebody could provide any sample it will be very useful

It is not entirely clear from your question what you mean by “outside scdf” . I assume you are referring to existing code that you want to use as a source for an scdf stream. It may already produce messages using a supported message middleware, for example it writes to a Kafka topic, or can be modified to do so but for some reason, you cannot use SCDF to manage its deployment. The simplest way to do this is to use the source topic as a named destination in your stream definition: :my-topic > processor | sink
https://dataflow.spring.io/docs/feature-guides/streams/named-destinations/

Related

Rational behind appending versions as Service/Deployment name on k8s with spring cloud skipper

I am kind of new the spring cloud dataflow world and while playing around with the framework, I see that if I have a stream = 'test-steram' with 1 application called 'app'. When I deploy using skipper to kubernetes, I see that It creates pod/deployment & service on kubernetes with name as
test-stream-app-v1.
My question is why do we need to have v1 in service/deployment names on k8s? What role does it play in the overall workflow using spring cloud dataflow?
------Follow up -----------
Just wanted to confirm few points to make sure i am on right track to understand the flow
My understanding is with traditional stream (bind through kafka topics) service (object on kubernetes) do not play a significant role.
Rolling Update (Red/Black) pattern has implemented in following way in skipper and versioning in deployment/service plays a role in following way.
Let's assume that app-v1 deployment already exists and upgrade is requested. Skipper creates app-v2 deployment and
wait for it to be ready. Once ready it destroys app-v1
If my above understanding is right I have following follow up questions...
I see that skipper can deploy and package (and it do not have to be a traditional stream) to work with. Is that the longer term plan or Skipper is only intended to work spring-cloud-dataflow streams?
In case of non-tradtional stream package, where an package has multiple apps(rest microservices) in a group, how this model of versioning will work? I mean when I want to call the microservice from other microservice, I cannot possibly know or less than ideal to know the release-version of the app?
#Anand. Congrats on the 1st post!
The naming convention goes by the idea that each of the stream application is "versioned" if Skipper is used with SCDF. The version gets bumped for when, as a user, when you rolling-upgrade and rolling-downgrade the streaming-application versions or the application-specific properties either on-demand or via CI/CD automation.
It is very relevant for continuous-delivery and continuous-deployment workflows, and we provide native options in SCDF through commands such as stream update .. and stream rollback .. respectively. For any of these operations, the applications will be rolling updated in K8s, and each action will bump the number in the application name. In your example, you'd see them as test-stream-app-v1, `test-stream-app-v2, etc.
With all the historical versions in a central place (i.e., Skipper's database), you'd be able to interact with them via stream history.. and stream manifest .. commands in SCDF.
To learn more about all this, watch this demo-webinar (starts # ~41.25), and also have a look at samples in the reference guide.
I hope this helps.

Logging directly to standard output

Where I work, we are migrating our entire infrastructure which was until now based on monolithic services that ran directly on a windows/linux VM to a docker based architecture that will be orchestrated by Kubernetes.
One of the things that came to my mind is how we would handle logs in this new infrastructure.
Up until now, each app had its own way of handling logs, some were using log4net/log4j to write to file system, some were writing to GrayLog via a dedicated library.
The main problem I have with that is that one of the core ideas of programming micro-services in a Docker environment is that every service should assume as little as possible about the rest of services or the platform.
So basically I was looking into how I can abstract the logging process from the application, make it independent from the rest of the infrastructure.
One interesting thing that I found was that you could write the logs to standard output (stdout) and then configure Kubernetes to pull these logs and direct them to a centralised storage or a centralised logging server (like GrayLog) https://kubernetes.io/docs/concepts/cluster-administration/logging/
I have several concerns with this approach, for once, I haven't seen too many companies that do it, most popular logging solutions are to use a dedicated library to log to filesystem.
I am also concerned about how it might impact performance, some languages block if you write to stdout, whereas when you use a standard logging library, the logs are queued.
So what about services that output massive amount of user related logs?
I was interested about what you think, I didn't see this approach used widely, maybe there is reason for that.
Logging to whatever stream (File, stdout, GrayLog...) can either be synchronous (blocking) or asynchronous (non-blocking). Inherently, that has nothing to do with the medium you log to per-se. It is true that using System.out.println in Java will result in heavy thread-contention.
All the major logging frameworks (like log4j) provide you with the means to log in an asynchronous fashion to every medium that you like.
Your perception of not many companies doing this I think is wrong. Logging to stdout and configuring your underlying architecture to forward logs somewhere is the defacto standard of all PaaS/containerized applications.
So my tip is going to be: Log to stdout using a good logging framework which ensures asynchronous usage of the stream. For the rest you'll probably be fine.

Collect metrics from dockers in Kubernetes

I work with Docker and Kubernetes.
I would like to collect application specific metrics from each Docker.
There are various applications, each running in one or more Dockers.
I would like to collect the metrics in JSON format in order to perform further processing on each type of metrics.
I am trying to understand what is the best practice, if any and what tools can I use to achieve my goal.
Currently I am looking into several options, none looks too good:
Connecting to kubectl, getting a list of pods, performing a command (exec) at each pod to cause the application to print/send JSON with metrics. I don't like this option as it means that I need to be aware to which pods exist and access each, while the whole point of having Kubernetes is to avoid dealing with this issue.
I am looking for Kubernetes API HTTP GET request that will allow me to pull a specific file.
The closest I found is GET /api/v1/namespaces/{namespace}/pods/{name}/log and it seems it is not quite what I need.
And again, it forces me to mention each pop by name.
I consider to use ExecAction in Probe to send JSON with metrics periodically. It is a hack (as this is not the purpose of Probe), but it removes the need to handle each specific pod
I can't use Prometheus for reasons that are out of my control but I wonder how Prometheus collects metric. Maybe I can use similar approach?
Any possible solution will be appreciated.
From an architectural point of view you have 2 options here:
1) pull model: your application exposes metrics data through a mechanisms (for instance using the HTTP protocol on a different port) and an external tool scrapes your pods at a timed interval (getting pod addresses from the k8s API); this is the model used by prometheus for instance.
2) push model: your application actively pushes metrics to an external server, tipically a time series database such as influxdb, when it is most relevant to it.
In my opinion, option 2 is the easiest to implement, because:
you don't need to deal with k8s APIs in order to discover pods addresses;
you don't need to create a local storage layer to store your metrics (because you push them one by one as they occour);
But there is a drawback: you need to be careful how you implement this, it could cause your API to become slower and you might need to deal with asynchronous calls to your metrics server.
This is obviously a very generic answer, but I hope it could point you in the right direction.
Pity you can not use Prometheus, but it's a good lead for what can be done in this scope. What Prom does is as follows :
1: it assumes that metrics you want to scrape (collect) are exposed with some http endpoint that Prometheus can access.
2: it connects to kubernetes api to perform a discovery of endpoints to scrape metrics from (there is a config for it, but generaly it means it has to be able to connect to the API and list services/deployments/pods and analyze their annotations (as they have info about metrics endpoints) to compose a list of places to scrape data from
3: periodicaly (15s, 60s etc.) it connects to the endpoints and collects the exposed metrics.
That's it. Rest is storage/postprocessing. The kube related part might be a significant amount of work to do though, so it would be way better to go with something that already exists.
Sidenote: while this is generaly a pull based model, there are cases where pull is not possible (vide short lived scripts like php), that is where Prometheus pushgateway comes into play to allow pushing metrics to a place where Prometheus will pull from.

Using custom DataFlow unbounded source on DirectPipelineRunner

I'm writing a custom DataFlow unbounded data source that reads from Kafka 0.8. I'd like to run it locally using the DirectPipelineRunner. However, I'm getting the following stackstrace:
Exception in thread "main" java.lang.IllegalStateException: no evaluator registered for Read(KafkaDataflowSource)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner$Evaluator.visitTransform(DirectPipelineRunner.java:700)
at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:219)
at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:215)
at com.google.cloud.dataflow.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:102)
at com.google.cloud.dataflow.sdk.Pipeline.traverseTopologically(Pipeline.java:252)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner$Evaluator.run(DirectPipelineRunner.java:662)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner.run(DirectPipelineRunner.java:374)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner.run(DirectPipelineRunner.java:87)
at com.google.cloud.dataflow.sdk.Pipeline.run(Pipeline.java:174)
Which makes some sense, as I haven't registered an evaluator for my custom source at any time.
Reading https://github.com/GoogleCloudPlatform/DataflowJavaSDK, it seems like only evaluators for bounded sources are registered. What's the recommended way to define and register an evaluator for an custom unbounded source?
DirectPipelineRunner currently runs over bounded input only. We are actively working on removing this restriction, and expect to release it shortly.
In the meanwhile, you can trivially turn any UnboundedSource into a BoundedSource, for testing purposes, by using withMaxNumRecords, as in the following example:
UnboundedSource<String> unboundedSource = ...; // make a Kafka source
PCollection<String> boundedKafkaCollection =
p.apply(Read.from(unboundedSource).withMaxNumRecords(10));
See this issue on GitHub for more details.
Separately, there are several efforts on contributing the Kafka connector. You may want to engage with us and other contributors about that via our GitHub repository.

Benefit of Apache Flume

I am new with Apache Flume.
I understand that Apache Flume can help transport data.
But I still fail to see the ultimate benefit offered by Apache Flume.
If I can configure a software or make a software to send which data goes where, why I need Flume?
Maybe someone can explain a situation that shows Apache Flume's benefit?
Reliable transmission (if you use the file channel):
Flume sends batches of small events. Every time it sends a batch to the next node it waits for acknowledgment before deleting. The storage in the file channel is optimized to allow recovery on crash.
I think the biggest benefit that you get out of flume is extensiblity. Basically all components starting from source, interceptor and sink, everything is extensible.
We use flume and read data using custom kakfa source, data is in the form of JSON we parse it in custom kafka source and then pass it on to HDFS sink. It working reliably in 5 of nodes. We extended only kafka source, HDFS sink functionality we got out the box.
At the same time, being from the Hadoop ecosystem, you get great community support and multiple options to use the tools in different ways.

Resources