Using custom DataFlow unbounded source on DirectPipelineRunner - google-cloud-dataflow

I'm writing a custom DataFlow unbounded data source that reads from Kafka 0.8. I'd like to run it locally using the DirectPipelineRunner. However, I'm getting the following stackstrace:
Exception in thread "main" java.lang.IllegalStateException: no evaluator registered for Read(KafkaDataflowSource)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner$Evaluator.visitTransform(DirectPipelineRunner.java:700)
at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:219)
at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:215)
at com.google.cloud.dataflow.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:102)
at com.google.cloud.dataflow.sdk.Pipeline.traverseTopologically(Pipeline.java:252)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner$Evaluator.run(DirectPipelineRunner.java:662)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner.run(DirectPipelineRunner.java:374)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner.run(DirectPipelineRunner.java:87)
at com.google.cloud.dataflow.sdk.Pipeline.run(Pipeline.java:174)
Which makes some sense, as I haven't registered an evaluator for my custom source at any time.
Reading https://github.com/GoogleCloudPlatform/DataflowJavaSDK, it seems like only evaluators for bounded sources are registered. What's the recommended way to define and register an evaluator for an custom unbounded source?

DirectPipelineRunner currently runs over bounded input only. We are actively working on removing this restriction, and expect to release it shortly.
In the meanwhile, you can trivially turn any UnboundedSource into a BoundedSource, for testing purposes, by using withMaxNumRecords, as in the following example:
UnboundedSource<String> unboundedSource = ...; // make a Kafka source
PCollection<String> boundedKafkaCollection =
p.apply(Read.from(unboundedSource).withMaxNumRecords(10));
See this issue on GitHub for more details.
Separately, there are several efforts on contributing the Kafka connector. You may want to engage with us and other contributors about that via our GitHub repository.

Related

Source stream outside scdf

I have read the documentation and it's not clear for me the following scenario:
the consumer is outside scdf and the processor and sink are inside.
All the examples provided, the three components are inside.
From my point of view I think that there are two solutions:
The producer outside SCDF produces a message in the topic configured in SCDF
There is another binder outside SCDF and the processor/sink connects to this binder outside SCDF
If somebody could provide any sample it will be very useful
It is not entirely clear from your question what you mean by “outside scdf” . I assume you are referring to existing code that you want to use as a source for an scdf stream. It may already produce messages using a supported message middleware, for example it writes to a Kafka topic, or can be modified to do so but for some reason, you cannot use SCDF to manage its deployment. The simplest way to do this is to use the source topic as a named destination in your stream definition: :my-topic > processor | sink
https://dataflow.spring.io/docs/feature-guides/streams/named-destinations/

Inspecting port data in real time

Is there any recommended way to inspect/plot the numeric values that are being sent through the ports between drake systems in real-time?. (something similar to rqt_plot in ROS). Apart from the SignalLogger or writing and wiring custom individual plotting Systems, is there any method to access the port values internally?
There's nothing as nice as rqt_plot as far as I know.
If you are able to alter your Diagram before calling DiagramBuilder::Build, you could add an LcmScopeSystem onto any vector-valued output port and then the port's contents will be transmitted on an LCM channel. You can add multiple scopes, but you currently have to add them one by one, ahead of time.
Once the data is onto an LCM channel, then you could use the provided drake-lcm-spy program which has the ability to show (very rudimentary) live plots:
cd drake
bazel build //lcmtypes:drake-lcm-spy
bazel-bin/lcmtypes/drake-lcm-spy &
Also tangentially related would be https://github.com/RobotLocomotion/drake/issues/5857, though that is not on any near-term roadmap.

Rational behind appending versions as Service/Deployment name on k8s with spring cloud skipper

I am kind of new the spring cloud dataflow world and while playing around with the framework, I see that if I have a stream = 'test-steram' with 1 application called 'app'. When I deploy using skipper to kubernetes, I see that It creates pod/deployment & service on kubernetes with name as
test-stream-app-v1.
My question is why do we need to have v1 in service/deployment names on k8s? What role does it play in the overall workflow using spring cloud dataflow?
------Follow up -----------
Just wanted to confirm few points to make sure i am on right track to understand the flow
My understanding is with traditional stream (bind through kafka topics) service (object on kubernetes) do not play a significant role.
Rolling Update (Red/Black) pattern has implemented in following way in skipper and versioning in deployment/service plays a role in following way.
Let's assume that app-v1 deployment already exists and upgrade is requested. Skipper creates app-v2 deployment and
wait for it to be ready. Once ready it destroys app-v1
If my above understanding is right I have following follow up questions...
I see that skipper can deploy and package (and it do not have to be a traditional stream) to work with. Is that the longer term plan or Skipper is only intended to work spring-cloud-dataflow streams?
In case of non-tradtional stream package, where an package has multiple apps(rest microservices) in a group, how this model of versioning will work? I mean when I want to call the microservice from other microservice, I cannot possibly know or less than ideal to know the release-version of the app?
#Anand. Congrats on the 1st post!
The naming convention goes by the idea that each of the stream application is "versioned" if Skipper is used with SCDF. The version gets bumped for when, as a user, when you rolling-upgrade and rolling-downgrade the streaming-application versions or the application-specific properties either on-demand or via CI/CD automation.
It is very relevant for continuous-delivery and continuous-deployment workflows, and we provide native options in SCDF through commands such as stream update .. and stream rollback .. respectively. For any of these operations, the applications will be rolling updated in K8s, and each action will bump the number in the application name. In your example, you'd see them as test-stream-app-v1, `test-stream-app-v2, etc.
With all the historical versions in a central place (i.e., Skipper's database), you'd be able to interact with them via stream history.. and stream manifest .. commands in SCDF.
To learn more about all this, watch this demo-webinar (starts # ~41.25), and also have a look at samples in the reference guide.
I hope this helps.

Logging directly to standard output

Where I work, we are migrating our entire infrastructure which was until now based on monolithic services that ran directly on a windows/linux VM to a docker based architecture that will be orchestrated by Kubernetes.
One of the things that came to my mind is how we would handle logs in this new infrastructure.
Up until now, each app had its own way of handling logs, some were using log4net/log4j to write to file system, some were writing to GrayLog via a dedicated library.
The main problem I have with that is that one of the core ideas of programming micro-services in a Docker environment is that every service should assume as little as possible about the rest of services or the platform.
So basically I was looking into how I can abstract the logging process from the application, make it independent from the rest of the infrastructure.
One interesting thing that I found was that you could write the logs to standard output (stdout) and then configure Kubernetes to pull these logs and direct them to a centralised storage or a centralised logging server (like GrayLog) https://kubernetes.io/docs/concepts/cluster-administration/logging/
I have several concerns with this approach, for once, I haven't seen too many companies that do it, most popular logging solutions are to use a dedicated library to log to filesystem.
I am also concerned about how it might impact performance, some languages block if you write to stdout, whereas when you use a standard logging library, the logs are queued.
So what about services that output massive amount of user related logs?
I was interested about what you think, I didn't see this approach used widely, maybe there is reason for that.
Logging to whatever stream (File, stdout, GrayLog...) can either be synchronous (blocking) or asynchronous (non-blocking). Inherently, that has nothing to do with the medium you log to per-se. It is true that using System.out.println in Java will result in heavy thread-contention.
All the major logging frameworks (like log4j) provide you with the means to log in an asynchronous fashion to every medium that you like.
Your perception of not many companies doing this I think is wrong. Logging to stdout and configuring your underlying architecture to forward logs somewhere is the defacto standard of all PaaS/containerized applications.
So my tip is going to be: Log to stdout using a good logging framework which ensures asynchronous usage of the stream. For the rest you'll probably be fine.

Flume automatic scalability and failover

My company is considering using flume for some fairly high volume log processing. We believe that the log processing needs to be distributed, both for volume (scalability) and failover (reliability) reasons, and Flume seems the obvious choice.
However, we think we must be missing something obvious, because we don't see how Flume provides automatic scalability and failover.
I want to define a flow that says for each log line, do thing A, then pass it along and do thing B, then pass it along and do thing C, and so on, which seems to match well with Flume. However, I want to be able to define this flow in purely logical terms, and then basically say, "Hey Flume, here are the servers, here is the flow definition, go to work!". Servers will die, (and ops will restart them), we will add servers to the cluster, and retire others, and flume will just direct the work to whatever nodes have available capacity.
This description is how Hadoop map-reduce implements scalability and failover, and I assumed that Flume would be the same. However, the documentation sees to imply that I need to manually configure which physical servers each logical node runs on, and configure specific failover scenarios for each node.
Am I right, and Flume does not serve our purpose, or did I miss something?
Thanks for your help.
Depending on whether you are using multiple masters, you can code your configuration to follow a failover pattern.
This is fairly detailed in the guide: http://archive.cloudera.com/cdh/3/flume/UserGuide/index.html#_automatic_failover_chains
To answer your question, bluntly, Flume does not yet have an ability to figure out a failover scheme automatically.

Resources