Does data-flow-server and skipper have any active role after the deployment of the stream applications (other than maintaining state). Asked differently, say for example if I have an http source and amqp sink , does any of the traffic from http to amqp go through the data-flow-server or skipper?
Neither SCDF nor Skipper interact with or contribute towards streaming or batch processing. They are both responsible for the designing and the deployment of data pipelines made of streaming/batch applications. To do the designing and the deployment part, we provide a variety of tools, including UI, CLI, RESTful APIs, and Java DSL.
To say this differently, if you deploy a stream or a task data pipeline using SCDF, the applications involved in the data pipeline are solely responsible for the data processing—no data will ever go through SCDF or Skipper. The applications are independently addressing the use-case requirement. SCDF (and Skipper) will help with central monitoring, management, and continuous deliveries of the applications.
You can read more about the responsibilities from the architecture section in docs.
Related
Our project involves containerisation of services / application and later they will be deployed on Kuberentes. My job is to do performance testing using Jmeter after the services are hosted on Kubernetes.
I am relatively new to Performance testing and have basic experience on Jmeter that I gained from working on it. I have understood how the app is load / perf tested using basic URLs or APIs but I want to know how I should go about handling performance testing for Docker containers hosted on Kubernetes.
How could I handle the above scenario?
JMeter doesn't know anything about the underlying technologies used at the backend, it just sends requests via Samplers, waits for responses and measures the elapsed time of the request and some other performance metrics. Later on you can generate HTML Reporting Dashboard to visualize the results
So your goal is to:
Identify the business use cases you need to implement for the performance testing
Identify network protocols which are being used under the hood of these business use cases
Create a JMeter Test Plan to precisely mimic the real user (or other application) accessing your system and doing what it supposed to be doing
I am kind of new the spring cloud dataflow world and while playing around with the framework, I see that if I have a stream = 'test-steram' with 1 application called 'app'. When I deploy using skipper to kubernetes, I see that It creates pod/deployment & service on kubernetes with name as
test-stream-app-v1.
My question is why do we need to have v1 in service/deployment names on k8s? What role does it play in the overall workflow using spring cloud dataflow?
------Follow up -----------
Just wanted to confirm few points to make sure i am on right track to understand the flow
My understanding is with traditional stream (bind through kafka topics) service (object on kubernetes) do not play a significant role.
Rolling Update (Red/Black) pattern has implemented in following way in skipper and versioning in deployment/service plays a role in following way.
Let's assume that app-v1 deployment already exists and upgrade is requested. Skipper creates app-v2 deployment and
wait for it to be ready. Once ready it destroys app-v1
If my above understanding is right I have following follow up questions...
I see that skipper can deploy and package (and it do not have to be a traditional stream) to work with. Is that the longer term plan or Skipper is only intended to work spring-cloud-dataflow streams?
In case of non-tradtional stream package, where an package has multiple apps(rest microservices) in a group, how this model of versioning will work? I mean when I want to call the microservice from other microservice, I cannot possibly know or less than ideal to know the release-version of the app?
#Anand. Congrats on the 1st post!
The naming convention goes by the idea that each of the stream application is "versioned" if Skipper is used with SCDF. The version gets bumped for when, as a user, when you rolling-upgrade and rolling-downgrade the streaming-application versions or the application-specific properties either on-demand or via CI/CD automation.
It is very relevant for continuous-delivery and continuous-deployment workflows, and we provide native options in SCDF through commands such as stream update .. and stream rollback .. respectively. For any of these operations, the applications will be rolling updated in K8s, and each action will bump the number in the application name. In your example, you'd see them as test-stream-app-v1, `test-stream-app-v2, etc.
With all the historical versions in a central place (i.e., Skipper's database), you'd be able to interact with them via stream history.. and stream manifest .. commands in SCDF.
To learn more about all this, watch this demo-webinar (starts # ~41.25), and also have a look at samples in the reference guide.
I hope this helps.
I'm using Spring Cloud Dataflow local server and deploying 60+ streams with a Kafka topic and custom sink. The memory/cpu usage cost is not currently scalable. I've set the Xmx to 64m for most streams.
Currently exploring my options.
Disable embedded Tomcat server. I tried this once and SCDF couldn't tell the deployment status of the stream.
Group multiple Kafka "source" topics to a single sink app. This is allowed by Kafka but unclear if SCDF will permit subscribing to multiple topics.
Switch to using Kubernetes deployer. Won't exactly reduce the memory/cpu usage but distribute it across multiple machines. Haven't pursued this option because Kubernetes isn't used in my org yet. Maybe this will force the issue.
Open to other ideas. Might also be able to tweak Kafka configs such as max.poll.records and reduce memory usage.
Thanks!
First, I'd like to clarify the differences between SCDF and Stream/Task apps in the data pipeline.
SCDF is a lightweight Spring Boot app that includes the DSL, REST-APIs, and the Dashboard. Simply put, it serves as the orchestrator to define and deploy stream and task/batch data pipelines made of stream and task applications respectively.
The actual business logic, its performance, and the underlying resource consumption are at the individual Stream/Task application level. SCDF doesn't interfere with the app's operation, nor it contributes to the resource load. Everything, in the end, is standalone Boot apps - standalone Java processes.
Now, to your exploratory steps.
Disable embedded Tomcat server. I tried this once and SCDF couldn't tell the deployment status of the stream.
SCDF is a REST server and it requires the application container (in this case Tomcat), you cannot disable it.
Group multiple Kafka "source" topics to a single sink app. This is allowed by Kafka but unclear if SCDF will permit subscribing to multiple topics.
Again, there is no relation between SCDF and the apps. SCDF orchestrates full-blown Stream/Task (aka: Boot apps) into coherent data pipeline. If you have to produce or consumer to/from multiple Kafka topics, it is done at application level. Checkout the multi-io sample for more details.
There's the facility to consume from multiple topics directly via named-destination, too. SCDF provides a DSL/UI capability to build fan-in and fan-out pipelines. Refer to docs for more details. This video could be useful, too.
Switch to using Kubernetes deployer.
SCDF's Local-server is generally recommended for development. Primarily because there's no resiliency baked into the Local-server implementation. For example, if the streaming apps crash for any reason, there's no mechanism to restart them automatically. This is exactly why we recommend either SCDF's Kubernetes or Cloud Foundry server implementations in production. The platform provides the resiliency and fault-tolerance by automatically restarting the apps under fault scenarios.
From resourcing standpoint, once again, it depends on each application. They are standalone microservice application doing a specific operation at runtime, and it is up to how much resources the business logic requires.
I am wanting to know what would be the best way to expose a library via zeromq. Say, I install a machine learning library (mll) on one machine, and I have a zeromq broker running on another. Now, if I have a zeromq client which needs to call functions within the mll, how can it do so via the broker.
I am wanting to know the steps I will need to take to make this work for libraries in a generic way.
Basically you need to have a "listener" that picks up data from ZMQ and feeds it to your machine-learning backend code, then transmits the results back to the requestor.
There are a lot of design choices to be made, such as what format to use to serialize data between client and server (JSON? YAML? Pickle? Thrift? ...) , and how to encode requests and request options. But all things considered, this is a pretty straightforward ZMQ usage.
The problem comes when you want a more feature-rich, complete, robust, etc. design--things like multi-threaded or multi-process servers, multi-machine scalability, secure user / request authentication and authorization, job reporting and dashboard, or job checkpointing. All those "extras" are common "network job scheduler" or "(enterprise) message broker" functions that tend to come out-of-the-box with packages like Celery or RQ.
If you don't want to go the full "message broker middleware" route, you might start by examining others' designs for lightweight ZMQ-based job brokers, such as this one from Jeff Knupp.
I'm really impressed with the power of cloud computing when it comes to the possibility to scale up and down your facilities depending on your load.
How can I shift my paradigm and learn to write my applications in that way? Write it once and forget(no matter of the future load) would be the best solution.
How can I practice my skills in that area?
Setup virtualization environment when I can add another VMs into the private cloud(via command line?) on some smart algorithms to foresee the load for some period of time?
Ideally I want to practice it without buying actual Cloud computing services and just on my hardware.
The only thing I want to practice here is app/web role and/or message queue systems scaling when current workers have too many jobs in queue. So let's rule out database scaling from the question's goal as too big topic.
One option I will throw out is to use a native Cloud execution framework. You might look at CloudIQ Platform. One component is CloudIQ Engine. It allows you to develop cloud native apps in C/C++, Java and .NET. You get the capabilities of scale up by simply adding workers to your cloud. The framework automatically distributes your applications to the new machine(s), and once installed, will begin sending work to them as requests come in. So in effect the cloud handles your queueing issue for you.
Check out the Download and Community links for more information.
You should try AWS- Amazon's offering a free tier that gives you storage, messaging and micro instances (only linux). you can start developing small try-outs without paying. writing an application that scales isn't that hard- try to break your flow into small, concurrent tasks. client-server applications are even easier- use a load balancer to raise\terminate servers by demand.