I have recently started working on Docker, K8s and Argo. I am currently working on creating 2 containerized applications and then link them up in such a way that they can run on Argo. The 2 containerized applications would be as follows:
ReadDataFromAFile: This container would have the code that would receive a url/file with some random names. It would separate out all those names and return an array/list of names.
PrintData: This container would accept the list of names and then print them out with some business logic involved.
I am currently not able to understand how to:
Pass text/file to the ReadData Container.
Pass on the processed array of names from the first container to the second container.
I have to write an Argo Workflow that would regularly perform these steps!
Posting this as Community wiki for better visibility with a general solution.
Feel free to expand it.
Since you don't need to store any artifacts, the best options to pass data between Kubernetes Pods are (as #David Maze mentioned in his comment):
1. Pass the data in the body of HTTP POST requests.
There is a good article with examples of HTTP POST requests here.
POST is an HTTP method designed to send data to the server from an HTTP client. The HTTP POST method requests the web server accept the data enclosed in the body of the POST message.
2. Use a message broker, for example, RabbitMQ.
RabbitMQ is the most widely deployed open source message broker. It supports multiple messaging protocols. RabbitMQ can be deployed in distributed and federated configurations to meet high-scale, high-availability requirements.
RabbitMQ provides a wide range of developer tools for most popular languages.
You can install RabbitMQ into the Kubernetes cluster using the Bitnami Helm chart.
Related
While trying to create a web application for shared drawing I got stuck on a problem regarding Kubernetes and scaling. The application uses an ASP.NET Core backend with SignalR for sharing the drawing data across its users. For scaling out the application I am using a deployment for each microservice of the system. For the SignalR part though, additional configuration is required.
After some research I have found out about the possibility to sync all instances of the SignalR backend either through the use of Azures SignalR Service or the use of a Redis backplane. The latter of which I have gotten to work on my local minikube environment. I am not really happy with this solution because of the following reasons:
My main concern is that like this I have created a hard bottleneck in
the system. Unlike in a chat application where data is sent only once
in a while, messages are sent for every few points drawn in the
shared drawing experience by any client. Simply put, a lot of traffic
can occur and all of it has to pass through the single Redis backplane.
Additionally to me it seems unneccessary to make all instances of the SignalR backend talk to each
other. In this application shared drawing does only occur in small groups of up to 10 clients lets
say. Groups of this size can easily be hosted on a single instance.
So without syncing all instances of the SignalR backend I would have to route the clients connection based on the SignalR group name to the right instance of the SignalR backend when the client is trying to join a group.
I have found out about StatefulSets which allow me to have a persistent address for each backend pod in the cluster. I then could somehow associate the SignalR group IDs with the pod addresses they are running on in lets say another look up microservice. The problem with this is that the client needs to be able to access the right pod from outside of the cluster where that cluster internal address does not really help.
Also I am wondering if there isnt a whole better approach to the problem since I am very new to the world of kubernetes. I would be very greatful for your thoughts on this issue and any hint towards a (better) solution.
I'm currently trying to create a tracing tool for fun (which supports gRPC tracing) and was confused as to whether or not I was thinking about this architecture properly. A tracing tool keeps track of the entire workflow/journey of the request (from the moment a user clicks the button, to when the request goes to the API gateway, between microservices, and back.
Let's say the application is a bookstore, and it is broken up to 2 microservices, maybe account and books. Let's say that there is a User Interface, and when you click a button, it allows a user to favorite a book. I'm only using 2 microservices to keep this example simple.
**Different parts of the Fake/Mock up application**
UI ->
nginx -> I wanted to use this as an API Gateway.
microservice 1 -> (Contains data for all Users of a bookstore)
microservice 2 -> (Contains data for all the books)
**So my goal is to figure a way to trace that request. So we can imagine the request goes to nginx
Concern #1: When the request goes to nginx, it is HTTP. Cool, but when the request is sent to the microservice, it is a grpc call (or over http2). Can nginx get an http request and then send that request over http2...? Not sure if I'm wording this correctly or not. I know nginx plus supports http2. I also know that grpc has a grpc gateway too.
Concern #2: Containerization. Do I have to containerize both microservices individually, or would I have to containerize the entire docker container itself. Is it simple to link nginx and docker?
Concern #3: When tracing gRPC requests (finding out how much time a request is fulfilled), I'm considering using a middleware logger or a tracing API (opentracing, jaegar, etc.) to do this. How else would I figure out how long it takes for gRPC to make requests?
I was wondering if it was possible to address these concerns, if my thought process is correct, and if this architecture is feature.
Most solutions in the industry are implemented on top of a container orchestration solution (Kubernetes, Docker Swarm, etc).
It is usually not a good idea to "containerize" and manage reverse proxy yourself.
The reverse proxy should be aware of all the containers status (by hooking to orchestrator) and dynamically update its configuration when a container created, crashed, or relocated (due to a machine gets out of service).
Kubernetes handles GRPC using the mesh networks. Please take a look at kubernetes service mesh.
If you decided to use Traefik and Docker Swarm check out traefik h2c support.
In conclusion, consider more modern alternatives to Nginx when you want to load balance GRPC.
Preface
I am currently trying to learn how micro-services work and how to implement container replication and API gateways. I've hit a block though.
My Application
I have three main services for my application.
API Gateway
Crawler Manager
User
I will be focusing on the API Gateway and Crawler Manager services for this question.
API Gateway
This is a docker container running a Go server. The communication is all done with GraphQL.
I am using an API Gateway because I expect to have different services in my application each having their own specialized API. This is to unify everything.
All it does is proxy requests to their appropriate service and return a response back to the client.
Crawler Manager
This is another docker container running a Go server. The communication is done with GraphQL.
More or less, this behaves similar to another API gateway. Let me explain.
This service expects the client to send a request like this:
{
# In production 'url' will be encoded in base64
example(url: "https://apple.example/") {
test
}
}
The url can only link to one of these three sites:
https://apple.example/
https://peach.example/
https://mango.example/
Any other site is strictly prohibited.
Once the Crawler Manager service receives a request and the link is one of those three it decides which other service to have the request fulfilled. So in that way, it behaves much like another API gateway, but specialized.
Each URL domain gets its own dedicated service for processing it. Why? Because each site varies quite a bit in markup and each site needs to be crawled for information. Because their markup is varied, I'd like a service for each of them so in case a site is updated the whole Crawler Manager service doesn't go down.
As far as the querying goes, each site will return a response formatted identical to other sites.
Visual Outline
Problem
Now that we have a bit of an idea of how my application works I want to discuss my actual issues here.
Is having a sort of secondary API gateway standard and good practice? Is there a better way?
How can I replicate this system and have multiple Crawler Manager service family instances?
I'm really confused on how I'd actually create this setup. I looked at clusters in Docker Swarm / Kubernetes, but with the way I have it setup it seems like I'd need to make clusters of clusters. That makes me question my design overall. Maybe I need to not think about keeping them so structured?
At a very generic level, if service A calls service B that has multiple replicas B1, B2, B3, ... then it needs to know how to call them. The two basic options are to have some sort of service registry that can return all of the replicas, and then pick one, or to put a load balancer in front of the second service and just directly reach that. Usually setting up the load balancer is a little bit easier: the service call can be a plain HTTP (GraphQL) call, and in a development environment you can just omit the load balancer and directly have one service call the other.
/-> service-1-a
Crawler Manager --> Service 1 LB --> service-1-b
\-> service-1-c
If you're willing to commit to Kubernetes, it essentially has built-in support for this pattern. A Deployment is some number of replicas of identical pods (containers), so it would manage the service-1-a, -b, -c in my diagram. A Service provides the load balancer (its default ClusterIP type provides a load balancer accessible only within the cluster) and also a DNS name. You'd configure your crawler-manager pods with perhaps an environment variable SERVICE_1_URL=http://service-1.default.svc.cluster.local/graphql to connect everything together.
(In your original diagram, each "box" that has multiple replicas of some service would be a Deployment, and the point at the top of the box where inbound connections are received would be a Service.)
In plain Docker you'd have to do a bit more work to replicate this, including manually launching the replicas and load balancers.
Architecturally what you've shown seems fine. The big "if" to me is that you've designed it so that each site you're crawling potentially gets multiple independent crawling containers and a different code base. If that's really justified in your scenario, then splitting up the services this way makes sense, and having a "second routing service" isn't really a problem.
Iam working on a project of big data, where Iam trying to get tweets from Twitter and analyse these tweets and make predictions out of it.
I have followed this tutorial : http://blog.cloudera.com/blog/2012/10/analyzing-twitter-data-with-hadoop-part-2-gathering-data-with-flume/
for getting the tweets. Now Iam planning to build a microservice which can replicate itself as I increase the number of topics on which I want tweets. Now whatever code I have written to gather the tweets with that I want to make a microservice that can take a keyword and create a instance of that code for that keyword and gather tweets, for each keyword an instance should be created.
It will also be helpful if you inform me what tools to use for such application.
Thank you.
I want to make a microservice that can take a keyword and create a instance of that code for that keyword and gather tweets, for each keyword an instance should be created.
You could use kubernetes as an underlying cluster/deployment infrastructure. It has an API that allows you to deploy new services programmatically. So what you would have to do is:
Set up a basic service container for your twitter-service that is available in a container repository.
Then you deploy a first service based on your container. The service configuration will contain the keyword that the service uses as well as information about the kubernetes cluster (how to access the cluster API and where to find the container in the repository).
Now your first service has all the information it needs to automatically create additional service descriptions for kubernetes (with other key words) and deploy those additional services by calling the kubernetes cluster API.
Since the additional services will be passed all the necessary information as well, they themselves can then start even more services and so on.
You probably need to put some effort into figuring out the cluster provisioning, but that can also be done automatically with auto-scaling (available for Google or AWS clouds for example).
A different approach would be to run a horizontally scaled cluster of your basic twitter services that use a self organization algorithm to involve all the keywords put into a database or event queue.
In the Kubernetes/Docker ecosystem there is a convention of using /healthz as a health-check endpoint for applications.
Where does the name 'healthz' come from, and are there any particular semantics associated with that name?
It historically comes from Google’s internal practices. They're called "z-pages".
The reason it ends with z is to reduce collisions with actual application endpoints with the same name (like /status). See this talk for more: https://vimeo.com/173610242
Similar endpoints (at least inside Google) are /varz, /statusz, /rpcz. Services developed at Google automatically get these endpoints to export their health and metrics and there are tools that collect the exposed metrics/statuses from all the deployed services.
Open source tools like Prometheus implement this pattern (since original authors of Prometheus are also ex-Googlers) by coming to a well-known endpoint to collect metrics from your application. Similarly OpenCensus allows you to expose z-pages from your app (ideally on a different port) to diagnose problems.