I'm trying to collect my application containers' logs throughout their entire life cycle. These containers are running inside Kubernetes pods, I've found solutions like Fluentd but I also found out that I need to specify a backend (Elasticsearch, AWS S3, etc.) whereas I want to collect logs inside files having specific names, for example podname_namespace_containername.json and then parse those files using a script. Is this possible with fluentd?
By far the fastest way to setup log collection is https://github.com/helm/charts/tree/master/stable/fluent-bit. Refer values.yaml for all the options available. It supports multiple backends like ES, S3, Kafka. Every log event is enriched with pod metadata (pod name, namespace, etc) and tagged so that you can organize processing separately on a backend. E.g. on a backend you can select and parse only certain pods in certain namespaces.
According to https://kubernetes.io/docs/concepts/cluster-administration/logging/ you log to stdout/stderr, it gets written to the underlying node, a log collector (daemonset) collects everything and sends further.
FluentBit daemonset in Kubernetes implements exactly this architecture. More docs on FluentBit: https://docs.fluentbit.io/manual/concepts/data-pipeline
Related
I have a container (a machine learning application) which is capable of loading pre-trained ml-models stored in a persistent volume. I can ask the application to load a particular model by giving its name via its REST API.
Now I want to scale up this application so that, I can load whatever the model, in any of the replicas (not in all of them) and should be able to parse data from that model.
I know this can be done by having multiple deployments and multiple services pointing to each of those deployments so that each instance will have a separate node port and I can access the REST API of each instance through these different node ports. By doing this, I can keep a record of which model is loaded on which instance in my own.
Is there any recommended way to accomplish this requirement without having multiple deployments but through replicas? (Like maintaining a single deployment file with replicas and a manual load balancer at the service level)
As you mentioned, the preferred way to achieve this is through multiple deployments and services.
Unless created manually, replicasets are managed by the deployment and you won't be able to have a single deployment with multiple replicasets running different versions.
It will definitely be easier for you to have one service + deployment per version and a single ingress in front to route the traffic based on some piece of information. It could be a header, sub-domain, path, etc.
To generate all your deployments and services, you could have a look at kustomize.
I'm trying to design a good way to bring up a fleet of docker containers that act as IOT devices, each with a slightly different configuration.
In each container I would haven an app that simulated some hardware device, e.g. temperature sensor, and each one would have a unique "identity". Example sensor1, sensor2, sensor3, etc and some other configurations that could vary.
I would also want to scale up and scale down the number of virtual devices based on the use case being tested.
The ways I can think of doing this would be to either pass unique properties to each container via a shell script, or have each container access a database or some other store where it would retrieve its unique configuration on startup from a pool of available configurations.
This way the same app would run in each container but its configuration would be unique. Then they would start sending data to some endpoint where I process their telemetry payloads as a data stream.
For a Kubernetes solution it would seem the containers would have to connect to some common datastore to get their configuration since it would not be possible to pass a unique set of properties to each container?
Regarding the identity, David already mentioned the StatefulSet.
Like a Deployment , a StatefulSet manages Pods that are based on an
identical container spec. Unlike a Deployment, a StatefulSet maintains
a sticky identity for each of their Pods. These pods are created from
the same spec, but are not interchangeable: each has a persistent
identifier that it maintains across any rescheduling.
Regarding the scaling. If I understand you correctly, you need a ReplicaSet.
A ReplicaSet is defined with fields, including a selector that
specifies how to identify Pods it can acquire, a number of replicas
indicating how many Pods it should be maintaining, and a pod template
specifying the data of new Pods it should create to meet the number of
replicas criteria. A ReplicaSet then fulfills its purpose by creating
and deleting Pods as needed to reach the desired number. When a
ReplicaSet needs to create new Pods, it uses its Pod template.
For the configuration/properties part, ConfigMaps are what you need.
ConfigMaps bind configuration files, command-line arguments,
environment variables, port numbers, and other configuration artifacts
to your Pods' containers and system components at runtime.
Please let me know if that helped.
I have this idea for what I think should happen with my project, but I want to check in and see if this works on a theoretical level first. Basically I am working on a Django site that is run on Kubernetes, but am struggling a little bit about which way I should set up my replicationSet/statefulSet to manage uploaded content (images).
My biggest concern is trying to find out how to scale and maintain uploaded content. My first idea is that I need to have a single volume that has these files written to it, but can I have multiple pods write to the same volume that way while scaling?
From what I have gathered, it doesn't seem to work that way. It sounds more like each pod, or at least each node, would have it's own volume. But would a request for an image reach the volume it is stored on? Or should I create a custom backend program to move things around so that it is served off of a NGINX server like my other static content?
FYI - this its my first scalable project lol. But I am really just trying to find the best way to manage uploads... or a way in general. I would appreciate any explanations, thoughts, or fancy diagrams on how something like this might work!
Hello I think you should forget kubernetes a bit and think of the architecture and capabilities of your Django application. I guess you have built a web app, that offers some 'upload image' functionality, and then you have code that 'stores' this image somewhere. On the very simple scenario if you run your app on your laptop, the you web app, is configured to save this content to a local folder, a more advanced example is that you deploy your application to a VM or a cloud vm e.g an AWS EC2 instance, and your app is saving the files to the local storage of this EC2 instance. The question is twofold - what happens if we have 2 instances of your web app deployed - can the be configured and run - so that they 'share' the same folder to save the images? I guess this is what you want, other wise your app would not scale horizontally , each user would have to hit each individual instance - in order to upload or retrieve specific images. So having that in mind that is a design decision of your application, which I am pretty sure you have already worked it out, the you need to think - how can I share a folder? a bucket so that all my instances of my web app can save files? If you spinned 3 different vms, on any cloud, you would have to use some kind of clour storage, so that all three instances point to the same physical storage location, or an NFS drive or you could save your data to a cloud storage service S3!
Having all the above in mind, and clearly understanding that you need to decouple your application from the notion of locale storage especially if you want to make it as as stateless as it gets (whatever that means to you), having your web app, which is packaged as a docker container and deployed in a kubernetes cluster as a pod - and saving files to the local storage is not going to get any far, since each pod, each docker container will use the underlying kubernetes worker (vm) storage to save files, so another instance will be saving files on some other vm etc etc.
Kubernetes provides this kind of abstraction for applications (pods) that want to 'share' within the kubernetes cluster, some local storage and of course persist it. Something that I did not add above is that pod and worker storage (meaning if you save files in the kubernetes worker or pod) once this vm / instance is restarted you will loose your data. So you want something durable.
To cut a long story short,
1) you can either to deploy your application / pod along with a Persistent Volume Claim assuming that your kubernetes cluster supports it. What is happening is that you can mount to your pod some kind of folder / storage which will be backed up by whatever is available to your cluster - some kind of NFS store. https://kubernetes.io/docs/concepts/storage/persistent-volumes/
2) You can 'outsource' this need to share a common local storage to some external provider, e.g a common case use an S3 bucket, and not tackle the problem on kubernetes - just keep and provision the app within kubernetes.
I hope I gave you some basic ideas.
Note: Kubernetes 1.14 now (March 2019) comes with Durable Local Storage Management is Now GA, which:
Makes locally attached (non-network attached) storage available as a persistent volume source.
Allows users to take advantage of the typically cheaper and improved performance of persistent local storage kubernetes/kubernetes: #73525, #74391, #74769 kubernetes/enhancements: #121 (kep)
That might help securing a truly persistent storage for your case.
As noted by x-yuri in the comments:
See more with "Kubernetes 1.14: Local Persistent Volumes GA", from Michelle Au (Google), Matt Schallert (Uber), Celina Ward (Uber).
you could use ipfs https://pypi.org/project/django-ipfs-storage/
creating a container with this image https://hub.docker.com/r/ipfs/go-ipfs/ in the same pod you can ref as 'localhost'
I work with Docker and Kubernetes.
I would like to collect application specific metrics from each Docker.
There are various applications, each running in one or more Dockers.
I would like to collect the metrics in JSON format in order to perform further processing on each type of metrics.
I am trying to understand what is the best practice, if any and what tools can I use to achieve my goal.
Currently I am looking into several options, none looks too good:
Connecting to kubectl, getting a list of pods, performing a command (exec) at each pod to cause the application to print/send JSON with metrics. I don't like this option as it means that I need to be aware to which pods exist and access each, while the whole point of having Kubernetes is to avoid dealing with this issue.
I am looking for Kubernetes API HTTP GET request that will allow me to pull a specific file.
The closest I found is GET /api/v1/namespaces/{namespace}/pods/{name}/log and it seems it is not quite what I need.
And again, it forces me to mention each pop by name.
I consider to use ExecAction in Probe to send JSON with metrics periodically. It is a hack (as this is not the purpose of Probe), but it removes the need to handle each specific pod
I can't use Prometheus for reasons that are out of my control but I wonder how Prometheus collects metric. Maybe I can use similar approach?
Any possible solution will be appreciated.
From an architectural point of view you have 2 options here:
1) pull model: your application exposes metrics data through a mechanisms (for instance using the HTTP protocol on a different port) and an external tool scrapes your pods at a timed interval (getting pod addresses from the k8s API); this is the model used by prometheus for instance.
2) push model: your application actively pushes metrics to an external server, tipically a time series database such as influxdb, when it is most relevant to it.
In my opinion, option 2 is the easiest to implement, because:
you don't need to deal with k8s APIs in order to discover pods addresses;
you don't need to create a local storage layer to store your metrics (because you push them one by one as they occour);
But there is a drawback: you need to be careful how you implement this, it could cause your API to become slower and you might need to deal with asynchronous calls to your metrics server.
This is obviously a very generic answer, but I hope it could point you in the right direction.
Pity you can not use Prometheus, but it's a good lead for what can be done in this scope. What Prom does is as follows :
1: it assumes that metrics you want to scrape (collect) are exposed with some http endpoint that Prometheus can access.
2: it connects to kubernetes api to perform a discovery of endpoints to scrape metrics from (there is a config for it, but generaly it means it has to be able to connect to the API and list services/deployments/pods and analyze their annotations (as they have info about metrics endpoints) to compose a list of places to scrape data from
3: periodicaly (15s, 60s etc.) it connects to the endpoints and collects the exposed metrics.
That's it. Rest is storage/postprocessing. The kube related part might be a significant amount of work to do though, so it would be way better to go with something that already exists.
Sidenote: while this is generaly a pull based model, there are cases where pull is not possible (vide short lived scripts like php), that is where Prometheus pushgateway comes into play to allow pushing metrics to a place where Prometheus will pull from.
What's the best way to get the IP addresses of the other kubernetes pods on a local network?
Currently, I'm using the following command and parsing the output: kubectl describe pods.
Unfortunately, the command above often takes many seconds to complete (at least 3, and often 30+ seconds) and if a number of requests happen nearly simultaneously, I get 503 style errors. I've built a caching system around this command to cache the IP addresses on the local pod, but when a 10 or so pods wake up and need to create this cache, there is a large delay and often many errors. I feel like I'm doing something wrong. Getting the IP addresses of other pods on a network seems like it should be a straightforward process. So what's the best way to get them?
For added details, I'm using Google's kubernetes system on their container engine. Running a standard Ubuntu image.
Context: To add context, I'm trying to put together a shared memcached between the pods on the cluster. To do that, they all need to know eachother's IP address. If there's an easier way to link pods/instances for the purposes of memcached, that would also be helpful.
Have you tried
kubectl get pods -o wide
This also returns IP addresses of the pods. Since this does not return ALL information describe returns, this might be faster.
For your described use case you should be using services. A headless service would allow you to reference them with my-svc.my-namespace.svc.cluster.local. This assumes you don't need to know individual nodes, only how to reach one of them, as it will round robin between them.
If you do need to have fixed network identities in your cluster attached to the pods you can setup a StatefulSet and reference them with: app-0.my-svc.my-namespace.svc.cluster.local, app-1.my-svc.my-namespace.svc.cluster.local and so on.
You should never need to contact specific pod ip's in other ways, specially since they can be rescheduled at any time and have their IPs changed.
For your use case specifically, it might be easier to just use the memcache helm chart, which supports a cluster in a StatefulSet: https://github.com/kubernetes/charts/tree/master/stable/memcached