Instance only when needed - GCP - docker

I have a video editing task that needs to be completed occasionally. The task is relatively intensive and therefore needs a powerful machine to do it. It can take up to about 10 minutes to complete. I might get 10-20 such requests per day, though that will increase in the future.
I have created a docker container that currently is a consumer that pulls jobs from PubSub. I was thinking to have an instance of this container on Google Container Engine. However, as I understand it, I would need to have at least one instance of this (large / powerful / expensive) container running at all times, even if the majority of time it is sat idle. Therefore my cost for running this service would be overly high until my usage increased.
Is there an alternative way of running my container (GCP or otherwise) where I push a job to some service, which then starts an instance of a powerful machine, processes the job, then shuts down? Therefore I am paying for my CPU hours used.

Have a look at the cluster autoscaler: https://cloud.google.com/container-engine/docs/cluster-autoscaler

Related

Delay on requests from Google API Gateway to Cloud Run

I'm currently seeing delays of 2-3 seconds on my first requests coming into our APIs.
We've set the min instances to 1 to prevent cold start but this a delay is still occurring.
If I check the metrics I don't see any startup latencies in the specified timeframe so I have no insights in what is causing these delays. Tracing gives the following:
The only thing I can change, is switching to "CPU is always allocated" but this isn't helping in any way.
Can somebody give more information on this?
As mentioned in the Answer :
As per doc :
Idle instances As traffic fluctuates, Cloud Run attempts to reduce the
chance of cold starts by keeping some idle instances around to handle
spikes in traffic. For example, when a container instance has finished
handling requests, it might remain idle for a period of time in case
another request needs to be handled.
Cloud Run But, Cloud Run will terminate unused containers after some
time if no requests need to be handled. This means a cold start can
still occur. Container instances are scaled as needed, and it will
initialize the execution environment completely. While you can keep
idle instances permanently available using the min-instance setting,
this incurs cost even when the service is not actively serving
requests.
So, let’s say you want to minimize both cost and response time latency
during a possible cold start. You don’t want to set a minimum number
of idle instances, but you also know any additional computation needed
upon container startup before it can start listening to requests means
longer load times and latency.
Cloud Run container startup There are a few tricks you can do to
optimize your service for container startup times. The goal here is to
minimize the latency that delays a container instance from serving
requests. But first, let’s review the Cloud Run container startup
routine.
When Starting the service
Starting the container
Running the entrypoint command to start your server
Checking for the open service port
You want to tune your service to minimize the time needed for step 1a.
Let’s walk through 3 ways to optimize your service for Cloud Run
response times.
1. Create a leaner service
2. Use a leaner base image
3. Use global variables
As mentioned in the Documentation :
Background activity is anything that happens after your HTTP response
has been delivered. To determine whether there is background activity
in your service that is not readily apparent, check your logs for
anything that is logged after the entry for the HTTP request.
Avoid background activities if CPU is allocated only during request processing
If you need to set your service to allocate CPU only during request
processing, when the Cloud Run service finishes handling a
request, the container instance's access to CPU will be disabled or
severely limited. You should not start background threads or routines
that run outside the scope of the request handlers if you use this
type of CPU allocation. Review your code to make sure all asynchronous
operations finish before you deliver your response.
Running background threads with this kind of CPU allocation can create
unpredictable behavior because any subsequent request to the same
container instance resumes any suspended background activity.
As mentioned in the Thread reason could be that all the operations you performed have happened after the response is sent.
According to the docs the CPU is allocated only during the request processing by default so the only thing you have to change is to enable CPU allocation for background activities.
You can refer to the documentation for more information related to the steps to optimize Cloud Run response times.
You can also have a look on the blog related to use of Google API Gateway with Cloud Run.

How does Openwhisk decide how many runtime containers to create?

I am working on a project that is using Openwhisk. I have created a Kubernetes cluster on Google cloud with 5 nodes and installed OW on it. My serverless function is written in Java. It does some processing based on arguments I pass to it. The processing can last up to 30 seconds and I invoke the function multiple times during these 30 seconds which means I want to have a greater number of runtime containers(pods) created without having to wait for the previous invocation to finish. Ideally, there should be a container for each invocation until the resources are finished.
Now, what happens is that when I start invoking the function, the first container is created, and then after few seconds, another one to serve the first two invocation. From that point on, I continue invoking the function (no more than 5 simultaneous invocation) but no containers are started. Then, after some time, a third container is created and sometimes, but rarely, a fourth one, but only after long time. What is even weirded is that the containers are all started on a single cluster node or sometimes on two nodes (always the same two nodes). The other nodes are not used. I have set up the cluster carefully. Each node is labeled as invoker. I have tried experimenting with memory assigned to each container, max number of containers, I have increased the max number of invocations I can have per minute but despite all this, I haven't been able to increase the number of containers created. Additionally, I have tried with different machines used for the cluster (different number of cores and memory) but it was in vain.
Since Openwhisk is still relatively a young project, I don't get enough information from the official documentation unfortunately. Can someone explain how does Openwhisk decide when to start a new container? What parameters can I change in values.yaml such that I achieve greater number of containers?
The reason why very few containers were created is the fact that worker nodes do not have Docker Java runtime image and that it needs be downloaded on each of the nodes the first this environment is requested. This image weights a few hundred MBs and it needs time to be downloaded (a couple of seconds in google cluster). I don't know why Openwhisk controller decided to wait for already created pods to be available instead of downloading the image on other nodes. Anyway, once I downloaded the image manually on each of the nodes, using the same application with the same request rate, a new pod was created for each request that could not be served with an existing pod.
The OpenWhisk scheduler implements several heuristics to map an invocation to a container. This post by Markus Thömmes https://medium.com/openwhisk/squeezing-the-milliseconds-how-to-make-serverless-platforms-blazing-fast-aea0e9951bd0 explains how container reuse and caching work and may be applicable for what you are seeing.
When you inspect the activation record for the invokes in your experiment, check the annotations on the activation record to determine if the request was "warm" or "cold". Warm means container was reused for a new invoke. Cold means a container was freshly allocated to service the invoke.
See this document https://github.com/apache/openwhisk/blob/master/docs/annotations.md#annotations-specific-to-activations which explains the meaning of waitTime and initTime. When the latter is decorating the annotation, the activation was "cold" meaning a fresh container was allocated.
It's possible your activation rate is not fast enough to trigger new container allocations. That is, the scheduler decided to allocate your request to an invoker where the previous invoke finished and could accept the new request. Without more details about the arrival rate or think time, it is not possible to answer your question more precisely.
Lastly, OpenWhisk is a mature serverless function platform. It has been in production since 2016 as IBM Cloud Functions, and now powers multiple public serverless offerings including Adobe I/O Runtime and Naver Lambda service among others.

Docker as "Function" (Create a Docker per request)

Is there a simple way to create an istance of a docker container for each request?
I have a Docker container that takes a very long time to compute a mathematical algorithm. When running, no other requests can be processed in parallel. Lambda Functions would be the best solution, but the container needs to download more than 1gb of data and needs at least 10 cores and 5GB ram to be executed, and therefore Lambda would be too expensive.
We have a big cluster (1000 cores, 0.5TB RAM) and I was considering to use a NGINX Load balancer or a Kubernetes bare metal.
Is it possible to configure in a way that creates an instance per request (similar to a Lambda Function)?
There are tools like Airflow or Argo that are designed for these things.
basically you can create a DAG will run very much like a function as a service but on what ever custom docker container you want.
You probably need to decouple the HTTP service from the backend processing. If the job takes minutes or longer to run, most browsers and other HTTP clients will time out before it will finish, so the HTTP end of it needs to start the job in some way and immediately return some sort of success message.
Once you’ve done that, you might find a job queue like RabbitMQ a useful piece of infrastructure technology. Again, this decouples the queue of jobs from the mechanism to actually run them. In a Docker/Kubernetes space you’d launch some number of persistent workers that all listened to the queue and did work as it appeared there. You wouldn’t necessarily launch one worker per job; or possibly you would have just one worker that launched other Docker containers or Kubernetes Jobs; but if the work backlog got too long you could launch additional workers.
In a pure-Docker space it’s theoretically possible to use the Docker API to launch additional containers. However, doing this gives your process unlimited root-level access to the host; if you are running this in the context of an HTTP server you need to be extremely careful about security considerations. Kubernetes also has an API and from a security point of view this is probably better: you can set up a service account that has permissions only to launch Jobs, and launch a Job per inbound job that arrives. (Security is still important but it’s much harder for a malicious input to root the host.)

What happens to ECS containers that exceed soft memory limit when there is memory contention?

Say I have an instance with 2G memory, and a task/container with 0.5G soft memory limit, and 0.75G hard memory limit.
The instance is running 3 containers, each consuming 0.6G memory. Now a 4th container needs to be added? What happens to the 3 running containers? Is their memory allocation reduced? Or are they migrated to another instance? What if there is no other instance, will the 4th container be placed?
I understand how soft and hard CPU limits work since CPU is a dynamic resource (the application can handle spikes in free CPU). In case of memory, however, you cannot really take away memory from a container that is already using it.
The 4th container will not be able to spawn and you will get the below error.
(service sample) was unable to place a task because no container instance met all of its requirements. The closest matching (container-instance 05016874-f518-4b7a-a817-eb32a4d387f1) has insufficient memory available. For more information, see the Troubleshooting section of the Amazon ECS Developer Guide.
You need to add another ecs instance if you want to schedule the 4th container. all other 3 containers will be in the steady state. Nothing like memory allocation reduced happened in the cluster. If there is no instance your service will always be in an unsteady state and continue to give you the above errors.
Ref: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html
Actually, memory can be reclaimed from running processes. For example the kernel may evict memory that is backed by files (like the code of the process itself). If the data ends up being needed again the kernel can page it back in. This is explained a little in this blog post: https://chrisdown.name/2018/01/02/in-defence-of-swap.html
If the task is scheduled on that node but the kernel fails to reclaim enough memory to avoid an out-of-memory situation then one of the processes will get killed by the kernel, which docker will detect and kill the container, which ECS will notice. I'm not sure if ECS will try to reschedule the dead task on the same instance or a different one. It probably depends.

What's the main advantage of using replicas in Docker Swarm Mode?

I'm struggling to understand the idea of replica instances in Docker Swarm Mode. I've read that it's a feature that helps with high availability.
However, Docker automatically starts up a new task on a different node if one node goes down even with 1 replica defined for the service, which also provides high availability.
So what's the advantage of having 3 replica instances rather than 1 for an arbitrary service? My assumption was that with more replicas, Docker spends less time creating a new instance on another node in the event of failure, which aids performance. Is this correct?
What Makes a System Highly Available?
One of the goals of high availability is to eliminate single points of
failure in your infrastructure. A single point of failure is a
component of your technology stack that would cause a service
interruption if it became unavailable.
Let's take your example of a replica that consists of a single instance. Now let's suppose there is a failure. Docker Swarm will notice that the service failed and restart it. The service restarts, but a restart isn't instant. Let's say the restart takes 5 seconds. For those 5 seconds your service is unavailable. Single point of failure.
What if you had a replica that consists of 3 instances. Now when one of them fails (no service is perfect), Docker Swarm will notice that one of the instances is unavailable and create a new one. During that time you still have 2 healthy instances serving requests. To a user of your service, it appears as if there was no down time. This component is no longer a single point of failure.
ROMANARMY answer is very good and i just wanted to mention that the replicas can be on different nodes, so if one of your servers goes down(become unavailable) the container(replica) on the other server can be run without problem.

Resources