How to get Cloud Run to handle multiple simultaneous deployments? - google-cloud-run

I've got a project with 4 components, and every component has hosting set up on Google Cloud Run, separate deployments for testing and for production. I'm also using Google Cloud Build to handle the build & deployment of the components.
Due to lack of good webhook events from source system, I'm currently forced to trigger a rebuild of all components in a project every time there is a new change. In the project this means 8 different images to build and deploy, as testing and production use different build-time settings as well.
I've managed to optimize Cloud Build to handle the 8 concurrent builds pretty nicely, but they all finish around the same time, and then all 8 are pushed to Cloud Run. It often seems like Cloud Run does not like this at all and starts throwing some errors to me that I've been unable to resolve.
First and more serious is that often about 4-6 of the 8 deployments go through as expected, and the remaining ones either are significantly delayed or just fail, often so that the first few go through fine, then a few with significant delays, and the final 1-2 just fail. This seems to be caused by some "reconciliation request quota" being exhausted in the region (in this case europe-north1), as this is the error I can see at the top of the Cloud Run service -view:
Additionally and mostly annoyingly, the Cloud Run dashboard itself does not seem to handle having 8 services deployed, as just sitting on the dashboard view listing the services regularly throws me another error related to some read quotas:
I've tried contacting Google via their recommended "Send feedback" button but have received no reply in ~1wk+ (who knows when I sent it, because they don't seem to confirm receipt).
One option I can do to try and improve the situation is to deploy the "testing" and "production" variants in different regions, however that would be less than optimal, and seems like this is some simple configuration somewhere about the limits. Are there other options for me to consider? Or should I just try to set up some synchronization on these that not all deployments are fired at once?
Optimizing the need to build and deploy all components at once is not really an option in this case, since they have some shared code as well, and when that changes it would still be necessary to support this.

This is an issue with Cloud Run. Developers are expected to be able to deploy many services in parallel.
The bug should be fixed within a few days or couple of weeks.
[update] Bug should now be fixed.
Make sure to use the --async flag if you want to deploy in parrallel: gcloud run deploy $SERVICE --image --async

Related

Cloud Run - OpenBLAS Warnings and application restarts (Not a cold start issue)

Problem
I have an application running on a Cloud Run instance for a 5 months now.
The application has a startup time of about 3 minutes and when the startup is over it does not need much RAM.
Here are two snapshots of docker stats when I run the app locally :
When the app isn't excited
When the app is receiving 10 requests per seconds (Which is way over our use case for now) :
There aren't any problems when I run the app locally however problems arise when I deploy it on Cloud Run. I keep receiving : "OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k" messages followed by the restart of the app. This is a problem because as I said the app takes up to 3 minutes to restart, during which the requests take a lot of time to get treated.
I already fixed the cold start issue by using a minimum instance of 1 AND using a google cloud scheduler to query the service every minutes.
Examples
Here are examples of what I see in the logs.
In the second example the warnings came once again just after the application restart which caused a second restart in a row, this happens quite often.
Also note that those warnings/restarts are not necessarily happening when users are connected to the app but can happen when the only activity is due to the Google Cloud Scheduler
I tried increasing the allocated RAM and CPU to 4 CPUs and 4 Go of RAM (which is a huge over kill) and yet the problem remains.
Update 02/21
As of 01/01/21 we stopped witnessing such behavior from our cloud run service (maybe due an update, I don't know). I did contact the GCP support but they just told me to raise an issue on the OpenBLAS github repo but since I can't reproduce the behavior I did not do so. I'll leave the question open as nothing I did really worked.
OpenBLAS performs high performance compute optimizations and need to know what are the CPU capacity to tune itself the best.
However, when you run a container on Cloud Run, you run it in a sandbox GVisor, to increase the security and the isolation of all the container running on the same serverless platform.
This sandbox intercepts low level kernel calls and discard the abnormal/dangerous ones. I guess that for this reason that OpenBLAS can't determine the L2 cache size. On your environment, you haven't this sandbox, and you can access directly to the CPU info.
Why it's restart?? It could be a problem with OpenBLAS or a problem with Cloud Run (suspicious kernel call, kill the instance and restart it).
I haven't immediate solution because I don't know OpenBLAS. I had similar behavior with Tensorflow Serving, and tensorflow proposes a compiled version without any CPU optimization: less efficient but more portable and resilient to different environment constraint. If a similar compilation exists for OpenBLAS, it could be great to test it.

Tasks will not run in Spring Cloud Data Flow (Docker/K8S)

Last week I installed the Docker/Kubernetes based version of Spring Cloud Data Flow
Although there were not overt errors, things are not working correctly.
I am able to create streams and tasks in the web UI and Spring Cloud Data Flow Shell but nothing runs.
I am most interested in Tasks.
When I create them, they all show with a Task Status of UNKNOWN.
Unfortunately, no matter how many times I launch them, the status always remains UNKNOWN.
I'm able to delete them but what magic must I use to make them run?
There's nothing apparent from the description as to what has failed. Perhaps if you can update it with more details, it'd be useful.
From a troubleshooting standpoint, when deploying streams or if the launch of Tasks' fails for any reason, they will be logged in the SCDF-server/Skipper-server logs. You'd have to tail the logs of the respective pod to learn more about the failures.
Also, it'd be useful to verify the output of kubectl describe pod/<POD_NAME> to see what's causing the stream/task pods not to start successfully. They're usually listed towards the end of this command-output.
The usual suspects are due to pods' health-check failures and/or the stream/task application docker images aren't resolvable at runtime. You'll see the reasons in the logs, of course.
This was a misconfiguration on my end.
I'm able to run as expected now.

Run a large amount of tasks on a cluster

I'm looking for a solution to running a large amount of tasks and monitoring their status on a cluster.
In detail: Each task consists of 3-4 processes which are docker contained (each process is a docker run command). All of the processes have to run on the same server.
The amount of tasks we're talking about is bursts of several hundreds of tasks at a time.
I've looked into several solutions all of them based on Mesos:
Chronos - Seems like it would falter under high load and in any case is more directed towards recurring (cron) jobs. While I need one-time (heavy) job.
Custom Mesos FW - Seems to low-level for my needs would require me to write scheduling and retrying mechanisms, I'd save this for last resort.
Aurora - This seems promising as each task is run on the same node and comprised of several processes. I am missing a couple of this here though: Aurora seems to not be able to run several tasks as a part of a single job. Since my tasks are all similar with different input I could use a single job with many (say 400) instances and the first process of each task (whose role is to download the input from S3) could download a different set based on the instance ID. Which brings me to another problem: I can't find a working example of using {{ mesos.instance }} in .aurora files can anyone give me an example?
Thanks for all the fish people
You could also have a look on Kubernetes (which also can be run as a framework in Mesos). Kubernetes has the concept of Pods which are basically a set of co-located containers. So in your case a pod would consist of your 3-4 processes/containers and then these pods can be scaled up/down.
Short comments regarding the other solutions you mentioned:
Chronos: Not really targeting your use case
Custom FW: Actually not so difficult, but good call to save this as last resort.
Aurora: Very powerful but also complex framework
Marathon (which you didn't mention): targeted for long running applications which can be easily scaled up and down.
In addition to the excellent other answer, you could check out Two Sigma's Cook which they have only recently open sourced but have been using in prod at scale for a while.

travis/coverity: automatically re-schedule a build after given time

I'm using githubs integration of travis-ci with coverity-scan (the free versions of all these services) to test my FLOSS code.
The problem I'm facing is that when continuously working on the code, i'm hitting the coverity quota pretty soon.
Since I'm working on multiple projects simultaneously, it can therefore well be that I switch away from working on a given project before I'm allowed to submit a coverity again, thus potentially having flaws in the code for weeks although they would have been caught easily by coverity.
I would like to avoid this.
The first measure to prevent hitting the quota too frequently, is by using a dedicated branch (usually coverity_scan) which does not receive pushes as often as the master and/or feature branches.
However, this puts cognitive load on the user (me), which I also like to avoid.
Also, sometimes I still hit the quota (some of my projects as in the 100k-500k lines-of-code range, so they have a lower threshold than usual).
What I would like to have is being able to automatically re-trigger a coverity-scan once the quota has expired, if (and only if) the current build did hit the quota.
Is somthing like this possible with plain travis-ci/coverity features?
Or would I have to setup a separate hook, that monitors the coverity quota and travis-ci builds?
You don't need to run Coverity on every check-in. It's just too slow.
You should configure your (coverity build) system to poll your repo for changes, but have them checked infrequently. Something like a few times per day.
This will trigger the build when things change, but not on every change that is detected.

How are people solving app pool recycle issues on deployment with large apps?

Currently after a build/deployment of our app (58 projects, large asp.net MVC 3 front end) takes ~15-20secs to load as it goes through the whole 'recycling the app pool' (release configuration).
We do have a web farm if that alters people's answers, but the question really is:
What are people doing in large scale applications where a maintenance window isn't viable (we're a 24/7 very active website) to minimize that initial 'first hit' on the app pool recycle after a deploy?
We've used a number of tools to analyze that startup time and there doesn't really seem to be any way to bring it down so what I'm looking for are what techniques do people employ in order to minimize the impact of a large application deploy affecting users.
By default - if you change 15 files in an ASP.NET application at once (even via FTP) then the app pool is automatically recycled. You can change the number of files but as soon as web.config and bin files are changed then it needs to recycle. So in my opinion the ideal solution for an environment like yours would be as follows:
4 web servers (this is an arbitrary number)
each server has a status.aspx that the load balancer looks at - use TeamCity to take 2 of these servers "off line" (off the load balancer) and wait 20 seconds for the traffic to filter across. A distributed cache will help keep user experience problems
Use TeamCity to deploy to those 2 servers - run your automated tests etc. and once you are happy put those back into the farm and take the other 2 offline and deploy to those
This can all be scripted / automated. The only issue with this is any schema changes that are not backwards compatible may not allow running the new version site in parallel with old version of the site for the 20 seconds for the load balancer to kick back in
This is good old fashioned Canary Releasing - there are some patterns here http://continuousdelivery.com/patterns/ to help take into consideration. Id also suggest a copy of that continuous delivery book - its like a continuous delivery bible and has got me out of a few situations :)
At the very base you could run a tinyget script against the application after completion of deployment which will "warm up" the application however if a customer hits your site before the script can run, they will still face a delay. What do you currently have in place, what post deployment steps do you have in place?
In a farm environment you could stage deployments too, so take one server out of load balance, update it and then bring that online after deployment and take the other out, complete the deployment and then reintroduce into the farm. How is your SQL Server setup - clustered?
copy and paste from my post here
We operate a Blue/Green deployment strategy on a 4 tier architecture which has a web site over 4 servers at the top tier. Due to the complexity the architecture introduced for deployments, we needed a way to deploy without disturbing any traffic to the "live" site. Following Fowler's advice, but not quite in the same way, we came up with a solution that means we have 2 sites on each server (a blue and a green, or in our case site A and site B). The live site has the appropriate host header, and once we have deployed and tested to the non-live site, we then flip the headers of the 2 sites so that what was once live is now the non-live site, and vice-versa. The effect is, a robust deployment that can be done in business hours and with the highest level of confidence.
This of course complicates your configuration and deployment slightly, but it's worth the effort. I guess it kind of goes without saying that you want to script both the deployment, and the host header swapping.
Firstly, unless you're running Google or something bigger, does a 15-20s load time at 3am for a handful of users really impact that much? I'd say the effort invested in eliminating the occasional lag would far outweigh the 15-20s inconvenience of a couple of users.
I consider it a necessary evil of using ASP.NET unfortunately. Using a pre-compiled site (.DLLs instead of the code-behind files) will lessen the time but not necessarily eliminate it.
The best thing you can do is use something like a status notification bar to warn users they may experience some "issues" during "essential maintenance".
But even then, I'd say in terms of user experience it'd be better to keep quiet and have a handful of people blame their "slow internet" when your site takes 20s to load on one occasion, than announce to all and sundry that it will be slow.
You can also try this approach : http://weblogs.asp.net/scottgu/archive/2009/09/15/auto-start-asp-net-applications-vs-2010-and-net-4-0-series.aspx
without knowing anything about your site, my first thought is that you might be able to break it down into smaller sites so that they start faster individually.
second, with your web farm, i assume you have some sort of load balancing device in front of that from which you can pull machines out of the pool when they are being deployed. don't put them back in the pool until after you have sent a request against the site to get it started up. you should be able to script this such that you are pretty much clicking a button that takes a machine out, deploys to it, and sends a request after it's back up and happy.
You can consider using aspnet_compiler.exe to precompile your application, because I think the delay after deployment is caused by the compilation phase rather than "whole recycling the app pool".

Resources