readiness check for google cloud run - how? - google-cloud-run

I've searched quite extensively in the documentation at https://cloud.google.com/run/docs/how-to. I also found the YAML in the console.cloud.google.com, but I can't edit it. Is there a way to set it up using a command I might have missed?
EDIT:
I couldn't find anything in https://cloud.google.com/sdk/gcloud/reference/beta/container/clusters/create about it either.
EDIT2:
I'm looking for a way to make Google cloud run have a readiness check for my app in a container. The same way that kubernetes does it - example here: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/. The problem is I don't want to have my service down for 30-60 seconds while the app in the container is still spinning up. Google instantly redirects the traffic causing users to wait for a long time when I push a new build.
EDIT3:
Here's the time it takes to make the first initial request after I've deployed a new version.
EDIT4:
The app I'm trying to start is in Python. It's a flask app serving a tensorflow model. I needs to load in several files into memory. This takes only 5-10seconds on my computer, but as you can it takes longer on cloud run.

Cloud Run does not have a readiness check other than confirming your service is listening on the specified port. Once that is done traffic starts routing to the new revision and previous serving revisions are scaled down as they wrap up in-progress requests.
If your goal is to ensure the service is ready ASAP after deployment, you might make a heavier entrypoint that takes care of more setup tasks.
A "heavier" entrypoint like this will help post-deploy responsiveness, at the cost of slower cold-starts.
Examples of things you can front-load in the entrypoint (whether in BASH scripts or in your service before turning on the HTTP server):
Perform all necessary setup tasks such as loading files into memory.
Establish and preserve in global state any clients or connections to backing services.
Perform via your service code any healthchecks that backing services and resources are available.
Warm up in-container caches to minimize the first response.
Again, this optimizes for post-deploy response by penalizing all cold starts.
https://cloud.google.com/run/docs/tips#optimizing_performance

Note: There's been a recent update and things have improved.
https://cloud.google.com/run/docs/configuring/healthchecks
"You can configure HTTP and TCP startup health check probes, along with HTTP liveness probes for new and existing Cloud Run services."
Oddly Cloud Run still doesn't allow readiness probes.
But startup and liveness probes are allowed.
The startup probe acts as a readiness probe for initial startup vs an ongoing readiness probe. That being said startup + liveness probe is probably sufficient in the majority of cases, if it fails it's liveness probe it'll still get removed from load balancing. And startup probe gives replacements enough time to finish booting up before receiving traffic.
Note I like to verify things, CloudRun makes it difficult and unintuitive to verify that the startup probe's initialDelaySeconds works as intended & their APIs are annoyingly limited in terms of being automation friendly.
That being said this will give you an idea of how it works:
# Edit Vars
export CR_SERVICE_NAME=startup-probe-test
export PROJECT=chrism-playground
export REGION=us-central1
# Set defaults
gcloud config set project $PROJECT
gcloud config set run/region $REGION
# Deploy a Hello World App for testing purposes
gcloud run deploy $CR_SERVICE_NAME \
--image=us-docker.pkg.dev/cloudrun/container/hello \
--allow-unauthenticated \
--max-instances=1
# Hacky Automation to workaround CloudRun's lack of good automation
gcloud run services describe $CR_SERVICE_NAME --format export > example.yaml
export TOP=$(cat example.yaml | grep " ports:" -B 1000 | grep -v " ports:")
export BOTTOM=$(cat example.yaml | grep " ports:" -A 1000)
export MIDDLE=$(echo """
startupProbe:
initialDelaySeconds: 60
timeoutSeconds: 2
periodSeconds: 10
failureThreshold: 3
httpGet:
path: /healthy
""" | sed '1d')
export REPLACE_NAME=$(echo "$TOP" | grep "name: $CR_SERVICE_NAME-")
export NEW_TOP=$(echo $TOP | sed s/$REPLACE_NAME/" name: $CR_SERVICE_NAME-new"/)
echo "$NEW_TOP\n$MIDDLE\n$BOTTOM" | tee updated-example1.yaml
export NEW_TOP=$(echo $TOP | sed s/$REPLACE_NAME/" name: $CR_SERVICE_NAME-newest"/)
echo "$NEW_TOP\n$MIDDLE\n$BOTTOM" | tee updated-example2.yaml
# Run these (should take about 3 mins for both to finish)
gcloud beta run services replace updated-example1.yaml
gcloud beta run services replace updated-example2.yaml
After copy pasting that bash you should see something like this:
Click the symbol near 1 in the image, then check the boxes to see Revision URLs & Actions, and press OK.
Create revision URLs original & startup-probe as shown, then click each to test.
When I clicked on the original tagged link the site loaded instantly
https://original---startup-probe-test-pxehf3xjcq-uc.a.run.app/
When I clicked on the startup-probe tagged link the site had a loading circle for 60 seconds, and then it worked as expected, which proves the startup probe worked as expected
https://startup-probe---startup-probe-test-pxehf3xjcq-uc.a.run.app/

they recently also added support for liveness and readiness probes in terraform using beta provider.
see https://cloud.google.com/run/docs/configuring/healthchecks#terraform_3

Related

Periodic cron-like Functions Across Containers in a Docker Project

I have implemented the LAMP stack for a 3rd party forum application on its own dedicated virtual server. One of my aims here was to use a composed docker project (under Git) to encapsulate the application fully. I wanted to keep this as simple to understand as possible for the other sysAdmins supporting the forum, so this really ruled out using S6 etc., and this in turn meant that I had to stick to the standard of one container per daemon service using the docker runtime to do implement the daemon functionality.
I had one particular design challenge that doesn't seem to be addressed cleanly through the Docker runtime system, and that is I need to run periodic housekeeping activities that need to interact across various docker containers, for example:
The forum application requires a per-minute PHP housekeeping task to be run using php-cli, and I only have php-cli and php-fpm (which runs as the foreground deamon process) installed in the php container.
Letsencrypt certificate renewal need a weekly certbot script to be run in the apache container's file hierarchy.
I use conventional /var/log based logging for high-volume Apache access logs as these generate Gb access files that I want to retain for ~7 days in the event of needing to do hack analysis, but that are otherwise ignored.
Yes I could use the hosts crontab to run docker exec commands but this involves exposing application internals to the host system and IMO this is breaking one of my design rules. What follows is my approach to address this. My Q is really to ask for comments and better alternative approaches, and if not then this can perhaps serve as a template for others searching for an approach to this challenge.
All of my containers contain two special to container scripts: docker-entrypoint.sh which is a well documented convention; docker-service-callback.sh which is the action mechanism to implement the tasking system.
I have one application agnostic host service systemctl: docker-callback-reader.service which uses this bash script, docker-callback-reader. This services requests on a /run pipe that is volume-mapped into any container that need to request such event processes.
In practice I have only one such housekeeping container see here that implements Alpine crond and runs all of the cron-based events. So for example the following entry does the per-minute PHP tasking call:
- * * * * echo ${VHOST} php task >/run/host-callback.pipe
In this case the env variable VHOST identifies the relevant docker stack, as I can have multiple instances (forum and test) running on the server; the next parameter (php in this case) identifies the destination service container; the final parameter (task) plus any optional parameters are passed as arguments to a docker exec of php containers docker-service-callback.sh and magic happens as required.
I feel that the strengths of the system are that:
Everything is suitably encapsulated. The host knows nothing of the internals of the app other than any receiving container must have a docker-service-callback.sh on its execution path. The details of each request are implemented internally in the executing container, and are hidden from the tasking container.
The whole implementation is simple, robust and has minimal overhead.
Anyone is free to browse my Git repo and cherry-pick whatever of this they wish.
Comments?

Google Bigtable export hangs, is stuck, then fails in Dataflow. Workers never allocated

I'm trying to use this process:
https://cloud.google.com/bigtable/docs/exporting-sequence-files
to export my bigtable for backup. I've tried bigtable-beam-import versions 1.1.2 and 1.3.0 with no success. The program seems to kick off a Dataflow properly, but no matter what settings I use, workers never seem to get allocated to the job. The logs always say:
Autoscaling: Raised the number of workers to 0 based on the rate of progress in the currently running step(s).
Then it hangs and workers never get allocated. If I let it run, the logs say:
2018-03-26 (18:15:03) Workflow failed. Causes: The Dataflow appears to be stuck. Workflow failed. Causes: The Dataflow appears to be stuck. You can get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.
then it gets cancelled:
Cancel request is committed for workflow job...
I think I've tried changing all the possible Pipeline options desrcrbed here:
https://cloud.google.com/dataflow/pipelines/specifying-exec-params
I've tried turning Autoscaling off and specifying the number of workers like this:
java -jar bigtable-beam-import-1.3.0-shaded.jar export \
--runner=DataflowRunner \
--project=mshn-preprod \
--bigtableInstanceId=[something]\
--bigtableTableId=[something] \
--destinationPath=gs://[something] \
--tempLocation=gs://[something] \
--maxNumWorkers=10 \
--zone=us-central1-c \
--bigtableMaxVersions=1 \
--numWorkers=10 \
--autoscalingAlgorithm=NONE \
--stagingLocation=gs:[something] \
--workerMachineType=n1-standard-4
I also tried specifying the worker machine type. Nothing changes. Always autoscaling to 0 and fail. If there are people from the Dataflow team on, you can check out failed job ID: exportjob-danleng-0327001448-2d391b80.
Anyone else experience this?
After testing lots of changes to my GCloud project permissions, checking my quotas, etc. it turned out that my issue was with networking. This Stack Overflow question/answer was really helpful:
Dataflow appears to be stuck 3
It turns out that our team had created some networks/subnets in the gcloud project and removed the default network. When dataflow was trying to create VMs for the workers to run, it failed because it was unable to do so in the "default" network.
There was no error in the dataflow logs, just the one above about "dataflow being stuck." We ended up finding a helpful error message in the "Activity" stream on the gloud home page. We then solved the problem by creating a VPC literally called "default", with subnets called "default" in all the regions. Dataflow was then able to allocate VMs properly.
You should be able to pass network and subnet as pipeline parameters, but that didn't work for us using the BigTable export script provided (link in the question), but if you're writing Java code directly against the Dataflow API, you can probably fix the issue I had by setting the right network and subnet from your code.
Hope this helps anyone who is dealing with the symptoms we saw.

Docker swarm mode load balancing

I've set up a docker swarm mode cluster, with two managers and one worker. This is on Centos 7. They're on machines dkr1, dkr2, dkr3. dkr3 is the worker.
I was upgrading to v1.13 the other day, and wanted zero downtime. But it didn't work exactly as expected. I'm trying to work out the correct way to do it, since this is one of the main goals, of having a cluster.
The swarm is in 'global' mode. That is, one replica per machine. My method for upgrading was to drain the node, stop the daemon, yum upgrade, start daemon. (Note that this wiped out my daemon config settings for ExecStart=...! Be careful if you upgrade.)
Our client/ESB hits dkr2, which does its load balancing magic over the swarm. dkr2 which is the leader. dkr1 is 'reachable'
I brought down dkr3. No issues. Upgraded docker. Brought it back up. No downtime from bringing down the worker.
Brought down dkr1. No issue at first. Still working when I brought it down. Upgraded docker. Brought it back up.
But during startup, it 404'ed. Once up, it was OK.
Brought down dkr2. I didn't actually record what happened then, sorry.
Anyway, while my app was starting up on dkr1, it 404'ed, since the server hadn't started yet.
Any idea what I might be doing wrong? I would suppose I need a health check of some sort, because the container is obviously ok, but the server isn't responding yet. So that's when I get downtime.
You are correct -- You need to specify a healthcheck to run against your app inside the container in order to make sure it is ready. Your container will not receive traffic until this healtcheck has passed.
A simple curl to an endpoint should suffice. Use the Healthcheck flag in your Dockerfile to specify a healthcheck to perform.
An example of the healthcheck line in a Dockerfile to check if an endpoint returned 200 OK would be:
HEALTHCHECK CMD curl -f 'http://localhost:8443/somepath' || exit 1
If you can't modify your Dockerfile, then you can also specify your healthcheck manually at deployment time using the compose file healthcheck format.
If that's also not possible either and you need to update a running service, you can do a service update and use a combination of the health flags to specify your healthcheck.

Sandbox command execution with docker via Ajax

I'm looking For help in this matter, what options do I have if I want to sandbox the execution of commands that are typed in a website? I would like to create an online interpreter for a programming language.
I've been looking at docker, how would I use it? Is this the best option?
codecube.io does this. It's open source: https://github.com/hmarr/codecube
The author wrote up his rationale and process. Here's how the system works:
A user types some code in to a box on the website, and specifies the language the code is written in
They click “Run”, the code is POSTed to the server
The server writes the code to a temporary directory, and boots a docker container with the temporary directory mounted
The container runs the code in the mounted directory (how it does this varies according to the code’s language)
The server tails the logs of the running container, and pushes them down to the browser via server-sent events
The code finishes running (or is killed if it runs for too long), and the server destroys the container
The Docker container's entrypoint is entrypoint.sh, which inside a container runs:
prog=$1
<...create user and set permissions...>
sudo -u codecube /bin/bash /run-code.sh $prog
Then run-code.sh checks the extension and runs the relevant compiler or interpreter:
extension="${prog##*.}"
case "$extension" in
"c")
gcc $prog && ./a.out
;;
"go")
go run $prog
;;
<...cut...>
The server that accepts the code examples from the web, and orchestrates the Docker containers was written in Go. Go turned out to be a pretty good choice for this, as much of the server relied on concurrency (tailing logs to the browser, waiting for containers to die so cleanup could happen), which Go makes joyfully simple.
The author also details how he implemented resource limiting, isolation and thoughts of security.

Using SSH (Scripts, Plugins, etc) to start processes

I'm trying to finish a remote deployment by restarting the two processes that make my Python App work. Like so
process-one &
process-two &
I've tried to "Execute a Shell Script" by doing this
ssh -i ~/.ssh/id_... user#xxx.xxx ./startup.sh
I've tried using the Jekins SSH Plugin and the Publish Over SSH Plugin and doing the same thing. All of the previous steps, stopping the processes, restarting other services, pulling in new code work fine. But when I get to the part where I start the services. It executes those two lines, and none of the Plugins or the Default Script execution can get off of the server. They all either hang until I restart Jekins or time out int he case of the Publish Over SSH plugin. So my build either requires a restart of Jenkins, or is marked unstable.
Has anyone had any success doing something similar? I've tried
nohup process-one &
But the same thing has happened. It's not that the services are messing up either, because they actually start properly, it's just that Jenkins doesn't seem to understand that.
Any help would be greatly appreciated. Thank you.
What probably happens in that the process when spawned (even with the &) is consuming the same input and output as your ssh connection. Jenkins is waiting for these pipes to be emptied before the job closes, thus waits for the processes to exit. You could verify that by killing your processes and you will see that the jenkins job terminates.
Dissociating outputs and starting the process remotely
There are multiple solutions to your problem:
(preferred) use proper daemon control tools. Your target platform probably has a standard way to manage those services, e.g. init.d scripts. Note, when writing init.d scripts, make sure you detach the process in the background AND ensure the input/output of the daemon are detached from the shell that starts them. There are several techniques, like like http://www.unix.com/man-page/Linux/8/start-stop-daemon/ tools, daemonize, daemontools or something like the shell script described under https://wiki.jenkins-ci.org/display/JENKINS/Installing+Jenkins+as+a+Unix+daemon (take note of the su -s bin/sh jenkins -c "YOUR COMMAND; ...disown" etc). I also list some python specific techniques under.
ssh server 'program < /dev/null > /dev/null 2>&1 &'
or
ssh server 'program < /dev/null >> logfile.log 2>&1 &' if you want to have a crude output management (no log rotation, etc...)
potentially using setsid (I haven't tried) https://superuser.com/questions/172043/how-do-i-fork-a-process-that-doesnt-die-when-shell-exits . In my quick tests I wasn't able to get it to work though...
Python daemons
The initial question was focused on SSH, so I didn't fully described how to run the python process as daemon. This is mostly covered in other techniques:
with start-stop-daemon: start-stop-daemon and python
with upstart on ubuntu: Run python script as daemon at boot time (Ubuntu)
some more python oriented approaches:
How to make a Python script run like a service or daemon in Linux
Can I run a Python script as a service?

Resources