Automate bazel shutdown

Automate bazel shutdown - bazel

I am building a large project on a remote machine using Bazel. Clean build times are around 30 minutes. Incremental builds (changing code in 1-2 files) typically take around 10-20 seconds.
The problem I have is that when I log out of my machine and log back in again after 1-2 days the build command takes around 10 minutes even though I have not modified any source code.
If I call bazel shutdown and then call bazel build again the "no-build" op takes around 5-10 seconds (i.e. much better than the other "no-build" op).
If I log out and log back in again immediately I can see there is still a bazel process running in the background, which disappears when I call bazel shutdown. I am guessing that when I do not shut bazel down properly it gets killed in such a way that corrupts or deletes cached data. The long "no-build" op then spends a long time reconstructing data that was previously stored in the Bazel cache.
Is there a way to automatically shut down the bazel server when I am disconnected? Preferably this should work both when (i) I call exit from the command-line to log out, (ii) I get automatically disconnected through some kind of timeout or interruption in network connectivity.

Set up your development environment so that you sessions do not automatically exit / get killed, e.g., using a tool like screen or tmux. When you want to end a session call bazel shutdown prior to exit. Not completely automated but the point is that you should be in control of when your sessions end.

Related

Properly handle timeout on CloudRun

We use Google Cloud Run to wrap an analysis developed in R behind a web API. For this, we have a small Fastify app that launches an R script and uploads the results to Google Cloud Storage. The process' stdout and stderr are written to a file and are also uploaded at the end of the analysis.
However, we sometimes run into issues when a process takes longer to execute than expected. In these cases, we fail to upload anything and it's difficult to debug, because stdout and stderr are "lost" on the instance. The only thing we see in the Cloud Run logs is this message
The request has been terminated because it has reached the maximum request timeout
Is there a recommended way to handle a request timeout?
In App Engine there used to be a descriptive error: DeadllineExceededError for Python and DeadlineExceededException for Java.
We currently evaluate the following approach
Explicitly set Cloud Run's request timeout
Provide the same value as an environment variable, so it's available to the container
When receiving a request, we start a timer that calls a "cleanup" function just before the timeout is exceeded
The cleanup function stops the running analysis and uploads the current stdout and stderr files to Cloud Storage
This feels a little complicated so any feedback very appreciated.

Since the default timeout is 5 minutes and can extend up to 60 minutes, I would simply start by increasing this to 10 minutes. Then observe over the course of a month how that affects your service.
Aside from that fix, I would start investigating why your process is taking longer than expected and if it's perhaps due to a forever-growing result set.
If there's no result set scalability concern, then bumping the default timeout up from 5-minutes seems to be the most reasonable and simple fix. It would only be a problem until your script has to deal with more data in the future for some reason.

How can I see how long my Cloud Run deployed revision took to spin up?

I deployed a Vue.js and a Kotlin server app. Cloud Run does promise to put a service to sleep if no request to it arise for a specific time. I did not opened my app for a day now. As I opened it - it was available almost immediatly. Since I know how long it takes to spin up when started locally I kinda don't trust the promise that Cloud Run really had put the app to sleep and span it up so crazy fast.
I'd love to know a way how I can really see how long it took for the spinup - also for startup improvement for the backend service.

After having the service inactive for some time, record the time when you request the service URL and request it.
Then go to the logs for the Cloud Run service, and use this filter to see the logs for the service:
resource.type="cloud_run_revision"
resource.labels.service_name="$SERVICE_NAME"
Look for the log entry with the normal app output after your request, check its time and compare it with the recorded time.

You can't know when the instance will be evicted or if it is kept in memory. It could happen quickly, or take hours or days before eviction. it's "serverless".
About the starting time, when I test, I deploy a new revision and I have a try on it. In the logging service, the first log entry of the new revision provides me the cold start duration. (Usually 300+ ms, compare to usual 20 - 50 ms with warm start).
The billing instance time is the sum of all the containers running times. A container is considered as "running" when it process request(s).

PythonOperator task hangs accessing Cloud Storage and is stacked as SCHEDULED

One of the tasks in my DAG sometimes hangs when accessing Cloud Storage. It seems the code stops at the download function here:
hook = GoogleCloudStorageHook(google_cloud_storage_conn_id='google_cloud_default')
for input_file in hook.list(bucket, prefix=folder):
hook.download(bucket=bucket, object=input_file)
In my tests the folder contains a single 20Mb json file.
The task normally takes 20-30 seconds, but in some cases it will run for 5 minutes, and after that its state is updated to SCHEDULED and stuck there (waited for more than 6 hours). I suspect the 5 minutes are due to the configuration scheduler_zombie_task_threshold 300 but not sure.
If I clear the task manually on the Web UI, the task is quickly queued and run again correctly. I am getting around the issue by setting an execution_timeout which updates the task correctly to FAILED or UP_FOR_RETRY state when it takes longer than 10 minutes; but I'd like to fix the underlying issue to avoid relying on a fixed timeout threshold, any suggestions?

There was a discussion on the Cloud Composer Discuss group about this: https://groups.google.com/d/msg/cloud-composer-discuss/alnKzMjEj8Q/0lbp3bTlAgAJ. It is a problem with the Celery executor when Airflow workers die.
Although Composer is working on a fix, if you want this to happen less frequently in the current version, you may consider reducing your parallelism Airflow configuration or creating a new environment with a larger machine-type.

Distributed Erlang - network split recovery and using heart with distributed applications

I have a standard situation, two distributed Erlang nodes, one master one standby.
When I stop the master the standby comes on - failover, when I start the master the standby stops - takeover. Everything works fine as long as heart is not turned on and there is no network split.
However, when I disconnect the master from the network after 60 seconds or so the standby gives me an error message ** removing (timedout) connection ** and starts up as if the master node stopped. This makes sense to me, it doesn't know if the master is alive or not, and epmd can't connect to the master node so it is removed from the nodes() list. Lets pretend for a moment that this is the desired outcome.
The problem is that, when the connection is restored, I have master and standby running at the same time and the standby is oblivious to the fact that the master is running. Pinging the standby during the masters init does not solve the issue. I checked nodes() on the standby after doing so, it sees the master node but still it continues to run.
My solution for now has been to create a process, that monitors all nodes that are above each node in hierarchy and if any of them are online, can be pinged, the process calls erlang:halt() to terminate the standby node. It works for simple situations, but maybe someone can tell me if there is a better way? I found a similar problem described on Elixir forum so it probably a known erlang problem without an easy solution. https://elixirforum.com/t/distributed-application-network-split/10007
If during a network split you don't want to have two nodes running in parallel I'm guessing an outside monitoring application needs to be used?
The second major issue is heart. If heart is turned on, as is, the failover never happens. If heart is running with a sleep before it calls start it stops the failover node when it calls the application start. So even when it can't start the master, do to it not having access to vital resources for example, it stops the failover node, and doesn't bring it back up after it fails to start the master. I don't know if heart is not supposed to be used with a distributed application or if there is an option to run some erlang code to check if the resources are available before attempting a start the node and before stopping the failover node?
The documentation on heart is not great. Very hard to find any examples of HEART_COMMAND. I found a way to set the HEART_COMMAND to a script from within my application, but there is a limit to how long the argument can be, and it's not as long as stated in the documentation from what I can tell. This for example sets a sleep timer for 60 seconds before calling application start again. It doesn't solve any issues, because in 60 seconds it stops the failover node and hangs if master node can't start.
heart:set_cmd("sleep 60; ./bin/myapp start")
The solution I've ended up with for now is letting heart of the main release start another release, a pre-loader, which does a preliminary check that all resources are available and if they are it starts the main release-application, and if they are not it continues checking forever. This way the main app is running on the failover node without interruption. So the main release has heart turned on, and the pre-loader does not. I ended up using a bash script file because I needed to do more work than I could fit in the heart:set_cmd/1, so I'm still calling heart:set_cmd(Dir ++ "/priv/myHeartScript.sh " ++ Arg1 ++ " " ++ Arg2), but don't get carried away with the Args as there is a limit on size! I also used Environment Variables which I set in vm.args using -env to pass data to the script, such as the pre-loader path/name. This allowed me to avoid having to edit the scrip too during deployment.
If anyone has a better solution PLEASE let me know.
UPDATE
The team at Erlang Solutions was kind enough to shed some light onto the subject. Basically, nobody they know uses the Erlangs built in distributed model. Everything revolves around the data, and as long as it is available on redundant databases you can spin up new applications anytime. They recommend using the cloud hosts that can spin up new servers when one goes down or use a redundant node design, so have 5 nodes up in parallel and if a few go down you can restart them manually or by other means.
As for me, I can say that getting heart to start a pre-loader release/app gets the job done but it gets complicated fast. To launch the app now requires provisioning several extra sys.config/vm.args/rebar.config files. I will be looking into their suggestions for the next iteration.
UPDATE
Moved away from using Erlang distributed model. Using RabbitMQ to send heartbeats to all nodes, including itself. If a node is receiving heartbeats from itself and no other node it's the master, if receiving more than one use any attribute like node name to chose the master. You don't have to use RabbitMQ, but you need to make sure all nodes can reach the same destination and consume from it.
Also, devOps oppose using heart. They prefer to use standard Linux tools to monitor application status and restart it after crash or a server reboot.

What could be causing "no action" state?

I have a Bazel repo that builds some artifact. The problem is that it stops half way through hanging with this message:
[3 / 8] no action
What on the earth could be causing this condition? Is this normal?
(No, the problem is not easily reducible, there is a lot of custom code and if I could localize the issue, I'd not be writing this question. I'm interested in general answer, what in principle could cause this and is this normal.)

It's hard to answer your question without more information, but yes it is sometimes normal. Bazel does several things that are not actions.
One reason that I've seen this is if Bazel is computing digests of lots of large files. If you see getDigestInExclusiveMode in the stack trace of the Bazel server, it is likely due to this. If this is your problem, you can try out the --experimental_multi_threaded_digest flag.

Depending on the platform you are running Bazel on:
Windows: I've seen similar behavior, but I couldn't yet determine the reason. Every few runs, Bazel hangs at the startup for about half a minute.
If this is mid-build during execution phase (as it appears to be, given that Bazel is already printing action status messages), then one possible explanation is that your build contains many shell commands. I measured that on Linux (a VM with an HDD) each shell command takes at most 1-2ms, but on my Windows 10 machine (32G RAM, 3.5GHz CPU, HDD) they take 1-2 seconds, with 0.5% of the commands taking up to 10 seconds. That's 3-4 orders of magnitude slower if your actions are heavy on shell commands. There can be numerous explanations for this (antivirus, slow process creation, MSYS being slow), none of which Bazel has control over.
Linux/macOS: Run top and see if the stuck Bazel process is doing anything at all. Try hitting Ctrl+\, that'll print a JVM stacktrace which could help identifying the problem. Maybe the JVM is stuck waiting in a lock -- that would mean a bug or a bad rule implementation.
There are other possibilities too, maybe you have a build rule that hangs.
Does Bazel eventually continue, or is it stuck for more than a few minutes?

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart