I have some test and build targets in my Bazel BUILD file that take longer to run than others, so they end up being part of the critical path and always become the long poles for the build/tests to finish while the CPU is almost idle.
I have a theory I could reduce the total build (wall) time by shifting these build targets/tests to start earlier. It would shift some other tests to "fill" the idle CPU time towards the end, resulting in better average resource utilization.
Is there a way I can provide hints to Bazel in the BUILD file about what the critical path is likely to be, so that Bazel can prioritize making progress on that path to avoid it being the long pole?
I have not been able to find much about this in docs.
There are some strategies that can help to optimize build times and reduce the impact of long running targets.
Test Sharding: You can use the "--test_sharding_strategy=explicit" flag and "--test_shard_spread_units=1" flag to control the order of tests. You can also use the "shard_count" attribute in the "test_suite" rule to specify the number of shards for a test. By dividing your tests into smaller parts and running them in parallel, you can reduce the time spent waiting for a single test to complete.
Remote Execution: If you have a large number of machines, you can use Bazel's remote execution support to distribute the build and test workload across multiple machines. This can help reduce the impact of a single long running target.
Target Caching: You can use the "--remote_cache" flag to specify a remote cache for build artifacts. This allows Bazel to reuse build artifacts from previous builds, reducing the time spent building the same targets over and over again.
Heuristics: Bazel uses heuristics to determine the order in which to build targets. Factors such as the number of dependencies and the size of the outputs are taken into consideration. By structuring your build files in a way that minimizes the number of dependencies and reduces the size of the outputs, you can help Bazel make better decisions about the build order.
Profiling: You can use the "--profile" flag to enable profiling in Bazel. This generates a detailed report of the build process, including information about the time spent building each target. This information can help you identify opportunities for optimization.
Related
I'm trying to build a Jenkins declarative pipeline that will build on all agents in parallel.
How can I do this without disabling sandbox?
I have come across this page: https://jenkins.io/blog/2017/09/25/declarative-1/ but it seems repetitive, especially when padded out with my code as nearly all operations are performed almost the same on every node. Is there a way to do this and avoid repeating code?
I suggest that you follow the common pattern described in the referenced article.
By assigning labels identifying the node's operating system and allocating nodes based on these labels, you ensure that the job runs exactly once in each of the different build environments.
A severe drawback of your suggestion to build on all of the available agents (as said, I don't know anything how to actually do that)) would be in the case of one or multiple build agents being offline. So you don't run on Windows, because the server was just rebooting, but your build result is green as nothing failed? Not a good idea, isn't it?
Another benefit of the label-based approach is that you can easily add additional build agents to cope with increased number of builds, e.g., as your team grows. You don't want to build twice on Windows, when you add another build agent with Windows, right?
So I strongly recommend: Assign labels to your build agents and then specify, on which agents your job needs to run.
I have a Bazel repo that builds some artifact. The problem is that it stops half way through hanging with this message:
[3 / 8] no action
What on the earth could be causing this condition? Is this normal?
(No, the problem is not easily reducible, there is a lot of custom code and if I could localize the issue, I'd not be writing this question. I'm interested in general answer, what in principle could cause this and is this normal.)
It's hard to answer your question without more information, but yes it is sometimes normal. Bazel does several things that are not actions.
One reason that I've seen this is if Bazel is computing digests of lots of large files. If you see getDigestInExclusiveMode in the stack trace of the Bazel server, it is likely due to this. If this is your problem, you can try out the --experimental_multi_threaded_digest flag.
Depending on the platform you are running Bazel on:
Windows: I've seen similar behavior, but I couldn't yet determine the reason. Every few runs, Bazel hangs at the startup for about half a minute.
If this is mid-build during execution phase (as it appears to be, given that Bazel is already printing action status messages), then one possible explanation is that your build contains many shell commands. I measured that on Linux (a VM with an HDD) each shell command takes at most 1-2ms, but on my Windows 10 machine (32G RAM, 3.5GHz CPU, HDD) they take 1-2 seconds, with 0.5% of the commands taking up to 10 seconds. That's 3-4 orders of magnitude slower if your actions are heavy on shell commands. There can be numerous explanations for this (antivirus, slow process creation, MSYS being slow), none of which Bazel has control over.
Linux/macOS: Run top and see if the stuck Bazel process is doing anything at all. Try hitting Ctrl+\, that'll print a JVM stacktrace which could help identifying the problem. Maybe the JVM is stuck waiting in a lock -- that would mean a bug or a bad rule implementation.
There are other possibilities too, maybe you have a build rule that hangs.
Does Bazel eventually continue, or is it stuck for more than a few minutes?
I wonder why increasing of --ram_utilization_factor is not recommended (from the docs):
This option, which takes an integer argument, specifies what percentage of the system's RAM Bazel should try to use for its subprocesses. This option affects how many processes Bazel will try to run in parallel. The default value is 67. If you run several Bazel builds in parallel, using a lower value for this option may avoid thrashing and thus improve overall throughput. Using a value higher than the default is NOT recommended. Note that Bazel's estimates are very coarse, so the actual RAM usage may be much higher or much lower than specified. Note also that this option does not affect the amount of memory that the Bazel server itself will use.
Since Bazel has no way of knowing how much memory an action/worker uses/will use, the only way of setting this up seems ram_utilization_factor.
That comment is very old and I believe was the result of some trial and error when --ram_utilization_factor was first implemented. The comment was added to make sure that developers would have some memory left over for other applications to run on their machines. As far as I can tell, there is no deeper reason for it.
In Jenkins I have 100 java projects. Each has its own build file.
Every time I want clear the build file and compile all source files again.
Using bulkbuilder plugin I tried compling all the jobs.. Having 100 jobs run parallel.
But performance is very bad. Individually if the job takes 1 min. in the batch it takes 20mins. More the batch size more the time it takes.. I am running this on powerful server so no problem of memory and CPU.
Please Suggest me how to over come this.. what configurations need to be done in jenkins.
I am launching jenkins using war file.
Thanks..
Even though you say you have enough memory and CPU resources, you seem to imply there is some kind of bottleneck when you increase the number of parallel running jobs. I think this is understandable. Even though I am not a java developer, I think most of the java build tools are able to parallelize build internally. I.e. building a single job may well consume more than one CPU core and quite a lot of memory.
Because of this I suggest you need to monitor your build server and experiment with different batch sizes to find an optimal number. You should execute e.g. "vmstat 5" while builds are running and see if you have idle cpu left. Also keep an eye on the disk I/O. If you increase the batch size but disk I/O does not increase, you are consuming all of the I/O capacity and it probably will not help much if you increase the batch size.
When you have found the optimal batch size (i.e. how many executors to configure for the build server), you can maybe tweak other things to make things faster:
Try to spend as little time checking out code as possible. Instead of deleting workspace before build starts, configure the SCM plugin to remove files that are not under version control. If you use git, you can use a local reference repo or do a shallow clone or something like that.
You can also try to speed things up by using SSD disks
You can get more servers, run Jenkins slaves on them and utilize the cpu and I/O capacity of multiple servers instead of only one.
I've been looking around online at ways of improving our build time (which is currently ~30-40 minutes, depending on which build agent gets the task), and one common theme I've seen is use CI builds.
I understand the logic behind this, and it makes sense that it would reduce the time each build takes. Our problem, however, is that building on every check-in is a pointless use of our resources, because in our development branch, we only keep the latest successful build. This means that if 2 people check-in in a short space of time, whoever checked-in last will be the one whose build is kept.
It's this reason (along with disk space limitations) that we changed to using Rolling Builds, so that we only built the development branch a maximum of once every 45 minutes (obviously we could manually trigger builds on otp of that).
What I want to know (and haven't been able to find anywhere) is whether there's a way of combining rolling builds AND continuous integration. So keep building only once every 45 minutes, but only get and build files that have changed.
I'm not even sure it's possible, and if not then I'll look into other ways, but this seems like something that should be possible.