bazel: why do targets keep increasing if build graph is known

bazel: why do targets keep increasing if build graph is known - bazel

Per this doc, the analysis and execution phases handle building out the dependency tree (among other things) and going and doing the work if needed, respectively. If that's true, I'm curious why the total number of targets keeps increasing as the build progresses (i.e., when I start a large build, bazel may report that it's built 5 out of 100 targets, but later will say it's built 20 out of 300 targets, and so forth, with the denominator increasing for a while until it levels off).
I've heard the loading and analysis phases can be intermixed. My likely incomplete or incorrect understanding is that when bazel parses a BUILD file, analysis is invoked to determine what dependencies are needed for the requested targets on the command line, and then I guess this is somehow communicated back to the loader to pull in any other BUILD files referenced by these dependencies, which could cause the loader to go out and fetch a remote repo if the dependency (and thus BUILD file) is not in the local repo.
However, my understanding was also that while dynamic build-out of the dependency graph was a potential future direction for bazel, that currently, execution does not intermix with analysis, and thus when execution begins, the full dependency tree should be available to bazel (and thus the total number of targets known)? Does bazel have the full tree, but just not want to traverse the tree to get a count in case it's big, or is something else going on here?
Note: I found a brief mention of this phenomenon here, but without an explanation as to why it happens.

The number you're seeing in the progress bar refers to actions (command-lines..ish) and not targets (e.g. //my:target). I wrote a blog post about the action graph, and here's the relevant description about it:
The action graph contains a different set of information: file-level
dependencies, full command lines, and other information Bazel needs to
execute the build. If you are familiar with Bazel’s build phases, the
action graph is the output of the loading and analysis phase and used
during the execution phase.
However, Bazel does not necessarily execute every action in the graph.
It only executes if it has to, that is, the action graph is the super
set of what is actually executed.
As to why the denominator is ever-increasing, it's because the actions-to-execute discovery within the action graph is lazy. Here's a better explanation from the Bazel TL, Ulf Adams:
The problem is that Skyframe does not eagerly walk the action graph,
but it does it lazily. The reason for that is performance, since the
action graph can be rather large and this was previously a blocking
operation (where Bazel would just hang for some time). The downside is
that all threads that walk the action graph block on actions that they
execute, which delays discovery of remaining actions. That's why the
number keeps going up during the build.
Source: https://github.com/bazelbuild/bazel/issues/3582#issuecomment-329405311

Related

What's the difference between invocation_id and build_id in Bazel?

Looking at the outputs from Bazel's build event protocol, I can see that the BEP output only has one unique invocation_id (which makes sense) and also only one unique build_id (even when building multiple targets at once).
In that case, what's the point of build_id?

A higher-level system that uses Bazel (e.g., a CI system) may have a concept of a build that is broader than one Bazel invocation. For instance, the higher-level system may retry entire Bazel invocations under certain circumstances or allow a "build" to contain multiple Bazel build or test steps. The build id allows multiple Bazel invocations composing a high-level build to be correlated in Bazel's metadata emissions (most notably, the Build Event Protocol).

Running bazel query without downloading external dependencies

Given a rather big repository built with bazel and tons of third party dependencies in multiple languages (including heavy docker containers), I have the following problem:
running Bazel queries triggers the downloading of many of these dependencies, resulting in slow query performance. Hence, the question:
Is there a way to run bazel query without having to download the dependencies?
Typical query: bazel query 'kind("source file", deps(//...) except deps(//3rdparty/...))
I'm aware of the caching options, which I mostly use, but depending on the languages, things can still be slow.

After asking on Bazel's Slack channel, the response (from Sahin Yort) is not encouraging:
I don’t believe that’s possible due to nature of workspace files. loads from a workspace leads to fetch of the given workspace because has to expand the workspaces in order to know their targets. at that point, it is up to the repository rule to fetch whatever it needs to fetch eagerly or lazily. workspace rules usually expand BUILD files using various patterns. eg running a executable or using expand_template. i have little faith in that it is possible to get what you want.
I'll be looking into other ways to speed things up: a likely culprit for the slowness is probably the action/analysis cache being invalidated due to some flags changing.

If you only need to parse the AST without resolving dependencies, you can use buildozer instead:
buildozer "print srcs" "//some:target"
It also supports -output_json for machine readable output

How do I get workspace status in bazel

I would like to version build artefacts with build number for CI passed to bazel via workspace_status_command. Sometimes I would like to include build number to the name of the artefact.
Is there a way how do I access ctx when writing a macro(as I was trying to use ctx.info_file)? So far it seems that I am able to access such info just in new rule when creating a new rule which in this case is a bit awkward.
I guess that having a build number or similar info is pretty common use case so I wonder if thre is a simpler way how to access such info.

No, you really need to define a custom rule to be able to consume information passed from workspace_status_command through info_file and version_file file and even then you cannot just access it's values from Starlark, you can pass the file to your tooling (wrapper) and process the inputs there. After all, (build) rules do not execute anything, they emit actions to be executed at a later phase.
Be careful though, because if you depend on info_file (STABLE_* entries), changes to the file invalidate targets depending on it. For something like CI build number, it's usually not what you want and version_file is more likely what you are after. You may want to record the id, but you usually do not want to rebuild stuff just because the build ID has changed (it's a new CI run). However, even simple inclusion of IDs could be considered problematic, if you want your results to be completely reproducible.
Having variable artifact names is a whole new problem and there would be good reasons why not to. But generally since as proposed the name would be decided during execution of actions (reading in version_file in your tool), you're past the analysis phase to decide what comes out of the action. The only way I am currently aware of (that is for out of tree source of variable input, you can of course always define a Starlark variable and load it from your BUILD file) to be able to do that is to use tree artifacts (using declare_directory in your rule.

Jenkins workspaces and concurrent builds, how do they work?

I am currently learning the ins and outs of Jenkins and Pipeline.
One thing I do not yet understand is the following:
A Jenkins job by default can be executed concurrently (I can check the checkbox "Do not allow concurrent builds" if I don't want that).
What I don't understand is the following:
Let say Jenkins checks out code in /var/lib/jenkins/workspace/my-project-workspace/
Now how would it be possible to run concurrent builds without conflicts?
Let's say that build nr 1 checks out code in that path and starts testing it, and while doing that, build nr 2 is started and checks out code in that same path.
How will that not conflict with build nr 1?
I am probably missing something obvious here... Please help :)

The subdirectory inside the workspace/ folder will not always be your project name, but a (randomly) generated directory name. That's all the magic.

When this option is checked, multiple builds of this project may be executed in parallel.
By default, only a single build of a project is executed at a time — any other requests to start building that project will remain in the build queue until the first build is complete.
This is a safe default, as projects can often require exclusive access to certain resources, such as a database, or a piece of hardware.
But with this option enabled, if there are enough build executors available that can handle this project, then multiple builds of this project will take place in parallel. If there are not enough available executors at any point, any further build requests will be held in the build queue as normal.
Enabling concurrent builds is useful for projects that execute lengthy test suites, as it allows each build to contain a smaller number of changes, while the total turnaround time decreases as subsequent builds do not need to wait for previous test runs to complete.
This feature is also useful for parameterized projects, whose individual build executions — depending on the parameters used — can be completely independent from one another.
Each concurrently executed build occurs in its own build workspace, isolated from any other builds. By default, Jenkins appends "#" to the workspace directory name, e.g. "#2".
The separator "#" can be changed by setting the hudson.slaves.WorkspaceList Java system property when starting Jenkins. For example, "hudson.slaves.WorkspaceList=-" would change the separator to a hyphen.
For more information on setting system properties, see the wiki page.
However, if you enable the Use custom workspace option, all builds will be executed in the same workspace. Therefore caution is required, as multiple builds may end up altering the same directory at the same time. enter image description here

Power tradeoff between buildscript and CI server

Although this question specifically involves Gradle and Bamboo, it really is a question about any build system (Ant/Maven/Gradle/etc.) and any CI tool (Bamboo/Jenkins/Hudson/etc.).
I was always under the impression that the purpose of a CI build is to:
Check out code from VCS
Run a buildscript (Gradle, etc.)
Deploy a binary (WAR, etc.) to an environment
Hence, all the guts and heavy-lifting (running automated tests, code analysis, test coverage, compiling, Javadocs, packaging, etc.) was all to be done from inside the buildscript.
But Bamboo seems to allow you to break this heavy-lifting out of the buildscript and into Bamboo itself. In Bamboo, you can add build stages and decompose the stages into tasks. Each task is something just as atomic/fundamental as an Ant task.
So it got me thinking: how much should one empower the CI tool? What typical buildscript functionality should be transferred over to Bambooo/CI? For instance, should I be compiling from a Gradle task, or from a Bamboo task? Same goes for all tasks/stages.
For some reason, I view this as the same problem as to whether or not to use stored procedures or put the data processing all at the application layer. What are the pros/cons of each approach?

TL;DR at the bottom
My experience is with Jenkins, so examples will relate to that.
One thing with any build system (be it CI server or a buildscript), is that it should be stable, simple and self-contained so that an untrained receptionist (with printed instructions and proper credentials) could do it.
Ease of use and re-use
Based on the above, one would think that a buildscript wins. Not always. As with the receptionist example, it's about easy of use and easy of reproducibility.
If a buildscript has interdependent build targets that only work in correct order, dependence on pre-supplied property files that have to be adjusted for the correct branch ahead of build, reliance on environment variables that no-one remembers who created in the first place, and a supply of SCM revision numbers that have to be obtained by looking at the log of the commits for the last month... This is in no way better than a Jenkins job that can be triggered with a single button.
Likewise, a Jenkins workflow could be reliant on multiple dependant jobs, each being manually pre-configured before the build, and need artifacts uploaded from one place to another... which no receptionist will do.
So, at this point, a self-contained good buildscript that only requires ant build command to do everything from beginning to end, is just as good as a Jenkins job that only required build now... button to be pressed.
Self-contained
It is easy to think that since Jenkins will (at some point) end up calling at least a portion of a buildscript (say ant compile), that Jenkins is "compartmentalizing" the buildscript into multiple steps, thus breaking away from being self-contained.
However, instead you should zoom out by one level, and treat the whole Jenkins job configuration as a single XML file (which, by the way, can be stored and versioned through an SCM just like the buildscript)
So, at this point, it doesn't matter if the whole build logic is inside a single buildfile, or a single XML job configuration file. Both can be self-contained when done right.
The devil you know
In majority of cases, it comes down to what you know.
Some people find it easier to use Jenkins UI to visually arrange their build workflow, reporting, emailing, and archiving (and for anything that doesn't fit as wanted, find a plugin). For them, figuring out a build script language is more time consuming then simply trying it in UI.
Others prefer to know exactly what every single line of their build script does, and don't like giving control to some piece of foreign code obfuscated by UI.
Both points have merits from all sides Quality-Time-Budget triangle
The presentation
So far, things have been more or less balanced. However:
My Jenkins will email a detailed HTML report with a link to a job page and send it straight up to the (non tech-savvy) CEO. He can look at the list of latest builds, along with SCM changes for each build, linking him to JIRA issues fixed for each build (all hyperlinks to relevant places). He can select the build with the set of changes that he wants, and click "install iOS package" right off his iPad that he just used to view all this information. Meanwhile I can go to the same job page, and review the build logs and artifacts of each log, check the build time trends and compare the parameters that were used between the failing and succeeding jobs (and I didn't have to write any echos to display that, it's just all there, cause Jenkins does that for you)
With a buildscript, even if you piped the output to a file, would you send that to your (non tech-savvy) CEO? Unlikely. But wait, you know this devil very well. A few quick changes and hacks, couple Red Bulls... and months of thankless work (mostly after-hours) later... you've created a buildscript that will create and start a webserver, prepare HTML reports, collect statistics and history, email all the relevant people, and publish everything on a webpage, just like Jenkins did. (Ohh, if people could only see all the magic you did escaping and sanitizing all that HTML content in a buildscript). But wait... this only works for a single project.
So, a full case of Red Bulls later, you've managed to make it general enough to build any project, and you've created...
Another Jenkins/Bamboo/CI-server
Congratulations. Come up with a name, market it, and make some cash of it, cause this ultimate buildscript just became another CI solution a la Jenkins.
TL;DR:
Provided the CI-server can be configured simply and intuitively so that a receptionist could run the build, and provided the configuration can be self-contained (through whatever storage method the CI-server uses) and versioned in SCM, it all comes down to the Quality-Time-Budget triangle.
If you have little time and budget to learn the CI server, you can still greatly increase the quality (at least of the presentation) by embracing the CI-server's way of organizing stuff.
If you have unlimited time and budget, by all means, make your own Jenkins with the buildscript.
But considering the "unlimited" part is rather unrealistic, I would embrace the CI-server as much as possible. Yes, it's a change. However a little time invested in learning the CI-server and how it compartmentalizes or breaks into tasks the different parts of the build flow, this time spent can go a long way to increasing the quality.
Likewise, if you have no time and/or budget, figuring out the quirks of all the plugins/tasks/etc and how it all comes together will only bring your overall quality down, or even drag the time/budget down with it. In such cases, use the CI-server for bare minimum needed to trigger your existing buildscripts. However, in some cases, the "bare minimum" is no better than not using the CI-server in the first place. And when you are at this place... ask yourself:
Why do you want a CI-server in the first place?

Personally (and with today's tools), I'd take a pragmatic approach. I'd do as much as feasible on the build side (clearly better from an automation perspective), and the rest (e.g. distribution of work across machines) on the CI server. Anything that a developer might want to do on his own machine should definitely be automated on the build level. As to the concrete steps you gave, I'd generally check out code from the CI server, and deploy binaries from the build. I'd try to make every CI job look the same, invoking the build tool in the same way (e.g. gradlew ciBuild).
In Bamboo, you can add build stages and decompose the stages into tasks. Each task is something just as atomic/fundamental as an Ant task.
To some extent, this overlap in functionality is natural, as neither build tool nor CI server can assume existence of the other, and both want to provide as complete a solution as possible.
For some reason, I view this as the same problem as to whether or not to use stored procedures or put the data processing all at the application layer.
It's not an unfair comparison, and hence opinions will be as diverse, contextual, and nuanced.
Disclaimer: I'm a Gradle(ware) developer.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart