Our bazel builds sometimes stuck and get timeouts, so we lose all build logs when VM is killed. To find the cause, we want to use Build event protocol to see which rules started to get executed, but did not finish (usually these are memory-eager tests).
This graph from official docs shows that TargetConfigured and TargetCompleted events are the only events between rule start and finish.
But in reality bazel configures all targets at the same time, so we cannot just subtract TargetCompleted time from TargetConfigured time.
Moreover, both events do not contain any timestamp. Here is the build event file from the sample repo (truncated):
{"id":{"targetConfigured":{"label":"//:B"}},"children":[{"targetCompleted":{"label":"//:B","configuration":{"id":"f157fdcaf05e7672fa1bf535fbb2c3edb004ce9e9a7f6d84d9bf031454e2fb64"}}}],"configured":{"targetKind":"java_binary rule","tag":["__JAVA_RULES_MIGRATION_DO_NOT_USE_WILL_BREAK__"]}}
{"id":{"targetConfigured":{"label":"//:main"}},"children":[{"targetCompleted":{"label":"//:main","configuration":{"id":"f157fdcaf05e7672fa1bf535fbb2c3edb004ce9e9a7f6d84d9bf031454e2fb64"}}}],"configured":{"targetKind":"java_library rule","tag":["__JAVA_RULES_MIGRATION_DO_NOT_USE_WILL_BREAK__"]}}
{"id":{"targetConfigured":{"label":"//:step1"}},"children":[{"targetCompleted":{"label":"//:step1","configuration":{"id":"f157fdcaf05e7672fa1bf535fbb2c3edb004ce9e9a7f6d84d9bf031454e2fb64"}}}],"configured":{"targetKind":"genrule rule"}}
{"id":{"progress":{"opaqueCount":2}},"children":[{"progress":{"opaqueCount":3}},{"namedSet":{"id":"0"}}],"progress":{"stderr":"\r\u001b[1A\u001b[K\u001b[32mAnalyzing:\u001b[0m 3 targets (0 packages loaded, 0 targets configured)\n\r\u001b[1A\u001b[K\u001b[32mINFO: \u001b[0mAnalyzed 3 targets (0 packages loaded, 0 targets configured).\n\n\r\u001b[1A\u001b[K\u001b[32mINFO: \u001b[0mFound 3 targets...\n\n\r\u001b[1A\u001b[K\u001b[32m[0 / 1]\u001b[0m [Prepa] BazelWorkspaceStatusAction stable-status.txt\n"}}
{"id":{"workspaceStatus":{}},"workspaceStatus":{"item":[{"key":"BUILD_EMBED_LABEL"},{"key":"BUILD_HOST","value":"mtymchuk"},{"key":"BUILD_TIMESTAMP","value":"1598888970"},{"key":"BUILD_USER","value":"mikhailtymchuk"}]}}
{"id":{"namedSet":{"id":"0"}},"namedSetOfFiles":{"files":[{"name":"B.jar","uri":"file:///private/var/tmp/_bazel_mikhailtymchuk/3bd90847b9f03e9e5c46f99d542eb754/execroot/__main__/bazel-out/darwin-fastbuild/bin/B.jar","pathPrefix":["bazel-out","darwin-fastbuild","bin"]},{"name":"B","uri":"file:///private/var/tmp/_bazel_mikhailtymchuk/3bd90847b9f03e9e5c46f99d542eb754/execroot/__main__/bazel-out/darwin-fastbuild/bin/B","pathPrefix":["bazel-out","darwin-fastbuild","bin"]}]}}
{"id":{"targetCompleted":{"label":"//:B","configuration":{"id":"f157fdcaf05e7672fa1bf535fbb2c3edb004ce9e9a7f6d84d9bf031454e2fb64"}}},"completed":{"success":true,"outputGroup":[{"name":"default","fileSets":[{"id":"0"}]}],"tag":["__JAVA_RULES_MIGRATION_DO_NOT_USE_WILL_BREAK__"],"importantOutput":[{"name":"B.jar","uri":"file:///private/var/tmp/_bazel_mikhailtymchuk/3bd90847b9f03e9e5c46f99d542eb754/execroot/__main__/bazel-out/darwin-fastbuild/bin/B.jar","pathPrefix":["bazel-out","darwin-fastbuild","bin"]},{"name":"B","uri":"file:///private/var/tmp/_bazel_mikhailtymchuk/3bd90847b9f03e9e5c46f99d542eb754/execroot/__main__/bazel-out/darwin-fastbuild/bin/B","pathPrefix":["bazel-out","darwin-fastbuild","bin"]}]}}
{"id":{"progress":{"opaqueCount":3}},"children":[{"progress":{"opaqueCount":4}},{"namedSet":{"id":"1"}}],"progress":{}}
{"id":{"namedSet":{"id":"1"}},"namedSetOfFiles":{"files":[{"name":"libmain.jar","uri":"file:///private/var/tmp/_bazel_mikhailtymchuk/3bd90847b9f03e9e5c46f99d542eb754/execroot/__main__/bazel-out/darwin-fastbuild/bin/libmain.jar","pathPrefix":["bazel-out","darwin-fastbuild","bin"]}]}}
{"id":{"targetCompleted":{"label":"//:main","configuration":{"id":"f157fdcaf05e7672fa1bf535fbb2c3edb004ce9e9a7f6d84d9bf031454e2fb64"}}},"completed":{"success":true,"outputGroup":[{"name":"default","fileSets":[{"id":"1"}]}],"tag":["__JAVA_RULES_MIGRATION_DO_NOT_USE_WILL_BREAK__"],"importantOutput":[{"name":"libmain.jar","uri":"file:///private/var/tmp/_bazel_mikhailtymchuk/3bd90847b9f03e9e5c46f99d542eb754/execroot/__main__/bazel-out/darwin-fastbuild/bin/libmain.jar","pathPrefix":["bazel-out","darwin-fastbuild","bin"]}]}}
{"id":{"progress":{"opaqueCount":4}},"children":[{"progress":{"opaqueCount":5}},{"namedSet":{"id":"2"}}],"progress":{}}
{"id":{"namedSet":{"id":"2"}},"namedSetOfFiles":{"files":[{"name":"step1_output.txt","uri":"file:///private/var/tmp/_bazel_mikhailtymchuk/3bd90847b9f03e9e5c46f99d542eb754/execroot/__main__/bazel-out/darwin-fastbuild/bin/step1_output.txt","pathPrefix":["bazel-out","darwin-fastbuild","bin"]}]}}
{"id":{"targetCompleted":{"label":"//:step1","configuration":{"id":"f157fdcaf05e7672fa1bf535fbb2c3edb004ce9e9a7f6d84d9bf031454e2fb64"}}},"completed":{"success":true,"outputGroup":[{"name":"default","fileSets":[{"id":"2"}]}],"importantOutput":[{"name":"step1_output.txt","uri":"file:///private/var/tmp/_bazel_mikhailtymchuk/3bd90847b9f03e9e5c46f99d542eb754/execroot/__main__/bazel-out/darwin-fastbuild/bin/step1_output.txt","pathPrefix":["bazel-out","darwin-fastbuild","bin"]}]}}
So, is it possible to extract target build start time from the build event protocol (or using another method)?
On the console, if that helps, you should be able to get this information combining --subcommands (or -s) which prints commands as they are being executed. And --show_timestamps which adds timestamps to all messages emitted.
It's not the same as what you're asking for (which I am not sure adding time to build event protocol could be trivially achieved just by configuration), but it may help with the debugging quest.
Related
I am using terraform resource google_dataflow_flex_template_job to deploy a Dataflow flex template job.
resource "google_dataflow_flex_template_job" "streaming_beam" {
provider = google-beta
name = "streaming-beam"
container_spec_gcs_path = module.streaming_beam_flex_template_file[0].fully_qualified_path
parameters = {
"input_subscription" = google_pubsub_subscription.ratings[0].id
"output_table" = "${var.project}:beam_samples.streaming_beam_sql"
"service_account_email" = data.terraform_remote_state.state.outputs.sa.email
"network" = google_compute_network.network.name
"subnetwork" = "regions/${google_compute_subnetwork.subnet.region}/subnetworks/${google_compute_subnetwork.subnet.name}"
}
}
Its all working fine however without my requesting it the job seems to be using flexible resource scheduling (flexRS) mode, I say this because the job takes about ten minutes to start and during that time has state=QUEUED which I think is only applicable to flexRS jobs.
Using flexRS mode is fine for production scenarios however I'm currently still developing my dataflow job and when doing so flexRS is massively inconvenient because it takes about 10 minutes to see the effect of any changes I might make, no matter how small.
In Enabling FlexRS it is stated
To enable a FlexRS job, use the following pipeline option:
--flexRSGoal=COST_OPTIMIZED, where the cost-optimized goal means that the Dataflow service chooses any available discounted resources or
--flexRSGoal=SPEED_OPTIMIZED, where it optimizes for lower execution time.
I then found the following statement:
To turn on FlexRS, you must specify the value COST_OPTIMIZED to allow the Dataflow service to choose any available discounted resources.
at Specifying pipeline execution parameters > Setting other Cloud Dataflow pipeline options
I interpret that to mean that flexrs_goal=SPEED_OPTIMIZED will turn off flexRS mode. However, I changed the definition of my google_dataflow_flex_template_job resource to:
resource "google_dataflow_flex_template_job" "streaming_beam" {
provider = google-beta
name = "streaming-beam"
container_spec_gcs_path = module.streaming_beam_flex_template_file[0].fully_qualified_path
parameters = {
"input_subscription" = google_pubsub_subscription.ratings[0].id
"output_table" = "${var.project}:beam_samples.streaming_beam_sql"
"service_account_email" = data.terraform_remote_state.state.outputs.sa.email
"network" = google_compute_network.network.name
"subnetwork" = "regions/${google_compute_subnetwork.subnet.region}/subnetworks/${google_compute_subnetwork.subnet.name}"
"flexrs_goal" = "SPEED_OPTIMIZED"
}
}
(note the addition of "flexrs_goal" = "SPEED_OPTIMIZED") but it doesn't seem to make any difference. The Dataflow UI confirms I have set SPEED_OPTIMIZED:
but it still takes too long (9 minutes 46 seconds) for the job to start processing data, and it was in state=QUEUED for all that time:
2021-01-17 19:49:19.021 GMTStarting GCE instance, launcher-2021011711491611239867327455334861, to launch the template.
...
...
2021-01-17 19:59:05.381 GMTStarting 1 workers in europe-west1-d...
2021-01-17 19:59:12.256 GMTVM, launcher-2021011711491611239867327455334861, stopped.
I then tried explictly setting flexrs_goal=COST_OPTIMIZED just to see if it made any difference, but this only caused an error:
"The workflow could not be created. Causes: The workflow could not be
created due to misconfiguration. The experimental feature
flexible_resource_scheduling is not supported for streaming jobs.
Contact Google Cloud Support for further help. "
This makes sense. My job is indeed a streaming job and the documentation does indeed state that flexRS is only for batch jobs.
This page explains how to enable Flexible Resource Scheduling (FlexRS) for autoscaled batch pipelines in Dataflow.
https://cloud.google.com/dataflow/docs/guides/flexrs
This doesn't solve my problem though. As I said above if I deploy with flexrs_goal=SPEED_OPTIMIZED then still state=QUEUED for almost ten minutes, yet as far as I know QUEUED is only applicable to flexRS jobs:
Therefore, after you submit a FlexRS job, your job displays an ID and a Status of Queued
https://cloud.google.com/dataflow/docs/guides/flexrs#delayed_scheduling
Hence I'm very confused:
Why is my job getting queued even though it is not a flexRS job?
Why does it take nearly ten minutes for my job to start processing any data?
How can I speed up the time it takes for my job to start processing data so that I can get quicker feedback during development/testing?
UPDATE, I dug a bit more into the logs to find out what was going on during those 9minutes 46 seconds. These two consecutive log messages are 7 minutes 23 seconds apart:
2021-01-17 19:51:03.381 GMT
"INFO:apache_beam.runners.portability.stager:Executing command: ['/usr/local/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/dataflow-requirements-cache', '-r', '/dataflow/template/requirements.txt', '--exists-action', 'i', '--no-binary', ':all:']"
2021-01-17 19:58:26.459 GMT
"INFO:apache_beam.runners.portability.stager:Downloading source distribution of the SDK from PyPi"
Whatever is going on between those two log records is the main contributor to the long time spent in state=QUEUED. Anyone know what might be the cause?
As mentioned in the existing answer you need to extract the apache-beam modules inside your requirements.txt:
RUN pip install -U apache-beam==<version>
RUN pip install -U -r ./requirements.txt
While developing, I prefer to use DirectRunner, for the fastest feedback.
The question is mostly about step #3.
I need to implement the following loop in F#:
Load some input parameters from database. If input parameters specify that the loop should be terminated, then quit. This is straightforward, thanks to type providers.
Run a model generator, which generates a single F# source file, call it ModelData.fs. The generation is based on input parameters and some random values. This may take up to several hours. I already have that.
Compile the project. I can hard code a call to MSBuild, but that does not feel right.
Copy executables into some temporary folder, let’s say <SomeTempData>\<SomeModelName>.
Asynchronously run the executable from the output location of the previous step. The execution time is usually between several hours to several days. Each process runs in a single thread because this is the most efficient way to do it in this case. I tried the parallel version, but it did not beat the single threaded one.
Catch when the execution finishes. This seems straightforward: https://learn.microsoft.com/en-us/dotnet/api/system.diagnostics.process.exited?view=netframework-4.7.2 . The running process is responsible for storing useful results, if any. This event can be handled by F# MailBoxProcessor.
Repeat from the beginning. Because execution is async, monitor the number of running tasks to ensure that it does not exceed some allowed number. Again, MailBoxProcessor will do that with ease.
The whole thing will run on Windows, so there is no need to maintain multiple platforms. Just NET Framework (let’s say 4.7.2 as of this date) will do fine.
This seems like a very straightforward CI-like exercise and F# FAKE seemed as a proper solution. Unfortunately, none of the provided scarce examples worked (even with reasonable tweaks) and the bugs were cryptic. However, the worst part was that the compiling feature did not work at all. The provided example: http://fake.build/fake-gettingstarted.html#Compiling-the-application cannot be run at all and even after accounting for something like that: https://github.com/fsharp/FAKE/issues/1579 : it still silently choses not to compile the project. I’d appreciate any advice.
Here is the code that I was trying to run. It is based on the references above:
#r #"C:\GitHub\ClmFSharp\Clm\packages\FAKE.5.8.4\tools\FakeLib.dll"
#r #"C:\GitHub\ClmFSharp\Clm\packages\FAKE.5.8.4\tools\System.Reactive.dll"
open System.IO
open Fake.DotNet
open Fake.Core
open Fake.IO
open Fake.IO.Globbing.Operators
let execContext = Fake.Core.Context.FakeExecutionContext.Create false "build.fsx" []
Fake.Core.Context.setExecutionContext (Fake.Core.Context.RuntimeContext.Fake execContext)
// Properties
let buildDir = #"C:\Temp\WTF\"
// Targets
Target.create "Clean" (fun _ ->
Shell.cleanDir buildDir
)
Target.create "BuildApp" (fun _ ->
!! #"..\SolverRunner\SolverRunner.fsproj"
|> MSBuild.runRelease id buildDir "Build"
|> Trace.logItems "AppBuild-Output: "
)
Target.create "Default" (fun _ ->
Trace.trace "Hello World from FAKE"
)
open Fake.Core.TargetOperators
"Clean"
==> "BuildApp"
==> "Default"
Target.runOrDefault "Default"
The problem is that it does not build the project at all but no error messages are produced! This is the output when running it in FSI:
run Default
Building project with version: LocalBuild
Shortened DependencyGraph for Target Default:
<== Default
<== BuildApp
<== Clean
The running order is:
Group - 1
- Clean
Group - 2
- BuildApp
Group - 3
- Default
Starting target 'Clean'
Finished (Success) 'Clean' in 00:00:00.0098793
Starting target 'BuildApp'
Finished (Success) 'BuildApp' in 00:00:00.0259223
Starting target 'Default'
Hello World from FAKE
Finished (Success) 'Default' in 00:00:00.0004329
---------------------------------------------------------------------
Build Time Report
---------------------------------------------------------------------
Target Duration
------ --------
Clean 00:00:00.0025260
BuildApp 00:00:00.0258713
Default 00:00:00.0003934
Total: 00:00:00.2985910
Status: Ok
---------------------------------------------------------------------
I'm trying to separate out some code from drake/automotive/automotive_demo.cc. As a first step, I'm trying to copy automotive_demo.cc and automotive_demo.py into differently named files (test.cc and test.py) and then running bazel run automotive:test -- --num_simple_cars=1. I modified automotive/BUILD.bazel and test.py to take into account the new dependencies.
The problem is that after I bazel run, the simulator window opens but no car gets rendered. Eventually it just crashes with the following errors:
[lcm-spy] ClassDiscoverer: java.lang.NoClassDefFoundError: apple/laf/AquaPopupMenuUI
[lcm-spy] jar: ../com_jidesoft_jide_oss/jide-oss-2.9.7.jar
[lcm-spy] class: com/jidesoft/plaf/aqua/AquaJidePopupMenuUI.class
...
[drake_visualizer] Qt WebEngine seems to be initialized from a plugin. Please set Qt::AA_ShareOpenGLContexts using QCoreApplication::setAttribute before constructing QGuiApplication.
...
[lcm-spy] LCM: Disabling IPV6 support
[lcm-spy] LCM: TTL set to zero, traffic will not leave localhost.
[lcm-spy] java.net.SocketException: Can't assign requested address
Here is an (unresolved) Github issue that points to the problem being that test is a "custom plug-in". But if automotive_demo can work, surely there's a way to reproduce that behavior for test? I also tried grepping for QGuiApplication and only found a series of binary files, so I didn't know how to follow the error message's suggestion.
when trying out your steps on Mac I unfortunately cannot reproduce your specific errors. I do not think that having test as a target name should cause problems (at least I did not experience issues).
Could you please make sure:
You're able to run bazel run automotive:demo -- --num_simple_car=1?
After having renamed automotive_demo.* to test.*, in your BAZEL.build, test.py files the following are mapped correctly: demo -> test and automotive_demo -> test_cc (or whatever unique name you choose)?
I'm working on moving us from ant to gulp, and as part of the effort I want to write timing stats to Graphite. We're doing this in ant as well (no idea how, beside the point anyway). My question is, I'd prefer to not have to add some or other plugin manually to every task we have (we have over 60), but rather have some sort of global behavior, where for every task, before the task is run a timer is start, and when it signals completion we push some data to Graphite (over statsd).
Can someone point me in the right direction where to hook into gulp for this? I couldn't find anything particularly useful in the docs / recipes...
We're running gulp#4.
Instead of adding timing code to your numerous tasks, you could make use of the NPM gulp-duration package.
A snippet of an example of it's use is shown below:
function rebundle() {
var uglifyTimer = duration('uglify time')
var bundleTimer = duration('bundle time')
return bundler.bundle()
.pipe(source('bundle.js'))
.pipe(bundleTimer)
// start just before uglify recieves its first file
.once('data', uglifyTimer.start)
.pipe(uglify())
.pipe(uglifyTimer)
.pipe(gulp.dest('example/'))
}
gulp-duration's duration function:
Creates a new pass-through duration stream. When this stream is
closed, it will log the amount of time since its creation to your
terminal.
will then allow you to log the duration of the task.
Whilst this is not a global behaviour solution, at least you can specify the timing code in your gulp file, as opposed to having to modify all 60+ of your tasks.
So, in my collection I have about ten requests, with the last two being:
/Wait 10 seconds
/Check Complete
The first makes a call to the postman's echo (delay by 10 seconds) and the second is the call to my system to check for the status complete. Now, if status is unavailable I wait another 10s:
postman.setNextRequest("Wait 10 seconds");
The complete status on my system can appear in a minute or so. Now, as one can see - it is an infinite loop if something goes wrong with the system and status is never complete. Is there a way in postman/newman test to fail a test if it has been going for more than 2 minutes, for example.
Additionally, this will be executed in jenkins with command line, so I am not really looking into postman settings or delays between requests in the runner.
you may have a look to newman options here : https://www.npmjs.com/package/newman#newman-run-collection-file-source-options. The interesting option is
--timeout-request : it will surely fulfill your need.
In Postman itself, you may test the responseTime. I recall that there is a snippet, on the right part, which looks like this:
tests["Response time is less than 200ms"] = responseTime < 200;
and which could help you as the test fails if response does not occur within the requested time.
Alexandre
If you are going to be using Jenkins pipeline you can use the timeout step to cause long running jobs to result in failure, here's on for 2 mins.
timeout(120) {
node {
sh 'newman command'
}
}
Check out the "Pipeline Syntax" editor in Jenkins to generated your code block and look for other useful functions.