Our bazel builds sometimes stuck and get timeouts, so we lose all build logs when VM is killed. To find the cause, we want to use Build event protocol to see which rules started to get executed, but did not finish (usually these are memory-eager tests).
This graph from official docs shows that TargetConfigured and TargetCompleted events are the only events between rule start and finish.
But in reality bazel configures all targets at the same time, so we cannot just subtract TargetCompleted time from TargetConfigured time.
Moreover, both events do not contain any timestamp. Here is the build event file from the sample repo (truncated):
{"id":{"targetConfigured":{"label":"//:B"}},"children":[{"targetCompleted":{"label":"//:B","configuration":{"id":"f157fdcaf05e7672fa1bf535fbb2c3edb004ce9e9a7f6d84d9bf031454e2fb64"}}}],"configured":{"targetKind":"java_binary rule","tag":["__JAVA_RULES_MIGRATION_DO_NOT_USE_WILL_BREAK__"]}}
{"id":{"targetConfigured":{"label":"//:main"}},"children":[{"targetCompleted":{"label":"//:main","configuration":{"id":"f157fdcaf05e7672fa1bf535fbb2c3edb004ce9e9a7f6d84d9bf031454e2fb64"}}}],"configured":{"targetKind":"java_library rule","tag":["__JAVA_RULES_MIGRATION_DO_NOT_USE_WILL_BREAK__"]}}
{"id":{"targetConfigured":{"label":"//:step1"}},"children":[{"targetCompleted":{"label":"//:step1","configuration":{"id":"f157fdcaf05e7672fa1bf535fbb2c3edb004ce9e9a7f6d84d9bf031454e2fb64"}}}],"configured":{"targetKind":"genrule rule"}}
{"id":{"progress":{"opaqueCount":2}},"children":[{"progress":{"opaqueCount":3}},{"namedSet":{"id":"0"}}],"progress":{"stderr":"\r\u001b[1A\u001b[K\u001b[32mAnalyzing:\u001b[0m 3 targets (0 packages loaded, 0 targets configured)\n\r\u001b[1A\u001b[K\u001b[32mINFO: \u001b[0mAnalyzed 3 targets (0 packages loaded, 0 targets configured).\n\n\r\u001b[1A\u001b[K\u001b[32mINFO: \u001b[0mFound 3 targets...\n\n\r\u001b[1A\u001b[K\u001b[32m[0 / 1]\u001b[0m [Prepa] BazelWorkspaceStatusAction stable-status.txt\n"}}
{"id":{"workspaceStatus":{}},"workspaceStatus":{"item":[{"key":"BUILD_EMBED_LABEL"},{"key":"BUILD_HOST","value":"mtymchuk"},{"key":"BUILD_TIMESTAMP","value":"1598888970"},{"key":"BUILD_USER","value":"mikhailtymchuk"}]}}
{"id":{"namedSet":{"id":"0"}},"namedSetOfFiles":{"files":[{"name":"B.jar","uri":"file:///private/var/tmp/_bazel_mikhailtymchuk/3bd90847b9f03e9e5c46f99d542eb754/execroot/__main__/bazel-out/darwin-fastbuild/bin/B.jar","pathPrefix":["bazel-out","darwin-fastbuild","bin"]},{"name":"B","uri":"file:///private/var/tmp/_bazel_mikhailtymchuk/3bd90847b9f03e9e5c46f99d542eb754/execroot/__main__/bazel-out/darwin-fastbuild/bin/B","pathPrefix":["bazel-out","darwin-fastbuild","bin"]}]}}
{"id":{"targetCompleted":{"label":"//:B","configuration":{"id":"f157fdcaf05e7672fa1bf535fbb2c3edb004ce9e9a7f6d84d9bf031454e2fb64"}}},"completed":{"success":true,"outputGroup":[{"name":"default","fileSets":[{"id":"0"}]}],"tag":["__JAVA_RULES_MIGRATION_DO_NOT_USE_WILL_BREAK__"],"importantOutput":[{"name":"B.jar","uri":"file:///private/var/tmp/_bazel_mikhailtymchuk/3bd90847b9f03e9e5c46f99d542eb754/execroot/__main__/bazel-out/darwin-fastbuild/bin/B.jar","pathPrefix":["bazel-out","darwin-fastbuild","bin"]},{"name":"B","uri":"file:///private/var/tmp/_bazel_mikhailtymchuk/3bd90847b9f03e9e5c46f99d542eb754/execroot/__main__/bazel-out/darwin-fastbuild/bin/B","pathPrefix":["bazel-out","darwin-fastbuild","bin"]}]}}
{"id":{"progress":{"opaqueCount":3}},"children":[{"progress":{"opaqueCount":4}},{"namedSet":{"id":"1"}}],"progress":{}}
{"id":{"namedSet":{"id":"1"}},"namedSetOfFiles":{"files":[{"name":"libmain.jar","uri":"file:///private/var/tmp/_bazel_mikhailtymchuk/3bd90847b9f03e9e5c46f99d542eb754/execroot/__main__/bazel-out/darwin-fastbuild/bin/libmain.jar","pathPrefix":["bazel-out","darwin-fastbuild","bin"]}]}}
{"id":{"targetCompleted":{"label":"//:main","configuration":{"id":"f157fdcaf05e7672fa1bf535fbb2c3edb004ce9e9a7f6d84d9bf031454e2fb64"}}},"completed":{"success":true,"outputGroup":[{"name":"default","fileSets":[{"id":"1"}]}],"tag":["__JAVA_RULES_MIGRATION_DO_NOT_USE_WILL_BREAK__"],"importantOutput":[{"name":"libmain.jar","uri":"file:///private/var/tmp/_bazel_mikhailtymchuk/3bd90847b9f03e9e5c46f99d542eb754/execroot/__main__/bazel-out/darwin-fastbuild/bin/libmain.jar","pathPrefix":["bazel-out","darwin-fastbuild","bin"]}]}}
{"id":{"progress":{"opaqueCount":4}},"children":[{"progress":{"opaqueCount":5}},{"namedSet":{"id":"2"}}],"progress":{}}
{"id":{"namedSet":{"id":"2"}},"namedSetOfFiles":{"files":[{"name":"step1_output.txt","uri":"file:///private/var/tmp/_bazel_mikhailtymchuk/3bd90847b9f03e9e5c46f99d542eb754/execroot/__main__/bazel-out/darwin-fastbuild/bin/step1_output.txt","pathPrefix":["bazel-out","darwin-fastbuild","bin"]}]}}
{"id":{"targetCompleted":{"label":"//:step1","configuration":{"id":"f157fdcaf05e7672fa1bf535fbb2c3edb004ce9e9a7f6d84d9bf031454e2fb64"}}},"completed":{"success":true,"outputGroup":[{"name":"default","fileSets":[{"id":"2"}]}],"importantOutput":[{"name":"step1_output.txt","uri":"file:///private/var/tmp/_bazel_mikhailtymchuk/3bd90847b9f03e9e5c46f99d542eb754/execroot/__main__/bazel-out/darwin-fastbuild/bin/step1_output.txt","pathPrefix":["bazel-out","darwin-fastbuild","bin"]}]}}
So, is it possible to extract target build start time from the build event protocol (or using another method)?
On the console, if that helps, you should be able to get this information combining --subcommands (or -s) which prints commands as they are being executed. And --show_timestamps which adds timestamps to all messages emitted.
It's not the same as what you're asking for (which I am not sure adding time to build event protocol could be trivially achieved just by configuration), but it may help with the debugging quest.
I am running a Cloud Dataflow Job using PubSubIO. The Dataflow job gets deployed/initiated successfully. It is invoked by Template.
But while the ParDo's are running , In one of them I dont see any results in the output and also getting the below error in the pipeline:
textPayload: "**Error message from worker: Processing stuck in step PriorityRanking/ParMultiDo(MergeRecords) for at least 05m00s without outputting** or completing in state windmill-read
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:502)
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:83)
at org.apache.beam.runners.dataflow.worker.MetricTrackingWindmillServerStub.getStateData(MetricTrackingWindmillServerStub.java:202)
at org.apache.beam.runners.dataflow.worker.WindmillStateReader.startBatchAndBlock(WindmillStateReader.java:414)
at org.apache.beam.runners.dataflow.worker.WindmillStateReader$WrappedFuture.get(WindmillStateReader.java:313)
at org.apache.beam.runners.dataflow.worker.WindmillStateInternals$WindmillValue.read(WindmillStateInternals.java:385)
at org.apache.beam.runners.dataflow.worker.StreamingSideInputFetcher.blockedMap(StreamingSideInputFetcher.java:244)
at org.apache.beam.runners.dataflow.worker.StreamingSideInputFetcher.storeIfBlocked(StreamingSideInputFetcher.java:181)
at org.apache.beam.runners.dataflow.worker.StreamingSideInputDoFnRunner.processElement(StreamingSideInputDoFnRunner.java:72)
at org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:335)
at org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
at org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
at org.apache.beam.runners.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:280)
It seems from the log that the Processing is stuck from at a Stateful ParDo.
Apache beam version : 2.16.0
Java version : 1.8
I have also added :
experiments.add("disable_conscrypt_security_provider"); as a solution provided in some other blogs.
But it seems to not work.Can anybody please help?
Thanks.
I am running jmeter using jenkins, i have duration assertion of 300ms in my script. The assertion is working fine, as my jmeter result is showing error but the jenkins build still passes.
Is there any way to fail my build in case of error in jmeter result ?
You need to define Error Threshold
The Error Threshold marks the the build as unstable or failed if the amount of errors will exceed the specified value.
Use error thresholds on single build – Define error thresholds in % for the current build.
The options are in:
Configure error threshold in Performance Plugin GUI
Jenkins considers build passed when it returns 0 exit status code, you can "tell" JMeter to exit with non-zero exit code by adding JSR223 Listener and using the following code:
if (!prev.isSuccessful()) {
System.out.println("Test failure, exiting...")
System.exit(1)
}
where prev stands for parent SampleResult class instance which gives you control over parent Sampler response code, message, data, etc. Check out Top 8 JMeter Java Classes You Should Be Using with Groovy article for more information on JMeter API shortcuts available for JSR223 Test Elements
it will result in the following behaviour:
and this 1 status code will cause Jenkins build failure
Alternative solution for point 2 would be using Taurus tool as a wrapper for your JMeter test, it provides handly Pass/Fail Criteria subsystem which provides flexible way of defining custom thresholds for considering your test as failed
I'm using Apache Beam on Dataflow through Python API to read data from Bigquery, process it, and dump it into Datastore sink.
Unfortunately, quite often the job just hangs indefinitely and I have to manually stop it. While the data gets written into Datastore and Redis, from the Dataflow graph I've noticed that it's only a couple of entries that get stuck and leave the job hanging.
As a result, when a job with fifteen 16-core machines is left running for 9 hours (normally, the job runs for 30 minutes), it leads to huge costs.
Maybe there is a way to set a timer that would stop a Dataflow job if it exceeds a time limit?
It would be great if you can create a customer support ticket where we would could try to debug this with you.
Maybe there is a way to set a timer that would stop a Dataflow job if
it exceeds a time limit?
Unfortunately the answer is no, Dataflow does not have an automatic way to cancel a job after a certain time. However, it is possible to do this using the APIs. It is possible to wait_until_finish() with a timeout then cancel() the pipeline.
You would do this like so:
p = beam.Pipeline(options=pipeline_options)
p | ... # Define your pipeline code
pipeline_result = p.run() # doesn't do anything
pipeline_result.wait_until_finish(duration=TIME_DURATION_IN_MS)
pipeline_result.cancel() # If the pipeline has not finished, you can cancel it
To sum up, with the help of #ankitk answer, this works for me (python 2.7, sdk 2.14):
pipe = beam.Pipeline(options=pipeline_options)
... # main pipeline code
run = pipe.run() # doesn't do anything
run.wait_until_finish(duration=3600000) # (ms) actually starts a job
run.cancel() # cancels if can be cancelled
Thus, in case if a job was successfully finished within the duration time in wait_until_finished() then cancel() will just print a warning "already closed", otherwise it will close a running job.
P.S. if you try to print the state of a job
state = run.wait_until_finish(duration=3600000)
logging.info(state)
it will be RUNNING for the job that wasn't finished within wait_until_finished(), and DONE for finished job.
Note: this technique will not work when running Beam from within a Flex Template Job...
The run.cancel() method doesn't work if you are writing a template and I haven't seen any successful work around it...
I am using a batch file which calls an SPSS production job which runs many syntax files.
In the syntax files I want to able to check some variables, and if certain conditions are not met then I want to stop the production job, exit SPSS and return an error code to the batch file.
The batch file needs to stop running the next commands based on the error code returned. I know how to do this in the batch file already.
The most basic solution could be if the error code is not 0 then stop, and the error text will be output to a separate text file from within the syntax. A bonus would be a different error code which I could then match to where in the syntax that code is thrown.
What is the best way to achieve this in the SPSS syntax and or production file?
One way to do this would be to execute Statistics as an external mode Python job. Then you could interrogate any results, catch exceptions, and set exit codes and messages however, you like. Here is an example:
jobs.py:
Python jobs
import sys
sys.path.append(r"""c:/spss23/python/lib/site-packages""")
import spss
try:
spss.Submit("""INSERT FILE="c:/temp/syntax1.sps".""")
except:
print "syntax1.spss failed"
exit(code=1)
try:
spss.Submit("""INSERT FILE="c:/temp/syntax2.sps".""")
except:
print "syntax1.spss failed"
exit(code=2)
Then the bat file would do
python c:/myjobs/jobs.py
print %ERRORLEVEL%
or similar. The job would need to save the output in appropriate format using OMS or shell redirection. (The blocks after try and except should be indented.)
In external mode, you could use code like this or you could interrogate items in the Viewer.
import spss, spssdata
curs = spssdata.Spssdata("variable2")
for case in curs:
if case[0] == 6:
exit(99)
curs.CClose