What could be causing "no action" state? - bazel

I have a Bazel repo that builds some artifact. The problem is that it stops half way through hanging with this message:
[3 / 8] no action
What on the earth could be causing this condition? Is this normal?
(No, the problem is not easily reducible, there is a lot of custom code and if I could localize the issue, I'd not be writing this question. I'm interested in general answer, what in principle could cause this and is this normal.)

It's hard to answer your question without more information, but yes it is sometimes normal. Bazel does several things that are not actions.
One reason that I've seen this is if Bazel is computing digests of lots of large files. If you see getDigestInExclusiveMode in the stack trace of the Bazel server, it is likely due to this. If this is your problem, you can try out the --experimental_multi_threaded_digest flag.

Depending on the platform you are running Bazel on:
Windows: I've seen similar behavior, but I couldn't yet determine the reason. Every few runs, Bazel hangs at the startup for about half a minute.
If this is mid-build during execution phase (as it appears to be, given that Bazel is already printing action status messages), then one possible explanation is that your build contains many shell commands. I measured that on Linux (a VM with an HDD) each shell command takes at most 1-2ms, but on my Windows 10 machine (32G RAM, 3.5GHz CPU, HDD) they take 1-2 seconds, with 0.5% of the commands taking up to 10 seconds. That's 3-4 orders of magnitude slower if your actions are heavy on shell commands. There can be numerous explanations for this (antivirus, slow process creation, MSYS being slow), none of which Bazel has control over.
Linux/macOS: Run top and see if the stuck Bazel process is doing anything at all. Try hitting Ctrl+\, that'll print a JVM stacktrace which could help identifying the problem. Maybe the JVM is stuck waiting in a lock -- that would mean a bug or a bad rule implementation.
There are other possibilities too, maybe you have a build rule that hangs.
Does Bazel eventually continue, or is it stuck for more than a few minutes?

Related

How can I debug why my Dataflow job is stuck?

I have a Dataflow job that is not making progress - or it is making very slow progress, and I do not know why. How can I start looking into why the job is slow / stuck?
The first resource that you should check is Dataflow documentation. It should be useful to check these:
Troubleshooting your Pipeline
Common error guidance
If these resources don't help, I'll try to summarize some reasons why your job may be stuck, and how you can debug it. I'll separate these issues depending on which part of the system is causing the trouble. Your job may be:
Job stuck at startup
A job can get stuck being received by the Dataflow service, or starting up new Dataflow workers. Some risk factors for this are:
Did you add a custom setup.py file?
Do you have any dependencies that require a special setup on worker startup?
Are you manipulating the worker container?
To debug this sort of issue I usually open StackDriver logging, and look for worker-startup logs (see next figure). These logs are written by the worker as it starts up a docker container with your code, and your dependencies. If you see any problem here, it would indicate an issue with your setup.py, your job submission, staged artifacts, etc.
Another thing you can do is to keep the same setup, and run a very small pipeline that stages everything:
with beam.Pipeline(...) as p:
(p
| beam.Create(['test element'])
| beam.Map(lambda x: logging.info(x)))
If you don't see your logs in StackDriver, then you can continue to debug your setup. If you do see the log in StackDriver, then your job may be stuck somewhere else.
Job seems stuck in user code
Something else that could happen is that your job is performing some operation in user code that is stuck or slow. Some risk factors for this are:
Is your job performing operations that require you to wait for them? (e.g. loading data to an external service, waiting for promises/futures)
Note that some of the builtin transforms of Beam do exactly this (e.g. the Beam IOs like BigQueryIO, FileIO, etc).
Is your job loading very large side inputs into memory? This may happen if you are using View.AsList for a side input.
Is your job loading very large iterables after GroupByKey operations?
A symptom of this kind of issue can be that the pipeline's throughput is lower than you would expect. Another symptom is seeing the following line in the logs:
Processing stuck in step <STEP_NAME>/<...>/<...> for at least <TIME> without outputting or completing in state <STATE>
.... <a stacktrace> ....
In cases like these it makes sense to look at which step is consuming the most time in your pipeline, and inspect the code for that step, to see what may be the problem.
Some tips:
Very large side inputs can be troublesome, so if your pipeline relies on accessing a very large side input, you may need to redesign it to avoid that bottleneck.
It is possible to have asynchronous requests to external services, but I recommend that you commit / finalize work on startBundle and finishBundle calls.
If your pipeline's throughput is not what you would normally expect, it may be because you don't have enough parallelism. This can be fixed by a Reshuffle, or by sharding your existing keys into subkeys (Beam often does processing per-key, and so if you have too few keys, your parallelism will be low) - or using a Combiner instead of GroupByKey + ParDo.
Another reason that your throughput is low may be that your job is waiting too long on external calls. You can try addressing this by trying out batching strategies, or async IO.
In general, there's no silver bullet to improve your pipeline's throughput,and you'll need to have experimentation.
The data freshness or system lag are increasing
First of all, I'd recommend you check out this presentation on watermarks.
For streaming, the advance of the watermarks is what drives the pipeline to make progress, thus, it is important to be watchful of things that could cause the watermark to be held back, and stall your pipeline downstream. Some reasons why the watermark may become stuck:
One possibility is that your pipeline is hitting an unresolvable error condition. When a bundle fails processing, your pipeline will continue to attempt to execute that bundle indefinitely, and this will hold the watermark back.
When this happens, you will see errors in your Dataflow console, and the count will keep climbing as the bundle is retried. See:
You may have a bug when associating the timestamps to your data. Make sure that the resolution of your timestamp data is the correct one!
Although unlikely, it is possible that you've hit a bug in Dataflow. If neither of the other tips helps, please open a support ticket.

Jenkins: jobs in queue are stuck and not triggered to be restarted

For a while, our Jenkins experiences critical problems. We have jobs hung, our job scheduler does not trigger the builds. After the Jenkins service restart, everything is back to normal, but after some period of time all problem are return. (this period can be week or day or ever less). Any idea where we can start looking? I'll appreciate any help on this issue
Muatik has made a good point in his comment, the recommended approach is to run jobs on agents (slave) nodes. If you already do it, you can look at:
Jenkins master machine CPU, RAM and hard disk usage. Access the machine and/or use plugin like Java Melody. I have seen missing graphics in the builds test results and stuck builds due to no hard disk space. You could also have hit the limit of RAM or CPU for the slaves/jobs you are executing. You may need more heap space.
Look at Jenkins log files, start with severe exceptions. If the files are too big or you see logrotate exceptions, you can change the logging levels, so that fewer exceptions are logged. For more details see my article on this topic. Try to fix exceptions that you see logged.
Go through recently made changes that can be the cause of such behavior, for example, new plugins, changes to config files (jenkins.xml)?
Look at TCP connections. Run netstat -a Are there suspicious connections (CLOSED_WAIT status)?
Delete old builds that you do not need.
We have been facing this issue from last 4 months, and tried everything, changing resources CPU & memory, increasing desired nodes in ASG. But nothing seems worked .
Solution: 1. Go to Manage Jnekins-> System Configurationd-> Maven project
configurations
2. In "usage" field, select "Only buid Jobs with label expressions matching this nodes"
Doing this resolved it and jenkins is working as a Rocket now :)

SonarQube 5.x performance changes

We currently use SonarQube 4.3.3 on a fairly big project with 30+ developers, 300k LOC, 2200 Unit Tests and it is currently used to help us tracking issues on our Continuous Integration cycle (Merge Requests, etc).
With the increase of custom XPath rules, we're seeing a big increase on the analysis time. Moreover, we currently can't run more than one analysis at the same time (the second one hangs on "waiting for a checkpoint from Job #xxx) so this usually makes things slow. We're seeing build times (Jenkins + JUnit + SonarQube) around 30 minutes. With the release of the 5.x branch, we'd like to know if the effort to update everything is worth the while, so we've got some questions?
Are Java rules faster than the old XPath rules to process?
Will we be able to run more than one analysis at the same time?
I saw on SonarSource's JIRA that the 5.2 release (which will detach the analysis from the DB) is due to the 23rd of September. Will this really happen? Are the users expected to see a big performance increase with this?
We're curious to hear real-world cases on this, so any input will be highly appreciated.
Thanks!

Dataflow job takes too long to start

I'm running a job which reads about ~70GB of (compressed data).
In order to speed up processing, I tried to start a job with a large number of instances (500), but after 20 minutes of waiting, it doesn't seem to start processing the data (I have a counter for the number of records read). The reason for having a large number of instances is that as one of the steps, I need to produce an output similar to an inner join, which results in much bigger intermediate dataset for later steps.
What should be an average delay before the job is submitted and when it starts executing? Does it depend on the number of machines?
While I might have a bug that causes that behavior, I still wonder what that number/logic is.
Thanks,
G
The time necessary to start VMs on GCE grows with the number of VMs you start, and in general VM startup/shutdown performance can have high variance. 20 minutes would definitely be much higher than normal, but it is somewhere in the tail of the distribution we have been observing for similar sizes. This is a known pain point :(
To verify whether VM startup is actually at fault this time, you can look at Cloud Logs for your job ID, and see if there's any logging going on: if there is, then some VMs definitely started up. Additionally you can enable finer-grained logging by adding an argument to your main program:
--workerLogLevelOverrides=com.google.cloud.dataflow#DEBUG
This will cause workers to log detailed information, such as receiving and processing work items.
Meanwhile I suggest to enable autoscaling instead of specifying a large number of instances manually - it should gradually scale to the appropriate number of VMs at the appropriate moment in the job's lifetime.
Another possible (and probably more likely) explanation is that you are reading a compressed file that needs to be decompressed before it is processed. It is impossible to seek in the compressed file (since gzip doesn't support it directly), so even though you specify a large number of instances, only one instance is being used to read from the file.
The best way to approach the solution of this problem would be to split a single compressed file into many files that are compressed separately.
The best way to debug this problem would be to try it with a smaller compressed input and take a look at the logs.

building multiple jobs in jenkins performance

In Jenkins I have 100 java projects. Each has its own build file.
Every time I want clear the build file and compile all source files again.
Using bulkbuilder plugin I tried compling all the jobs.. Having 100 jobs run parallel.
But performance is very bad. Individually if the job takes 1 min. in the batch it takes 20mins. More the batch size more the time it takes.. I am running this on powerful server so no problem of memory and CPU.
Please Suggest me how to over come this.. what configurations need to be done in jenkins.
I am launching jenkins using war file.
Thanks..
Even though you say you have enough memory and CPU resources, you seem to imply there is some kind of bottleneck when you increase the number of parallel running jobs. I think this is understandable. Even though I am not a java developer, I think most of the java build tools are able to parallelize build internally. I.e. building a single job may well consume more than one CPU core and quite a lot of memory.
Because of this I suggest you need to monitor your build server and experiment with different batch sizes to find an optimal number. You should execute e.g. "vmstat 5" while builds are running and see if you have idle cpu left. Also keep an eye on the disk I/O. If you increase the batch size but disk I/O does not increase, you are consuming all of the I/O capacity and it probably will not help much if you increase the batch size.
When you have found the optimal batch size (i.e. how many executors to configure for the build server), you can maybe tweak other things to make things faster:
Try to spend as little time checking out code as possible. Instead of deleting workspace before build starts, configure the SCM plugin to remove files that are not under version control. If you use git, you can use a local reference repo or do a shallow clone or something like that.
You can also try to speed things up by using SSD disks
You can get more servers, run Jenkins slaves on them and utilize the cpu and I/O capacity of multiple servers instead of only one.

Resources