GNU Parallel More Frequently Check Nodelist - gnu-parallel

I am using GNU Parallel for managing submissions across a remote cluster and local machines. As stated in the man page, -sshloginfile is only rechecked after a job completes. However, all of my jobs take 8h+ to complete. Is there anyway to check the sshloginfile more often?

There is not. But why would you need it?
What would it accomplish to have it checked more often?

Related

How can I debug why my Dataflow job is stuck?

I have a Dataflow job that is not making progress - or it is making very slow progress, and I do not know why. How can I start looking into why the job is slow / stuck?
The first resource that you should check is Dataflow documentation. It should be useful to check these:
Troubleshooting your Pipeline
Common error guidance
If these resources don't help, I'll try to summarize some reasons why your job may be stuck, and how you can debug it. I'll separate these issues depending on which part of the system is causing the trouble. Your job may be:
Job stuck at startup
A job can get stuck being received by the Dataflow service, or starting up new Dataflow workers. Some risk factors for this are:
Did you add a custom setup.py file?
Do you have any dependencies that require a special setup on worker startup?
Are you manipulating the worker container?
To debug this sort of issue I usually open StackDriver logging, and look for worker-startup logs (see next figure). These logs are written by the worker as it starts up a docker container with your code, and your dependencies. If you see any problem here, it would indicate an issue with your setup.py, your job submission, staged artifacts, etc.
Another thing you can do is to keep the same setup, and run a very small pipeline that stages everything:
with beam.Pipeline(...) as p:
(p
| beam.Create(['test element'])
| beam.Map(lambda x: logging.info(x)))
If you don't see your logs in StackDriver, then you can continue to debug your setup. If you do see the log in StackDriver, then your job may be stuck somewhere else.
Job seems stuck in user code
Something else that could happen is that your job is performing some operation in user code that is stuck or slow. Some risk factors for this are:
Is your job performing operations that require you to wait for them? (e.g. loading data to an external service, waiting for promises/futures)
Note that some of the builtin transforms of Beam do exactly this (e.g. the Beam IOs like BigQueryIO, FileIO, etc).
Is your job loading very large side inputs into memory? This may happen if you are using View.AsList for a side input.
Is your job loading very large iterables after GroupByKey operations?
A symptom of this kind of issue can be that the pipeline's throughput is lower than you would expect. Another symptom is seeing the following line in the logs:
Processing stuck in step <STEP_NAME>/<...>/<...> for at least <TIME> without outputting or completing in state <STATE>
.... <a stacktrace> ....
In cases like these it makes sense to look at which step is consuming the most time in your pipeline, and inspect the code for that step, to see what may be the problem.
Some tips:
Very large side inputs can be troublesome, so if your pipeline relies on accessing a very large side input, you may need to redesign it to avoid that bottleneck.
It is possible to have asynchronous requests to external services, but I recommend that you commit / finalize work on startBundle and finishBundle calls.
If your pipeline's throughput is not what you would normally expect, it may be because you don't have enough parallelism. This can be fixed by a Reshuffle, or by sharding your existing keys into subkeys (Beam often does processing per-key, and so if you have too few keys, your parallelism will be low) - or using a Combiner instead of GroupByKey + ParDo.
Another reason that your throughput is low may be that your job is waiting too long on external calls. You can try addressing this by trying out batching strategies, or async IO.
In general, there's no silver bullet to improve your pipeline's throughput,and you'll need to have experimentation.
The data freshness or system lag are increasing
First of all, I'd recommend you check out this presentation on watermarks.
For streaming, the advance of the watermarks is what drives the pipeline to make progress, thus, it is important to be watchful of things that could cause the watermark to be held back, and stall your pipeline downstream. Some reasons why the watermark may become stuck:
One possibility is that your pipeline is hitting an unresolvable error condition. When a bundle fails processing, your pipeline will continue to attempt to execute that bundle indefinitely, and this will hold the watermark back.
When this happens, you will see errors in your Dataflow console, and the count will keep climbing as the bundle is retried. See:
You may have a bug when associating the timestamps to your data. Make sure that the resolution of your timestamp data is the correct one!
Although unlikely, it is possible that you've hit a bug in Dataflow. If neither of the other tips helps, please open a support ticket.

Jenkins remote API - wait for build to finish and get output?

When using Jenkins CLI, I can use the build command with options -v and -s to run a build, waiting for it to finish and printing its output.
Is there any way I can achieve the same result (wait for execution and get job output) with a single call to the REST API? I know this can be done by polling for build status until it finishes and then requesting its output, but I want to know if there is a straightforward option for short-running jobs.
You can do like that somehow. But even if you do also you can't able to apply the same code for other jobs. There will be waiting period for the next available executor or some race conditions like this might happen. And holding the rest API for that long period is not gonna be a good option. And nobody suggests that.
So Instead of looking for the REST API, you can have an algorithm for polling itself. instead of every second, take results from the previous builds and process it and try to predict the near possible time and then poll. Like this kind of algorithms or else you can use Jenkins build remaining time also. Hope this makes sense.

Jenkins Pipeline and huge amount of parallel steps

I have searched the whole internet for 2 weeks now, asked on freenode IRC and in the Jenkins user group mailing list for that but got no answer so here I am, you are my last hope (no pressure)
I have a Jenkins scripted pipeline that generates hundreds of parallel branches that have to run simultaneously on hundreds of slaves node. At the moment it looks like Jenkins BlueOcean user interface is not suited for that. We reach a point were all the steps can't be displayed.
I need to provide some kind of background to let you understand our need: We have a huge project in our company that have thousands of Behat/Selenium and this takes more that 30h to run now if done sequentially. We implemented a basic solution some times ago were we use a queuing system (RabbitMq) to store all the tests and consumers that run the tests by downloading the source code from Jenkins and uploading artifacts back to Jenkins too, but this is not as scallable as Jenkins native slaves and it is not maintainable enough (eg. we don't benefit from real time output log and usage statistics).
I know there is an open issue that describe the problem here : https://issues.jenkins-ci.org/browse/JENKINS-41205 but, basically, I need a workaround working for the next week (Our deelopment team are waiting for this new pipeline for a long time now).
Our pippeline looks like that at the moment:
Build --- Unit Tests --- Integration Tests --- Functional Tests ---
| | |
tool A suite A matrix-A-A-batch 0
tool B suite B matrix-A-A-batch 1
tool C matrix-A-A-batch 2
matrix-A-A-batch 3
....
"Unable to display more"
You can find a full version of our Jenkinsfile here : https://github.com/willy-ahva/pim-community-dev/blob/086e4ed48ef1a3d880ca16b6f5572f350d26eb03/Jenkinsfile (It may looks complicated but, basically, the real problem is the "Functional Tests" stage)
My questions are:
Am I using parallel the good way ?
Is it only a Jenkins/BlueOcean issue and I should contribute to the issue I linked ? (If yes, how ? I'm not a Java dev at all)
Should I try to use MultiJob and parallelize jobs instead of steps ?
Is there any other tool except parallel that I can use ? (some kind of fork or whatever) ?
Thanks a lot for your help. I love what Jenkins became with the Pipeline and BlueOcean UI and I really want to make it work in our team.
This is probably a poor way to do the parallel tasks. I would instead treat each parallel map entry as a worker, and put your tests into a queue / stack / data structure. Each worker thread could pop off the queue as required, and then you wouldn't sit there with a million tasks queued. You would have to be more careful with your logging so that it is apparent which test failed, but that shouldn't be too tough.
It's probably not something that's easy to fix, as it is as much a UI design issue as anything else. I would recommend that you give it a poke though! Who knows, maybe a solution will click for you?
Probably not. In my opinion this makes this muddier
Parallel is your option for forking.
If you really want to keep doing this, but don't want the UI to be so weird, you can stop defining each test as a stage. It'll be less clear what failed when one fails, but the UI should be happier.

Gerrit+Jenkins: communicate results of non-blocking jobs?

I have two jobs A and B. Job A is triggering job B, but does not wait for the result. How can I make job B now communicate back to Gerrit that that has also been done?
Do I have to use the API?
Either use API: ssh -p 29418 review.example.com gerrit review --message "Job B ran extremely well" <sha1>
note 1: the quotes may be needed around the actual gerrit call
note 2: depending on branch strategy for your project you might also want to include <change_id> after <sha1>, since one sha1 may be present in different branches
Or make job A wait for job B's completion (one way is to turn on block until the other projects complete ).
The latter would be easier to use and gives more possibilities with reduced effort cost. The former, however, has an advantage of better customization.

Jenkins: how to block a job to make it unrunnable

This is not just another question about concurrent job execution in Jenkins. The problem I have is that there are several jobs that run independently from one another. When they finish it should be possible to run a manual job. The condition though is that all those automated jobs should be in successful state. Otherwise it should not be possible to run this manual job. It should also not be possible to run or even schedule run of this manual job if those other jobs are running.
I searched for the answer everywhere and checked every possible plugin that serves synchronization. But I did not figure it out how to solve the above problem.
IMHO the delivery pipeline plugin (see https://wiki.jenkins-ci.org/display/JENKINS/Delivery+Pipeline+Plugin for the download and http://www.infoq.com/articles/orch-pipelines-jenkins for a thorough description) could do what you want.
You can run a lot of jobs (in parallel or not), and when (and only when) they succeed another job (or more). You even can add manual steps (needing a button click when your pipeline may continue).
Everything is configurable - and quite stable at this moment.
No-one should be able to manually (or otherwise) start a job that is in "waiting state" for other jobs to finish.
Regarding this question:
Otherwise it should not be possible to run this manual job. It should also not be possible to run or even schedule run of this manual job if those other jobs are running.
You can use the Throttle Concurrent Builds Plugin and create a category which will include your automated jobs and the manual jobs.
If one automated job is running, it will be impossible to launch the manual jobs.
Regarding your first question, did you have a look to the Join plugin?
Cheers
https://wiki.jenkins-ci.org/display/JENKINS/Promoted+Builds+Plugin can also be option. Setup promotions in that way that manual approval is needed and build will not fail only if automated jobs are done.

Resources