Opencenus prevent my java process from exiting - jenkins

I am using opencensus in my component, I am running a performance test with JMeter started by Jenkins, but the process never ends and I discovered that it is opencenus that is keeping it alive (because if I remove opencenus the process finishes/dies normally).
Is there anything I can do in opencenus, Jenkins or JMeter to force the job to finish? Aborting the job also does not help as per Jenkins do not collect the results then.

IIRC, there's nothing inherent in OpenCensus that would cause this.
Caveat: I've mostly used OpenCensus with Golang, Python and JavaScript but not Java.
However, if, for example, you're using the Prometheus Exporter, it's common to run this in a separate thread because the e.g. Prometheus server needs to scrape (via HTTP) a metrics endpoint that's exposed by your component.
Could it be that it is this that's keeping your component alive?
If so, there should be a mechanism to gracefully terminate the Exporter once your component is done with it.
zPages and possibly other Exporters take this background thread approach too.

Related

What is the idiomatic way to implement long running jobs in jenkins?

I have a jenkins declarative pipeline job which invokes a shared library. The groovy script in the library executes a couple of quick commands and then polls for the result (with Thread.sleep()) which could take up to 10 minutes. However after 5 minutes the thread is interrupted, which is by design. The discussion in the previous link mentions that such workloads should be done
in a step so it can happen on a background thread rather than holding up this thread
But it's not clear what is implied here. The shared library code is invoked from a script block inside a step.
In theory a while loop that only exits when the program receives the correct signal that jenkins sends when you click stop should work. But jenkins seems like the wrong tool to manage a long running process, I would look into systemd or other OS level init system so I could pass the responsibility of keeping the process running onto my ops team.

How can I debug why my Dataflow job is stuck?

I have a Dataflow job that is not making progress - or it is making very slow progress, and I do not know why. How can I start looking into why the job is slow / stuck?
The first resource that you should check is Dataflow documentation. It should be useful to check these:
Troubleshooting your Pipeline
Common error guidance
If these resources don't help, I'll try to summarize some reasons why your job may be stuck, and how you can debug it. I'll separate these issues depending on which part of the system is causing the trouble. Your job may be:
Job stuck at startup
A job can get stuck being received by the Dataflow service, or starting up new Dataflow workers. Some risk factors for this are:
Did you add a custom setup.py file?
Do you have any dependencies that require a special setup on worker startup?
Are you manipulating the worker container?
To debug this sort of issue I usually open StackDriver logging, and look for worker-startup logs (see next figure). These logs are written by the worker as it starts up a docker container with your code, and your dependencies. If you see any problem here, it would indicate an issue with your setup.py, your job submission, staged artifacts, etc.
Another thing you can do is to keep the same setup, and run a very small pipeline that stages everything:
with beam.Pipeline(...) as p:
(p
| beam.Create(['test element'])
| beam.Map(lambda x: logging.info(x)))
If you don't see your logs in StackDriver, then you can continue to debug your setup. If you do see the log in StackDriver, then your job may be stuck somewhere else.
Job seems stuck in user code
Something else that could happen is that your job is performing some operation in user code that is stuck or slow. Some risk factors for this are:
Is your job performing operations that require you to wait for them? (e.g. loading data to an external service, waiting for promises/futures)
Note that some of the builtin transforms of Beam do exactly this (e.g. the Beam IOs like BigQueryIO, FileIO, etc).
Is your job loading very large side inputs into memory? This may happen if you are using View.AsList for a side input.
Is your job loading very large iterables after GroupByKey operations?
A symptom of this kind of issue can be that the pipeline's throughput is lower than you would expect. Another symptom is seeing the following line in the logs:
Processing stuck in step <STEP_NAME>/<...>/<...> for at least <TIME> without outputting or completing in state <STATE>
.... <a stacktrace> ....
In cases like these it makes sense to look at which step is consuming the most time in your pipeline, and inspect the code for that step, to see what may be the problem.
Some tips:
Very large side inputs can be troublesome, so if your pipeline relies on accessing a very large side input, you may need to redesign it to avoid that bottleneck.
It is possible to have asynchronous requests to external services, but I recommend that you commit / finalize work on startBundle and finishBundle calls.
If your pipeline's throughput is not what you would normally expect, it may be because you don't have enough parallelism. This can be fixed by a Reshuffle, or by sharding your existing keys into subkeys (Beam often does processing per-key, and so if you have too few keys, your parallelism will be low) - or using a Combiner instead of GroupByKey + ParDo.
Another reason that your throughput is low may be that your job is waiting too long on external calls. You can try addressing this by trying out batching strategies, or async IO.
In general, there's no silver bullet to improve your pipeline's throughput,and you'll need to have experimentation.
The data freshness or system lag are increasing
First of all, I'd recommend you check out this presentation on watermarks.
For streaming, the advance of the watermarks is what drives the pipeline to make progress, thus, it is important to be watchful of things that could cause the watermark to be held back, and stall your pipeline downstream. Some reasons why the watermark may become stuck:
One possibility is that your pipeline is hitting an unresolvable error condition. When a bundle fails processing, your pipeline will continue to attempt to execute that bundle indefinitely, and this will hold the watermark back.
When this happens, you will see errors in your Dataflow console, and the count will keep climbing as the bundle is retried. See:
You may have a bug when associating the timestamps to your data. Make sure that the resolution of your timestamp data is the correct one!
Although unlikely, it is possible that you've hit a bug in Dataflow. If neither of the other tips helps, please open a support ticket.

Google Cloud Run and golang goroutines

I'm considering Google Cloud Run for some cron-like operations I need to perform. They will get triggered by an HTTP invocation. The invocation will return (likely with a 202) and continue running in the background via a golang goroutine.
But, I'm concerned that Google Cloud Run containers are destroyed when they're not handling HTTP requests. I could be part-way through my processing and get reaped.
Is there a way to tell GCR to keep the container alive until I'm finished?
Cloud Run will scale your CPU down to nearly zero when it's not handling any requests, because you’re only paying when a request is being processed. (It's documented here).
Therefore, applications starting goroutines in the background are not suitable for Cloud Run. If you do this, your goroutines will most likely starve for CPU time shares and your program may start behaving very weirdly (as it would be running on a very very slow CPU, if anything at all).
The miniscule amount of an inactive Cloud Run application gets is probably only good for garbage collection, which go runtime will be doing for you.
If you want to wait for your goroutine to finish during the context of the request, you should block the request from returning, by using something like a blocking-receive from a chan, or sync.WaitGroup#Done().
The fairly new Always On CPU feature of Cloud Run solves this. Here is a link to the details: https://cloud.google.com/blog/products/serverless/cloud-run-gets-always-on-cpu-allocation

Docker as "Function" (Create a Docker per request)

Is there a simple way to create an istance of a docker container for each request?
I have a Docker container that takes a very long time to compute a mathematical algorithm. When running, no other requests can be processed in parallel. Lambda Functions would be the best solution, but the container needs to download more than 1gb of data and needs at least 10 cores and 5GB ram to be executed, and therefore Lambda would be too expensive.
We have a big cluster (1000 cores, 0.5TB RAM) and I was considering to use a NGINX Load balancer or a Kubernetes bare metal.
Is it possible to configure in a way that creates an instance per request (similar to a Lambda Function)?
There are tools like Airflow or Argo that are designed for these things.
basically you can create a DAG will run very much like a function as a service but on what ever custom docker container you want.
You probably need to decouple the HTTP service from the backend processing. If the job takes minutes or longer to run, most browsers and other HTTP clients will time out before it will finish, so the HTTP end of it needs to start the job in some way and immediately return some sort of success message.
Once you’ve done that, you might find a job queue like RabbitMQ a useful piece of infrastructure technology. Again, this decouples the queue of jobs from the mechanism to actually run them. In a Docker/Kubernetes space you’d launch some number of persistent workers that all listened to the queue and did work as it appeared there. You wouldn’t necessarily launch one worker per job; or possibly you would have just one worker that launched other Docker containers or Kubernetes Jobs; but if the work backlog got too long you could launch additional workers.
In a pure-Docker space it’s theoretically possible to use the Docker API to launch additional containers. However, doing this gives your process unlimited root-level access to the host; if you are running this in the context of an HTTP server you need to be extremely careful about security considerations. Kubernetes also has an API and from a security point of view this is probably better: you can set up a service account that has permissions only to launch Jobs, and launch a Job per inbound job that arrives. (Security is still important but it’s much harder for a malicious input to root the host.)

Difference between agents and worker threads

I'm working on running NUnit console runners using Jenkins. These tests connect to a Selenium Grid (which is also run by Jenkins), so I want to limit their level of parallelism in order to avoid getting agents starving while waiting for a free node on the grid.
So far I haven't managed to figure out what exactly is the difference between an agent and a worker thread in NUnit... I suspect the agent can manage threads, but it's only a guess. Thanks :)
An agent is a separate process running tests for an assembly. A worker is a thread, within a process, running the tests for a particular assembly.
Theoretically, an agent process could have multiple appdomains, each domain could have multiple assemblies and each assembly could have multiple worker threads.
Practically, however, the normal thing to do is to have one process per assembly, so that there is no need for multiple domains, and each process will run some specified number of worker threads to run tests for the assembly. In some contexts, you may prefer to only run processes in parallel and not have any parallelism within the assembly - it's the approach that is most likely to work without any change to your tests, which you may not have designed with parallelism in mind.
Agents do not "manage" threads. They simply run the framework in a process and the framework decides how many threads to use depending on the attributes you have applied.
Using multiple agents is the only way to run nunit V2 tests in parallel, since the v2 framework is ignorant of parallelism.

Resources