Max CPU% on Continuous WebJob - azure-webjobssdk

I have a continuous webjob running that reads messages from a queue, reads a file from Blob Storage, converts it, and then writes the converted file to a different blob container. All of the files are being converted properly. The Kudu site for my app service is running at nearly 100%. The process explorer in Kudu shows my webjob as the only other process running within that server. Conventional wisdom indicates that it is probably the webjob that is the problem. Are there any tools for determining what the problem might be?
Thanks!

The simplest approach is to download a process dump (From Kudu process explorer), and then analyze it locally, e.g. with windbg or Visual Studio. By looking at the threads, you should be able to identify which ones are actively spinning CPU.
Another simple test is to temporarily stop creating new blobs, and check whether that causes the CPU usage to go down. If it's still high in the absence of any processing, then something strange is going on.

Related

dask scheduler OOM opening a large number of avro files

I'm trying to run a pipeline via dask on a cluster on gcp. The pipeline loads a lot of avro files from cloud storage (~5300 files with around 300MB each) like this
bag = db.read_avro(
'gcs://mybucket/myfiles-*.avro',
blocksize=5000000
)
It then applies some transformations and saves the data back to cloud storage (as parquet files).
I've tested this pipeline with a fraction of the avro files and it works perfectly, but when I tell it to ingest all the files, the scheduler process sits at 100% CPU for a long time and at some point it runs out of memory (I have tried scaling my master node running the scheduler up to 64GB of RAM but that still does not suffice), while the workers are idling. I assume that the problem is that it has to create an excessive amount of tasks that are all held in RAM before being distributed to the workers.
Is this some sort of antipattern that I'm using when trying to open a very large number of files? If so, is there perhaps a built-in way to better cope with this or would I have to split the avro files manually?
Avro with Dask at scale is not particularly well-trodden territory. There is no theoretical reason it should not work. You could inspect the contents of the graph to see if things are getting serialised there that are large, or if simply a massive number of tasks are being generated. If the former, it may be solvable, and you could raise an issue.
As you say, you may be able to keep the load on the scheduler down by processing sub-batches out of the total set of files at a time and waiting for completion.

Cloud Run - OpenBLAS Warnings and application restarts (Not a cold start issue)

Problem
I have an application running on a Cloud Run instance for a 5 months now.
The application has a startup time of about 3 minutes and when the startup is over it does not need much RAM.
Here are two snapshots of docker stats when I run the app locally :
When the app isn't excited
When the app is receiving 10 requests per seconds (Which is way over our use case for now) :
There aren't any problems when I run the app locally however problems arise when I deploy it on Cloud Run. I keep receiving : "OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k" messages followed by the restart of the app. This is a problem because as I said the app takes up to 3 minutes to restart, during which the requests take a lot of time to get treated.
I already fixed the cold start issue by using a minimum instance of 1 AND using a google cloud scheduler to query the service every minutes.
Examples
Here are examples of what I see in the logs.
In the second example the warnings came once again just after the application restart which caused a second restart in a row, this happens quite often.
Also note that those warnings/restarts are not necessarily happening when users are connected to the app but can happen when the only activity is due to the Google Cloud Scheduler
I tried increasing the allocated RAM and CPU to 4 CPUs and 4 Go of RAM (which is a huge over kill) and yet the problem remains.
Update 02/21
As of 01/01/21 we stopped witnessing such behavior from our cloud run service (maybe due an update, I don't know). I did contact the GCP support but they just told me to raise an issue on the OpenBLAS github repo but since I can't reproduce the behavior I did not do so. I'll leave the question open as nothing I did really worked.
OpenBLAS performs high performance compute optimizations and need to know what are the CPU capacity to tune itself the best.
However, when you run a container on Cloud Run, you run it in a sandbox GVisor, to increase the security and the isolation of all the container running on the same serverless platform.
This sandbox intercepts low level kernel calls and discard the abnormal/dangerous ones. I guess that for this reason that OpenBLAS can't determine the L2 cache size. On your environment, you haven't this sandbox, and you can access directly to the CPU info.
Why it's restart?? It could be a problem with OpenBLAS or a problem with Cloud Run (suspicious kernel call, kill the instance and restart it).
I haven't immediate solution because I don't know OpenBLAS. I had similar behavior with Tensorflow Serving, and tensorflow proposes a compiled version without any CPU optimization: less efficient but more portable and resilient to different environment constraint. If a similar compilation exists for OpenBLAS, it could be great to test it.

What is the diffrence between Tika app, Tika Server and Java Wrapper. Which one is used and when?

I want to use Apache Tika for enterprise-level huge and lots of documents. Which one I use, Tika Server or Tika App or Java calls? Can you suggest me a system architecture? (i.e. Load balanced 3-4 Tika physically different Server)
Making PUT calls to a REST endpoint for sending thousands of 0.5 GB documents over HTTP, one at a time, is not an appropriate scenario for the Tika Server. It will not be memory efficient and the server will likely crash with some kind of memory leak or bugs.
Although as of v1.19 there is now a -spawnChild option to periodically restart the process after it has processed -maxFiles. From v2.x, this is now the default.
For your needs, you should simply use the tika-app in batch mode, which:
Runs locally, using an input and output directory that you specify
Sets up parent/child processes to robustly handle hangs/OOMEs
Runs multiple parser threads in parallel
Can restart child every x minutes or after y files to avoid memory leaks
Logs failures
java -jar tika-app.jar -i <input_directory> -o <output_dir>

Neo4j 2.0.4 browser cannot query large datasets

Whenever I try to run cypher queries in Neo4j browser 2.0 on large (anywhere from 3 to 10GB) batch-imported datasets, I receive an "Unknown Error." Then Neo4j server stops responding, and I need to exit out using Task Manager. Prior to this operation, the server shuts down quickly and easily. I have no such issues with smaller batch-imported datasets.
I work on a Win 7 64bit computer, using the Neo4j browser. I have adjusted the .properties file to allow for much larger memory allocations. I have configured my JVM heap to 12g, which should be fine for 64bit JDK. I just recently doubled my RAM, which I thought would fix the issue.
My CPU usage is pegged. I have the logs enabled but I don't know where to find them.
I really like the visualization capabilities of the 2.0.4 browser, does anyone know what might be going wrong?
Your query is taking a long time, and the web browser interface reports "Unknown Error" after a certain timeout period. The query is still running, but you won't see the results in the browser. This drove me nuts too when it first happened to me. If you run the query in the neo4j shell you can verify whether or not this is the problem, because the shell won't time out.
Once this timeout occurs, you can find that the whole system becomes quite non-responsive, especially if you re-run the query, because now you have two extremely long queries running in parallel!
Depending on the type of query, you may be able to improve performance. Sometimes it's as simple as limiting the number of returned nodes (in cases where you only need to find one node or path).
Hope this helps.
Grace and peace,
Jim

building multiple jobs in jenkins performance

In Jenkins I have 100 java projects. Each has its own build file.
Every time I want clear the build file and compile all source files again.
Using bulkbuilder plugin I tried compling all the jobs.. Having 100 jobs run parallel.
But performance is very bad. Individually if the job takes 1 min. in the batch it takes 20mins. More the batch size more the time it takes.. I am running this on powerful server so no problem of memory and CPU.
Please Suggest me how to over come this.. what configurations need to be done in jenkins.
I am launching jenkins using war file.
Thanks..
Even though you say you have enough memory and CPU resources, you seem to imply there is some kind of bottleneck when you increase the number of parallel running jobs. I think this is understandable. Even though I am not a java developer, I think most of the java build tools are able to parallelize build internally. I.e. building a single job may well consume more than one CPU core and quite a lot of memory.
Because of this I suggest you need to monitor your build server and experiment with different batch sizes to find an optimal number. You should execute e.g. "vmstat 5" while builds are running and see if you have idle cpu left. Also keep an eye on the disk I/O. If you increase the batch size but disk I/O does not increase, you are consuming all of the I/O capacity and it probably will not help much if you increase the batch size.
When you have found the optimal batch size (i.e. how many executors to configure for the build server), you can maybe tweak other things to make things faster:
Try to spend as little time checking out code as possible. Instead of deleting workspace before build starts, configure the SCM plugin to remove files that are not under version control. If you use git, you can use a local reference repo or do a shallow clone or something like that.
You can also try to speed things up by using SSD disks
You can get more servers, run Jenkins slaves on them and utilize the cpu and I/O capacity of multiple servers instead of only one.

Resources