Icinga - check_yum - Socket Timeout? - monitoring

I'm using the check_yum - Plugin in my Icinga-Monitoring-Environment to check if there are security critical updates available. This works quite fine but sometimes I get a " CHECK_NRPE: Socket timeout after xx seconds." while executing the check. Currently my NRPE-Timeout is 30 seconds.
If I re-schedule the check a few times or executing the check directly from my Icinga-Server with a higher nrpe-timeout-value everything works fine, at least after a few executions of the check. All other checks via NRPE are not throwing any errors. So I think there is no general error with my NRPE-config or the plugins I'm using. Is there some explanation for this strange behaviour of the check_yum - plugin? Maybe some caching issues on the monitored servers?

First, be sure you are using the 1.0 version of this check from: https://code.google.com/p/check-yum/downloads/detail?name=check_yum_1.0.0&can=2&q=
The changes I've seen in that version could fix this issue, depending on it's root cause.
Second, if your server(s) are not configured to use all 'local' cache repos, then this check will likely time out before the 30 second deadline. Because: 1> the amount of data from the refresh/update is pretty large and may be taking a long time to download from remote (include RH proper) servers and 2> most of the 'official' update servers tend to go off-line A LOT.
Best solution I've found is to have a cronjob to perform your update check at a set interval (I use weekly) and create a log file containing those security patches the system(s) require. Then use a Nagios check, via a simple shell script, to see if said file has any new items in it.

Related

How to change the socket time out value in Jira?

We are trying to change the socket timeout in Jira as some of the REST API calls are taking too long to respond due to which we are getting Request Time Out Error.
For changing it, we tried the following but NONE of them worked:
We made changes in the General Configuration settings by following this article.
We followed the following article and changed the JVM_SUPPORT_RECOMMENDED_ARGS parameter to increase the socket time and these are our observations:
When setting the JVM_SUPPORT_RECOMMENDED_ARGS to 20000milliseconds (20sec), we found that it fails for the delay above 20sec; rest it is working fine for any value below it.
When the JVM_SUPPORT_RECOMMENDED_ARGS parameter set to any value between 2min to 14min, the delay above 50sec gives error. For the delay uptill 50sec in project creation, it is working fine.
The snapshot of setenv.sh file, where we made our changes:
Please suggest how to increase the socket time out so that we do not get a Request Time Out for a delay of around 2 min.
Any suggestions would be helpful.
If you use unproxied access to your Jira (i.e. via Tomcat distributed together with Jira), then you will have to change the timeout also in the Tomcat configuration: <jira-install>/conf/server.xml.
There's Tomcat's connector configuration:
<Connector port="8080" connectionTimeout="20000" ...
^^^^^
Notes:
Also feel free to contact official Atlassian Support if you have paid subscription.
Also take a look into logs to find out why project creation takes so much time and times-out (it's not normal, project creation is a matter of a second).

Cloud Run - OpenBLAS Warnings and application restarts (Not a cold start issue)

Problem
I have an application running on a Cloud Run instance for a 5 months now.
The application has a startup time of about 3 minutes and when the startup is over it does not need much RAM.
Here are two snapshots of docker stats when I run the app locally :
When the app isn't excited
When the app is receiving 10 requests per seconds (Which is way over our use case for now) :
There aren't any problems when I run the app locally however problems arise when I deploy it on Cloud Run. I keep receiving : "OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k" messages followed by the restart of the app. This is a problem because as I said the app takes up to 3 minutes to restart, during which the requests take a lot of time to get treated.
I already fixed the cold start issue by using a minimum instance of 1 AND using a google cloud scheduler to query the service every minutes.
Examples
Here are examples of what I see in the logs.
In the second example the warnings came once again just after the application restart which caused a second restart in a row, this happens quite often.
Also note that those warnings/restarts are not necessarily happening when users are connected to the app but can happen when the only activity is due to the Google Cloud Scheduler
I tried increasing the allocated RAM and CPU to 4 CPUs and 4 Go of RAM (which is a huge over kill) and yet the problem remains.
Update 02/21
As of 01/01/21 we stopped witnessing such behavior from our cloud run service (maybe due an update, I don't know). I did contact the GCP support but they just told me to raise an issue on the OpenBLAS github repo but since I can't reproduce the behavior I did not do so. I'll leave the question open as nothing I did really worked.
OpenBLAS performs high performance compute optimizations and need to know what are the CPU capacity to tune itself the best.
However, when you run a container on Cloud Run, you run it in a sandbox GVisor, to increase the security and the isolation of all the container running on the same serverless platform.
This sandbox intercepts low level kernel calls and discard the abnormal/dangerous ones. I guess that for this reason that OpenBLAS can't determine the L2 cache size. On your environment, you haven't this sandbox, and you can access directly to the CPU info.
Why it's restart?? It could be a problem with OpenBLAS or a problem with Cloud Run (suspicious kernel call, kill the instance and restart it).
I haven't immediate solution because I don't know OpenBLAS. I had similar behavior with Tensorflow Serving, and tensorflow proposes a compiled version without any CPU optimization: less efficient but more portable and resilient to different environment constraint. If a similar compilation exists for OpenBLAS, it could be great to test it.

How to get Cloud Run to handle multiple simultaneous deployments?

I've got a project with 4 components, and every component has hosting set up on Google Cloud Run, separate deployments for testing and for production. I'm also using Google Cloud Build to handle the build & deployment of the components.
Due to lack of good webhook events from source system, I'm currently forced to trigger a rebuild of all components in a project every time there is a new change. In the project this means 8 different images to build and deploy, as testing and production use different build-time settings as well.
I've managed to optimize Cloud Build to handle the 8 concurrent builds pretty nicely, but they all finish around the same time, and then all 8 are pushed to Cloud Run. It often seems like Cloud Run does not like this at all and starts throwing some errors to me that I've been unable to resolve.
First and more serious is that often about 4-6 of the 8 deployments go through as expected, and the remaining ones either are significantly delayed or just fail, often so that the first few go through fine, then a few with significant delays, and the final 1-2 just fail. This seems to be caused by some "reconciliation request quota" being exhausted in the region (in this case europe-north1), as this is the error I can see at the top of the Cloud Run service -view:
Additionally and mostly annoyingly, the Cloud Run dashboard itself does not seem to handle having 8 services deployed, as just sitting on the dashboard view listing the services regularly throws me another error related to some read quotas:
I've tried contacting Google via their recommended "Send feedback" button but have received no reply in ~1wk+ (who knows when I sent it, because they don't seem to confirm receipt).
One option I can do to try and improve the situation is to deploy the "testing" and "production" variants in different regions, however that would be less than optimal, and seems like this is some simple configuration somewhere about the limits. Are there other options for me to consider? Or should I just try to set up some synchronization on these that not all deployments are fired at once?
Optimizing the need to build and deploy all components at once is not really an option in this case, since they have some shared code as well, and when that changes it would still be necessary to support this.
This is an issue with Cloud Run. Developers are expected to be able to deploy many services in parallel.
The bug should be fixed within a few days or couple of weeks.
[update] Bug should now be fixed.
Make sure to use the --async flag if you want to deploy in parrallel: gcloud run deploy $SERVICE --image --async

HBase 0.98.1 Put operations never timeout

I am using 0.98.1 version of HBase server and client. My application has strict response time requirements. As far as HBase is concerned, I would like to abort the HBase operation if the execution exceeds 1 or 2 seconds. This task timeout is useful in case of Region-Server being non-responsive or has crashed.
I tired configuring
1) HBASE_RPC_TIMEOUT_KEY = "hbase.rpc.timeout";
2) HBASE_CLIENT_RETRIES_NUMBER = "hbase.client.retries.number";
However, the Put operations never timeout (I am using sync flush). The operations return only after the Put is successful.
I looked through the code and found that the function receiveGlobalFailure in AsyncProcess class keeps resubmitting the task without any check on the retires. This is in version 0.98.1
I do see that in 0.99.1 there have been some changes to AsyncProcess class that might do what I want. I have not verified it though.
My questions are:
Is there any other configuration that I missed that can give me
the desired functionality.
Do I have to use 0.99.1 client to
solve my problem? Does 0.99.1 solve my problem?
If I have to use 0.99.1 client, then do I have to use 0.99.1 server or can I still use my existing 0.98.1 region-server.

Bulk publishing Issue in Fatwire

Currently performing a Bulk publish using Mirror Queue for around 11000 assets.
Have tried the same in lower environments and was successful.
No error logged # source and destination futuretense logs. The asset update got logged at destination for long time and no updates later. Publishing process still keeps rolling in source.
No stuck threads and memory consumption looks good on both environments.
Any inputs for the same, will be much appreciated.
What is the Fatwire version? Please note that mirror to server publishing is no longer supported in Oracle Webcenter Sites 11.1.1.8. Real time publishing is more robust and less error prone.

Resources