HBase 0.98.1 Put operations never timeout - timeout

I am using 0.98.1 version of HBase server and client. My application has strict response time requirements. As far as HBase is concerned, I would like to abort the HBase operation if the execution exceeds 1 or 2 seconds. This task timeout is useful in case of Region-Server being non-responsive or has crashed.
I tired configuring
1) HBASE_RPC_TIMEOUT_KEY = "hbase.rpc.timeout";
2) HBASE_CLIENT_RETRIES_NUMBER = "hbase.client.retries.number";
However, the Put operations never timeout (I am using sync flush). The operations return only after the Put is successful.
I looked through the code and found that the function receiveGlobalFailure in AsyncProcess class keeps resubmitting the task without any check on the retires. This is in version 0.98.1
I do see that in 0.99.1 there have been some changes to AsyncProcess class that might do what I want. I have not verified it though.
My questions are:
Is there any other configuration that I missed that can give me
the desired functionality.
Do I have to use 0.99.1 client to
solve my problem? Does 0.99.1 solve my problem?
If I have to use 0.99.1 client, then do I have to use 0.99.1 server or can I still use my existing 0.98.1 region-server.

Related

Large percent of requests in CLRThreadPoolQueue

We have an ASP.NET MVC application hosted in an azure app-service. After running the profiler to help diagnose possible slow requests, we were surprised to see this:
An unusually high % of slow requests in the CLRThreadPoolQueue. We've now run multiple profile sessions each come back having between 40-80% in the CLRThreadPoolQueue (something we'd never seen before in previous profiles). CPU each time was below 40%, and after checking our metrics we aren't getting sudden spikes in requests.
The majority of the requests listed as slow are super simple api calls. We've added response caching and made them async. The only thing they do is hit a database looking for a single record result. We've checked the metrics on the database and the query avg run time is around 50ms or less. Looking at application insights for these requests confirms this, and shows that the database query doesn't take place until the very end of the request time line (I assume this is the request sitting in the queue).
Recently we started including SignalR into a portion of our application. Its not fully in use but it is in the code base. We since switched to using Azure SignalR Service and saw no changes. The addition of SignalR is the only "major" change/addition we've made since encountering this issue.
I understand we can scale up and/or increase the minWorkerThreads. However, this feels like I'm just treating the symptom not the cause.
Things we've tried:
Finding the most frequent requests and making them async (they weren't before)
Response caching to frequent requests
Using Azure SignalR service rather than hosting it on the same web
Running memory dumps and contacting azure support (they
found nothing).
Scaling up to an S3
Profiling with and without thread report
-- None of these steps have resolved our issue --
How can we determine what requests and/or code is causing requests to pile up in the CLRThreadPoolQueue?
We encountered a similar problem, I guess internally SignalR must be using up a lot of threads or some other contended resource.
We did three things that helped a lot:
Call ThreadPool.SetMinThreads(400, 1) on app startup to make sure that the threadpool has enough threads to handle all the incoming requests from the start
Create a second App Service with the same code deployed to it. In the javascript, set the SignalR URL to point to that second instance. That way, all the SignalR requests go to one app service, and all the app's HTTP requests go to the other. Obviously this requires a SignalR backplane to be set up, but assuming your app service has more than 1 instance you'll have had to do this anyway
Review the code for any synchronous code paths (eg. making a non-async call to the database or to an API) and convert them to async code paths

How can I get result of Dask compute on a different machine than the one that submitted it?

I am using Dask behind a Django server and the basic setup I have is summarised here: https://github.com/MoonVision/django-dask-demo/ where the Dask client can be found here: https://github.com/MoonVision/django-dask-demo/blob/master/demo/daskmanager/daskmanager.py
I want to be able to separate the saving of a task from the server that submitted it for robustness and scalability. I also would like more detailed information as to the processing status of the task, right now the future status is always pending even if the task is processing. Having a rough estimate of percent complete would also be great.
Right now, if the web server were to die, the client would get deleted and the task would stop as no client is still holding the future. I can get around this by using fire_and_forget but I then have no way to save the task status and result when it completes.
Ways I see to track the status and save the result after a fire_and_forget:
I could have a scheduler plugin that sends all transfers to AMPQ server (RabbitMQ). I like the robustness and being able to subscribe to certain messages that are output by the scheduler and knowing every message will be processed. I'm not sure how I could get the result it self with this method. I could manually adding a node to the end of every graph to save the result but would rather have it be behind the scenes.
get_task_stream on separate server or use it in some way. With this, it seems I could miss some messages if the server were to go down so seems like a worse option 1.
Other option?
What would be the best way to accomplish this?
Edit: Just tested and it seems when the client that submitted a task shuts down, all futures it created are moved from processing to forgotten, even if calling fire_and_forget.
You probably want to look at Dask's coordination primitivies like Queues and Pub/Sub. My guess is that putting your futures into a queue would solve your problem.
https://docs.dask.org/en/latest/futures.html#coordination-primitives

How to set up logging on dask distributed workers?

After upgrading of dask distributed to version 1.15.0 my logging stopped working.
I've used logging.config.dictConfig to initialize python logging facilities, and previously these settings propagated to all workers. But after upgrade it doesn't work anymore.
If I do dictConfig right before every log call on every worker it works but it's not a proper solution.
So the question is how it initialize logging on every worker before my computation graph starts executing and do it only once per worker?
UPDATE:
This hack worked on a dummy example but didn't make a difference on my system:
def init_logging():
# logging initializing happens here
...
client = distributed.Client()
client.map(lambda _: init_logging, client.ncores())
UPDATE 2:
After digging through documentation this fixed the problem:
client.run(init_logging)
So the question now is: Is this a proper way to solve this problem?
As of version 1.15.0 we now fork workers from a clean process, so changes that you make to your process prior to calling Client() won't affect forked workers. For more information search for forkserver here: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods
Your solution of using Client.run looks good to me. Client.run is currently (as of version 1.15.0) the best way to call a function on all currently active workers.
Distributed Systems
It is worth noting that here you're setting up clients forked from the same process on a single computer. The trick you use above will not work in a distributed setting. In case people come to this question asking about how to handle logging with Dask in a cluster context I'm adding this note.
Generally Dask does not move logs around. Instead, it is common that whatever mechanism you used to launch Dask handles this. Job schedulers like SGE/SLURM/Torque/PBS all do this. Cloud systems like YARN/Mesos/Marathon/Kubernetes all do this. The dask-ssh tool does this.

Redis flushall command is randomly being called

I have a ruby app in production that uses sidekiq (that uses redis) and I have managed to discover that flushall commands are being called which cause the database to be wiped (thus removing all the processed and scheduled jobs).
I don't know or understand what could be causing this.
Does anyone know how I can begin to trace the call to flushall?
Thanks,
It is most likely that your Redis server is open to the public network without any protection - that is just calling for trouble because anyone can connect and do much more damage than just a FLUSHALL. If that it the case, use password authentication at the very least, after burning the compromised server - the attacker may have gained access to your server's operating system and from there who knows where. More information at: http://antirez.com/news/96
If that isn't the case and you have a rogue application somewhere that randomly calls unwanted commands, you can try tracking it by combining the MONITOR and CLIENT LIST.
Lastly, you can consider renaming/disabling the FLUSHALL command, at least temporarily, until you get to the bottom of this.

Icinga - check_yum - Socket Timeout?

I'm using the check_yum - Plugin in my Icinga-Monitoring-Environment to check if there are security critical updates available. This works quite fine but sometimes I get a " CHECK_NRPE: Socket timeout after xx seconds." while executing the check. Currently my NRPE-Timeout is 30 seconds.
If I re-schedule the check a few times or executing the check directly from my Icinga-Server with a higher nrpe-timeout-value everything works fine, at least after a few executions of the check. All other checks via NRPE are not throwing any errors. So I think there is no general error with my NRPE-config or the plugins I'm using. Is there some explanation for this strange behaviour of the check_yum - plugin? Maybe some caching issues on the monitored servers?
First, be sure you are using the 1.0 version of this check from: https://code.google.com/p/check-yum/downloads/detail?name=check_yum_1.0.0&can=2&q=
The changes I've seen in that version could fix this issue, depending on it's root cause.
Second, if your server(s) are not configured to use all 'local' cache repos, then this check will likely time out before the 30 second deadline. Because: 1> the amount of data from the refresh/update is pretty large and may be taking a long time to download from remote (include RH proper) servers and 2> most of the 'official' update servers tend to go off-line A LOT.
Best solution I've found is to have a cronjob to perform your update check at a set interval (I use weekly) and create a log file containing those security patches the system(s) require. Then use a Nagios check, via a simple shell script, to see if said file has any new items in it.

Resources