We run a TokuMX replica-set (2 instances + arbiter) with about about 120GB data (on disk) and lots of indices.
Since the upgrade to TokuMX 2.0 we noticed that restarting the SECONDARY instance always took a very long time. The database kept getting stuck at STARTUP2 for 1h+, before switching to normal mode. While the server is at STARTUP2, it's running at a continuous CPU load - we assume it's rebuilding its indices, even though it was shut down properly before.
While this is annoying, with the PRIMARY being available it caused no downtime. But recently during an extended maintenance we needed to restart both instances.
We stopped the SECONDARY first, then the PRIMARY and started them in reverse order. But this resulted in both taking the full 1h+ startup-time and therefore the replica-set was not available for this time.
Not being able to restart a possibly downed replica-set without waiting for such a long time, is a risk we'd rather not take.
Is there a way to avoid the (possible) full index-rebuild on startup?
#Chris - We are revisiting your ticket now. It may have been inadvertently closed prematurely.
#Benjamin: You may want to post this on https://groups.google.com/forum/#!forum/tokumx-user where many more TokuMX users, and the Tokutek support team, will see it.
This is a bug in TokuMX, which is causing it to load and iterate the entire oplog on startup, even if the oplog has been mostly (or entirely) replicated already. I've located and fixed this issue in my local build of TokuMX. The pull request is here: https://github.com/Tokutek/mongo/pull/1230
This has reduced my node startup times from hours to <5 seconds.
Related
Problem
I have an application running on a Cloud Run instance for a 5 months now.
The application has a startup time of about 3 minutes and when the startup is over it does not need much RAM.
Here are two snapshots of docker stats when I run the app locally :
When the app isn't excited
When the app is receiving 10 requests per seconds (Which is way over our use case for now) :
There aren't any problems when I run the app locally however problems arise when I deploy it on Cloud Run. I keep receiving : "OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k" messages followed by the restart of the app. This is a problem because as I said the app takes up to 3 minutes to restart, during which the requests take a lot of time to get treated.
I already fixed the cold start issue by using a minimum instance of 1 AND using a google cloud scheduler to query the service every minutes.
Examples
Here are examples of what I see in the logs.
In the second example the warnings came once again just after the application restart which caused a second restart in a row, this happens quite often.
Also note that those warnings/restarts are not necessarily happening when users are connected to the app but can happen when the only activity is due to the Google Cloud Scheduler
I tried increasing the allocated RAM and CPU to 4 CPUs and 4 Go of RAM (which is a huge over kill) and yet the problem remains.
Update 02/21
As of 01/01/21 we stopped witnessing such behavior from our cloud run service (maybe due an update, I don't know). I did contact the GCP support but they just told me to raise an issue on the OpenBLAS github repo but since I can't reproduce the behavior I did not do so. I'll leave the question open as nothing I did really worked.
OpenBLAS performs high performance compute optimizations and need to know what are the CPU capacity to tune itself the best.
However, when you run a container on Cloud Run, you run it in a sandbox GVisor, to increase the security and the isolation of all the container running on the same serverless platform.
This sandbox intercepts low level kernel calls and discard the abnormal/dangerous ones. I guess that for this reason that OpenBLAS can't determine the L2 cache size. On your environment, you haven't this sandbox, and you can access directly to the CPU info.
Why it's restart?? It could be a problem with OpenBLAS or a problem with Cloud Run (suspicious kernel call, kill the instance and restart it).
I haven't immediate solution because I don't know OpenBLAS. I had similar behavior with Tensorflow Serving, and tensorflow proposes a compiled version without any CPU optimization: less efficient but more portable and resilient to different environment constraint. If a similar compilation exists for OpenBLAS, it could be great to test it.
I occasionally (~ 1 out of 30 times) get a net::ERR_CACHE_READ_FAILURE in Chrome dev tools when loading my Electron app. I can't track down a reason for the error and I can't reproduce it consistently. Has anyone run into this problem before?
If you run multiple instances of your app, the first instance might lock the cache, which will prevent another instance from reading the cache.
Take a look at this Github issue:
You should not run multiple instances of the same app at the same time, for certain operations global locks are applied. In your case the cache database is locked by one instance and all other instances will fail to read cache.
You can use the app.requestSingleInstanceLock() API to prevent multiple instances of your application from running if that is appropriate for you.
We use Umbraco v.7.2.6 on multiple nodes.
When Database Server gets reloaded. Umbraco starts pushing infinite amount of sql queries similar to those shown on the image.
The load on the network channels increases by 6 times, and runs into the bandwidth limitation. No need to say our environment starts operate very slow.
The only way to solve the problem is to restore previous backup of the database. This issue happens occasionally and we don't know how to fix it.
What would be the steps to troubleshoot the problem?
So I've been running into some speed issues with my site which has been online for a few weeks now. It's an MVC3 site using MySQL on discountasp.net.
I cleaned up the structure of the site and got it working pretty fast on my local machine, around 800-1100ms to load with no caching. The strange thing is when I try and visit the live site I get times of around 15-16 seconds, sometimes freezing up as long as 30 seconds. I switched off the viewstate in web.config and now the local loads in 1.3 seconds (yes, oddly a little longer) and the live site is down to 8-9 seconds most of the time, but that's still pretty poor.
Without making this problem to specific to my case (since there can be a million reasons sites go slow), I am curious if there are any reasons why the load times between the local Visual Studio sever or IIS Express would run so fast while the live site would run so slow. Wouldn't anything code wise or dependency wise effect both equally? I just can't think of a reason that would affect the live site but not the local.
Any thoughts?
Further thoughts: I have the site setup as a sub-folder which I'm using IIS URL Rewriting to map to a subdomain. I've not heard of this causing issues before, but could this be a problem?
Further Further Updates: So I uploaded a simple page that does nothing but query all the records in the largest table I have with no caching. On my local machine it's averages around 110ms (which still seems slow...), and on the live site it's usually over double the time. If I'm hitting the database several times to load the page, it makes sense that this would heavily affect the page load time. I'm still not sure if the issue is with LINQ or MySQL or MVC in general (maybe even discountasp.net).
I had a similar problem once and the culprit was the initialization of the user session. Turns out a lot of objects were being read/write to the session state on each request, but for some reason this wasn't affecting my local machine (I probably had InProc mode enabled locally).
So try adding an attribute to some of your controllers and see if that speeds things up:
[SessionState(SessionStateBehaviour.Disabled)]
public class MyController : Controller
{
On another note, I ran some tests, and surprisingly, it was faster to read some of those objects from the DB on each request than to read them once, then put them in the session state. That kinda makes sense, since session state mode in production was SqlServer, and serialization/deserialization was apparently slower than just assigning values to properties from a DataReader. Plus, changing that had the nice side-effect of avoiding deserialization errors when deploying a new version of the assembly...
By the way, even 992ms is too much, IMHO. Can you use output caching to shave that off a bit?
So as I mentioned above, I had caching turned off for development, but only on my local machine. What I didn't realise was there was a problem WITH the caching which was turned on for the LIVE server, which I never turned off because I thought it was helping fix the slow speeds! It all makes sense now :)
Fixing my cache issue (IQueryable<> at the top of a dataset that was supposed to cache the entire table.. >_>) my speeds have increased 10 fold.
Thanks to everyone who assisted!
I've used capistrano for a long time, but always for sites that weren't critical. If something went wrong, a few minutes of downtime weren't a big problem.
Now I'm working on a more critical service and need to cover my edge cases. One of which is if my local connection to a server becomes interrupted in the middle of a deployment.
One solution I can think of is to do deployments directly from the server, inside of a screen session. This seems like a reasonable and obvious solution, but I'm surprised I've never read about it elsewhere or even seen it recommended in the capistrano documentation.
Any pointers / tips are welcome. Thanks!
There is only a very small time window during a typical Capistrano deploy where a dropped connection would cause trouble. This window is when the current release is linked to the new version and the server is told to restart. If your connection drops before or after that, you're fine.
If you positively need to be safe from disconnects 100%, you can log onto the server, open a screen session and do a cap deploy from the latest release folder.