How to properly handle asynchronous database replication? - scalability

I'm considering using Amazon RDS with read replicas to scale our database.
Some of our controllers in our web application are read/write, some of them are read-only. We already have an automated way for identifying which controllers are read-only, so my first approach would have been to open a connection to the master when requesting a read/write controller, else open a connection to a read replica when requesting a read-only controller.
In theory, that sounds good. But then I stumbled open the replication lag concept, which basically says that a replica can be several seconds behind the master.
Let's imagine the following use case then:
The browser posts to /create-account, which is read/write, thus connecting to the master
The account is created, transaction committed, and the browser gets redirected to /member-area
The browser opens /member-area, which is read-only, thus connecting to a replica. If the replica is even slightly behind the master, the user account might not exist yet on the replica, thus resulting in an error.
How do you realistically use read replicas in your application, to avoid these potential issues?

I worked with application which used pseudo-vertical partitioning. Since only handful of data was time-sensitive the application usually fetched from slaves and from master only in selected cases.
As an example: when the User updated their password application would always ask master for authentication prompt. When changing non-time sensitive data (like User Preferences) it would display success dialog along with information that it might take a while until everything is updated.
Some other ideas which might or might not work depending on environment:
After update compute entity checksum, store it in application cache and when fetching the data always ask for compliance with checksum
Use browser store/cookie for storing delta ensuring User always sees the latest version
Add "up-to-date" flag and invalidate synchronously on every slave node before/after update
Whatever solution you choose keep in mind it's subject of CAP Theorem.

This is a hard problem, and there are lots of potential solutions. One potential solution is to look at what facebook did,
TLDR - read requests get routed to the read only copy, but if you do a write, then for the next 20 seconds, all your reads go to the writeable master.
The other main problem we had to address was that only our master
databases in California could accept write operations. This fact meant
we needed to avoid serving pages that did database writes from
Virginia because each one would have to cross the country to our
master databases in California. Fortunately, our most frequently
accessed pages (home page, profiles, photo pages) don't do any writes
under normal operation. The problem thus boiled down to, when a user
makes a request for a page, how do we decide if it is "safe" to send
to Virginia or if it must be routed to California?
This question turned out to have a relatively straightforward answer.
One of the first servers a user request to Facebook hits is called a
load balancer; this machine's primary responsibility is picking a web
server to handle the request but it also serves a number of other
purposes: protecting against denial of service attacks and
multiplexing user connections to name a few. This load balancer has
the capability to run in Layer 7 mode where it can examine the URI a
user is requesting and make routing decisions based on that
information. This feature meant it was easy to tell the load balancer
about our "safe" pages and it could decide whether to send the request
to Virginia or California based on the page name and the user's
location.
There is another wrinkle to this problem, however. Let's say you go to
editprofile.php to change your hometown. This page isn't marked as
safe so it gets routed to California and you make the change. Then you
go to view your profile and, since it is a safe page, we send you to
Virginia. Because of the replication lag we mentioned earlier,
however, you might not see the change you just made! This experience
is very confusing for a user and also leads to double posting. We got
around this concern by setting a cookie in your browser with the
current time whenever you write something to our databases. The load
balancer also looks for that cookie and, if it notices that you wrote
something within 20 seconds, will unconditionally send you to
California. Then when 20 seconds have passed and we're certain the
data has replicated to Virginia, we'll allow you to go back for safe
pages.

Related

Aren't PWAs user unfriendly if the service worker is not immediately active?

I posted another question as a brute-force solution to this one (Angular: fully install service worker before anything else) but I thought I'd make a separate one to discuss the use case for when a service worker is used as intended.
According to the service worker life cycle (https://developers.google.com/web/fundamentals/primers/service-workers/lifecycle), the SW is installed but it's only active once you then reload the page (you can claim() the page but that's only for calls that happen after the service worker is installed). The reasoning is that if and existing version is updated, the old one and the new one do not mix states and caches. I can agree with that decision.
What I have trouble understanding is why it is not immediately active once it is initially installed. Instead, it requires a page reload unless you explicitly define precaching rules in the SW. If you define caching rules with wildcards, it's not possible to precache those so you need the reload.
Given a single page PWA (like Angular), a user will discover the site and browser around on it but the page will never be reloaded during that session. If they then want to use the site offline later, they need to have refreshed or re-opened the tab at least one other time. That seems like a pretty big pitfall to me.
Am I missing something here?
Your understanding of the service worker lifecycle is correct but I do not think the pitfall you mentioned is as severe as you think it is.
If I understand you correctly, the user experience will only be negatively affected if the user loses connectivity during the initial browsing of the page (before the service worker is active) and is missing an offline asset. If this is truly a scenario you want to account for then that offline asset can be pre-cached in the browser-side javascript. Alternatively, as you mentioned, you can skipWaiting() and claim() to make the service worker active without the user refreshing the page.

Why is self.skipWaiting() and self.clients.claim() not default behaviour for service workers

I'm researching service workers for my thesis. I understand how the lifecycle works, but I'm having trouble understanding the default update behaviour of service workers.
When installing a new service worker, while an old one is installed, the service worker will have to wait to activate. With self.skipWaiting() and self.clients.claim() it is possible to fully activate the service worker and control the pages. I don't get why this is not default behaviour. The main reason I can find is to preserve code and data consistency (https://redfin.engineering/service-workers-break-the-browsers-refresh-button-by-default-here-s-why-56f9417694). With some basic understanding of the lifecycle, shouldn't it be possible to preserve both code and data consistency when a service worker updates or am I missing something? Are there any additional reasons?
Also has this behaviour been different in the past? Have skipWaiting() and clients.claim() been added afterwards?
The default - as it is now - is safer in general and doesn't force everyone to come up with all sorts of solutions.
User loads page with main1.js, SWv1 registers 1 second later, site now fully cached
User loads the page again - this time from cache by SWv1, super fast. New SWv2 registers 1 second later, caches new assets (main1.js is now main2.js), takes control via skipWaiting and clientsClaim
Two things can happen now:
Page has loaded with main1.js and the browser has executed whatever that script said. User has interacted with the page etc. Page is running main1.js which expects to be talking to SWv1 but actually the SW in control is SWv2. The script, main1.js, could be sending messages and trying to interact with the SW in a way that only SWv1 understood but v2 doesn't have any idea about. Now the page breaks because of the mismatch.
SWv1 cached all assets that site v1 needed. Thus if main1.js was to lazyload something etc. when user interacted with the page, browser would get that from the cache. As SWv2 has taken control and cached its idea of the assets (these are now newer assets), when main1.js tries to lazyload something originally cached by SWv1 it's not found. Also, because this is now a new deployment, the asset is not on the HTTP server anymore. It would have been in caches handled by SWv1 but SWv2 doesn't know about it. SWv2 knows about a newer version of that file. Page breaks.
It is important to understand that this might not be the case for every site/SW combination. If you have very little logic in the SW script and the main.js doesn't communite with sw.js too much it is possible to build a combination where skipWaiting and clientsClaim don't cause any problems. You can also code in such a way that if an error happens, you'll show the user a notification to refresh.

Accounting for users that have left website without using onunload

I have a webservice with very limited resources (I will be able to handle about 3 simultaneous users).
When users interact with my website they start a complex process server-side. (This process is the limiting factor, as my server machine will not be able to handle many in parallel, and clients cannot run this on their side.)
My question is how to make sure to end the process for users that leave, for example by closing the window.
I have considered onunload and onbeforeunload, but they are also triggered by links within the website (which I need for users to be able to interact with the process) so that does not seem like an option.
This approach seems problematic according to other questions (see this, for example), but it could work if there were a way to check if the user is still an active user when performing the action triggered by onunload (even if in a different page of the website), but I don't know how to do this.
I have also considered periodically checking the list of active users and cancelling the process for users that have left, but I don't know if this is even possible.
I have zero experience with cookies, but could this be a place to use them? Can the server access the (still living) cookies of disconnected users?
Which sounds like a reasonable approach for this problem?
Cases such as these are generally handled by heartbeats. Have your client send periodic heartbeats (which are essentially pings) to the server notifying that it is still alive and interested in the process's results. And the server automatically kills those processes for which it hasn't received client heartbeat for a configured amount of time.
I have considered onunload and onbeforeunload
You are right- you can't rely on them.
I have zero experience with cookies, but could this be a place to use them?
No. Cookies maintain client-side state that is sent to a server on HTTP calls. So, servers don't manage cookies. Instead, they only look at them to identify state.

Zero downtime/blue-green deployment of Single Page Application (SPA)

Yesterday together with the team we were discussing the possibility of using zero downtime deployments to support our single page application.
While discussing it we identified one edge case for it.
After user loads the page in his browser it cannot be removed from memory until he reloads the page. It means that if user loads the page and starts working with the website (for example starts typing a long article like I am doing now) then he cannot receive an updated version of it until he reloads the page.
We could ignore the fact that user sees old application version in his browser but there 2 points listed below.
In case we introduce a breaking change to HTTP Api that is used to serve spa then the user will not be able to save his article (data loss!) or can receive some other error when performing other backend related action.
When user navigates to a new page without reloading SPA he can receive a template of the next page or of some control that is incompatible with outer old container. It can kead to broken markup or application logic.
We cannot force user to relogin as he can be in the middle of typing his article and it is just a bad UX.
Taking all theses points into account one could propose the following solution:
User 1 loads v1 of the SPA into his browser.
Alongside with auth token the version information is sent to browser (using JWT for example).
We want to deploy v2 version of our application. We spin up the v2 version but do not disable v1.
User 2 loads v2 of SPA into his browser
User 1 goes to the next page in SPA. Load balancer checks the version information in his token and routes the traffic of the user 1 to v1 server.
User 2 gets routed in the same way to v2.
User 1 logs out the app and closes the browser.
User 1 logs in back - this time he receives v2.
After v1 application does not receive any traffic for a long time it gets disposed.
In this approach however it is possible to have multiple versions alive, more than 2 (for example if user stays online for whe whole day or two). It means that we will not be able to migrate the database to the new schema until the last user gets logged out (image how it could work for sites like Facebook). It is not a problem to have multiple versions however, such tools as Docker and Rancher allow us to do it easily.
Also in the step 7. User needs to reload the page or close the browser-otherwise he still will be working with v1 and we cannot force him to the next version.
The question I have is what approach do you use to do zero downtime/blue-green deployment of single page applications?
How do you manage the lifetime of "blue" version of your application when you are switching traffic to "green" version, especially in respect to existing "blue" client applications.
Did you solve these issues, do you know any other solution?
I've been struggling with this problem for quite some time and tried several approaches and one specifically worked really well:
Use hashed names when bundling the SPA (including images, et al)
Use a static asset bucket (e.g.: AWS S3) and upload all assets to it before the deployment process kicks in
Enforce internal guidelines to minimize API contracts to be broken (i.e: fields from an endpoint should only be removed after X releases)
Deploy with usual blue/green strategy
Rationale
Using a bucket with hashed bundles ensures that if a customer gets the old version of the SPA, all of its assets will be available before/during/after any deployment process.
Enforcing internal guidelines to not break API compatibility is sometimes tricky but it comes from the very same principles applied to any public API. Embracing/adapting an API deprecation policy from big players helps when communicating with the team with a concrete example.
One approach you might consider is gradual reloading of the SPA in such a moment, when it is not burdensome (or even unnoticeable) for end user.
Suggested approach:
Colored versions of the system (components providing back-end services, API and front-end) "know" (runtimes are provided with) their "color". Component providing users with front-end application embeds this color information into the SPA. This is then sent (via cookie or custom HTTP header) with every request SPA is making to the backend.
Component that routes API calls (API gateway, load balancer, nginx, HAproxy, custom Zuul-based router etc) is aware of this color information and uses it to direct traffic to infrastructure of proper color.
Additionally there is a public URL (not provided by "colored" infrastructure - for example S3 file provided via CloudFront or other proxy) with latest version color. SPA is checking this version every given period of time (60 or 120 seconds). If version does not match the one SPA was provided when loading then on the major next route change page is reloaded "physically", instead of realizing this navigation in browser only.
You can choose which route changes are verifying this version in such a way that it is least obtrusive to the user (possibly almost unnoticeable).
If you choose some of the routes that are used every day by all users then pretty soon all users will migrate to the latest color. Those who have unused opened browser window for long periods of time (computer hibernated for two weeks?) can be handled by forcing reload after certain period of inactivity.
I hope I managed to make myself sound at last a bit cohesive :-)
Regards,
Wojtek
Not sure why would you go for a complete overhaul of your UI since their is always a learning curve involved.Practically in real world it would be a bad idea to switch over to a new UI immediately. You would allow customers to switch over to the new interface over a period of time and then disable older version after a forewarning. Not worth the effort of having such real time switch. A/B testing could be a way to introduce customers to the new interface and then do an actual rollout.
The technique you're describing is called blue-green deployment; You start with your existing server (blue) and add your updated server (green). All new traffic from that point on is redirected to the green environment. The blue environment is only there for servicing existing http connections and also for an optional "roll back" in case the green environment hits major problems. Eventually the "blue" environment can be retired when it has finished servicing all of its requests.
This technique requires that the two systems be somewhat similar. Database schema for instance may make it inpractical.

Best practice for assigning A/B test variation based on IP address

I am starting to write some code for A/B testing in a Grails web application. I want to ensure that requests from the same IP address always see the same variation. Rather than store a map of IP->variant, is it OK to simply turn the IP address into an integer by removing the dots, then use that as the seed for a random number generator? The following is taking place in a Grails Filter:
def ip = request.remoteAddr
def random = new Random(ip.replaceAll(/\./, '').toInteger())
def value = random.nextBoolean()
session.assignment = value
// value should always be the same for a given IP address
I know that identifying users by IP address is not reliable, and I will be using session variables/cookies as well, but this seems to be useful for the case where we have a new session, and no cookies set (or the user has cookies disabled).
You could simply take the 32-bit number and do ip mod number_of_test_scenarios. Or use a standard hashing function provided in ruby. But I feel I should point out a few problems with this approach:
If your app is behind any proxies, the ip will be the same for all the users of that proxy.
Some users will change IPs fairly frequently, more frequently than you think. Maybe (as Joel Spolsky says) "The internet is broken for those users", but I'd say it's a disservice to your customers if you make the internet MORE broken for them, especially in a subtle way, given that they are probably not in a position to do anything about it.
For users who have a new session, you can just assign the cookie on the first request and keep the assignments in memory; unless a user's initial requests go to multiple servers at the same time this should resolve that problem (it's what I do on the app I maintain).
For users with cookies disabled, I'd say "The Internet is broken", and I wouldn't go to much trouble to support that case; they'd get assigned to a default test bucket and all go there. If you plan to support many such users in a non-broken way you're creating work for yourself, but maybe that's ok. In this case you may want to consider using URL-rewriting and 302 redirects to send these users down one scenario or another. However in my opinion this isn't worth the time.
If your users can log into the site make sure you record the scenario assignments in your database and reconcile the cookie/db discrepancies accordingly.

Resources