Zero downtime/blue-green deployment of Single Page Application (SPA)

Zero downtime/blue-green deployment of Single Page Application (SPA) - docker

Yesterday together with the team we were discussing the possibility of using zero downtime deployments to support our single page application.
While discussing it we identified one edge case for it.
After user loads the page in his browser it cannot be removed from memory until he reloads the page. It means that if user loads the page and starts working with the website (for example starts typing a long article like I am doing now) then he cannot receive an updated version of it until he reloads the page.
We could ignore the fact that user sees old application version in his browser but there 2 points listed below.
In case we introduce a breaking change to HTTP Api that is used to serve spa then the user will not be able to save his article (data loss!) or can receive some other error when performing other backend related action.
When user navigates to a new page without reloading SPA he can receive a template of the next page or of some control that is incompatible with outer old container. It can kead to broken markup or application logic.
We cannot force user to relogin as he can be in the middle of typing his article and it is just a bad UX.
Taking all theses points into account one could propose the following solution:
User 1 loads v1 of the SPA into his browser.
Alongside with auth token the version information is sent to browser (using JWT for example).
We want to deploy v2 version of our application. We spin up the v2 version but do not disable v1.
User 2 loads v2 of SPA into his browser
User 1 goes to the next page in SPA. Load balancer checks the version information in his token and routes the traffic of the user 1 to v1 server.
User 2 gets routed in the same way to v2.
User 1 logs out the app and closes the browser.
User 1 logs in back - this time he receives v2.
After v1 application does not receive any traffic for a long time it gets disposed.
In this approach however it is possible to have multiple versions alive, more than 2 (for example if user stays online for whe whole day or two). It means that we will not be able to migrate the database to the new schema until the last user gets logged out (image how it could work for sites like Facebook). It is not a problem to have multiple versions however, such tools as Docker and Rancher allow us to do it easily.
Also in the step 7. User needs to reload the page or close the browser-otherwise he still will be working with v1 and we cannot force him to the next version.
The question I have is what approach do you use to do zero downtime/blue-green deployment of single page applications?
How do you manage the lifetime of "blue" version of your application when you are switching traffic to "green" version, especially in respect to existing "blue" client applications.
Did you solve these issues, do you know any other solution?

I've been struggling with this problem for quite some time and tried several approaches and one specifically worked really well:
Use hashed names when bundling the SPA (including images, et al)
Use a static asset bucket (e.g.: AWS S3) and upload all assets to it before the deployment process kicks in
Enforce internal guidelines to minimize API contracts to be broken (i.e: fields from an endpoint should only be removed after X releases)
Deploy with usual blue/green strategy
Rationale
Using a bucket with hashed bundles ensures that if a customer gets the old version of the SPA, all of its assets will be available before/during/after any deployment process.
Enforcing internal guidelines to not break API compatibility is sometimes tricky but it comes from the very same principles applied to any public API. Embracing/adapting an API deprecation policy from big players helps when communicating with the team with a concrete example.

One approach you might consider is gradual reloading of the SPA in such a moment, when it is not burdensome (or even unnoticeable) for end user.
Suggested approach:
Colored versions of the system (components providing back-end services, API and front-end) "know" (runtimes are provided with) their "color". Component providing users with front-end application embeds this color information into the SPA. This is then sent (via cookie or custom HTTP header) with every request SPA is making to the backend.
Component that routes API calls (API gateway, load balancer, nginx, HAproxy, custom Zuul-based router etc) is aware of this color information and uses it to direct traffic to infrastructure of proper color.
Additionally there is a public URL (not provided by "colored" infrastructure - for example S3 file provided via CloudFront or other proxy) with latest version color. SPA is checking this version every given period of time (60 or 120 seconds). If version does not match the one SPA was provided when loading then on the major next route change page is reloaded "physically", instead of realizing this navigation in browser only.
You can choose which route changes are verifying this version in such a way that it is least obtrusive to the user (possibly almost unnoticeable).
If you choose some of the routes that are used every day by all users then pretty soon all users will migrate to the latest color. Those who have unused opened browser window for long periods of time (computer hibernated for two weeks?) can be handled by forcing reload after certain period of inactivity.
I hope I managed to make myself sound at last a bit cohesive :-)
Regards,
Wojtek

Not sure why would you go for a complete overhaul of your UI since their is always a learning curve involved.Practically in real world it would be a bad idea to switch over to a new UI immediately. You would allow customers to switch over to the new interface over a period of time and then disable older version after a forewarning. Not worth the effort of having such real time switch. A/B testing could be a way to introduce customers to the new interface and then do an actual rollout.

The technique you're describing is called blue-green deployment; You start with your existing server (blue) and add your updated server (green). All new traffic from that point on is redirected to the green environment. The blue environment is only there for servicing existing http connections and also for an optional "roll back" in case the green environment hits major problems. Eventually the "blue" environment can be retired when it has finished servicing all of its requests.
This technique requires that the two systems be somewhat similar. Database schema for instance may make it inpractical.

Related

Aren't PWAs user unfriendly if the service worker is not immediately active?

I posted another question as a brute-force solution to this one (Angular: fully install service worker before anything else) but I thought I'd make a separate one to discuss the use case for when a service worker is used as intended.
According to the service worker life cycle (https://developers.google.com/web/fundamentals/primers/service-workers/lifecycle), the SW is installed but it's only active once you then reload the page (you can claim() the page but that's only for calls that happen after the service worker is installed). The reasoning is that if and existing version is updated, the old one and the new one do not mix states and caches. I can agree with that decision.
What I have trouble understanding is why it is not immediately active once it is initially installed. Instead, it requires a page reload unless you explicitly define precaching rules in the SW. If you define caching rules with wildcards, it's not possible to precache those so you need the reload.
Given a single page PWA (like Angular), a user will discover the site and browser around on it but the page will never be reloaded during that session. If they then want to use the site offline later, they need to have refreshed or re-opened the tab at least one other time. That seems like a pretty big pitfall to me.
Am I missing something here?

Your understanding of the service worker lifecycle is correct but I do not think the pitfall you mentioned is as severe as you think it is.
If I understand you correctly, the user experience will only be negatively affected if the user loses connectivity during the initial browsing of the page (before the service worker is active) and is missing an offline asset. If this is truly a scenario you want to account for then that offline asset can be pre-cached in the browser-side javascript. Alternatively, as you mentioned, you can skipWaiting() and claim() to make the service worker active without the user refreshing the page.

Why is self.skipWaiting() and self.clients.claim() not default behaviour for service workers

I'm researching service workers for my thesis. I understand how the lifecycle works, but I'm having trouble understanding the default update behaviour of service workers.
When installing a new service worker, while an old one is installed, the service worker will have to wait to activate. With self.skipWaiting() and self.clients.claim() it is possible to fully activate the service worker and control the pages. I don't get why this is not default behaviour. The main reason I can find is to preserve code and data consistency (https://redfin.engineering/service-workers-break-the-browsers-refresh-button-by-default-here-s-why-56f9417694). With some basic understanding of the lifecycle, shouldn't it be possible to preserve both code and data consistency when a service worker updates or am I missing something? Are there any additional reasons?
Also has this behaviour been different in the past? Have skipWaiting() and clients.claim() been added afterwards?

The default - as it is now - is safer in general and doesn't force everyone to come up with all sorts of solutions.
User loads page with main1.js, SWv1 registers 1 second later, site now fully cached
User loads the page again - this time from cache by SWv1, super fast. New SWv2 registers 1 second later, caches new assets (main1.js is now main2.js), takes control via skipWaiting and clientsClaim
Two things can happen now:
Page has loaded with main1.js and the browser has executed whatever that script said. User has interacted with the page etc. Page is running main1.js which expects to be talking to SWv1 but actually the SW in control is SWv2. The script, main1.js, could be sending messages and trying to interact with the SW in a way that only SWv1 understood but v2 doesn't have any idea about. Now the page breaks because of the mismatch.
SWv1 cached all assets that site v1 needed. Thus if main1.js was to lazyload something etc. when user interacted with the page, browser would get that from the cache. As SWv2 has taken control and cached its idea of the assets (these are now newer assets), when main1.js tries to lazyload something originally cached by SWv1 it's not found. Also, because this is now a new deployment, the asset is not on the HTTP server anymore. It would have been in caches handled by SWv1 but SWv2 doesn't know about it. SWv2 knows about a newer version of that file. Page breaks.
It is important to understand that this might not be the case for every site/SW combination. If you have very little logic in the SW script and the main.js doesn't communite with sw.js too much it is possible to build a combination where skipWaiting and clientsClaim don't cause any problems. You can also code in such a way that if an error happens, you'll show the user a notification to refresh.

Using django-channels with django-rest-framework in the creation of mobile application

I already have a project writen in Django and I am able to use the django rest framework well with it. This project is actually based on django-oscar and I implemented some other features. I am now in the middle of working with the mobile version of this application and I am in need of realtime server updates like Sockets and I am aware of djnago channels. My question now is this, Is it possible to link django-rest framework with django-channels because if for example a user makes a purchase on the mobile app, the number of available products should decrease in real-time or if a user adds a product to cart the user should be a able to get an increased number of items immediately reflected witha notification badge and I feel this can be achieved by django channels. So how can I relate the rest API to django channels

URLRouter([
url(r"^longpoll/$", LongPollConsumer),
url(r"^notifications/(?P<stream>\w+)/$", LongPollConsumer),
url(r"", AsgiHandler),
])
If a http argument is not provided, it will default to the Django view system’s ASGI interface, channels.http.AsgiHandler, which means that for most projects that aren’t doing custom long-poll HTTP handling, you can simply not specify a http option and leave it to work the “normal” Django way.
If you want to split HTTP handling between long-poll handlers and Django views, use a URLRouter with channels.http.AsgiHandler specified as the last entry with a match-everything pattern.
The content above is from https://channels.readthedocs.io/en/latest/topics/routing.html#protocoltyperouter

Google script origin request url

I'm developing a Google Sheets add-on. The add-on calls an API. In the API configuration, a url like https://longString-script.googleusercontent.com had to be added to the list of urls allowed to make requests from another domain.
Today, I noticed that this url changed to https://sameLongString-0lu-script.googleusercontent.com.
The url changed about 3 months after development start.
I'm wondering what makes the url to change because it also means a change in configuration in our back-end every time.
EDIT: Thanks for both your responses so far. Helped me understand better how this works but I still don't know if/when/how/why the url is going to change.
Quick update, the changing part of the url was "-1lu" for another user today (but not for me when I was testing). It's quite annoying since we can't use wildcards in the google dev console redirect uri field. Am I supposed to paste a lot of "-xlu" uris with x from 1 to like 10 so I don't have to touch this for a while?

For people coming across this now, we've also just encountered this issue while developing a Google Add-on. We've needed to add multiple origin urls to our oauth client for sign-in, following the longString-#lu-script.googleusercontent.com pattern mentioned by OP.
This is annoying as each url has to be entered separately in the authorized urls field (subdomain or wildcard matching isn't allowed). Also this is pretty fragile since it breaks if Google changes the urls they're hosting our add-on from. Furthermore I wasn't able to find any documentation from Google confirming that these are the script origins.

URLs are managed by the host in various ways. At the most basic level, when you build a web server you decide what to call it and what to call any pages on it. Google and other large content providers with farms of servers and redundant data centers and everything are going to manage it a bit differently, but for your purposes, it will be effectively the same in that ... you need to ask them since they are the hosting provider of your cloud content.
Something that MIGHT be related is that Google rolled out some changes recently dealing with the googleusercontent.com domain and picassa images (or at least was scheduled to do so.) So the google support forums will be the way to go with this question for the freshest answers since the cause of a URL change is usually going to be specific to that moment in time and not something that you necessarily need to worry about changing repeatedly. But again, they are going to need to confirm that it was something related to the recent planned changes... or not. :-)
When you find something out you can update this question in case it is of use to others. Especially, if they tell you that it wasn't a one time thing dealing with a change on their end.

This is more likely related to Changing origin in Same-origin Policy. As discussed:
A page may change its own origin with some limitations. A script can set the value of document.domain to its current domain or a superdomain of its current domain. If it sets it to a superdomain of its current domain, the shorter domain is used for subsequent origin checks.
For example, assume a script in the document at http://store.company.com/dir/other.html executes the following statement:
document.domain = "company.com";
After that statement executes, the page can pass the origin check with http://company.com/dir/page.html
So, as noted:
When using document.domain to allow a subdomain to access its parent securely, you need to set document.domain to the same value in both the parent domain and the subdomain. This is necessary even if doing so is simply setting the parent domain back to its original value. Failure to do this may result in permission errors.

How to properly handle asynchronous database replication?

I'm considering using Amazon RDS with read replicas to scale our database.
Some of our controllers in our web application are read/write, some of them are read-only. We already have an automated way for identifying which controllers are read-only, so my first approach would have been to open a connection to the master when requesting a read/write controller, else open a connection to a read replica when requesting a read-only controller.
In theory, that sounds good. But then I stumbled open the replication lag concept, which basically says that a replica can be several seconds behind the master.
Let's imagine the following use case then:
The browser posts to /create-account, which is read/write, thus connecting to the master
The account is created, transaction committed, and the browser gets redirected to /member-area
The browser opens /member-area, which is read-only, thus connecting to a replica. If the replica is even slightly behind the master, the user account might not exist yet on the replica, thus resulting in an error.
How do you realistically use read replicas in your application, to avoid these potential issues?

I worked with application which used pseudo-vertical partitioning. Since only handful of data was time-sensitive the application usually fetched from slaves and from master only in selected cases.
As an example: when the User updated their password application would always ask master for authentication prompt. When changing non-time sensitive data (like User Preferences) it would display success dialog along with information that it might take a while until everything is updated.
Some other ideas which might or might not work depending on environment:
After update compute entity checksum, store it in application cache and when fetching the data always ask for compliance with checksum
Use browser store/cookie for storing delta ensuring User always sees the latest version
Add "up-to-date" flag and invalidate synchronously on every slave node before/after update
Whatever solution you choose keep in mind it's subject of CAP Theorem.

This is a hard problem, and there are lots of potential solutions. One potential solution is to look at what facebook did,
TLDR - read requests get routed to the read only copy, but if you do a write, then for the next 20 seconds, all your reads go to the writeable master.
The other main problem we had to address was that only our master
databases in California could accept write operations. This fact meant
we needed to avoid serving pages that did database writes from
Virginia because each one would have to cross the country to our
master databases in California. Fortunately, our most frequently
accessed pages (home page, profiles, photo pages) don't do any writes
under normal operation. The problem thus boiled down to, when a user
makes a request for a page, how do we decide if it is "safe" to send
to Virginia or if it must be routed to California?
This question turned out to have a relatively straightforward answer.
One of the first servers a user request to Facebook hits is called a
load balancer; this machine's primary responsibility is picking a web
server to handle the request but it also serves a number of other
purposes: protecting against denial of service attacks and
multiplexing user connections to name a few. This load balancer has
the capability to run in Layer 7 mode where it can examine the URI a
user is requesting and make routing decisions based on that
information. This feature meant it was easy to tell the load balancer
about our "safe" pages and it could decide whether to send the request
to Virginia or California based on the page name and the user's
location.
There is another wrinkle to this problem, however. Let's say you go to
editprofile.php to change your hometown. This page isn't marked as
safe so it gets routed to California and you make the change. Then you
go to view your profile and, since it is a safe page, we send you to
Virginia. Because of the replication lag we mentioned earlier,
however, you might not see the change you just made! This experience
is very confusing for a user and also leads to double posting. We got
around this concern by setting a cookie in your browser with the
current time whenever you write something to our databases. The load
balancer also looks for that cookie and, if it notices that you wrote
something within 20 seconds, will unconditionally send you to
California. Then when 20 seconds have passed and we're certain the
data has replicated to Virginia, we'll allow you to go back for safe
pages.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart