Why network-first strategy is slower than no service worker? - service-worker

While benchmarking performance of service worker with workbox, we found an interesting phenomena.
When service worker is applied, network-first strategy of workbox takes about 30 ms slower than no service worker networking. Then, we tried to skip workbox and implement network-first strategy manually, it is about 20ms slower.
My guess is that, if service worker kicks in, all request has to be handled by javascript code. It is the execution of JavaScript code that make the networking slower.
Then, I checked cache-first strategy, it turns out that fetching content from cache-storage is slower than fetching content from disk-cache(http cache) without service worker.
So, in my understanding, even though service worker offers us more control on caching, it is not guaranteed to be faster in caching, right?

There is a cost associated with starting up a service worker that was not previously running. This could be on the order of tens of milliseconds, depending on the device. Once that service worker starts up, if it doesn't handle your navigation requests (which are almost certainly the first request that a service worker would receive) by going against the cache, then it's likely you'll end up with worse performance than if there were no service worker present at all.
If you are going against the cache, then having a service worker present should offer roughly the same performance vs. looking things up against the HTTP browser cache once it's actually running, but there is the same startup cost that needs to be taking into account first.
The real performance benefits of using a service worker come from handling navigation requests for HTML in a cache-first manner, which is not something you could traditionally do with HTTP caching.
You can read more about these tradeoffs and best practices in
"High-performance service worker loading".

Related

Why service worker is even slower than normal network request on chrome

I used service worker to precache a 15KB size page, but why serving it can even be slower than normal network requests? Sometimes the cache served from service worker can take ~200ms and sometimes can take ~600ms which may be even slower than network requests.
The service worker logic is pretty easy. It does an url match then use fetchEvent.respondWith to return the response.
It seems the problem is not related with Cache API. I tried to cache the page as a in-memory state in the service worker, in the case I guarantee that the state is not destroyed, the same size response served from service worker can still take 150ms~600ms
I tested it in chrome. It seems Safari has a comparatively better performance.

Delay on requests from Google API Gateway to Cloud Run

I'm currently seeing delays of 2-3 seconds on my first requests coming into our APIs.
We've set the min instances to 1 to prevent cold start but this a delay is still occurring.
If I check the metrics I don't see any startup latencies in the specified timeframe so I have no insights in what is causing these delays. Tracing gives the following:
The only thing I can change, is switching to "CPU is always allocated" but this isn't helping in any way.
Can somebody give more information on this?
As mentioned in the Answer :
As per doc :
Idle instances As traffic fluctuates, Cloud Run attempts to reduce the
chance of cold starts by keeping some idle instances around to handle
spikes in traffic. For example, when a container instance has finished
handling requests, it might remain idle for a period of time in case
another request needs to be handled.
Cloud Run But, Cloud Run will terminate unused containers after some
time if no requests need to be handled. This means a cold start can
still occur. Container instances are scaled as needed, and it will
initialize the execution environment completely. While you can keep
idle instances permanently available using the min-instance setting,
this incurs cost even when the service is not actively serving
requests.
So, let’s say you want to minimize both cost and response time latency
during a possible cold start. You don’t want to set a minimum number
of idle instances, but you also know any additional computation needed
upon container startup before it can start listening to requests means
longer load times and latency.
Cloud Run container startup There are a few tricks you can do to
optimize your service for container startup times. The goal here is to
minimize the latency that delays a container instance from serving
requests. But first, let’s review the Cloud Run container startup
routine.
When Starting the service
Starting the container
Running the entrypoint command to start your server
Checking for the open service port
You want to tune your service to minimize the time needed for step 1a.
Let’s walk through 3 ways to optimize your service for Cloud Run
response times.
1. Create a leaner service
2. Use a leaner base image
3. Use global variables
As mentioned in the Documentation :
Background activity is anything that happens after your HTTP response
has been delivered. To determine whether there is background activity
in your service that is not readily apparent, check your logs for
anything that is logged after the entry for the HTTP request.
Avoid background activities if CPU is allocated only during request processing
If you need to set your service to allocate CPU only during request
processing, when the Cloud Run service finishes handling a
request, the container instance's access to CPU will be disabled or
severely limited. You should not start background threads or routines
that run outside the scope of the request handlers if you use this
type of CPU allocation. Review your code to make sure all asynchronous
operations finish before you deliver your response.
Running background threads with this kind of CPU allocation can create
unpredictable behavior because any subsequent request to the same
container instance resumes any suspended background activity.
As mentioned in the Thread reason could be that all the operations you performed have happened after the response is sent.
According to the docs the CPU is allocated only during the request processing by default so the only thing you have to change is to enable CPU allocation for background activities.
You can refer to the documentation for more information related to the steps to optimize Cloud Run response times.
You can also have a look on the blog related to use of Google API Gateway with Cloud Run.

Service workers slow speed when serving from cache

I have some resources that I want to be cached and served at top speed to my app.
When I used appcache I got great serving speeds, but i was stuck with an appcache.
So I've replaced it with a service worker.
Then I tried the simplest strategy, just cache the static assets on install and serve them from the cache whenever fetched.
It worked, when I checked chrome's network panel I was happy to see my service worker in action, BUT - the load times were horrible, each resource load time doubled.
So I started thinking about other strategies, here you can find plenty of them, the cache and network race sounded interesting but i was deterred by the data usage.
So I've tried something different, I tried to aggressively cache the resources in the service worker's memory. Whenever my service worker is up and running it pools the relevant resources from the cache and save the response objects in memory for later use. When it gets a matching fetch it just responds with a clone of the in memory response.
This strategy proved to be fastest, here's a comparison I made:
So my question is pretty vague as my understanding in service workers is still pretty vague...
Does this all makes sense, can I keep static resources cache in memory?
What about the bloated memory usage, are there any negative implications in that? for instance - maybe the browser shuts down more frequently service workers with high memory consumption.
You can't rely on keeping Response objects in memory inside of a service worker and then responding directly with them, for (at least) two reasons:
Service workers have a short lifetime, and everything in the global scope of the service worker is cleared each time the service worker starts up again.
You can only read the body of a Response object once. Responding to a fetch request with a Response object will cause its body to be read. So if you have two requests for the same URL that are both made before the service worker's global scope is cleared, using the Response for the second time will fail. (You can work around this by calling clone() on the Response and using the clone to respond to the fetch event, but then you're introducing additional overhead.)
If you're seeing a significant slowdown in getting your responses from the service worker back to your page, I'd take some time to dig into what your service worker code actually looks like—and also what the code on your client pages look like. If your client pages have their main JavaScript thread locked up (due to heavyweight JavaScript operations that take a while to complete and never yield, for instance) that could introduce a delay in getting the response from the service worker to the client page.
Sharing some more details about how you've implemented your cache-based service worker would be a good first step.

How do I make HTTP requests in Rails while still servicing many requests per minute?

I'm trying to scale up an app server to process over 20,000 requests per minute.
When I stress-test the requests, most requests are easily handling 20,000 RPM or more.
But, requests that need to make an external HTTP request (eg, Facebook Login) bring the server down to a crawl (3,000 RPM).
I conceptually understand the limitations of my current environment -- 3 load-balanced servers with 4 unicorn workers per server can only handle 12 requests at a time, even if all of them are waiting on HTTP requests.
What are my options for scaling this better? I'd like to handle many more connections at once.
Possible solutions as I understand it:
Brute force: use more unicorn workers (ie, more RAM) and more servers.
Push all the blocking operations into background/worker processes to free up the web processes. Clients will need to poll periodically to find when their request has completed.
Move to Puma instead of Unicorn (and probably to Rubinius from MRI), so that I can use threads instead of processes -- which may(??) improve memory usage per connection, and therefore allow the number of workers to be increased.
Fundamentally, what I'm looking for is: Is there a better way to increase the number of blocked/queued requests a single worker can handle so that I can increase the number of connections per server?
For example, I've heard discussion of using Thin with EventMachine. Does this open up the possibility of a Rails worker that can put down the web request it's currently working on (because that one is waiting on an external server) and then picks up another request while it's waiting? If so, is this a worthwhile avenue to pursue for performance compared with Unicorn and Puma? (Does it strongly depend on the runtime activities of the app?)
Unicorn is a single-threaded, multi-process synchronous app server. It's not a good match for this kind of processing.
It sounds like your application is I/O bound. This argues for an event-oriented daemon to process your requests.
I'd recommend trying EventMachine and the em-http-request and em-http-server.
This will allow you to service both incoming requests to the http server and outgoing HTTP service calls asynchronously.

Practical Queuing Theory

I want to learn enough simple/practical queuing theory to model the behavior of a standard web application stack: Load balancer with multiple application server backends.
Given a simple traffic pattern extracted from a tool like NewRelic showing percentage of traffic to a given part of an application and average response time for that part of the application, I think I should be able to model different queueing behaviors with loadbalancer configuration, number of app servers, and queuing models.
Can anyone help point me to queuing theory introductory/fundamentals I would need to represent this system mathematically? I'm embarrassed to say I knew how to do this as an undergrad but have since forgotten all of the fundamentals.
My goal is to model different load-balancer and app-server queuing models and measure the results.
For example, it seems clear an N-mongrel Ruby on Rails application stack will have worse latency/wait time with a queue on each Mongrel than a Unicorn/Passenger system with a single queue for each group of app workers.
I can't point you at theory, but there are a few basic methods in popular usage:
Blind (linear or weighted) round-robining - requests are cycled through n servers, maybe according to some weighting. Each backend maintains a request queue. A slow-running request backs up that worker's request queue. A worker that stops returning results is eventually dropped out of the balancer pool, with all requests currently queued on it getting dropped. This is common for haproxy/nginx balancing setups.
Global pooling - a master queue maintains a list of requests, and workers report when they are free to accept a new request. The master hands off the front of the queue to the available worker. If a worker goes down, only the currently-being-handled request is lost. Results in slightly diminished performance under ideal circumstances (all workers up and returning requests quickly), since communication between queue master and backends is prerequisite to a job actually being handed off, but with the benefit of naturally avoiding slow, dead, or stalled workers. Passenger uses this balancing algorithm by default, and haproxy uses uses a variant on it with its "leastconn" balancing algorithm.
Hashed balancing - some component of the request is hashed, and the resulting hash determines which backend to use. memcached uses this sort of strategy for sharded setups. The downside is that if your cluster configuration changes, all the previous hashes become invalid, and may map to different backends than before. In the case of memcached specifically, this results in a likely invalidation of most or all of your cached data (reddit suffered some massive performance problems recently due to this sort of problem).
Generally speaking, for web apps, I tend to prefer the global pooling method, since it maintains the smoothest user experience when you have slow or dead workers.

Resources