How do different visitor metrics relate? - scalability

Hypothetically, tets say someone tells you to to expect X (like 100,000 or something) number of unique visitors per day as a result of a successful marketing campaing.
How does that translate to peak requests/second? Peak simultaneous requests?
Obviously this depends on many factors, such as typical number of pages requested per user session or load time of a typical page, or many other things. These are other variables Y, Z, V, etc.
I'm looking for some sort of function or even just a ratio to estimate these metrics. Obviously for planing out the production environment scalability strategy.
This might happen on a production site I'm working on really soon. Any kind of help estimating these is useful.

Edit: (following indication that we have virtually NO prior statistics on the traffic)
We can therefore forget about the bulk of the plan laid out below and directly get into the "run some estimates" part. The problem is that we'll need to fill-in parameters from the model using educated guesses (or plain wild guesses). The following is a simple model for which you can tweak the parameters based on your understanding of the situation.
Model
Assumptions:
a) The distribution of page requests follows the normal distribution curve.
b) Considering a short period during peak traffic, say 30 minutes, the number of requests can be considered to be evenly distributed.
This could be [somewhat] incorrect: for example we could have a double curve if the ad campaign targets multiple geographic regions, say the US and the Asian markets. Also the curve could follow a different distribution. These assumptions are however good ones for the following reasons:
it would err, if at all, on the "pessimistic side" i.e. over-estimating peak traffic values. This "pessimistic" outlook can further be further adopted by using a slightly smaller std deviation value. (We suggest using 2 to 3 hours, which would put 68% and 95% of the traffic over a period of 4 and 8 hours (2h std dev) and 6 and 12 hours (3h stddev), respectively.
it makes for easy calculations ;-)
it is expected to generally match reality.
Parameters:
V = expected number of distinct visitors per 24 hour period
Ppv = average number of page requests associated with a given visitor session. (you may consider using the formula twice, one for "static" type of responses, and the other for dynamic responses, i.e. when the application spends time crafting a response for a given user/context)
sig = std deviation in minutes
R = peak-time number of requests per minute.
Formula:
R = (V * Ppv * 0.0796)/(2 * sig / 10)
That is because, with a normal distribution, and as per z-score table, roughly 3.98% of the samples fall within 1/10 of a std dev, on one or the other side of the mean (of the very peak), therefore get almost 8 percent of the samples within one tenth of a std dev on each side, and with the assumption of relatively even distribution during this period, we just divide by the number of minutes.
Example: V=75,000 Ppv=12 and sig = 150 minutes (i.e 68% of traffic assumed to come over 5 hours, 95% over 10 hours, 5% for the other 14 hours of the day).
R = 2,388 requests per minute, i.e. 40 requests per second. Rather Heavy, but "doable" (unless application takes 15 seconds per request...)
Edit (Dec 2012):
I'm adding here an "executive summary" of the model proposed above, as provided in comments by full.stack.ex.
In this model, we assume most people visit our system, say, at noon. That's the peak time. Others jump ahead or lag behind, the farther the fewer; Nobody at midnight. We chose a bell curve that reasonably covers all requests within the 24h period: with about 4 sigma on the left and 4 on the right, "nothing" significant remains in the long tail. To simulate the very peak, we cut out a narrow stripe around the noon and count the requests there.
It is noteworthy that this model, in practice, tends to overestimate the peak traffic, and may prove to be more useful at estimating "worse case" scenario rather than more plausible traffic patterns. Tentative suggestions to improve the estimate are
to extend the sig parameter (to acknowledge that the effective traffic period of relatively high traffic is longer)
to reduce the overall amount of visits for the period considered i.e. reduce the V parameter, by say 20% (to acknowledge that about that many visit happen outside of any peak time)
to use a different distribution such as say the Poisson or some binomial distribution.
to consider that there are a number of peaks each day, and that the traffic curve is actually the sum of several normal (or other distribution) functions with similar spread, but with a distinct peak. Assuming that such peaks are sufficiently apart, we can use the original formula, only with a V factor divided by as many peaks as considered.
[original response]
It appears that your immediate concern is how the server(s) may handle the extra load... A very worthy concern ;-). Without distracting you from this operational concern, consider the process of estimating the scale of the upcoming surge, also provides an opportunity of preparing yourself to gather more and better intelligence about the site's traffic, during and beyond the ad-campaign. Such information will in time prove useful for making better estimates of surges etc, but also for guiding some of the site's design (for commercial efficiency as well as for improving scalability).
A tentative plan
Assume qualitative similarity with existing traffic.
The ad campaign will expose the site to a distinct population (type of users) than its current visitors/users population: different situations select different subjects. For example the "ad campaign" visitors may be more impatient, focussed on a particular feature, concerned about price... as compared to the "self selected ?" visitors. Never the less, by lack of any other supporting model and measurement, and for sake of estimating load, the general principle could be to assume that the surge users will on-the-whole behave similarly to the self-selected crowd. A common approach is "run numbers" on this basis and to use educated guesses to slightly bend the coefficients of the model to accommodate for a few distinctive qualitative distinctions.
Gather statistics about existing traffic
Unless you readily have better information for this (eg. tealeaf, Google Analytics...) your source for such information may simply be the webserver's log... You can then build some simple tools to extract parse these logs and extract the following statistics. Note that these tools will be reusable for future analysis (eg: of the campaign itself), and also look for opportunities of logging more/different data, without significantly changing the application!
Average, Min, Max, Std Dev. for
number of pages visited per session
duration of a session
percentage of 24 hour traffic for each hour of a work day (exclude week-ends and such, unless of course this is a site which receives much traffic during these periods) These percentages should be calculated over several weeks at least to remove noise.
"Run" some estimates:
For example, start with peak use estimate, using the peak hour(s) percentage, the average daily session count, the average number of pages hits per session etc. This estimate should take into account the stochastic nature of traffic. Note that you don't have to, in this phase, worry about the impact of the queuing effect, instead, assume that the service time relative to the request period is low enough. Therefore just use a realistic estimate (or rather a value informed from the log analysis, for these very high usage periods), for the way the probability of a request is distributed over short periods (say of 15 minutes).
Finally, based on the numbers you obtained in this fashion, you can get a feel for the type of substained load this would represent on the server, and plan to add resources, to refactor part of the application. Also -very important!- if the outlook for sustained at-capacity load, start running the Pollaczek-Khinchine formula, as suggested by ChrisW, to get a better estimate of the effective load.
For extra credit ;-) Consider running some experiments during the campaign for example by randomly providing a distinct look or behavior for some of the pages visited, and by measuring the impact this may have (if any) on particular metrics (registration for more info, orders place, number of pages visited ...) The effort associated with this type of experiment may be significant, but the return can be significant as well, and if nothing else it may keep your "useability expert/consultant" on his/her toes ;-) You'll obviously want to work on defining such experiments, with the proper marketing/business authorities, and you may need to calculate ahead of time the minimum percentage of users upon which the alternate site would be proposed, to keep the experiment statistically representative. It is indeed important to know that the experiment doesn't need to be applied to 50% of the visitors; one can start small, just not so small that possible variations observed may be due to random...

I'd start by assuming that "per day" means "during the 8-hour business day", because that's a worse-case scenario without perhaps being unecessarily worst-case.
So if you're getting an average of 100,000 in 8 hours, and if the time at which each one arrives is random (independent of the others) then in some seconds you're getting more and in some seconds you're getting less. The details are a branch of knowledge called "queueing theory".
Assuming that the Pollaczek-Khinchine formula is applicable, then because your service time (i.e. CPU time per request) is quite small (i.e. less than a second, probably), therefore you can afford to have quite a high (i.e. greater than 50%) server utilization.
In summary, assuming that the time per request is small, you need a capacity that's higher (but here's the good news: not much higher) than whatever's required to service the average demand.
The bad news is that if your capacity is less than the average demand, then the average queueing delay is infinite (or more realistically, some requests will be abandoned before they're serviced).
The other bad news is that when your service time is small, you're sensitive to temporary fluctations in the average demand, for example ...
If the demand peaks during the lunch hour (i.e. isn't the same average demand as during other hours), or even if for some reason it peaks during a 5-minute period (during a TV commercial break, for example)
And if you can't afford to have customers queueing for that period (e.g. queueing for the whole lunch hour, or e.g. the whole five-minute commercial break)
... then your capacity needs to be enough to meet those short-term peak demands. OTOH you might decide that you can afford to lose the surplus: that it's not worth engineering for the peak capacity (e.g. hiring extra call centre staff during the lunch hour) and that you can afford some percentage of abandoned calls.

That will depend on the marketing campaign. For instance a TV ad will bring a lot of traffic at once, for a newspaper ad it will be spread out more over the day.
My experience with marketing types has been that they just pull a number from where the sun doesn't shine, typically higher than reality by at least an order of magnitude

Related

Query on Estimated Savings

We have a question regarding the time saved by rectifying the opportunities presented on Google Lighthouse.
Question 1:
We have embedded an example of a Lighthouse scan from our company's website.
These are the opportunities identified:
Reduce initial server response time (estimated savings: 1.57s)
Enable text compression (estimated savings: 1.02s)
Reduce unused CSS (estimated savings: 0.72s)
Reduce unused JavaScript (estimated savings: 0.55s)
Eliminate render-blocking resources (estimated savings: 0.55s)
May I understand if assuming all issues addressed by Lighthouse is addressed by our developers, then we should be able to save at least 4 minutes and 41 seconds of performance speed (4m41s is derived by totaling all the estimated savings identified by Lighthouse)?
Question 2:
If it is true that I am able to save 4 minutes and 41 seconds of performance speed, which of the 6 metrics do I match this data against?
Thank you.
First of all, summing the time estimates in the Opportunities section gives you 4.41 seconds, not 4 minutes 41 seconds. In addition, these are just time estimates, so the actual savings could vary. These estimates are mostly there to help identify the most important areas that can be focused on, prioritizing engineering efforts.
The metrics shown in the second image are what the score is calculated with, not the opportunities in the first image (though improving those will likely help your score). web.dev (a resource by Google, who makes Lighthouse) has some great resources explaining what the metrics mean and also how the score is calculated from those metrics. These metrics taken together can help you see a bigger picture of how your page is performing.

Is there an autonimus realtime clock wilth month loss less than 10milliseconds?

Looking for a realtime clock for IoT project. Need a millisecond resolution for my app protocol and its loss is critical. So I wonder if there is an autonimus realtime clock (with a battery) that will loose less than 10ms per month and work for a year?
The drift parameters you're asking for here -- 10 ms / 30 days -- imply <4 ppb accuracy. This will be a very difficult target to hit. A typical quartz timing crystal of the type used by most RTCs will drift by 50 - 100+ ppm (50,000 - 100,000 ppb) just based on temperature fluctuations.
Most of the higher-quality timing options (TCXO, OCXO, etc) will not be usable within your power budget -- a typical OCXO may require as much as 1W of (continuous) power to run its heater. About the only viable option I can think of would be a GPS timing receiver, which can synchronize a low-quality local oscillator to the GPS time, which is highly accurate.
Ultimately, though, I suspect your best option will be to modify your protocol to loosen or remove these timing requirements.
Sync it with the precise clock source like GPS for example.
You can also use tiny atomic clock https://physicsworld.com/a/atomic-clock-is-smallest-on-the-market/
or in Europe DCF77 receiver.

Is a high `min_replan_interval` a good idea for a low traffic app?

We have an app with a Neo4j backend that receives a relatively low amount of traffic, max a few hundred hits per hour. I'm wondering if the default value of 1 second for dbms.cypher.min_replan_interval will mean that all our query plans will be replanned between calls and whether we might see better performance if we increased it.
Also, are there any dangers to increasing this? If the structure of our data doesn't change all that much will it not then be a good idea to keep the query plans for as long as possible?

Is there good method to profile the process for large reductions?

I found the process reduction is large in our product environment, and the messages didn't decrease.
FYI, the reduction is 10831243888178 and then 10838818431635 after 5 minutes. The message_queue_len is 1012 and then 1014 according to the reduction.
I supposed that the messages returned from process_info(Pid) should be consumed in the 5 minutes but it didn't. Can I say that the process was blocked by some messages?
I read from the web that one reduction can be looked as one function call, but I don't fully understand it. I'll appreciate if someone can tell me more about the "reduction".
Reductions is a way to measure work done by a process.
Every scheduled process given a number of reductions to spend before preempting, in other words before it will have to let other processes to execute. Calling a function will spend 1 reduction, that seems right, but it is not the only thing that spends them, a lot of reductions will vanish inside this function call too.
It seems that numbers you given are accumulated reductions spent by a process. A big number by itself do not mean something at all actually. A big increase, however, means that process is doing some hard work. If this workhorse is not consuming the message queue, a great chance it is stuck inside one very long, or even unending computation.
You may try to inspect it further with process_info(Pid, current_function) or process_info(Pid, current_stacktrace).

Optimizing landing pages

In my current project (Rails 2.3) we have a collection of 1.2 million keywords, and each of them is associated with a landing page, which is effectively a search results page for a given keywords. Each of those pages is pretty complicated, so it can take a long time to generate (up to 2 seconds with a moderate load, even longer during traffic spikes, with current hardware). The problem is that 99.9% of visits to those pages are new visits (via search engines), so it doesn't help a lot to cache it on the first visit: it will still be slow for that visit, and the next visit could be in several weeks.
I'd really like to make those pages faster, but I don't have too many ideas on how to do it. A couple of things that come to mind:
build a cache for all keywords beforehand (with a very long TTL, a month or so). However, building and maintaing this cache can be a real pain, and the search results on the page might be outdated, or even no longer accessible;
given the volatile nature of this data, don't try to cache anything at all, and just try to scale out to keep up with traffic.
I'd really appreciate any feedback on this problem.
Something isn't quite adding up from your description. When you say 99.9% being new visits, that is actually pretty unimportant. When you cache a page you're not just caching it for one visitor. But perhaps you're saying that for 99.9% of those pages, there is only 1 hit every few weeks. Or maybe you mean that 99.9% of visits are to a page that only gets hit rarely?
In any case, the first thing I would be interested in knowing is whether there is a sizable percentage of pages that could benefit from full page caching? What defines a page as benefitting from caching? Well, the ratio of hits to updates is the most important metric there. For instance, even a page that only gets hit once a day could benefit significantly from caching if it only needs to be updated once a year.
In many cases page caching can't do much, so then you need to dig into more specifics. First, profile the pages... what are the slowest parts to generate? What parts have the most frequent updates? Are there any parts that are dependent on logged-in state of the user (doesn't sound like you have users though?)?
The lowest-hanging fruit (and what will propagate throughout the system) is good old fashioned optimization. Why does it take 2-seconds to generate a page? Optimize the hell out of your code and data store. But don't go doing things willy-nilly like removing all Rails helpers. Always profile first (NewRelic Silver and Gold are tremendously useful for getting traces from the actual production environment. Definitely worth the cost) Optimize your data store. This could be through denormalization or in extreme cases by switching to different DB technology.
Once you've done all reasonable direct optimization strategy, look at fragment caching. Can the most expensive part of the most commonly accessed pages be cached with a good hit-update ratio? Be wary of solutions that are complicated or require expensive maintenance.
If there is any cardinal rule to optimizing scalability cost it is that you want enough RAM to fit everything you need to serve on a regular basis, because this will always get you more throughput than disk access no matter how clever you try to be about it. How much needs to be in RAM? Well, I don't have a lot of experience at extreme scales, but if you have any disk IO contention then you definitely need more RAM. The last thing you want is IO contention for something that should be fast (ie. logging) because you are waiting for a bunch of stuff that could be in RAM (page data).
One final note. All scalability is really about caching (CPU registers > L1 cache > L2 cache > RAM > SSD Drives > Disc Drives > Network Storage). It's just a question of grain. Page caching is extremely coarse-grained, dead simple, and trivially scalable if you can do it. However for huge data sets (Google) or highly personalized content (Facebook), caching must happen at a much finer-grained level. In Facebook's case, they have to optimize down to the invidual asset. In essence they need to make it so that any piece of data can be accessed in just a few milliseconds from anywhere in their data center. Every page is constructed individually for a single user with a customized list of assets. This all has to be put together in < 500ms.

Resources