Azure Website Memory Usage - what's OK - memory

I have been trying to determine what is the best instance/size combination for 21 sites on Azure Websites due to what I think is memory pressure. CPU is not an issue < 3%.
I have tried 1 medium and 1/2 smalls. Medium improved overall performance by about 15ms response time on the busiest site (per New Relic). Probably because it double the cores (and memory).
Using the new preview portal's memory quote module:
1 or 2 smalls runs about 80%-90% average memory
1 medium runs about 70%
That makes no sense considering the medium is double the memory. Is the larger memory availability not forcing the GC to run as often on the medium instance?
What % memory can an instance run and it not impact performance.
Is 80-90 OK on the small instance?
Will adding instances to smalls help a memory problem since it basically just creates the same setup across all the instances and will eventually use up the same amount?
I have not been able to isolate any issues on performance on any of the 21 sites, but I don't want a surprise if I am running too close.

Run the sites on a smaller instance and set auto-scale to ramp when it gets to high CPU. Monitor how often it needs to scale and if it scales often just switch to a larger instance size permanently.

Coming back to this for some user edification. It seems that the ~80% range is normal and is just using "what's available." No perf issues or failures ever.
So if you come across this and are worries about the high memory usage, you'll need to keep an eye on it to determine if it's just normal or if you have memory leak.
I know, crappy answer :)

Related

How to know if I have to do memory profiling too?

I currently do CPU sampling of an ASP.NET Core application where I send huge number of requests(> 500K) to it. I see that the peak working set of the application is around ~300 MB which in my opinion is not huge considering the number of requests being made to the application. But what I have been observing is huge drop in requests per second when I enable certain pieces of functionality in my application.
Question:
Should I do memory profiling too? I ask this because even though the peak working set is ~300MB, there could be large number of short lived objects that could be created & collected by GC and since work by GC also counts as CPU, should I do memory profiling too to see if I allocate too much?
I will answer this question myself based on new information that I found out.
This is based on the tool PerfView, which provides information about GC and allocations.
When you open the GCStats view navigate the links to the process you care and you should see information like below:
Notice that view has the information has the % CPU Time spent Garbage Collecting. If you see this to be > 5% then it should be a cause of concern and you should start memory profiling.

Erlang ETS memory fragmentation

I have an erlang cluster where erlang:memory() 'total' is between 2-2.5GB from idle to busy time, day in day out. ets memory usage is around 440M and stays around there no matter what. The data within ets is heavily transient, completely changes throughout the day. Tomorrows data is guaranteed to have no commonality to today's.
Linux top says beam is using like 10 gigabytes. free -m 'used' agrees with that (the machine really only runs beam). The overall memory usage of the system grows regularly, like 1% per day on 16GB systems. There is some variance across nodes, but not by alot, and OS 'used' memory is always several times more than erlang:memory() total.
erlang:system_info({allocator, ets_alloc}) shows 20 allocators. Most have data that looks something like this (full output of command is here):
{mbcs_pool,[{blocks,2054},
{blocks_size,742672},
{carriers,10},
{carriers_size,17825792}]},
1) Does this mean that 742K bytes (words?) of memory are actually taking 17M of OS memory?
2) As this post suggests, should we add '+MEas bf' to the VM args, in order to reduce overhead?
3) What else can I do to avoid actually running out of memory?
This is R17.5 but we will be migrating to R19.3 in next deployment (this week). We don't have recon in the current deployment but will be adding it in the next deployment. Also, can't imagine this matters, but beam is running inside an alpine container.
In case someone else runs into this later: this was not actually leaked memory.
The default memory allocator strategy of erlang may not be optimal for your use, depending what you do, and depending on how erlang is configured to allocate blocks. Turns out, in some cases, "free" memory from erlang point of view won't necessarily be immediately released to the OS due to allocator fragmentation.
It's somewhat explained here: http://erlang.org/doc/man/erts_alloc.html
The default allocator strategy for the version of erlang we used at the time is aoffcbf (address order first fit carrier best fit). In our case, this resulted in very high memory fragmentation (10+GB overhead worth). When troubleshooting these things, erlang:system_info(allocator) and erlang:system_info({allocator, Alloc}) are your friend. Changing to aobff (address order best fit) resulted in much more efficient memory usage. In truth, as long as the machine didn't run out of physical memory, it wouldn't matter, but for us, we were getting dangerously close to the physical limit. And you do not want to start paging. With aobff, we never passed 4GB, even after the node being up 18 months. With the aoffcbf we would pass 10GB in a few weeks.
As always, YMMV, as it all depends what type, size, etc.. of blocks are allocated, and how long they live.

Apple Instruments slows down app when analyzing memory allocations

When running my app in the simulator and analyzing its memory allocations using Instruments, the App runs very slow, it runs at less than 1/30 of its normal speed.
The app uses about 50 MB RAM and has approximately 900,000 life objects (according to Instruments).
Could this be the reason for the slow performance?
When running in the app on the device or in the simulator without using Instruments, it performs well (except the memory issue I am trying to debug).
Do you have any idea on how to solve this issue?
Did you encounter slow performance using the Memory Allocation
instruments?
Would you consider having more than 900,000 life
objects "concerning"?
Considering your Analyzer performance issue
In your specific case monitoring the app over a long period of time will not be necessary, as you reach the state of high memory consumption very soon. You could simply stop recording at this point. Then you won't have problems navigating through the different views and statistics to find the cause of the memory issue.
Analyzing the memory issue
Slowing down is normal. 1/30 sounds quite alarming.
You probably should track how the amount of life objects and the memory usage change while you use the app.
It is difficult to decide if a certain amount of life objects at a specific point in time is critical (though 900,000 seems very high).
In general: if life objects and memory usage grow continuously and don't shrink, that is a bad sign.
If you take a look Statistics -> Object Summary (Screenshot), Live Bytes should be a lot smaller than Overall Bytes and the amount of #Living objects should be a lot smaller than the amount of #Transitory objects.
The second thing you can look at, is the Call Tree view.
It gives you a nice overview of which parts of the application are responsible for reserving large amount of memory:
Possible solutions
Once you detect the parts of your code that are responsible for reserving the large memory amount you can look for retain-cycles or you could try to use more autorelease pools in that spot.
Check that you have enough available disk space. I had 8gb left and it seems like that was too little. Instruments was extreeemely slow. Used a minute just to start and didn't quite get around at all.
I cleared out more disk space and then it suddenly went back to being fast as before.

Should a process always consume the same amount of memory if executed in the same way?

Hi folks and thanks for your time in advance.
I'm currently extending our C# test framework to monitor the memory consumed by our application. The intention being that a bug is potentially raised if the memory consumption significantly jumps on a new build as resources are always tight.
I'm using System.Diagnostics.Process.GetProcessByName and then checking the PrivateMemorySize64 value.
During developing the new test, when using the same build of the application for consistency, I've seen it consume differing amounts of memory despite supposedly executing exactly the same code.
So my question is, if once an application has launched, fully loaded and in this case in it's idle state, hence in an identical state from run to run, can I expect the private bytes consumed to be identical from run to run?
I need to clarify that I can expect memory usage to be consistent as any degree of varience starts to reduce the effectiveness of the test as a degree of tolerance would need to be introduced, something I'd like to avoid.
So...
1) Should the memory usage be 100% consistent presuming the application is behaving consistenly? This was my expectation.
or
2) Is there is any degree of variance in the private byte usage returned by windows or in the memory it allocates when requested by an app?
Currently, if the answer is memory consumed should be consistent as I was expecteding, the issue lies in our app actually requesting a differing amount of memory.
Many thanks
H
Almost everything in .NET uses the runtime's garbage collector, and when exactly it runs and how much memory it frees depends on a lot of factors, many of which are out of your hands. For example, when another program needs a lot of memory, and you have a lot of collectable memory at hand, the GC might decide to free it now, whereas when your program is the only one running, the GC heuristics might decide it's more efficient to let collectable memory accumulate a bit longer. So, short answer: No, memory usage is not going to be 100% consistent.
OTOH, if you have really big differences between runs (say, a few megabytes on one run vs. half a gigabyte on another), you should get suspicious.
If the program is deterministic (like all embedded programs should be), then yes. In an OS environment you are very unlikely to get the same figures due to memory fragmentation and numerous other factors.
Update:
Just noted this a C# app, so no, but the numbers should be relatively close (+/- 10% or less).

cooperative memory usage across threads?

I have an application that has multiple threads processing work from a todo queue. I have no influence over what gets into the queue and in what order (it is fed externally by the user). A single work item from the queue may take anywhere between a couple of seconds to several hours of runtime and should not be interrupted while processing. Also, a single work item may consume between a couple of megabytes to around 2GBs of memory. The memory consumption is my problem. I'm running as a 64bit process on a 8GB machine with 8 parallel threads. If each of them hits a worst case work item at the same time I run out of memory. I'm wondering about the best way to work around this.
plan conservatively and run 4 threads only. The worst case shouldn't be a problem anymore, but we waste a lot of parallelism, making the average case a lot slower.
make each thread check available memory (or rather total allocated memory by all threads) before starting with a new item. Only start when more than 2GB memory are left. Recheck periodically, hoping that other threads will finish their memory hogs and we may start eventually.
try to predict how much memory items from the queue will need (hard) and plan accordingly. We could reorder the queue (overriding user choice) or simply adjust the number of running worker threads.
more ideas?
I'm currently tending towards number 2 because it seems simple to implement and solve most cases. However, I'm still wondering what standard ways of handling situations like this exist? The operating system must do something very similar on a process level after all...
regards,
Sören
So your current worst-case memory usage is 16GB. With only 8GB of RAM, you'd be lucky to have 6 or 7GB left after the OS and system processes take their share. So on average you're already going to be thrashing memory on a moderately loaded system. How many cores does the machine have? Do you have 8 worker threads because it is an 8-core machine?
Basically you can either reduce memory consumption, or increase available memory. Your option 1, running only 4 threads, under-utilitises the CPU resources, which could halve your throughput - definitely sub-optimal.
Option 2 is possible, but risky. Memory management is very complex, and querying for available memory is no guarantee that you will be able to go ahead and allocate that amount (without causing paging). A burst of disk I/O could cause the system to increase the cache size, a background process could start up and swap in its working set, and any number of other factors. For these reasons, the smaller the available memory, the less you can rely on it. Also, over time memory fragmentation can cause problems too.
Option 3 is interesting, but could easily lead to under-loading the CPU. If you have a run of jobs that have high memory requirements, you could end up running only a few threads, and be in the same situation as option 1, where you are under-loading the cores.
So taking the "reduce consumption" strategy, do you actually need to have the entire data set in memory at once? Depending on the algorithm and the data access pattern (eg. random versus sequential) you could progressively load the data. More esoteric approaches might involve compression, depending on your data and the algorithm (but really, it's probably a waste of effort).
Then there's "increase available memory". In terms of price/performance, you should seriously consider simply purchasing more RAM. Sometimes, investing in more hardware is cheaper than the development time to achieve the same end result. For example, you could put in 32GB of RAM for a few hundred dollars, and this would immediately improve performance without adding any complexity to the solution. With the performance pressure off, you could profile the application to see just where you can make the software more efficient.
I have continued the discussion on Herb Sutter's blog and provoced some very helpful reader comments. Head over to Sutter's Mill if you are interested.
Thanks for all the suggestions so far!
Sören
Difficult to propose solutions without knowing exactly what you're doing, but how about considering:
See if your processing algorithm can access the data in smaller sections without loading the whole work item into memory.
Consider developing a service-based solution so that the work is carried out by another process (possibly a web service). This way you could scale the solution to run over multiple servers, perhaps using a load balancer to distribute the work.
Are you persisting the incoming work items to disk before processing them? If not, they probably should be anyway, particularly if it may be some time before the processor gets to them.
Is the memory usage proportional to the size of the incoming work item, or otherwise easy to calculate? Knowing this would help to decide how to schedule processing.
Hope that helps?!

Resources