I have a Rails 3.2.8 app running on Heroku Cedar with Ruby 1.9.3. The app runs fine when it launches but after a day or so of continuous use, I start to see R14 errors on my logs. Once the memory errors start, they never go away, even if the app is idle for several hours.
Shouldnt the garbage collector clean up unused objects after a while and reduce the memory load? It seems this is not happening on Heroku. Generally, memory usage starts to creep up after running some reports with several thousand rows of data, although results are paginated.
How can I find the memory leak? Plugins like bleak_house are way out of date or dont run nicely in the Heroku environment. Can I adjust the GC settings to make it more aggressive?
The GC should do the clean up, and probably does.
You can force the GC with GC.start; if many objects were not collected this will, but I suspect that is not the issue.
Is it possible you somehow create a bunch of objects and never release them, by keeping cached copies or something?
I'm unfamiliar with the existing tools to check this, but you may want to check which objects exist using ObjectSpace. For example:
ObjectSpace.each_object.with_object(Hash.new(0)){|obj, h| h[obj.class] +=1 }
# => a Hash with the number of objects by class
If you get an unexpected number for one of your classes, for instance, you would have a better idea of where to look for.
Install the New Relic add-on. It has a bunch of useful metrics that you can use to find out the source of the leak. I think its generally a better idea to try to see which part of the code takes the longest to execute and perhaps try to optimize that, rather than tweak the GC outright.
Some of the nice features New Relic includes is being able to pinpoint the source of the longest running SQL query, for example. I encourage you to give it a try.
Related
I'm pretty new to Ruby, Rails, and everything else in that ecosystem. I've joined a team that has a Ruby 3.1.2 / Rails 6.1.7 app backed by a Postgres database.
We have a scenario where, sometimes, memory usage on one of our running instances jumps up significantly and is never relinquished (we've waited days). Until today, we didn't know what was causing it or how to reproduce it.
It turns out that it's caused by an internal tool which was running an unbounded ActiveRecord query -- no limit and no paging. When pointing this tool at a more active customer, it takes many seconds, returns thousands of records, and memory usage increases by tens of MB. No amount of waiting will lead to the memory usage going back down again.
Having discovered that, we recently added paging to that particular tool, and in the ~week since, we have not seen usage increasing in giant chunks anymore. However, there are other scenarios which have similar behavior but with smaller payloads; these cause memory usage to increase gradually over time. We deploy this application often enough that it hasn't been a big deal, but I am looking to gain a better understanding of what's happening and to determine if there's a problem here, because that's not what we should see from a stable application that's free of memory leaks.
My first suspicion was a memoized instance variable on the controller, but a quick check indicates that Rails controllers are discarded as soon as the request finishes processing, so I don't think that's it.
My next suspicion was that ActiveRecord was caching my resultset, but I've done a bunch of research on how this works and my understanding is that any cached queries/relations should be released when the request completes. Even if I have that wrong, a subsequent identical request takes just as long and causes another jump in memory usage, so either that's not it, or caching is broken on our system.
My Google searches turn up lots of results about various caching capabilities in Rails 2, 3, and 5 -- but not much about 6.x, so maybe something significant has changed and I just haven't found it.
I did find ruby-prof, memory-profiler, and get_process_mem -- these all seem like they are only suitable for high-level analysis and wouldn't help me here.
Can I explore the contents of the object graph currently in memory on an existing, live instance of my app? I'm imagining that this would happen in the Rails console, but that's not a constraint on the question. If not, is there some other way that I could find out what is currently in memory, and whether it's just a bunch of fragmented pages or if there's actually something that isn't getting garbage collected?
EDIT
#engineersmnky pointed out in the comments that maybe everything is fine and that perhaps Ruby is just still holding on to the OS page due to some other still-valid object therein. However, if this is the case, it strikes me as unlikely that memory usage would not go back down to the previous baseline after several days of production usage.
Loading tens of MB worth of resultset into memory should result in the allocation of >1000 16kb memory pages in just a handful of seconds. It seems reasonable to assume that the vast majority of those would contain exclusively this resultset, and could therefore be released as soon as the resultset is garbage collected.
Furthermore, I can reproduce the increased memory usage by running the same unbounded ActiveRecord query in the Rails console, and when I close that console, the memory goes down almost immediately -- exactly what I was expecting to see when the web request completes. I don't fully understand how the Rails console works when connecting to a running application, though, so this may not be relevant.
I have a ruby on rails app, where we validate records from huge excel files(200k records) in background via sidekiq. We also use docker and hence a separate container for sidekiq. When the sidekiq is up, memory used is approx 120Mb, but as the validation worker begins, the memory reaches upto 500Mb (that's after a lot of optimisation).
Issue is even after the job is processed, the memory usage stays at 500Mb and is never freed, not allowing any new jobs to be added.
I manually start garbage collection using GC.start after every 10k records and also after the job is complete, but still no help.
This is most likely not related to Sidekiq, but to how Ruby allocates from and releases memory back to the OS.
Most likely the memory can not be released because of fragmentation. Besides optimizing your program (process data chunkwise instead of reading it all into memory) you could try and tweak the allocator or change the allocator.
There has been a lot written about this specific issue with Ruby/Memory, I really like this post by Nate Berkopec: https://www.speedshop.co/2017/12/04/malloc-doubles-ruby-memory.html which goes into all the details.
The simple "solution" is:
Use jemalloc or, if not possible, set MALLOC_ARENA_MAX=2.
The more complex solution would be to try and optimize your program further, so that it does not load that much data in the first place.
I was able to cut memory usage in a project from 12GB to < 3GB by switching to jemalloc. That project dealt with a lot of imports/exports and was written quite poorly and it was an easy win.
Logging performed by our Performance Team has indicated that this line specifically, is killing our CPUs
Microsoft.AI.PerfCounterCollector!Microsoft.ApplicationInsights.Extensibility.PerfCounterCollector.Implementation.PerformanceCounterUtility.ExpandInstanceName()
One theory was the the regexes used to identify Perf Counters in the library are recursing
https://adtmag.com/blogs/dev-watch/2016/07/stack-overflow-crash.aspx
I've inspected the Perf Counter names and nothing looks particularly out of kilter regarding the names and the regexes should have no trouble chewing over them. Certainly for large periods of time there are no issues whatsoever.
I've now turned on Applications Insights Diagnostic Logging in an attempt to observe the issue (in a test environment)
Has anyone else observed this, how can we mitigate this?
We have ensured DeveloperMode is NOT set to on.
this answer most likely won`t be usable now as AppInsights 2 code is much improved, but in my case it was double call AddApplicationInsightsTelemetry(). It adds collector each time this method is called, and due to lack of synchronization inside perf counter collector code it creates CPU spikes.
So avoid calling multiple times AddApplicationInsightsTelemetry, or use AppInsights 2.x. (and the best thing is to do both).
Do the counters you're collecting utilize instance placeholders in their names? If the instance name is known at build time, getting rid of placeholders may significantly improve performance. For instance, instead of
\Process(??APP_WIN32_PROC??)\% Processor Time
try using
\Process(w3wp)\% Processor Time
Also, how many counters are you collecting overall?
I have a background worker in my rails project that executes a lot of complicated data aggregation in-memory in ruby. I'm seeing a strange behavior. When I boot up a process for executing the jobs (thousands), I see a strange performance decrease over time. In the beginning a job completion takes around 300ms but after processing around 10.000 jobs the execution time will gradually have decreased to around 2000ms. This is a big problem for me and I'm puzzled about how this can possibly happen. I see no memory leaks (RAM usage is pretty stable), and I see no errors. What might cause this on a low level, and where should I start looking?
Background facts:
Among the things the job does, it does a lot of regexp comparisons of a lot of strings. There is no external database calls made except for read/write operations to a redis instance.
I have tried to execute the same on different servers/computers, and the symptoms are all the same.
If I restart the process when it starts to perform too bad, the performance turns good again immediately after.
I'm running ruby 1.9.3p194 and rails 3.2 and sidekiq 2.9.0 for job processor
It is difficult to tell from the limited description of your service, but the behaviour is consistent with a small (i.e. not leaky) cache of data that either has poor lookup performance, or that you are relying on very heavily, and that is growing at just a modest rate. A contrived example might be a list of "jobs done so far by this worker" which is being sorted on demand at a few points in the code.
One such cache is out of your direct control: Ruby's symbol table. Finding a Symbol is something like O(log(n)) on number of symbols in the system, which is good. But this could still impact you if you handle a lot of symbols, and each iteration of your worker can generate new symbols (for instance if keys in an input hash can be arbitrary data, and you use a symbolize_keys method or call to_sym on a lot of varying strings). Symbols are cached permanently in the Ruby process. In theory a few million would not show up as a memory leak. But if your code can go from say 10,000 symbols to 1,000,000 in total, all the symbol generating and checking code would slow down by a small fixed amount. If you are doing that a lot, it could potentially explain a few hundred ms.
If hunting through suspect code is getting you nowhere, your best bet to find the problem is to use a profiler. You should collect a profile of the code behaving well, and behaving badly, and compare the two.
Some odd behavior I encountered while optimizing a rails view:
After tweaking the amount of garbage collect calls in a request I didnt see real improvements on the performance. Investigating the problem I found out garbage collect didn't really remove that much dead objects!
I got a very heavy view with LOADS of objects.
Using scrap I found out after a fresh server start and page load the amount of objects was about 670.000, after reloading the page 3 times the amount has risen to 19.000.000!
RAILS_GC_MALLOC_LIMIT is set to 16.000.000 and I can read GC has been called 1400 times.
Why does the memory keep increasing on a refresh of the page? And is there a way to make sure the old objects are removed by GC?
PS: Running on REE 1.8.7 2011.03 with rails 3.2.1
I highly recommended to use newrelic for optimalization, got way more performance boost there then with a little gc tweaking..
You dont need to gc objects you never create :)