Sidekiq is not releasing the memory after finishing the job - ruby-on-rails

I am facing some strange issue with Sidekiq, I have a few heavy jobs running in the background with Sidekiq. but even after Sidekiq finished the job it's still holding the memory. what could be the reason?
Versions
ruby : 2.2.4
rails : 4.2.7
sidekiq : 3.5.4
I have attached the memory log also.
Even I haved check the link but it couldn't help. I have even manually started the GC also.

My understanding is that Ruby MRI will not release memory back to the OS. If your Sidekiq job consumes a lot of memory, even if those objects are garbage collected, the memory will just be released back to Ruby, not back to the OS. You should try to find a way to make your Sidekiq job(s) consume less memory, and assume that your workers will eventually allocate the maximum amount of memory that your most memory-consuming job requires.

Hi we were also facing the same problem. I researched about it a lot. Jim said it right. It's Ruby MRI which is handling the resources. Actually your sidekiq workers are heavy or they're performing heavy operations and I guess it's making more object allocations as well. They more the resources it needs ruby takes the resources from the OS and then use it for the operations being performed in sidekiq. It doesn't release memory to the OS. It uses the same memory space and reuses it to provide resources to other objects.
The things you have to keep in mind to stay out of this problem is to optimise your code.
If you're performing multiple heavy operations and they can be divided then use separate workers for that.
You can run your fetch query in batches to decrease the load on the server.
Use array operations whenever they're necessary.
you can run GC.start to force run Garbage collector after the end of your job to release the memory as well.
when you kill a sidekiq process it releases the memory to OS because process is being run by ruby and you killed it so in the end ruby will release memory to the OS.
I hope this helps.
You'll have to optimise your code and this is the same suggestion which is given by mperham.

People above have already delineated the problem - MRI will not release the memory that Sidekiq allocated after a huge job is completed, and also a possible solution - run GC.start manually after you finish your job.
There are scenarios where you might have a big job that's split into smaller ones and that stills generates a lot of memory. I will kindly suggest an approach here:
The SmallerJobs have a way to understand whether their counterparts have finished - maybe a flag on your Cache Engine** of choice:
class BigJob
def perform(record_ids)
Rails.cache.write("big_job/in_progress", record_ids)
record_ids.map { SmallerJob.perform_later(_1) }
end
end
class SmallerJob
def perform(record_id)
jobs_in_progress = Rails.cache.read("big_job/in_progress/")
# perform you action against the record
jobs_in_progress -= [ record_id ]
Rails.cache.write("big_job/in_progress/", jobs_in_progress)
ForceGCJob.perform_later if jobs_in_progress.empty?
end
end
After you're done, you queue a ForceGCJob, which as the name says, simply does this:
def perform
logger.debug(GC.stats) # if you're curious
GC.start
end
** concurrent writes to the cache might not be Serializable and you might lose removing some of them due to race conditions, this is just a simple example. In one scenario, I have a Database column for each record, which guarantees these operations are ACID

Related

Sidekiq does not release memory after job is processed

I have a ruby on rails app, where we validate records from huge excel files(200k records) in background via sidekiq. We also use docker and hence a separate container for sidekiq. When the sidekiq is up, memory used is approx 120Mb, but as the validation worker begins, the memory reaches upto 500Mb (that's after a lot of optimisation).
Issue is even after the job is processed, the memory usage stays at 500Mb and is never freed, not allowing any new jobs to be added.
I manually start garbage collection using GC.start after every 10k records and also after the job is complete, but still no help.
This is most likely not related to Sidekiq, but to how Ruby allocates from and releases memory back to the OS.
Most likely the memory can not be released because of fragmentation. Besides optimizing your program (process data chunkwise instead of reading it all into memory) you could try and tweak the allocator or change the allocator.
There has been a lot written about this specific issue with Ruby/Memory, I really like this post by Nate Berkopec: https://www.speedshop.co/2017/12/04/malloc-doubles-ruby-memory.html which goes into all the details.
The simple "solution" is:
Use jemalloc or, if not possible, set MALLOC_ARENA_MAX=2.
The more complex solution would be to try and optimize your program further, so that it does not load that much data in the first place.
I was able to cut memory usage in a project from 12GB to < 3GB by switching to jemalloc. That project dealt with a lot of imports/exports and was written quite poorly and it was an easy win.

DelayedJob doesn't release memory

I'm using Puma server and DelayedJob.
It seems that the memory taken by each job isn't released and I slowly get a bloat causing me to restart my dyno (Heroku).
Any reason why the dyno won't return to the same memory usage figure before the job was performed?
Any way to force releasing it? I tried calling GC but it doesn't seem to help.
You can have one of the following problems. Or actually all of them:
Number 1. This is not an actual problem, but a misconception about how Ruby releases memory to operating system. Short answer: it doesn't. Long answer: Ruby manages an internal list of free objects. Whenever your program needs to allocate new objects, it will get those objects from this free list. If there are no more objects there, Ruby will allocate new memory from operating system. When objects are garbage collected they go back to the free list. So Ruby still have the allocated memory. To illustrate it better, imagine that your program is normally using 100 MB. When at some point program will allocate 1 GB, it will hold this memory until you restart it.
There are some good resource to learn more about it here and here.
What you should do is to increase your dyno size and monitor your memory usage over time. It should stabilize at some level. This will show you your normal memory usage.
Number 2. You can have an actual memory leak. It can be in your code or in some gem. Check out this repository, it contains information about well known memory leaks and other memory issues in popular gems. delayed_job is actually listed there.
Number 3. You may have unoptimized code that is using more memory than needed and you should try to investigate memory usage and try to decrease it. If you are processing large files, maybe you should do it in smaller batches etc.

Rising Total Memory on Heroku Dyno

I have a website hosted on a Heroku Dyno that allows max 512MB of memory.
My site allows users to upload raw time series data in CSV format, and I wanted to load test the performance of uploading a CSV with ~100k rows (3.2 MB in size). The UI lets the user upload the file, which in turns kicks of a Sidekiq job to import each row in the file into my database. It stores the uploaded file under /tmp storage on the dyno, which I believe gets cleared on each periodic restart of the dyno.
Everything actually finished without error, and all 100k rows were inserted. But several hours later I noticed my site was almost unresponsive and I checked Heroku metrics.
At the exact time I had started the upload, the memory usage began to grow and quickly exceeded the maximum 512MB.
The logs confirmed this fact -
# At the start of the job
Aug 22 14:45:51 gb-staging heroku/web.1: source=web.1 dyno=heroku.31750439.f813c7e7-0328-48f8-89d5-db79783b3024 sample#memory_total=412.68MB sample#memory_rss=398.33MB sample#memory_cache=14.36MB sample#memory_swap=0.00MB sample#memory_pgpgin=317194pages sample#memory_pgpgout=211547pages sample#memory_quota=512.00MB
# ~1 hour later
Aug 22 15:53:24 gb-staging heroku/web.1: source=web.1 dyno=heroku.31750439.f813c7e7-0328-48f8-89d5-db79783b3024 sample#memory_total=624.80MB sample#memory_rss=493.34MB sample#memory_cache=0.00MB sample#memory_swap=131.45MB sample#memory_pgpgin=441565pages sample#memory_pgpgout=315269pages sample#memory_quota=512.00MB
Aug 22 15:53:24 gb-staging heroku/web.1: Process running mem=624M(122.0%)
I can restart the Dyno to clear this issue, but I don't have much experience in looking at metrics so I wanted to understand what was happening.
If my job finished in ~30 mins, what are some common reasons why the memory usage might keep growing? Prior to the job it was pretty steady
Is there a way to tell what data is being stored in memory? It would be great to do a memory dump, although I don't know if it will be anything more than hex address data
What are some other tools I can use to get a better picture of the situation? I can reproduce the situation by uploading another large file to gather more data
Just a bit lost on where to start investigating.
Thanks!
Edit: - We have the Heroku New Relic addon which also collects data. Annoyingly enough, New Relic reports a different/normal memory usage value for that same time period. Is this common? What's it measuring?
There are most probable reasons for that:
Scenario 1. You process the whole file, first by loading every record from CSV to memory, doing some processing and then iterating over it and storing into database.
If that's the case then you need to change your implementation to process this file in batches. Load 100 records, process them, store in the database, repeat. You can also look at activerecord-import gem to speed up your inserts.
Scenario 2. You have memory leak in your script. Maybe you process in batches, but you hold references to unused object and they are not garbage collected.
You can find out by using ObjectSpace module. It has some pretty useful methods.
count_objects will return hash with counts for different object currently created on the heap:
ObjectSpace.count_objects
=> {:TOTAL=>30162, :FREE=>11991, :T_OBJECT=>223, :T_CLASS=>884, :T_MODULE=>30, :T_FLOAT=>4, :T_STRING=>12747, :T_REGEXP=>165, :T_ARRAY=>1675, :T_HASH=>221, :T_STRUCT=>2, :T_BIGNUM=>2, :T_FILE=>5, :T_DATA=>1232, :T_MATCH=>105, :T_COMPLEX=>1, :T_NODE=>838, :T_ICLASS=>37}
It's just a hash so you can look for specific type of object:
ObjectSpace.count_objects[:T_STRING]
=> 13089
You can plug this snippet in different points in your script to see how many objects are on the heap at specific time. To have consistent results you should manually trigger garbage collector before checking the counts. It will ensure that you will see only live objects.
GC.start
ObjectSpace.count_objects[:T_STRING]
Another useful method is each_object which iterates over all objects actually on the heap:
ObjectSpace.each_object { |o| puts o.inspect }
Or you can iterate over objects of one class:
ObjectSpace.each_object(String) { |o| puts o.inspect }
Scenario 3. You have memory leak in a gem or system library.
This like previous scenario, but the problem lies not in your code. You can find this also by using ObjectSpace. If you see there are some objects retained after calling library method, there is a chance that this library may have a memory leak. The solution would be to update such library.
Take a look at this repo. It maintains the list of gems with known memory leak problems. If you have something from this list I suggest to update it quickly.
Now addressing your other questions. If you have perfectly healthy app on Heroku or any other provider, you will always see memory increase over time, but it should stabilise at some point. Heroku is restarting dynos once a day or so. On your metrics you will see sudden drops and the slow increase over span of 2 days or so.
And New Relic by default shows average data from all instances. You should probably switch to showing data only from your worker dyno to see correct memory usage.
At the end I recommend to read this article about how Ruby uses memory. There are many useful tools mentioned there, derailed_benchmarks in particular. It was created by guy from Heroku (at that time) and it is a collection of many benchmarks related to most common problems people have on Heroku.

Ruby: What can cause execution of same codeblock to slowdown over time when ran over and over again?

I have a background worker in my rails project that executes a lot of complicated data aggregation in-memory in ruby. I'm seeing a strange behavior. When I boot up a process for executing the jobs (thousands), I see a strange performance decrease over time. In the beginning a job completion takes around 300ms but after processing around 10.000 jobs the execution time will gradually have decreased to around 2000ms. This is a big problem for me and I'm puzzled about how this can possibly happen. I see no memory leaks (RAM usage is pretty stable), and I see no errors. What might cause this on a low level, and where should I start looking?
Background facts:
Among the things the job does, it does a lot of regexp comparisons of a lot of strings. There is no external database calls made except for read/write operations to a redis instance.
I have tried to execute the same on different servers/computers, and the symptoms are all the same.
If I restart the process when it starts to perform too bad, the performance turns good again immediately after.
I'm running ruby 1.9.3p194 and rails 3.2 and sidekiq 2.9.0 for job processor
It is difficult to tell from the limited description of your service, but the behaviour is consistent with a small (i.e. not leaky) cache of data that either has poor lookup performance, or that you are relying on very heavily, and that is growing at just a modest rate. A contrived example might be a list of "jobs done so far by this worker" which is being sorted on demand at a few points in the code.
One such cache is out of your direct control: Ruby's symbol table. Finding a Symbol is something like O(log(n)) on number of symbols in the system, which is good. But this could still impact you if you handle a lot of symbols, and each iteration of your worker can generate new symbols (for instance if keys in an input hash can be arbitrary data, and you use a symbolize_keys method or call to_sym on a lot of varying strings). Symbols are cached permanently in the Ruby process. In theory a few million would not show up as a memory leak. But if your code can go from say 10,000 symbols to 1,000,000 in total, all the symbol generating and checking code would slow down by a small fixed amount. If you are doing that a lot, it could potentially explain a few hundred ms.
If hunting through suspect code is getting you nowhere, your best bet to find the problem is to use a profiler. You should collect a profile of the code behaving well, and behaving badly, and compare the two.

How can I find a memory leak on Heroku?

I have a Rails 3.2.8 app running on Heroku Cedar with Ruby 1.9.3. The app runs fine when it launches but after a day or so of continuous use, I start to see R14 errors on my logs. Once the memory errors start, they never go away, even if the app is idle for several hours.
Shouldnt the garbage collector clean up unused objects after a while and reduce the memory load? It seems this is not happening on Heroku. Generally, memory usage starts to creep up after running some reports with several thousand rows of data, although results are paginated.
How can I find the memory leak? Plugins like bleak_house are way out of date or dont run nicely in the Heroku environment. Can I adjust the GC settings to make it more aggressive?
The GC should do the clean up, and probably does.
You can force the GC with GC.start; if many objects were not collected this will, but I suspect that is not the issue.
Is it possible you somehow create a bunch of objects and never release them, by keeping cached copies or something?
I'm unfamiliar with the existing tools to check this, but you may want to check which objects exist using ObjectSpace. For example:
ObjectSpace.each_object.with_object(Hash.new(0)){|obj, h| h[obj.class] +=1 }
# => a Hash with the number of objects by class
If you get an unexpected number for one of your classes, for instance, you would have a better idea of where to look for.
Install the New Relic add-on. It has a bunch of useful metrics that you can use to find out the source of the leak. I think its generally a better idea to try to see which part of the code takes the longest to execute and perhaps try to optimize that, rather than tweak the GC outright.
Some of the nice features New Relic includes is being able to pinpoint the source of the longest running SQL query, for example. I encourage you to give it a try.

Resources