optimize memory usage in rails loop - ruby-on-rails

i develop a heroku rails application on the cedar stack and this is the bottle neck.
def self.to_csvAlt(options = {})
CSV.generate(options) do |csv|
column_headers = ["user_id", "session_id", "survey_id"]
pages = PageEvent.order(:page).select(:page).map(&:page).uniq
page_attributes = ["a", "b", "c", "d", "e"]
pages.each do |p|
page_attributes.each do |pa|
column_headers << p + "_" + pa
end
end
csv << column_headers
session_ids = PageEvent.order(:session_id).select(:session_id).map(&:session_id).uniq
session_ids.each do |si|
session_user = PageEvent.find(:first, :conditions => ["session_id = ? AND page != ?", si, 'none']);
if session_user.nil?
row = [si, nil, nil, nil]
else
row = [session_user.username, si, session_user.survey_name]
end
pages.each do |p|
a = 0
b = 0
c = 0
d = 0
e = 0
allpages = PageEvent.where(:page => p, :session_id => si)
allpages.each do |ap|
a += ap.a
b += ap.b
c += ap.c
d += ap.d
e += ap.e
end
index = pages.index p
end_index = (index + 1)*5 + 2
if !p.nil?
row[end_index] = a
row[end_index-1] = b
row[end_index-2] = c
row[end_index-3] = d
row[end_index-4] = e
else
row[end_index] = nil
row[end_index-1] = nil
row[end_index-2] = nil
row[end_index-3] = nil
row[end_index-4] = nil
end
end
csv << row
end
end
end
as you can see, it generates a csv file from a table that contains data on each individual page taken from a group of surveys. the problem is that there are ~50,000 individual pages in the table and the heroku app continues to give me R14 errors (out of memory 512MB) and eventually dies when the dyno goes to sleep after an hour.
that being said, i really dont care how long it takes to run, i just need it to complete. i am waiting on approval to add a worker dyno to run the csv generation, which i know will help but in the meantime i still would like to optimize this code. There is potential for over 100,000 pages to be processed at at time and i realize this is incredibly memory heavy and really need to cut back its memory usage as much as possible. thank you for your time.

You can split it up into batches so that the work is completed in sensible chunks.
Try something like this:
def self.to_csvAlt(options = {})
# ...
pages = PageEvent.order(:page).select(:page).map(&:page).uniq
pages.find_each(:batch_size => 5000) do |p|
# ...
Using find_each with a batch_size, you wont do one huge lookup for your loop. Instead it'll fetch 5000 rows, run your loop, fetch another, loop again ... etc, until you have no more records returned.
The other key thing to note here is that rather than rails trying to instantiate all of the objects returned from the database at the same time, it will only instantiate those returned in your current batch. This can save a huge memory overhead if you have a giant dataset.
UPDATE:
Using #map to restrict your results to a single attribute of your model is highly inefficient. You should instead use the pluck Active record method to just pull back the data you want from the DB directly rather than manipulating the results with Ruby, like this:
# Instead of this:
pages = PageEvent.order(:page).select(:page).map(&:page).uniq
# Use this:
pages = PageEvent.order(:page).pluck(:page).uniq
I also personally prefer to use .distinct rather than the alias .uniq as I feel it sits more in line with the DB query rather than confusing things with what seems more like an array function:
pages = PageEvent.order(:page).pluck(:page).distinct

Use
CSV.open("path/to/file.csv", "wb")
This will stream CSV into the file.
Instead of CSV.generate.
generate will create a huge string that will end up exasting memory if it gets too large.

Related

How to speed up a very frequently made query using raw SQL and without ORM?

I have an API endpoint that accounts for a little less than half of the average response time (on averaging taking about 514 ms, yikes). The endpoint simply returns some statistics about stored data scoped to particular time periods, such as this week, last week, this month, and so on...
There are a number of ways that we could reduce it's impact, like getting the clients to hit it less and with more particular queries such as only querying for "this week" when only that data is used. Here we focus on what can be done at the database-level first. In our current implementation we generate this data for all "time scopes" on-the-fly and the number of queries is enormous and made multiple times per second. No caching is used, but maybe there is a way to use Rails's cache_key, or the low-level Rails.cache?
The current implementation look something like this:
class FooSummaries
include SummaryStructs
def self.generate_for(user)
#user = user
summaries = Struct::Summaries.new
TimeScope::TIME_SCOPES.each do |scope|
foos = user.foos.by_scope(scope.to_sym)
summary = Struct::Summary.new
# e.g: summaries.last_week = build_summary(foos)
summaries.send("#{scope}=", build_summary(summary, foos))
end
summaries
end
private_class_method
def self.build_summary(summary, foos)
summary.all_quuz = #user.foos_count
summary.all_quux = all_quux(foos)
summary.quuw = quuw(foos).to_f
%w[foo bar baz qux].product(
%w[quux quuz corge]
).each do |a, b|
# e.g: summary.foo_quux = quux(foos, "foo")
summary.send("#{a.downcase}_#{b}=", send(b, foos, a) || 0)
end
summary
end
def self.all_quuz(foos)
foos.count
end
def self.all_quux(foos)
foos.sum(:quux)
end
def self.quuw(foos)
foos.quuwable.total_quuw
end
def self.corge(foos, foo_type)
return if foos.count.zero?
count = self.quuz(foos, foo_type) || 0
count.to_f / foos.count
end
def self.quux(foos, foo_type)
case foo_type
when "foo"
foos.where(foo: true).sum(:quux)
when "bar"
foos.bar.where(foo: false).sum(:quux)
when "baz"
foos.baz.where(foo: false).sum(:quux)
when "qux"
foos.qux.sum(:quux)
end
end
def self.quuz(foos, foo_type)
case trip_type
when "foo"
foos.where(foo: true).count
when "bar"
foos.bar.where(foo: false).count
when "baz"
foos.baz.where(foo: false).count
when "qux"
foos.qux.count
end
end
end
To avoid making changes to the model, or creating migrations to create a table to store this data (both of which may be valid and better solutions) I decided maybe it would be easier to construct one large sql query that will be executed at once in the hopes that it will be faster to build the query string and execute it without the overhead of active record set up and tear down of SQL queries.
The new approach looks something like this, it is horrifying to me and I know there must be a more elegant way:
class FooSummaries
include SummaryStructs
def self.generate_for(user)
results = ActiveRecord::Base.connection.execute(build_query_for(user))
results.each do |result|
# build up summary struct from query results
end
end
def self.build_query_for(user)
TimeScope::TIME_SCOPES.map do |scope|
time_scope = TimeScope.new(scope)
%w[foo bar baz qux].map do |foo_type|
%[
select
'#{scope}_#{foo_type}',
sum(quux) as quux,
count(*), as quuz,
round(100.0 * (count(*) / #{user.foos_count.to_f}), 3) as corge
from
"foos"
where
"foo"."user_id" = #{user.id}
and "foos"."foo_type" = '#{foo_type.humanize}'
and "foos"."end_time" between '#{time_scope.from}' AND '#{time_scope.to}'
and "foos"."foo" = '#{foo_type == 'foo' ? 't' : 'f'}'
union
]
end
end.join.reverse.sub("union".reverse, "").reverse
end
end
The funny way of replacing the last occurance of union also horrifies but it seems to work. There must be a beter way as there are probably many things that are wrong with the above implementation(s). It may be helpful to note that I use Postgresql and have no problem with writing queries that are not portable to other DB's. Any advice is truly appreciated!
Thanks for reading!
Update: I found a solution that works for me and sped up the endpoint that uses this service object by 500% ! Essentially the idea is, instead of building a query string and then executing it for each set of parameters, we create a prepared statement using prepare followed by an exec_prepared passing in parameters to the query. Since this query is made many times over this is a useful optmization because, as per the documentation:
A prepared statement is a server-side object that can be used to optimize performance. When the PREPARE statement is executed, the specified statement is parsed, analyzed, and rewritten. When an EXECUTE command is subsequently issued, the prepared statement is planned and executed. This division of labor avoids repetitive parse analysis work, while allowing the execution plan to depend on the specific parameter values supplied.
We prepare the query like so:
def prepare_query!
ActiveRecord::Base.transaction do
connection.prepare("foos_summary",
%[with scoped_foos as (
select
*
from
"foos"
where
"foos"."user_id" = $3
and ("foos"."end_time" between $4 and $5)
)
select
$1::text as scope,
$2::text as foo_type,
sum(quux)::float as quux,
sum(eggs + bacon + ham)::float as food,
count(*) as count,
round((sum(quux) / nullif(
(select
sum(quux)
from
scoped_foos), 0))::numeric,
5)::float as quuz
from
scoped_foos
where
(case $6
when 'Baz'
then (baz = 't')
else
(baz = 'f' and foo_type = $6)
end
)
])
end
You can see in this query we use a common table expression for more readability and to avoid writing the same select query twice over.
Then we execute the query, passing in the parameters we need:
def connection
#connection ||= ActiveRecord::Base.connection.raw_connection
end
def query_results
prepare_query! unless query_already_prepared?
#results ||= TimeScope::TIME_SCOPES.map do |scope|
time_scope = TimeScope.new(scope)
%w[bacon eggs ham spam].map do |foo_type|
connection.exec_prepared("foos_summary",
[scope,
foo_type,
#user.id,
time_scope.from,
time_scope.to,
foo_type.humanize])
end
end
end
Where query_already_prepared? is a simple check in the prepared statements table maintained by postgres:
def query_already_prepared?
connection.exec(%(select
name
from
pg_prepared_statements
where name = 'foos_summary')).count.positive?
end
A nice solution, I thought! Hopefully the technique illustrated here will help others with a similar problems.

Getting all the pages from an API

This is something I struggle with, or whenever I do it it seems to be messy.
I'm going to ask the question in a very generic way as it's not a single problem I'm really trying to solve.
I have an API that I want to consume some data from, e.g. via:
def get_api_results(page)
results = HTTParty.get("api.api.com?page=#{page}")
end
When I call it I can retrieve a total.
results["total"] = 237
The API limits the number of records I can retrieve in one call, say 20. So I need to call it a few more times.
I want to do something like the following, ideally breaking it into pieces so I can use things like delayed_job..etc
def get_all_api_pages
results = get_api_results(1)
total = get_api_results(1)["total"]
until page*20 > total do |p|
results += get_api_results(p)
end
end
I always feel like I'm writing rubbish whenever I try and solve this (and I've tried to solve it in a number of ways).
The above, for example, leaves me at the mercy of an error with the API, which knocks out all my collected results if I hit an error at any point.
Wondering if there is just a generally good, clean way of dealing with this situation.
I don't think you can have that much cleaner...because you only receive the total once you called the API.
Have you tried to build your own enum for this. It encapsulates the ugly part. Here is a bit of sample code with a "mocked" API:
class AllRecords
PER_PAGE = 50
def each
return enum_for(:each) unless block_given?
current_page = 0
total = nil
while total.nil? || current_page * PER_PAGE < total
current_page += 1
page = load_page(current_page)
total = page[:total]
page[:items].each do |item|
yield(item)
end
end
end
private
def load_page(page)
if page == 5
{items: Array.new(37) { rand(100) }, total: 237}
else
{items: Array.new(50) { rand(100) }, total: 237}
end
end
end
AllRecords.new.each.each_with_index do |item, index|
p index
end
You can surely clean that out a bit but i think that this is nice because it does not collect all the items first.

Retrieving only unique records with multiple requests

I have this "heavy_rotation" filter I'm working on. Basically it grabs tracks from our database based on certain parameters (a mixture of listens_count, staff_pick, purchase_count, to name a few)
An xhr request is made to the filter_tracks controller action. In there I have a flag to check if it's "heavy_rotation". I will likely move this to the model (cos this controller is getting fat)... Anyway, how can I ensure (in a efficient way) to not have it pull the same records? I've considered an offset, but than I have to keep track of the offset for every query. Or maybe store track.id's to compare against for each query? Any ideas? I'm having trouble thinking of an elegant way to do this.
Maybe it should be noted that a limit of 14 is set via Javascript, and when a user hits "view more" to paginate, it sends another request to filter_tracks.
Any help appreciated! Thanks!
def filter_tracks
params[:limit] ||= 50
params[:offset] ||= 0
params[:order] ||= 'heavy_rotation'
# heavy rotation filter flag
heavy_rotation ||= (params[:order] == 'heavy_rotation')
#result_offset = params[:offset]
#tracks = Track.ready.with_artist
params[:order] = "tracks.#{params[:order]}" unless heavy_rotation
if params[:order]
order = params[:order]
order.match(/artist.*/){|m|
params[:order] = params[:order].sub /tracks\./, ''
}
order.match(/title.*/){|m|
params[:order] = params[:order].sub /tracks.(title)(.*)/i, 'LOWER(\1)\2'
}
end
searched = params[:q] && params[:q][:search].present?
#tracks = parse_params(params[:q], #tracks)
#tracks = #tracks.offset(params[:offset])
#result_count = #tracks.count
#tracks = #tracks.order(params[:order], 'tracks.updated_at DESC').limit(params[:limit]) unless heavy_rotation
# structure heavy rotation results
if heavy_rotation
puts "*" * 300
week_ago = Time.now - 7.days
two_weeks_ago = Time.now - 14.days
three_months_ago = Time.now - 3.months
# mix in top licensed tracks within last 3 months
t = Track.top_licensed
tracks_top_licensed = t.where(
"tracks.updated_at >= :top",
top: three_months_ago).limit(5)
# mix top listened to tracks within last two weeks
tracks_top_listens = #tracks.order('tracks.listens_count DESC').where(
"tracks.updated_at >= :top",
top: two_weeks_ago)
.limit(3)
# mix top downloaded tracks within last two weeks
tracks_top_downloaded = #tracks.order("tracks.downloads_count DESC").where(
"tracks.updated_at >= :top",
top: two_weeks_ago)
.limit(2)
# mix in 25% of staff picks added within 3 months
tracks_staff_picks = Track.ready.staff_picks.
includes(:artist).order("tracks.created_at DESC").where(
"tracks.updated_at >= :top",
top: three_months_ago)
.limit(4)
#tracks = tracks_top_licensed + tracks_top_listens + tracks_top_downloaded + tracks_staff_picks
end
render partial: "shared/results"
end
I think seeking an "elegant" solution is going to yield many diverse opinions, so I'll offer one approach and my reasoning. In my design decision, I feel that in this case it's optimal and elegant to enforce uniqueness on query intersections by filtering the returned record objects instead of trying to restrict the query to only yield unique results. As for getting contiguous results for pagination, on the other hand, I would store offsets from each query and use it as the starting point for the next query using instance variables or sessions, depending on how the data needs to be persisted.
Here's a gist to my refactored version of your code with a solution implemented and comments explaining why I chose to use certain logic or data structures: https://gist.github.com/femmestem/2b539abe92e9813c02da
#filter_tracks holds a hash map #tracks_offset which the other methods can access and update; each of the query methods holds the responsibility of adding its own offset key to #tracks_offset.
#filter_tracks also holds a collection of track id's for tracks that already appear in the results.
If you need persistence, make #tracks_offset and #track_ids sessions/cookies instead of instance variables. The logic should be the same. If you use sessions to store the offsets and id's from results, remember to clear them when your user is done interacting with this feature.
See below. Note, I refactored your #filter_tracks method to separate the responsibilities into 9 different methods: #filter_tracks, #heavy_rotation, #order_by_params, #heavy_rotation?, #validate_and_return_top_results, and #tracks_top_licensed... #tracks_top_<whatever>. This will make my notes easier to follow and your code more maintainable.
def filter_tracks
# Does this need to be so high when JavaScript limits display to 14?
#limit ||= 50
#tracks_offset ||= {}
#tracks_offset[:default] ||= 0
#result_track_ids ||= []
#order ||= params[:order] || 'heavy_rotation'
tracks = Track.ready.with_artist
tracks = parse_params(params[:q], tracks)
#result_count = tracks.count
# Checks for heavy_rotation filter flag
if heavy_rotation? #order
#tracks = heavy_rotation
else
#tracks = order_by_params
end
render partial: "shared/results"
end
All #heavy_rotation does is call the various query methods. This makes it easy to add, modify, or delete any one of the query methods as criteria changes without affecting any other method.
def heavy_rotation
week_ago = Time.now - 7.days
two_weeks_ago = Time.now - 14.days
three_months_ago = Time.now - 3.months
tracks_top_licensed(date_range: three_months_ago, max_results: 5) +
tracks_top_listens(date_range: two_weeks_ago, max_results: 3) +
tracks_top_downloaded(date_range: two_weeks_ago, max_results: 2) +
tracks_staff_picks(date_range: three_months_ago, max_results: 4)
end
Here's what one of the query methods looks like. They're all basically the same, but with custom SQL/ORM queries. You'll notice that I'm not setting the :limit parameter to the number of results that I want the query method to return. This would create a problem if one of the records returned is duplicated by another query method, like if the same track was returned by staff_picks and top_downloaded. Then I would have to make an additional query to get another record. That's not a wrong decision, just one I didn't decide to do.
def tracks_top_licensed(args = {})
args = #default.merge args
max = args[:max_results]
date_range = args[:date_range]
# Adds own offset key to #filter_tracks hash map => #tracks_offset
#tracks_offset[:top_licensed] ||= 0
unfiltered_results = Track.top_licensed
.where("tracks.updated_at >= :date_range", date_range: date_range)
.limit(#limit)
.offset(#tracks_offset[:top_licensed])
top_tracks = validate_and_return_top_results(unfiltered_results, max)
# Add offset of your most recent query to the cumulative offset
# so triggering 'view more'/pagination returns contiguous results
#tracks_offset[:top_licensed] += top_tracks[:offset]
top_tracks[:top_results]
end
In each query method, I'm cleaning the record objects through a custom method #validate_and_return_top_results. My validator checks through the record objects for duplicates against the #track_ids collection in its ancestor method #filter_tracks. It then returns the number of records specified by its caller.
def validate_and_return_top_results(collection, max = 1)
top_results = []
i = 0 # offset incrementer
until top_results.count >= max do
# Checks if track has already appeared in the results
unless #result_track_ids.include? collection[i].id
# this will be returned to the caller
top_results << collection[i]
# this is the point of reference to validate your query method results
#result_track_ids << collection[i].id
end
i += 1
end
{ top_results: top_results, offset: i }
end

interpreter suddenly stops on creating mulitple threads in ruby

I' trying to fill by database with information, which is being downloaded from the internet, on the fly. I already have a list of ids in a table. What I initially tried is to get all the ids and traverse each id in a loop and download the relevant information. It worked, but, since I had more than 1000 ids it took approximately 24 hours. To speed up I tried to create threads, with each thread allotted some number of ids to download. THE problem here is that interpreter suddenly stops and exits. I also want to ask if the procedure what I wrote will actually gain me some speedup in overall time ? The code I wrote is something like this(I'm using ruby):
def self.called_by_thread(start, limit=50, retry_attempts = 5)
last_id = start
begin
#Users = User.where('id > ' + last_id.to_s).limit(limit)
#Users.each do |user|
#called a function to download information of user and store it,
#This function belongs to the user object
last_id = user.id
end
rescue => msg
puts "Something went wrong (" + msg + ")"
if retry_attempts > 0
retry_attempts -= 1
limit -= last_id-start
retry
end
end
end
In the above code start is the id from where to start.
I call the above function like this:
last_id = 1090
i = 1
limit = 50
workers = []
while i < num_workers
t = Thread.new { called_by_thread(last_id, limit, 5) }
workers << t
i += 1
last_id += limit
end
workers.each do |t|
t.join
end
all ids are incremental, so their is no harm in adding a positive number to it. It is guaranteed that the user exists for a given id. Provided its below 10000.

Rails CSV import, adding to a related table

I have a csv importing system on my app (used locally only) which parses the csv file line by line and adds the data to the database table. This is based on a tutorial here.
require 'csv'
def csv_import
#parsed_file=CSV::Reader.parse(params[:dump][:file])
n = 0
#parsed_file.each_with_index do |row, i|
next if i == 0 #ignore the first row
course = Course.new
course.title = row[0]
course.unit_code = row[1]
course.course_type = row[2]
course.value = row[3]
course.pass_mark = row[4]
if course.save
n = n+1
GC.start if n%50==0
end
flash.now[:message] = "CSV Import Successful, #{n} new courses added to the database."
end
redirect_to(courses_url)
end
This is all in the courses controller and works fine. There is a relationship that courses HABTM years and years HABTM courses. In the csv file (effectively in row[5] to row[8]) are the year_id s. Is there a way that I can add this within the method above. I am confused as to how to loop over the 4 items and add them to the courses_years table.
Thank you
Jack
You can do this by adding a simple loop after your "normal" data is added to the model, and using the << method to append to the years association.
...
course.value = row[3]
course.pass_mark = row[4]
5.upto(8).each do |i|
one_year = Year.find(row[i])
course.years << one_year if one_year
end
if course.save
n = n+1
...
You can add more checks in the loop if you want to make sure that the values are valid, and/or change the find to locate your year in another way. Another way when the related data is "trailing off the end" like this is to keep adding until there is nothing left to add, and also to add the years themselves if they don't exist yet:
...
course.value = row[3]
course.pass_mark = row[4]
row[5..-1].each do |year_id|
one_year = Year.find_or_create_by_id(year_id)
course.years << one_year
end
if course.save
n = n+1
...
There are a lot of different ways to do this, and the way which is right is really dependent on your actual data, but this is the basic method.
Have you tried to put either one of these before you save the course:
course.years.push(row[5])
course.years.push(row[6])
course.years.push(row[7])
course.years.push(row[8])
OR
course.years = [ row[5], row[6], row[7], row[8] ]
Place it before you save the course. It will fill the joint table courses_years.
EDIT
The error that you get seems to be because we are trying to put id's instead of objects, we should do this instead:
.....
year_array = Year.find(row[5], row[6], row[7], row[8])
course.years << year_array
.....
After we get the year objects, then we put it inside the association. You can save the course object after that.

Resources