I have a database with a bunch of deviceapi entries, that have a start_date and end_date (datetime in the schema) . Typically these entries no more than 20 seconds long (end_date - start_date). I have the following setup:
data = Deviceapi.all.where("start_date > ?", DateTime.now - 2.weeks)
I need to get the hour within data that had the highest number of Deviceapi entries. To make it a bit clearer, this was my latest try on it (code is approximated, don't mind typos):
runningtotal = 0
(2.weeks / 1.hour).to_i.times do |interval|
current = data.select{ |d| d.start_time > (start_date + (1.hour * (interval - 1))) }.select{ |d| d.end_time < (start_date + (1.hour * interval)) }.count
if current > runningtotal
runningtotal = current
end
The problem: this code works just fine. So did about a dozen other incarnations of it, using .where, .select, SQL queries, etc. But it is too slow. Waaaaay too slow. Because it has to loop through every hour within 2 weeks. Then this method might need to be called itself dozens of times.
There has to be a faster way to do this, maybe a sort? I'm stumped, and I've been searching for hours with no luck. Any ideas?
To get adequate performance, you'll want to do everything in a single query, which will mean avoiding ActiveRecord functionality and doing a raw query (e.g. via ActiveRecord::Base.connection.execute).
I have no way to test it, since I have neither your data nor schema, but I think something along these lines will do what you are looking for:
select y.starting_hour, max(y.num_entries) as max_entries
from
(
select x.starting_hour, count(*) as num_entries
from
(
select date_trunc('hour', start_time) starting_hour
from deviceapi as d
) as x
group by x.starting_hour
) as y
where y.num_entries = max(y.num_entries);
The logic of this is as follows, from the inner-most query out:
"Bucket" each starting time to the hour
From the resulting table of buckets, get the total number of entries in each bucket
Get the maximum number of entries from that table, and then use that number to match back to get the starting_hour itself.
If there happen to be more than one bucket with the same number of entries, you could determine a consistent way to pick one -- say the min(starting_hour) or similar (since that would stay the same even as data gets added, assuming you are not deleting items).
If you wanted to limit the initial time slice -- I see 2 weeks referenced in your post -- you could do that in the inner-most query with a where clause bracketing the date range.
Related
I have an app that returns a bunch of records in a list in a random order with some pagination. For this, I save a seed value (so that refreshing the page will return the list in the same order again) and then use .order('random()').
However, say that out of 100 records, I have 10 records that have a preferred_level = 1 while all the other 90 records have preferred_level = 0.
Is there some way that I can place the preferred_level = 1 records first but still keep everything randomized?
For example, I have [A,1],[X,0],[Y,0],[Z,0],[B,1],[C,1],[W,0] and I hope I would get back something like [B,1],[A,1],[C,1],[Z,0],[X,0],[Y,0],[W,0].
Note that even the ones with preferred_level = 1 are randomized within themselves, just that they come before all the 0 records. In theory, I would hope whatever solution would place preferred_level = 2 before the 1 records if I were ever to add them.
------------
I had hoped it would be as intuitively simple as Model.all.order('random()').order('preferred_level DESC') but that doesn't seem to be the case. The second order doesn't seem to affect anything.
Any help would be appreciated!
This got the job done for me on a very similar problem.
select * from table order by preferred_level = 1 desc, random()
or I guess the Rails way
Model.order('preferred_level = 1 desc, random()')
I have a query that loads thousands of objects and I want to tame it by using find_in_batches:
Car.includes(:member).where(:engine => "123").find_in_batches(batch_size: 500) ...
According to the docs, I can't have a custom sorting order: http://www.rubydoc.info/docs/rails/4.0.0/ActiveRecord/Batches:find_in_batches
However, I need a custom sort order of created_at DESC. Is there another method to run this query in chunks like it does in find_in_batches so that not so many objects live on the heap at once?
Hm I've been thinking about a solution for this (I'm the person who asked the question). It makes sense that find_in_batches doesn't allow you to have a custom order because lets say you sort by created_at DESC and specify a batch_size of 500. The first loop goes from 1-500, the second loop goes from 501-1000, etc. What if before the 2nd loop occurs, someone inserts a new record into the table? That would be put onto the top of the query results and your results would be shifted 1 to the left and your 2nd loop would have a repeat.
You could argue though that created_at ASC would be safe then, but it's not guaranteed if your app specifies a created_at value.
UPDATE:
I wrote a gem for this problem: https://github.com/EdmundMai/batched_query
Since using it, the average memory of my application has HALVED. I highly suggest anyone having similar issues to check it out! And contribute if you want!
The slower manual way to do this, is to do something like this:
count = Cars.includes(:member).where(:engine => "123").count
count = count/500
count += 1 if count%500 > 0
last_id = 0
while count > 0
ids = Car.includes(:member).where("engine = "123" and id > ?", last_id).order(created_at: :desc).limit(500).ids #which plucks just the ids`
cars = Cars.find(ids)
#cars.each or #cars.update_all
#do your updating
last_id = ids.last
count -= 1
end
Can you imagine how find_in_batches with sorting will works on 1M rows or more? It will sort all rows every batch.
So, I think will be better to decrease number of sort calls. For example for batch size equal to 500 you can load IDs only (include sorting) for N * 500 rows and after it just load batch of objects by these IDs. So, such way should decrease have queries with sorting to DB in N times.
I'm calculating total "walk time" for dog walking app. The Walks table has two cols, start_time and end_time. Since I want to display total time out for ALL walks for a particular dog, I should just be able to sum the two columns, subtract end_times_total from start_time_totals and result will be my total time out. However I'm getting strange results. When I sum the columns thusly,
start_times = dog.walks.sum('start_time')
end_times = dog.walks.sum('end_time')
BOTH start_times and end_times return the same value. Doing a sanity check I see that my start and end times in the db are indeed set as I would expect them to be (start times in the morning, end times in the afternoon), so the sum should definitely return a different value for each of the columns. Additionally, the value is different for each dog and in line with the relative values I would expect, so dogs with more walks return larger values than dogs with fewer walks. So, it looks like the sum is probably working, only somehow returning the same value for each column.
Btw, running this in dev Rails 3.2.3, ruby 2.0, SQLite.
Don't think that summing datetimes is a good idea. What you need is calculate duration of each single walk and sum them. You can do it in 2 ways:
1. DB-dependent, but more efficient:
# sqlite in dev and test modes
sql = "strftime('%s',end_time) - strftime('%s',start_time)" if !Rails.env.production?
# production with postgres
sql = "extract(epoch from end_time - start_time)" if Rails.env.production?
total = dog.walks.sum(sql)
2. DB-agnostic, but less efficient in case of hundreds record for each dog:
total = dog.walks.all.inject(0) {|tot,w| tot+=w.end_time-w.start_time}
I don't know, how sqlite handles datetime and operations on this data type, but while playing in sqlite console, I noticed that I could get reliable effects when converting datetime to seconds.
I would write it like:
dog.walks.sum("strftime('%s', end_time) - strftime('%s', start_time)")
Query should look like:
select sum(strftime('%s', end_time) - strftime('%s', start_time)) from walks;
I have a rails 4 (ruby 2) app that tracks time for employees against various companies. I need to get a sum of the minutes per company per date. My problem is I'm not sure the best way to pad date/company pairs with 0 if there are no time entries for that company on that day.
Tables
Companies Time_Entries
id name ... id, created_at, company_id, minutes ...
Current output given only 2 companies and 2 days,
[{"company_id":1,"company_name":"Company A","date":"2013-06-24","minutes":987},
{"company_id":1,"company_name":"Company A","date":"2013-06-25","minutes":5},
{"company_id":2,"company_name":"Company B","date":"2013-06-24","minutes":500}]
Expected output to do is pad days that aren't recorded with 0's is to have an additional item in the list where the last item is the new item.
[{"company_id":1,"company_name":"Company A","date":"2013-06-24","minutes":987},
{"company_id":1,"company_name":"Company A","date":"2013-06-25","minutes":5},
{"company_id":2,"company_name":"Company B","date":"2013-06-24","minutes":500},
{"company_id":2,"company_name":"Company B","date":"2013-06-25","minutes":0}]
Current Query (PostgreSQL)
#minutes = TimeEntry.where("created_at >= ?", 1.week.ago.utc)
.group('companies.id, date(created_at)')
.joins(:company)
.select("companies.id as company_id", "companies.name as company_name", "date(created_at)", "SUM(minutes) as minutes")
.order("date ASC")
I'm not sure the best way to go about this. I can think of a couple options:
A 3 deep loop that loops through days, than a loop through companies, than a loop through found results to add any day/company pairs that have not already been added.
Do a left join on a generate_series() for a date range in postgresq and coalesce null sums to 0, but I don't think that will get me all the way
Some unknown better more elegant option
I have wrote a few simple Rails application, accessing the database via the ActiveRecord abstraction, so I am afraid that I don't know much about the inner working of PostgreSQL engine.
However, I am writing a Rails app that need to support 100000+ rows with dynamically updated content and am wondering whether I am using the random functions efficiently:
Database migration schema setting:
t.float: attribute1
t.integer: ticks
add_index :mytable, :attribute1
add_index :mytable, :ticks
Basically, I want to have the following random function distribution:
a) row that has the top 10% value in attribute1 = 30% chance of being selected
b) the middle 60% (in attribute1) row = 50% chance of being selected
c) the lowest 30% (in attribute1) with less than 100 ticks = 15% chance of being selected,
d) and for those with the lowest 30% attribute1 that have more than X (use X = 100 in this question) ticks = 5% chance of being selected.
At the moment I have the following code:
#rand = rand()
if #rand>0.7
#randRowNum = Integer(rand(0.1) * Mytable.count )
#row = Mytable.find(:first, :offset =>#randRowNum , :order => '"attribute1" DESC')
else
if #rand>0.2
#randRowNum = Integer((rand(0.6)+0.1) * Mytable.count)
#row= Mytable.find(:first, :offset =>#randRowNum , :order => '"attribute1" DESC')
else
#row= Mytable.find(:first, :offset =>Integer(0.7 * Mytable.count), :order => '"attribute1" DESC')
if !(#row.nil?)
if (#rand >0.05)
#row= Mytable.find(:first, :order => 'RANDOM()', :conditions => ['"attribute1" <= '"#{#row.attribute1}", '"ticks" < 100' ] )
else
#row= Mytable.find(:first, :order => 'RANDOM()', :conditions => ['"attribute1" <= '"#{#row.attribute1}", '"ticks" >= 100' ] )
end
end
end
end
1) One thing I would like to do is avoid the use of :order => 'RANDOM()' as according to my research, it seems that each time it is called, involves the SQL engine first scanning through all the rows, assigning them a random value. Hence the use of #randRowNum = Integer(rand(0.1) * Mytable.count ) and offset by #randRowNum. for a) and b). Am I actually improving the efficiency? Is there any better way?
2) Should I be doing the same as 1) for c) and d), and is by using the :conditions => ['"attribute1" <= '"#{#row.attribute1}", '"ticks" >= 100' ], am I forcing the SQL engine to scan through all the rows anyway? Is there anything apart from indexing that I can improve the efficiency of this call (with as little space/storage overhead as possible too)?
3) There is a chance that the total number of entries in Mytable may have been changed between the Mytable.count and Mytable.find calls. I could put the two calls within a transaction, but it seems excessive to lock that entire table just for a read operation (at the moment, I have extra code to fall back to a simple random row selection if I got a #row.nil from the above code). Is it psosible to move that .count call within a single atomic SQL query in Rails? Or would it have the same efficiently as locking via transaction in Rails?
4) I have also been reading up on stored procedure in PostgreSQL... but for this particular case, is there any gain in efficiency to be achieved, worth moving the code from Activerecording abstraction to stored procedure?
Many thanks!
P.S. development/deployment environment:
Rube 1.8.7
Rails 2.3.14
PostgreSQL 8 (on Heroku)
Your question seems a bit vague, so correct me if my interpretation is wrong.
If you didn't split (c) and (d), I would just convert the uniform random variable over [0,1) to a biased random variable over [0,1) and use that to select your row offset.
Unfortunately, (c) and (d) is split based on the value of "ticks" — a different column. That's the hard part, and also makes the query much less efficient.
After you fetch the value of attribute1 at 70% from the bottom, also fetch the number of (c) rows; something like SELECT COUNT(*) FROM foo WHERE attribute1 <= partiton_30 AND ticks < 100. Then use that to find the offset into either the ticks < 100 or the ticks >= 100 cases. (You probably want an index on (attribute1, ticks) or something; the order which is best depends on your data).
If the "ticks" threshold is known in advance and doesn't need to change often, you can cache it in a column (BOOL ticks_above_threshold or whatever) which makes the query much more efficient if you have an index on (ticks_above_threshold, attribute1) (note the reversal). Of course, every time you change the threshold you need to write to every row.
(I think you can also use a "materialized view" to avoid cluttering the main table with an extra column, but I'm not sure what the difference in efficiency is.)
There are obviously some efficiency gains possible by using stored procedures. I wouldn't worry about it too much, unless latency to the server is particularly high.
EDIT:
To answer your additional questions:
Indexing (BOOL ticks_above_threshold, attribute1) should work better than (ticks, attribute1) (I may have the order wrong, though) because it lets you compare the first index column for equality. This is important.
Indexes generally use some sort of balanced tree to do a lookup on what is effectively a list. For example, take A4 B2 C3 D1 (ordered letter,number) and look up "number of rows with letter greater than B and number greater than 2". The best you can do is start after the Bs and iterate over the whole table. If you order by number,letter, you get 1D 2B 3C 4A. Again, start after the 2s.
If you instead your index is on is_greater_than_2,letter, it looks like false,B false,D true,A true,C. You can ask the index for the position of (true,B) — between true,A and true,C — and count the number of entries until the end. Counting the number of items between two index positions is fast.
Google App Engine's Restrictions on Queries goes one step further:
Inequality Filters Are Allowed on One Property Only
A query may only use inequality filters (<, <=, >=, >, !=) on one property across all of its filters.
...
The query mechanism relies on all results for a query to be adjacent to one another in the index table, to avoid having to scan the entire table for results. A single index table cannot represent multiple inequality filters on multiple properties while maintaining that all results are consecutive in the table.
If none of you other queries benefit from an index on ticks, then yes.
In some cases, it might be faster to index a instead of a,b (or b,a) if the clause including b is almost always true and you're fetching row data (not just getting a COUNT()) (i.e. if 1000 <= a AND a <= 1010 matches a million rows and b > 100 only fails for two rows, then it might end up being faster to do two extra row lookups than to work with the bigger index).
As long as rows aren't being removed b/n the call to count() and the call to find() I wouldn't worry about transactions. Definitely get rid of all calls ordering by RANDOM() as there is no way to optimize it. Make sure that attribute1 has an index on it. I haven't tested it, but something like this should be pretty quick:
total_rows = MyTable.count
r = rand()
if r > 0.7 # 90-100%
lower = total_rows * 0.9
upper = total_rows
elsif r > 0.2 # 30-89%
lower = total_rows * 0.3
upper = total_rows * 0.89
else # 0-29%
lower = 0
upper = total_rows * 0.29
end
offset = [lower + (upper - lower) * rand(), total_rows - 1].min.to_i
#row = Mytable.find(:first, :offset => offset, :order => 'attribute1 ASC')