I'm trying to load a Dask dataframe from a SQL connection. Per the read_sql_table documentation, it is necessary to pass in an index_col. What should I do if there's a possibility that there are no good columns to act as index?
Could this be a suitable replacement?
# Break SQL Query into chunks
chunks = []
num_chunks = math.ceil(num_records / chunk_size)
# Run query for each chunk on Dask workers
for i in range(num_chunks):
query = 'SELECT * FROM ' + table + ' LIMIT ' + str(i * chunk_size) + ',' + str(chunk_size)
chunk = dask.delayed(pd.read_sql)(query, sql_uri)
chunks.append(chunk)
# Aggregate chunks
df = dd.from_delayed(chunks)
dfs[table] = df
Unfortunately, LIMIT/OFFSET is not in general a reliable way to partition a query in most SQL implementations. In particular, it is often the case that, to get to an offset and fetch later rows from a query, the engine must first parse through earlier rows, and thus the work to generate a number of partitions is much magnified. In some cases, you might even end up with missed or duplicated rows.
This was the reasoning behind requiring boundary values in the dask sql implementation.
However, there is nothing in principle wrong with the way you are setting up your dask dataframe. If you can show that your server does not suffer from the problems we were anticipating, then you are welcome to take that approach.
I'm trying to display a table that counts webhooks and arranges the various counts into cells by date_sent, sending_ip, and esp (email service provider). Within each cell, the controller needs to count the webhooks that are labelled with the "opened" event, and the "sent" event. Our database currently includes several million webhooks, and adds at least 100k per day. Already this process takes so long that running this index method is practically useless.
I was hoping that Rails could break down the enormous model into smaller lists using a line like this:
#today_hooks = #m_webhooks.where(:date_sent => this_date)
I thought that the queries after this line would only look at the partial list, instead of the full model. Unfortunately, running this index method generates hundreds of SQL statements, and they all look like this:
SELECT COUNT(*) FROM "m_webhooks" WHERE "m_webhooks"."date_sent" = $1 AND "m_webhooks"."sending_ip" = $2 AND (m_webhooks.esp LIKE 'hotmail') AND (m_webhooks.event LIKE 'sent')
This appears that the "date_sent" attribute is included in all of the queries, which implies that the SQL is searching through all 1M records with every single query.
I've read over a dozen articles about increasing performance in Rails queries, but none of the tips that I've found there have reduced the time it takes to complete this method. Thank you in advance for any insight.
m_webhooks.controller.rb
def index
def set_sub_count_hash(thip) {
gmail_hooks: {opened: a = thip.gmail.send(#event).size, total_sent: b = thip.gmail.sent.size, perc_opened: find_perc(a, b)},
hotmail_hooks: {opened: a = thip.hotmail.send(#event).size, total_sent: b = thip.hotmail.sent.size, perc_opened: find_perc(a, b)},
yahoo_hooks: {opened: a = thip.yahoo.send(#event).size, total_sent: b = thip.yahoo.sent.size, perc_opened: find_perc(a, b)},
other_hooks: {opened: a = thip.other.send(#event).size, total_sent: b = thip.other.sent.size, perc_opened: find_perc(a, b)},
}
end
#m_webhooks = MWebhook.select("date_sent", "sending_ip", "esp", "event", "email").all
#event = params[:event] || "unique_opened"
#m_list_of_ips = [#List of three ip addresses]
end_date = Date.today
start_date = Date.today - 10.days
date_range = (end_date - start_date).to_i
#count_array = []
date_range.times do |n|
this_date = end_date - n.days
#today_hooks = #m_webhooks.where(:date_sent => this_date)
#count_array[n] = {:this_date => this_date}
#m_list_of_ips.each_with_index do |ip, index|
thip = #today_hooks.where(:sending_ip => ip) #Stands for "Today Hooks ip"
#count_array[n][index] = set_sub_count_hash(thip)
end
end
Well, your problem is very simple, actually. You gotta remember that when you use where(condition), the query is not straight executed in the DB.
Rails is smart enough to detect when you need a concrete result (a list, an object, or a count or #size like in your case) and chain your queries while you don't need one. In your code, you keep chaining conditions to the main query inside a loop (date_range). And it gets worse, you start another loop inside this one adding conditions to each query created in the first loop.
Then you pass the query (not concrete yet, it was not yet executed and does not have results!) to the method set_sub_count_hash which goes on to call the same query many times.
Therefore you have something like:
10(date_range) * 3(ip list) * 8 # (times the query is materialized in the #set_sub_count method)
and then you have a problem.
What you want to do is to do the whole query at once and group it by date, ip and email. You should have a hash structure after that, which you would pass to the #set_sub_count method and do some ruby gymnastics to get the counts you're looking for.
I imagine the query something like:
main_query = #m_webhooks.where('date_sent > ?', 10.days.ago.to_date)
.where(sending_ip:#m_list_of_ips)
Ok, now you have one query, which is nice, but I think you should separate the query in 4 (gmail, hotmail, yahoo and other), which gives you 4 queries (the first one, the main_query, will not be executed until you call for materialized results, don forget it). Still, like 100 times faster.
I think this is the result that should be grouped, mapped and passed to #set_sub_count instead of passing the raw query and calling methods on it every time and many times. It will be a little work to do the grouping, mapping and counting for sure, but hey, it's faster. =)
In case this helps anybody else, I learned how to fill a hash with counts in a much simpler way. More importantly, this approach runs a single query (as opposed to the 240 queries that I was running before).
#count_array[esp_index][j] = MWebhook.where('date_sent > ?', start_date.to_date)
.group('date_sent', 'sending_ip', 'event', 'esp').count
I have a database with a bunch of deviceapi entries, that have a start_date and end_date (datetime in the schema) . Typically these entries no more than 20 seconds long (end_date - start_date). I have the following setup:
data = Deviceapi.all.where("start_date > ?", DateTime.now - 2.weeks)
I need to get the hour within data that had the highest number of Deviceapi entries. To make it a bit clearer, this was my latest try on it (code is approximated, don't mind typos):
runningtotal = 0
(2.weeks / 1.hour).to_i.times do |interval|
current = data.select{ |d| d.start_time > (start_date + (1.hour * (interval - 1))) }.select{ |d| d.end_time < (start_date + (1.hour * interval)) }.count
if current > runningtotal
runningtotal = current
end
The problem: this code works just fine. So did about a dozen other incarnations of it, using .where, .select, SQL queries, etc. But it is too slow. Waaaaay too slow. Because it has to loop through every hour within 2 weeks. Then this method might need to be called itself dozens of times.
There has to be a faster way to do this, maybe a sort? I'm stumped, and I've been searching for hours with no luck. Any ideas?
To get adequate performance, you'll want to do everything in a single query, which will mean avoiding ActiveRecord functionality and doing a raw query (e.g. via ActiveRecord::Base.connection.execute).
I have no way to test it, since I have neither your data nor schema, but I think something along these lines will do what you are looking for:
select y.starting_hour, max(y.num_entries) as max_entries
from
(
select x.starting_hour, count(*) as num_entries
from
(
select date_trunc('hour', start_time) starting_hour
from deviceapi as d
) as x
group by x.starting_hour
) as y
where y.num_entries = max(y.num_entries);
The logic of this is as follows, from the inner-most query out:
"Bucket" each starting time to the hour
From the resulting table of buckets, get the total number of entries in each bucket
Get the maximum number of entries from that table, and then use that number to match back to get the starting_hour itself.
If there happen to be more than one bucket with the same number of entries, you could determine a consistent way to pick one -- say the min(starting_hour) or similar (since that would stay the same even as data gets added, assuming you are not deleting items).
If you wanted to limit the initial time slice -- I see 2 weeks referenced in your post -- you could do that in the inner-most query with a where clause bracketing the date range.
The following simpledb query returns 51 results:
select * from logger where time > '2011-07-29 17:45:10.540284+00:00'
This query returns 20534 results:
select * from logger where time < '2011-07-29 17:50:08.615626'
These two queries both return 0 results!!?:
select * from logger where time between '2011-07-29 17:45:10.540284+00:00' and '2011-07-29 17:50:08.615626'
select * from logger where time > '2011-07-29 17:45:10.540284+00:00' and time < '2011-07-29 17:50:08.615626'
What am I missing here?
But are any of your 51 results returned from the first query actually within the time span you are searching? If they are all later than 17:50:08.615626 then your queries are performing as expected.
I am also suspicious of the fact that you are being inconsistent in how you are representing the time. You should really be using ISO 8601 timestamps if you want consistent lexicographic matching of times with SDB.
The other option is that the queries are taking longer than the query timeout to run, are you checking for errors?
Finally - perhaps SDB is having a bad day and the query is just a bit slow - in those circumstances you can find you get 0 results but DO get a next token - and the actual results follow in the next batch.
Does any of that help?
I have a slightly complex time arithmetic problem.
I have a reminder system where the user can set "how many x before event" duration. For example: If I set '5 minutes' - I need to get reminder before 5 minutes of the event schedule.
In my reminder system, I have a cron which runs every minute and sends reminder mails. So far so good. I want to find all calendar events which are eligible for reminder (calendar entry whose scheduled time is between "5.minutes.from_now and 6.minutes.from_now"
I am trying the write the following where clause :
conds = "'when' >= '#{eval("#{cal.remind_before.to_s}.#{cal.remind_before_what.downcase}.from_now").to_s(:db)}' AND 'when' < '#{eval("#{cal.remind_before.to_s}.#{cal.remind_before_what.downcase}.from_now + 1.minutes").to_s(:db)}'"
#mail_calendar_for_reminder= Calendar.find(:all, :conditions=> conds)
Here cal.reminder_before = '5', cal.remind_before_what.downcase='minutes'
so the eval would be evaluating (5.minutes.from_now) and (6.minutes.from_now)
The resulting SQL statement is :
SELECT "calendars".* FROM "calendars" WHERE ('when' >= '2011-01-11 14:44:54' AND 'when' < '2011-01-11 14:45:54')
This SQL is syntactically and logically correct because it gets a time range of 5.minutes.from_now and 6.minutes.from_now. But it is not selecting eligible records. I suspect two things:
1. The SQL above is doing string comparisons rather than time comparisons.
2. The database entry for calendar's scheduled time has the following format :
2011-01-11 14:45:09.000000 --the 0's the end might be messing teh date comparisons.
I tried almost all sorts of date range arithmetic but could not get the eligible records in this query.
Depending on your server and its load, a one-minute window for cron might be a little optimistic.
What happens if you login to the dbms server and execute that SQL statement? Any rows returned? Any error messages?
You can try an explicit type cast. So
'when' >= CAST('2011-01-11 14:44:54' AS DATETIME) ...
Your dbms might require a different syntax for type casting and conversion. Search your docs.
Are your column names case sensitive? Is the column 'when' or 'When'? (Or wHen?)
This query returns your test event. Note the double quotes around the column name.
SELECT "calendars".*
FROM "calendars"
WHERE ("when" >= '2011-01-10 15:56'
AND "when" < '2011-01-10 15:57')