Ruby loop to create an array of DISTINCT counts - ruby-on-rails

I'm trying to create an array of counts per day. I want the counts to be only of distinct uid's (what uid's are "distinct" shouldn't be reset each day).
Before, I had:
#unique_count_array_by_day = []
15.times { |i|
bar = Model.select("DISTINCT(uid)").where(:created_at => (Time.now.beginning_of_day - i.days)..(Time.now.beginning_of_day - (i-1).days)).count()
#unique_count_array_by_day << bar
}
This wasn't giving me distinct uid's overall, it was giving me the count of unique uid's within a day. So I pulled the code selecting the distinct uid's out of the loop:
#unique_count_array_by_day = []
foo = Model.select("DISTINCT(uid)")
15.times { |i|
bar = foo.where(:created_at => (Time.now.beginning_of_day - i.days)..(Time.now.beginning_of_day - (i-1).days)).count()
#unique_count_array_by_day << bar
}
However, this still produces a count of distinct uid's per day instead of distinct uid's on their first occurrence in the data table.
Any thoughts on how to finagle this?

If you just want a list of distinct ID's you should just remove the loop:
#unique_uids = Model.select("DISTINCT(uid)").all
If you want to get the date that a uid first occurs, you could do something like this:
#unique_uids_with_first_dates = Model.find(:select => 'uid, min(created_at)', :group => 'uid')
(untested, so not sure if that works as-is, but that's basically the way to do it)
Not sure if that totally answers your question, I was a little confused by "overall distincts"

Related

I need advice in speeding up this rails method that involves many queries

I'm trying to display a table that counts webhooks and arranges the various counts into cells by date_sent, sending_ip, and esp (email service provider). Within each cell, the controller needs to count the webhooks that are labelled with the "opened" event, and the "sent" event. Our database currently includes several million webhooks, and adds at least 100k per day. Already this process takes so long that running this index method is practically useless.
I was hoping that Rails could break down the enormous model into smaller lists using a line like this:
#today_hooks = #m_webhooks.where(:date_sent => this_date)
I thought that the queries after this line would only look at the partial list, instead of the full model. Unfortunately, running this index method generates hundreds of SQL statements, and they all look like this:
SELECT COUNT(*) FROM "m_webhooks" WHERE "m_webhooks"."date_sent" = $1 AND "m_webhooks"."sending_ip" = $2 AND (m_webhooks.esp LIKE 'hotmail') AND (m_webhooks.event LIKE 'sent')
This appears that the "date_sent" attribute is included in all of the queries, which implies that the SQL is searching through all 1M records with every single query.
I've read over a dozen articles about increasing performance in Rails queries, but none of the tips that I've found there have reduced the time it takes to complete this method. Thank you in advance for any insight.
m_webhooks.controller.rb
def index
def set_sub_count_hash(thip) {
gmail_hooks: {opened: a = thip.gmail.send(#event).size, total_sent: b = thip.gmail.sent.size, perc_opened: find_perc(a, b)},
hotmail_hooks: {opened: a = thip.hotmail.send(#event).size, total_sent: b = thip.hotmail.sent.size, perc_opened: find_perc(a, b)},
yahoo_hooks: {opened: a = thip.yahoo.send(#event).size, total_sent: b = thip.yahoo.sent.size, perc_opened: find_perc(a, b)},
other_hooks: {opened: a = thip.other.send(#event).size, total_sent: b = thip.other.sent.size, perc_opened: find_perc(a, b)},
}
end
#m_webhooks = MWebhook.select("date_sent", "sending_ip", "esp", "event", "email").all
#event = params[:event] || "unique_opened"
#m_list_of_ips = [#List of three ip addresses]
end_date = Date.today
start_date = Date.today - 10.days
date_range = (end_date - start_date).to_i
#count_array = []
date_range.times do |n|
this_date = end_date - n.days
#today_hooks = #m_webhooks.where(:date_sent => this_date)
#count_array[n] = {:this_date => this_date}
#m_list_of_ips.each_with_index do |ip, index|
thip = #today_hooks.where(:sending_ip => ip) #Stands for "Today Hooks ip"
#count_array[n][index] = set_sub_count_hash(thip)
end
end
Well, your problem is very simple, actually. You gotta remember that when you use where(condition), the query is not straight executed in the DB.
Rails is smart enough to detect when you need a concrete result (a list, an object, or a count or #size like in your case) and chain your queries while you don't need one. In your code, you keep chaining conditions to the main query inside a loop (date_range). And it gets worse, you start another loop inside this one adding conditions to each query created in the first loop.
Then you pass the query (not concrete yet, it was not yet executed and does not have results!) to the method set_sub_count_hash which goes on to call the same query many times.
Therefore you have something like:
10(date_range) * 3(ip list) * 8 # (times the query is materialized in the #set_sub_count method)
and then you have a problem.
What you want to do is to do the whole query at once and group it by date, ip and email. You should have a hash structure after that, which you would pass to the #set_sub_count method and do some ruby gymnastics to get the counts you're looking for.
I imagine the query something like:
main_query = #m_webhooks.where('date_sent > ?', 10.days.ago.to_date)
.where(sending_ip:#m_list_of_ips)
Ok, now you have one query, which is nice, but I think you should separate the query in 4 (gmail, hotmail, yahoo and other), which gives you 4 queries (the first one, the main_query, will not be executed until you call for materialized results, don forget it). Still, like 100 times faster.
I think this is the result that should be grouped, mapped and passed to #set_sub_count instead of passing the raw query and calling methods on it every time and many times. It will be a little work to do the grouping, mapping and counting for sure, but hey, it's faster. =)
In case this helps anybody else, I learned how to fill a hash with counts in a much simpler way. More importantly, this approach runs a single query (as opposed to the 240 queries that I was running before).
#count_array[esp_index][j] = MWebhook.where('date_sent > ?', start_date.to_date)
.group('date_sent', 'sending_ip', 'event', 'esp').count

Neo4j: Sum relationship properties where node properties equal Value A and Value B (intersection)

Basically my question is: how do I sum relationship properties where there is a related nodes that have properties equal to Value A and Value B?
For example:
I have a simple DB has the following relationship:
(site)-[:HAS_MEMBER]->(user)-[:POSTED]->(status)-[:TAGGED_WITH]->(tag)
On [:TAGGED_WITH] I have a property called "TimeSpent". I can easily SUM up all the time spent for a particular day and user by using the following query:
MATCH (user)-[:POSTED]->(updates)-[r:TAGGED_WITH]->(tags)
WHERE user.name = "Josh Barker" AND updates.date = 20141120
RETURN tags.name, SUM(r.TimeSpent) as totalTimeSpent;
This returns to me a nice table with tags and associated time spent on each. (i.e. #Meeting 4.5). However, the question arises if I want to do some advanced searches and say "Show me all the meetings for ProjectA" (i.e. #Meeting #ProjectA). Basically, I am looking for a query that I can get all of the relationships where a single status has BOTH tags (and only if it has both). Then I can SUM that number up to get a count for how many meetings I spent in #ProjectA.
How do I do this?
MATCH (updates)-[r:TAGGED_WITH]->(tag1 {name: 'Meeting'}),
(updates)-[r:TAGGED_WITH]->(tag2 {name: 'ProjectA'})
RETURN SUM(r.TimeSpent) as totalTimeSpent, count(updates);
This should find all updates tagged with both of those things, and sum all time spent across all of those updates.
To create a generic solution where you may want one or more tags you could use something like this, passing in the array of tags as a parameter (and using the length of the array instead of the hard coded 2.
MATCH (user)-[:POSTED]->(update)-[r:TAGGED_WITH]->(tag)
WHERE user.name = "Josh Barker" AND updates.date = 20141120 AND tag.name IN ['Meeting', 'ProjectA']
WITH update, SUM(r.TimeSpent) AS totalTimeSpent, COLLECT(tag) AS tags
WHERE LENGTH(tags) = 2
RETURN update, totalTtimeSpent
As long as tag.name is indexed, this should be fast.
Edit - Remove User constraint
MATCH (update)-[r:TAGGED_WITH]->(tag)
WHERE tag.name IN ['Meeting', 'ProjectA']
WITH update, SUM(r.TimeSpent) AS totalTimeSpent, COLLECT(tag) AS tags
WHERE LENGTH(tags) = 2
RETURN update, totalTtimeSpent

Count records by date returning zero values

I am trying to find a way to output the number of records of each day between two given dates:
from = 7.days.ago.to_date
to = Date.today
total = Records.where(created_at: from..to).group("date(created_at)").count
list = total.values
There is just the problem that when in some dates are no records, it wouldn't be displayed, would be just ignored.
So instead of this:
list = [0,4,3,0,3,0,1]
I get this:
list = [4,3,3,1]
Well, I think I understand why I get it so, because of querying records and not dates after all.
I could do a loop from.upto(to) and on each one do a query to the DB, get those counts and build so my little list —or something else very dirty— but, is there a more clean solution not involving lots of queries?
I know this is not a perfect solution. But it doesn't involve any extra query. Basically reverse merge a hash with zero count for all the dates between from and to.
from = 7.days.ago.to_date
to = Date.today
date_hash = {}
(from..to).each{|date| date_hash[date] = 0}
total = Records.where(created_at: from..to).group("date(created_at)").count
total.reverse_merge!(date_hash)

How to iterate over grouped results in ThinkingSphinx?

I'm having trouble figuring out how to loop over the results of a ThinkingSphinx search that has been set to group_by. I currently have the following:
search = Event.search(
{
group_by: 'category_id',
group_function: :attr
}
)
search.each_with_groupby_and_count do |event, group, count|
puts [event, group, count].join(' - ')
end
This, however, only returns one record per category. It seems like the group and count values are correct, but I only get the first Event of each category, which I would have expected to be all the events in the group. Is it possible to get an array of Hashes or similar? Furthermore, if this is possible, would the per_page option be per group?
I would expect each_with_group_and_count to iterate over something like this:
[
{group: 1, hits: [Event1, Event2], count: 2},
{group: 2: hits: [Event3], count: 1}
]
I'm afraid Sphinx's grouping functionality doesn't behave in that matter - it only returns one document (in this situation, one event) per group value.
It may be more appropriate to just sort by category_id instead, and track when it changes as you iterate over it (or use Enumerable#group_by to group all events by category_id) - keep in mind that Sphinx paginates results, so you may want to increase the default page size (with :per_page) depending on how you're using these results.

Get random and specific items from database

This is how it should go. Table will show six values total, out of that three must be specific ones, and other three random ones, they can't match of course. Meaning that,
if I create two separate instances of Currencies model (which is in question), and from one single out three specific ones I need, and use other instance for getting random three, I would have to exclude those 3 specifics from the second instance. Example.
//instance
DateTime today = DateTime.Now.Date;
var currencies = db.Currencies.Where(c => c.DateCreated.Equals(today));
//first get three separate
currency1 = currencies.Where(c => c.Sign.Equals("EUR"));
currency2 = currencies.Where(c => c.Sign.Equals("USD"));
currency3 = currencies.Where(c => c.Sign.Equals("AUD"));
//second get three randoms
var currencies = db.Currencies.Where(c => c.DateCreated.Equals(today)).OrderBy(d => db.GetNewID()).Take(3);
Now, there should be a way (I think) to alter the currencies at 2nd get to use .Except but I'm not sure how to make an exception of three values. How to do this?
Source: Getting random records from a table using LINQ to SQL
var currencies = db.Currencies.Where(c => c.DateCreated.Equals(today))
.OrderBy(q => db.GetNewId())
.Take(6);
Reference:
Is there a way to select random rows?
Select N Random Records with Linq
linq: order by random
Hope this help..

Resources