I'm performing a query using an sqlite db where I pull out a quite large data set of call records from a database. On the same page I want to show the breakdown of counts per day on the call records, so I perform about 30 count queries on the database.
Is there a way I can filter the set that I retrieve initially and perform the counts on the in memory set, so I don't have to run those continuous queries? I need those counts for graphing and display purposes but even with an index on date, it takes about 10 seconds to run the initial query plus all of the count queries.
What I'm basically asking is there a way to perform the counts on the records returned or perform analysis on it, or is there a smarter way to cache this data?
#set = Record.get_records_for_range(date1, date2)
while date1 < date2
#count = Record.count_records_for_date(date1)
date1 = date1 + 1
end
is basically what I'm doing. Surely there's a simpler and faster way?
Using #set.length will get you the count of the in memory set without querying the database because it is performed by ruby not active record (like .count is)
Read about it here https://batsov.com/articles/2014/02/17/the-elements-of-style-in-ruby-number-13-length-vs-size-vs-count/
Here is a quote pulled out of that article
length is a method that’s not part of Enumerable - it’s part of a concrete class (like String or Array) and it’s usually running in O(1) (constant) time. That’s as fast as it gets, which means that using it is probably a good idea.
Related
My data model has a ClickerRecord entity with 2 attributes: date (NSDate) and numberOfBiscuits (NSNumber). Every time a new record is added, a different value for numberOfBiscuits can be entered.
To calculate a daily average for the number of biscuits I'm currently doing a fetch request for each day within range and using the corresponding NSExpression to calculate the sum of all numberOfBiscuits values for that day.
The problem: I'm using asynchronous fetch requests to avoid blocking the main thread, so it ends up being quite slow when there are many days between the first and last record. The fetch requests are performed one after another.
I could also load all records into memory and perform the sorting and calculations, but I'm worried that it could become an issue when the number of records becomes very large.
Therefore, my question: Is it possible to use NSExpressions to add something like sub-predicates for each date interval, in order to do a single fetch request and retrieve a dictionary with an entry for each daily sum of numberOfBiscuits?
If not, what would be the recommended approach for this situation?
I've read about subqueries but as far as I've understood they're not intended for this kind of use.
This is the first question I'm asking on SO, so I hope to have written it in a clear way :)
I think what you are looking for is the propertiesToGroupBy (see the Apple Docs) for the NSFetchRequest, though in your case it is not straight forward to implement, for reasons I will discuss later.
Suppose you could specify the category of biscuit consumed on each occasion, and this is stored in a category attribute of your entity. Then to obtain the total number of biscuits of each category (ignoring the date), you could use an NSExpression using #sum and specify:
fetch.propertiesToGroupBy = ["category"]
CoreData will then group the results of the fetch by the category and will calculate the sum for each group separately.
The problem in your case is that (unless you already strip out the time information from your date attribute), there is no attribute that represents the date interval that you want to group by, and CoreData will not let you specify a computed value to group by. You would need to add a new day attribute to your entity, and calculate that whenever you add/update a record, and specify it in the group by. And you face the same problem again if you subsequently want to calculate your average over a different interval - weeks or months for example. One other downside to this is that the results will only include days for which there are ClickerRecords: if the user has a day where they consume no biscuits, then the fetch will not show a result for that day (ie it will not infer an average of 0). You would need to handle this appropriately when using the results.
It might be better either to tune your asynchronous fetch or, as you suggest, just to read the whole lot into memory to perform the calculations. If your entity only has those two attributes, and assuming your users don't live entirely on biscuits, the volumes should not be too problematic.
I have a financial app that generates dynamic data from the database. Projected daily revenue numbers, for example, are generated and outputted as SQL records (PG:Result in PostgreSQL 9.4). How do I instantiate or build thousands of ActiveRecord objects quickly? To loop through each SQL record is taking too long.
Note that this is not a database issue, because I do NOT need to save to the database (in which case a mass insertion SQL statement would work). These numbers are dynamic so I don't have to create and save objects. The reason I want to build ActiveRecord objects is so that I can easily sort and filter through the numbers using AR methods such as WHERE and FIND (for example, sorting the objects by sheet ID and/or date). I noticed that keeping it as PG:Result saves time, but sorting through the results (which is just an array of hashes) requires messy code. A sample object, Flow, is below:
Flow.new
value_subunit: 10,
sheet_id: 1,
is_actual: true,
period_start: '2015-01-01'
end
The SQL query output looks like the following:
sheet_id value_subunit is_actual period_start
----------------------------------------------
1 10 1 '2015-01-01'
2 20 0 '2015-02-01'
Try activerecord-import. This blog talks about it: https://blog.codeship.com/speed-up-activerecord/
Haven't used this before, but stumbled upon this a week or two ago...
https://github.com/jamis/bulk_insert
Intro
I'm doing a system where I have a very simple layout only consisting of transactions (with basic CRUD). Each transaction has a date, a type, a debit amount (minus) and a credit amount (plus). Think of an online banking statement and that's pretty much it.
The issue I'm having is keeping my controller skinny and worrying about possibly over-querying the database.
A Simple Report Example
The total debit over the chosen period e.g. SUM(debit) as total_debit
The total credit over the chosen period e.g. SUM(credit) as total_credit
The overall total e.g. total_credit - total_debit
The report must allow a dynamic date range e.g. where(date BETWEEN 'x' and 'y')
The date range would never be more than a year and will only be a max of say 1000 transactions/rows at a time
So in the controller I create:
def report
#d = Transaction.select("SUM(debit) as total_debit").where("date BETWEEN 'x' AND 'y'")
#c = Transaction.select("SUM(credit) as total_credit").where("date BETWEEN 'x' AND 'y'")
#t = #c.credit_total - #d.debit_total
end
Additional Question Info
My actual report has closer to 6 or 7 database queries (e.g. pulling out the total credit/debit as per type == 1 or type == 2 etc) and has many more calculations e.g totalling up certain credit/debit types and then adding and removing these totals off other totals.
I'm trying my best to adhere to 'skinny model, fat controller' but am having issues with the amount of variables my controller needs to pass to the view. Rails has seemed very straightforward up until the point where you create variables to pass to the view. I don't see how else you do it apart from putting the variable creating line into the controller and making it 'skinnier' by putting some query bits and pieces into the model.
Is there something I'm missing where you create variables in the model and then have the controller pass those to the view?
A more idiomatic way of writing your query in Activerecord would probably be something like:
class Transaction < ActiveRecord::Base
def self.within(start_date, end_date)
where(:date => start_date..end_date)
end
def self.total_credit
sum(:credit)
end
def self.total_debit
sum(:debit)
end
end
This would mean issuing 3 queries in your controller, which should not be a big deal if you create database indices, and limit the number of transactions as well as the time range to a sensible amount:
#transactions = Transaction.within(start_date, end_date)
#total = #transaction.total_credit - #transaction.total_debit
Finally, you could also use Ruby's Enumerable#reduce method to compute your total by directly traversing the list of transactions retrieved from the database.
#total = #transactions.reduce(0) { |memo, t| memo + (t.credit - t.debit) }
For very small datasets this might result in faster performance, as you would hit the database only once. However, I reckon the first approach is preferable, and it will certainly deliver better performance when the number of records in your db starts to increase
I'm putting in params[:year_start]/params[:year_end] for x and y, is that safe to do?
You should never embed params[:anything] directly in a query string. Instead use this form:
where("date BETWEEN ? AND ?", params[:year_start], params[:year_end])
My actual report probably has closer to 5 database calls and then 6 or 7 calculations on those variables, should I just be querying the date range once and then doing all the work on the array/hash etc?
This is a little subjective but I'll give you my opinion. Typically it's easier to scale the application layer than the database layer. Are you currently having performance issues with the database? If so, consider moving the logic to Ruby and adding more resources to your application server. If not, maybe it's too soon to worry about this.
I'm really not seeing how I would get the majority of the work/calculations into the model, I understand scopes but how would you put the date range into a scope and still utilise GET params?
Have you seen has_scope? This is a great gem that lets you define scopes in your models and have them automatically get applied to controller actions. I generally use this for filtering/searching, but it seems like you might have a good use case for it.
If you could give an example on creating an array via a broad database call and then doing various calculations on that array and then passing those variables to the template that would be awesome.
This is not a great fit for Stack Overflow and it's really not far from what you would be doing in a standard Rails application. I would read the Rails guide and a Ruby book and it won't be too hard to figure out.
I have the following code for
h2.each {|k, v|
#count += 1
puts #count
sq.each do |word|
if Wordsdoc.find_by_docid(k).tf.include?(word)
sum += Wordsdoc.find_by_docid(k).tf[word] * #s[word]
end
end
rec_hash[k] = sum
sum = 0
}
h2 -> is a hash that contain ids of documents, the hash contains more than a 1000 of these
Wordsdoc -> is a model/table in my database...
sq -> is a hash that contain around 10 words
What i'm doing is i'm going through each of the document ids and then for each word in sq i look up in the Wordsdoc table if the word exists (Wordsdoc.find_by_docid(k).tf.include?(word) , here tf is a hash of {word => value}
and if it does I get the value of that word in Wordsdoc and multiple it with the value of the word in #s which is also a hash of {word = > value}
This seems to be running very slow. Tt processe one document per second. Is there a way to process this faster?
thanks really appreciate your help on this!
You do a lot of duplicate querying. While ActiveRecord can do some caching in the background to speed things up, there is a limit to what it can do, and there is no reason to make things harder for it.
The most obvious cause for slowdown is the Wordsdoc.find_by_docid(k). For each value of k, you call it 10 times, and each time you call it there is a possibility to call it again. That means you call that method with the same argument 10-20 times for each entry in h2. Queries to the database are expensive, since the database is on the hard disk, and accessing the hard disk is expensive in any system. You can just as easily call Wordsdoc.find_by_Docid(k) once, before you enter the sq.each loop, and store it in a variable - that would save a lot of querying and make your loop go much faster.
Another optimization - though not nearly as important as the first one - is to get all the Wordsdoc records in a single query. Almost all mid to high level(and some of the low level, too!) programming languages and libraries work better and faster when they work in bulks, and ActiveRecord is no exception. If you can query for all entries of Wordsdoc, and filter them by the docid's in h2's keys, you can turn 1000 queries(after the first optimization. Before the first optimization it was 10000-20000 queries) to a single, huge query. That will enable ActiveRerocd and the underlying database to retrieve your data in bigger chunks, and save you a lot of disc access.
There are some more minor optimization you can do, but the two I've specified should be more than enough.
You're calling Wordsdoc.find_by_docid(k) twice.
You could refactor the code to:
wordsdoc = Wordsdoc.find_by_docid(k)
if wordsdoc.tf.include?(word)
sum += wordsdoc.tf[word] * #s[word]
end
...but still it will be ugly and inefficient.
You should prefetch all records in batches, see: https://makandracards.com/makandra/1181-use-find_in_batches-to-process-many-records-without-tearing-down-the-server
For example something like that should be much more efficient:
Wordsdoc.find_in_batches(:conditions => {:docid => array_of_doc_ids}).each do |wordsdoc|
if wordsdoc.tf.include?(word)
sum += wordsdoc.tf[word] * #s[word]
end
end
Also you can retrieve only certain columns from Wordsdoc table using for example :select => :tf in find_in_batches method.
As you have a lot going on I'm just going to offer you up to things to check out.
A book called Eloquent Ruby deals with Documents and iterating through documents to count the number of times a word was used. All his examples are about a Document system he was maintaining and so it could even tackle other problems for you.
inject is a method that could speed up what you're looking to do for the sum part, maybe.
Delayed Job the whole thing if you are doing this async-ly. meaning if this is a web app, you must be timing out if you're waiting a 1000 seconds for this job to complete before it shows it's answers on the screen.
Go get em.
I have some table:
online | offline
10:32 | 11:06
12:28 | 13:04
14:07 | NULL
As you can see, - user is not offline at the time (assume that now is 14:15). And I need to get a duration of all user onlines. If user is not currently offline - I need to get difference Time.now - online.
In my ruby-code I create function:
def offline_datetime
if offline.nil? then Time.current else offline end
end
It's ok, but if I have many records it will be so slow... How can I do it through database?
I don't think it will make any difference. Even with hundreds of records, what you additionally have is one additional test on nil per record, and only one computation of 'Time.now'. Compared to the generation of the model objects, that is nearly nothing. You could compute the differences on storing, and then let the database compute the sum. But I would only do it if necessary (don't tune performance if it is not broken).
EDIT
If you have to compute the sum of the online times many times, you should compute the online time upfront when storing the offline timestamp:
Add by a migration the attribute diff as integer for the difference in minutes.
When storing your offline timestamp, compute as well the difference and store it together.
Add the sums by calling: Model.sum("diff")
Add to that the only record (there could be only one in your example) as #davidb has written in his solution.
I think in that position it would be the easyes solution to do two queries. One exluding datasets where offline is NULL and one where only these are grabbed. Then you can use the CURRENT_TIMESTAMP variable (It always contains the timestmap for actual time) of MySQL like this:
Model.where("offline IS NULL").sum("CURRENT_TIMESTAMP-online")
then you can add the result of the first and secound query. Maybe made method function out of it:
def self.your_method_name_here
secounds=0
secounds+=Model.where("offline IS NOT NULL").sum("offline-online").to_i
secounds+=Model.where("ofline IS NULL").sum("CURRENT_TIMESTAMP-online").to_i
secounds
end
It might be possible to do this in one query but I think this is quit practable!