Doing analytics on a large table in Rails / PostGreSQL

Doing analytics on a large table in Rails / PostGreSQL - ruby-on-rails

I have a "Vote" table in my database which is growing in size everyday, currently at around 100 million rows. For internal analytics / insights I used to have a rake task which would compute a few basic metrics, like the number of votes made daily in the past few days. It's just a COUNT with a where clause on the date "created_at".
This rake task was doing fine until I deleted the index on "created_at" because it seems that it had a negative impact on the app performance for all the other user-facing queries that didn't need this index, especially when inserting a new row.
Currently I don't have a lot of insights as to what is going on in my app and in this table. However I don't really want to add indexes on such a large table if it's only for my own use.
What else can I try ?

Alternately, you could sidestep the Vote table altogether and keep an external tally.
Every time a vote is cast, a separate tally class that keeps a running count of votes cast will be invoked. There will be one tally record per day. A tally record will have an integer representing the number of votes cast on that day.
Each increment call to the tally class will find a tally record for the current date (today), increment the vote count, and save the record. If no record exists, one will be created and incremented accordingly.
For example, let's have a class called VoteTally with two attributes: a date (date), and a vote count (integer), no timestamps, no associations. Here's what the model will look like:
class VoteTally < ActiveRecord::Base
def self.tally_up!
find_or_create_by_date(Date.today).increment!(:votes)
end
def self.tally_down!
find_or_create_by_date(Date.today).decrement!(:votes)
end
def self.votes_on(date)
find_by_date(date).votes
end
end
Then, in the Vote model:
class Vote < ActiveRecord::Base
after_create :tally_up
after_destroy :tally_down
# ...
private
def tally_up ; VoteTally.tally_up! ; end
def tally_down ; VoteTally.tally_down! ; end
end
These methods will get vote counts:
VoteTally.votes_on Date.today
VoteTally.votes_on Date.yesterday
VoteTally.votes_on 3.days.ago
VoteTally.votes_on Date.parse("5/28/13")
Of course, this is a simple example and you will have to adapt it to suit. This will result in an extra query during vote casting, but it's a hell of a lot faster than a where clause on 100M records with no index. Minor inaccuracies are possible with this solution, but I assume that's acceptable given the anecdotal nature of daily vote counts.

It's just a COUNT with a where clause on the date "created_at".
In that case the only credible index you can use is the one on created_at...
If write performance is an issue (methinks it's unlikely...) and you're using a composite primary key, clustering the table using that index might help too.

If the index has really an impact on the write performance, and it's only a few persons which run statistics now and then, you might consider another general approach:
You could separate your "transaction processing database" from your "reporting database".
You could update your reporting database on a regular basis, and create reporting-only indexes only there. What is more queries regarding reports will not conflict with transaction-oriented traffic, and it doesn't matter how long they run.
Of course, this increases a certain delay, and it increases system complexity. On the other hand, if you roll-forward your reporting database on a regular basis, you can ensure that your backup scheme actually works.

Related

Rails fastest way to get table records count

I have a user table in Postgresql database, if run User.count, it takes 150ms to get result. It is too too slow to us. Ideally, it shall take less than 10ms to return me the result. Is there any way to cache sql result in model level? Something like
def self.total_count
User.count.cached # that's my imagination
end

In my opinion, there are several ways you could go about this -
You could have another table that stores the count of the total number of users by incrementing the count there when a user is added/deleted or at frequent time intervals.
If your table is extremely big and accuracy is not the most important thing, you also look into Postgres' COUNT ESTIMATE query.
SELECT reltuples AS approximate_row_count FROM pg_class WHERE relname = 'users';

You should look into counter_cache. It will work great if your user belongs_to another model http://guides.rubyonrails.org/association_basics.html

Rails: Track Points On A Weekly Basis

In my current application, I need the ability to track points on a weekly basis so that the point totals for the user reset back to zero each week. I was planning on using the gem merit: https://github.com/tute/merit to track points.
In my users profile I have a field that is storing the points. What I have been unable to locate is how I can have rails on an auto basis for all users clear this field.
I have come across some information Rails reset single column I think this may be the answer in terms of resetting it every Sunday at a set time -- but I am uncertain on this last part and in addition where the code would go (model or controller)
Also, would welcome any suggestions if their is a better method.

You'd be better making a Point model, which belongs_to :user
This will allow you to add any points you want, and can then query the table based on the created_at column to get a .count of the points for the timespan you want
I can give you more info if you think it appropriate
Models
One principle we live by is to extend our models as much as possible
You want each model to hold only its data, thus ensuring more efficient db calls. I'm not super experienced with databases, but it's my opinion that having a lot of smaller models is more efficient than one huge model
So in your question, you wanted to assign some points to a user. The "right" way to do this is to store all the points perpetually; which can only be done with its own model
Points
#app/models/point.rb
Class Point < ActiveRecord::Base
belongs_to :user
end
#app/models/user.rb
Class User < ActiveRecord::Base
has_many :points
end
Points table could look like this:
points
id | user_id | value | created_at | updated_at
Saving
To save the points, you will literally just have to add extra records to the points table. The simplest way to achieve this will be to merge the params, like this:
#app/controllers/points_controller.rb
class PointsController < ApplicationController
def new
#points = Point.new
end
def create
#points = Point.new(points_params)
#points.save
end
private
def points_params
params.require(:points).permit(:value).merge(:user_id => current_user.id)
end
end
You can define the "number" of points by setting in the value column (which I'd set as default 1). This will be how StackOverflow gives different numbers of points; by setting the value column differently ;)
Counting
To get weekly countings, you'll have to create some sort of function which will allow you to split the points by week. Like this:
#app/models/point.rb -> THIS NEEDS MORE WORK
def self.weekly
where(:created_at => Time.now.next_week..Time.now.next_week.end_of_week)
end
That function won't work as it is
I'll sort out the function properly for you if you let me know a little more about how you'd like to record / display the weekly stats. Is it going to be operated via a cron job or something?

Based on your description, you might want to simply track the users points and the time that they got them. Then you can query for any 1 week period (or different periods if you decide you want all-time, annual, etc) and you won't lose historical data.

Storing data popularity in a Ruby on Rails app

I was reading this Stack Overflow question and was wondering what is a common practice to store popularity values for data in a Ruby on Rails application?
My thinking is to have 2 models, a regular model and a popular one that has data from the regular model sorted by a popularity formula. A cronjob would populate the latter model at some specific interval.
Any thoughts?

It would depend on the specifics, but I think you can store the popularity information as a column in the model. For example, if you had Questions which you wanted to sort by popularity, you could run a migration AddPopularityToQuestions popularity:float.
You could then run a script at set intervals (e.g. with Whenever) to update the popularity value for each question. However, if there isn't that much activity, it might make more sense to update the popularity for a question whenever something happens that will change it. For example, if popularity is mainly determined by votes, you could update a question's popularity whenever there are new votes.

Use one rake task to update the popularity and run it with heroku scheduler.

I would imagine popularity to be statistical. For example, views over the last 7 days, etc. The measurement you might want to use might change, even allowing the user to select which formula to use.
In this case, you'd want a Statistics model that belongs_to the regular model, and you can join the tables to get the popularity measurement that you're interested in.

I would go with a popularity_score field in the model and a scope that returns the popular items for example:
def popular(count = 10)
order('popularity_score DESC').limit(count)
end
depending on the scoring algorithm, I may add additional model to hold the statistics
class model_stats
attr_accessor :statistic, :value
belongs_to :model
end
This could hold stats like views, up_votes or shares which would be periodically aggregated using your preferred popularity algorithm and the result saved into popularity_score. (managed as a rake task kicked off by cron or similar)
I would make sure that popularity_score was an indexed field!

ActiveRecord: size vs count

In Rails, you can find the number of records using both Model.size and Model.count. If you're dealing with more complex queries is there any advantage to using one method over the other? How are they different?
For instance, I have users with photos. If I want to show a table of users and how many photos they have, will running many instances of user.photos.size be faster or slower than user.photos.count?
Thanks!

You should read that, it's still valid.
You'll adapt the function you use depending on your needs.
Basically:
if you already load all entries, say User.all, then you should use length to avoid another db query
if you haven't anything loaded, use count to make a count query on your db
if you don't want to bother with these considerations, use size which will adapt

As the other answers state:
count will perform an SQL COUNT query
length will calculate the length of the resulting array
size will try to pick the most appropriate of the two to avoid excessive queries
But there is one more thing. We noticed a case where size acts differently to count/lengthaltogether, and I thought I'd share it since it is rare enough to be overlooked.
If you use a :counter_cache on a has_many association, size will use the cached count directly, and not make an extra query at all.
class Image < ActiveRecord::Base
belongs_to :product, counter_cache: true
end
class Product < ActiveRecord::Base
has_many :images
end
> product = Product.first # query, load product into memory
> product.images.size # no query, reads the :images_count column
> product.images.count # query, SQL COUNT
> product.images.length # query, loads images into memory
This behaviour is documented in the Rails Guides, but I either missed it the first time or forgot about it.

tl;dr
If you know you won't be needing the data use count.
If you know you will use or have used the data use length.
If you don't know where it is used or the speed difference is neglectable, use size...
count
Resolves to sending a Select count(*)... query to the DB. The way to go if you don't need the data, but just the count.
Example: count of new messages, total elements when only a page is going to be displayed, etc.
length
Loads the required data, i.e. the query as required, and then just counts it. The way to go if you are using the data.
Example: Summary of a fully loaded table, titles of displayed data, etc.
size
It checks if the data was loaded (i.e. already in rails) if so, then just count it, otherwise it calls count. (plus the pitfalls, already mentioned in other entries).
def size
loaded? ? #records.length : count(:all)
end
What's the problem?
That you might be hitting the DB twice if you don't do it in the right order (e.g. if you render the number of elements in a table on top of the rendered table, there will be effectively 2 calls sent to the DB).

Sometimes size "picks the wrong one" and returns a hash (which is what count would do)
In that case, use length to get an integer instead of hash.

The following strategies all make a call to the database to perform a COUNT(*) query.
Model.count
Model.all.size
records = Model.all
records.count
The following is not as efficient as it will load all records from the database into Ruby, which then counts the size of the collection.
records = Model.all
records.size
If your models have associations and you want to find the number of belonging objects (e.g. #customer.orders.size), you can avoid database queries (disk reads). Use a counter cache and Rails will keep the cache value up to date, and return that value in response to the size method.

I recommended using the size function.
class Customer < ActiveRecord::Base
has_many :customer_activities
end
class CustomerActivity < ActiveRecord::Base
belongs_to :customer, counter_cache: true
end
Consider these two models. The customer has many customer activities.
If you use a :counter_cache on a has_many association, size will use the cached count directly, and not make an extra query at all.
Consider one example:
in my database, one customer has 20,000 customer activities and I try to count the number of records of customer activities of that customer with each of count, length and size method. here below the benchmark report of all these methods.
user system total real
Count: 0.000000 0.000000 0.000000 ( 0.006105)
Size: 0.010000 0.000000 0.010000 ( 0.003797)
Length: 0.030000 0.000000 0.030000 ( 0.026481)
so I found that using :counter_cache Size is the best option to calculate the number of records.

Here's a flowchart to simplify your decision-making process. Hope it helps.
Source: Difference Between the Length, Size, and Count Methods in Rails

Rails Caching DB Queries and Best Practices

The DB load on my site is getting really high so it is time for me to cache common queries that are being called 1000s of times an hour where the results are not changing.
So for instance on my city model I do the following:
def self.fetch(id)
Rails.cache.fetch("city_#{id}") { City.find(id) }
end
def after_save
Rails.cache.delete("city_#{self.id}")
end
def after_destroy
Rails.cache.delete("city_#{self.id}")
end
So now when I can City.find(1) the first time I hit the DB but the next 1000 times I get the result from memory. Great. But most of the calls to city are not City.find(1) but #user.city.name where Rails does not use the fetch but queries the DB again... which makes sense but not exactly what I want it to do.
I can do City.find(#user.city_id) but that is ugly.
So my question to you guys. What are the smart people doing? What is
the right way to do this?

With respect to the caching, a couple of minor points:
It's worth using slash for separation of object type and id, which is rails convention. Even better, ActiveRecord models provide the cacke_key instance method which will provide a unique identifier of table name and id, "cities/13" etc.
One minor correction to your after_save filter. Since you have the data on hand, you might as well write it back to the cache as opposed to delete it. That's saving you a single trip to the database ;)
def after_save
Rails.cache.write(cache_key,self)
end
As to the root of the question, if you're continuously pulling #user.city.name, there are two real choices:
Denormalize the user's city name to the user row. #user.city_name (keep the city_id foreign key). This value should be written to at save time.
-or-
Implement your User.fetch method to eager load the city. Only do this if the contents of the city row never change (i.e. name etc.), otherwise you can potentially open up a can of worms with respect to cache invalidation.
Personal opinion:
Implement basic id based fetch methods (or use a plugin) to integrate with memcached, and denormalize the city name to the user's row.
I'm personally not a huge fan of cached model style plugins, I've never seen one that's saved a significant amount of development time that I haven't grown out of in a hurry.
If you're getting way too many database queries it's definitely worth checking out eager loading (through :include) if you haven't already. That should be the first step for reducing the quantity of database queries.

If you need to speed up sql queries on data that doesnt change much over time then you can use materialized views.
A matview stores the results of a query into a table-like structure of
its own, from which the data can be queried. It is not possible to add
or delete rows, but the rest of the time it behaves just like an
actual table. Queries are faster, and the matview itself can be
indexed.
At the time of this writing, matviews are natively available in Oracle
DB, PostgreSQL, Sybase, IBM DB2, and Microsoft SQL Server. MySQL
doesn’t provide native support for matviews, unfortunately, but there
are open source alternatives to it.
Here is some good articles on how to use matviews in Rails
sitepoint.com/speed-up-with-materialized-views-on-postgresql-and-rails
hashrocket.com/materialized-view-strategies-using-postgresql

I would go ahead and take a look at Memoization, which is now in Rails 2.2.
"Memoization is a pattern of
initializing a method once and then
stashing its value away for repeat
use."
There was a great Railscast episode on it recently that should get you up and running nicely.
Quick code sample from the Railscast:
class Product < ActiveRecord::Base
extend ActiveSupport::Memoizable
belongs_to :category
def filesize(num = 1)
# some expensive operation
sleep 2
12345789 * num
end
memoize :filesize
end
More on Memoization

Check out cached_model

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Doing analytics on a large table in Rails / PostGreSQL - ruby-on-rails

It's just a COUNT with a where clause on the date "created_at". In that case the only credible index you can use is the one on created_at... If write performance is an issue (methinks it's unlikely...) and you're using a composite primary key, clustering the table using that index might help too.

Related

Rails fastest way to get table records count

Rails: Track Points On A Weekly Basis

Storing data popularity in a Ruby on Rails app

ActiveRecord: size vs count

Rails Caching DB Queries and Best Practices

Categories

Resources