I am building several reports in an application and have come across a few ways of building the reports and wanted to get your take on the best/common ways to build reports that are both scalable and as real-time as possible.
First, some conditions/limits/goals:
The report should be able to handle being real time (using node.js or ajax polling)
The report should update in an optimized way
If the report is about page views, and you're getting thousands a second, it might not be best to update the report every page view, but maybe every 10 or 100.
But it should still be close to real-time (so daily/hourly cron is not an acceptable alternative).
The report shouldn't be recalculating things that it's already calculated.
If it's has counts, it increments a counter.
If it has averages, maybe it can somehow update the average without grabbing all records it's averaging every second and recalculating (not sure how to do this yet).
If it has counts/averages for a date range (today, last_week, last_month, etc.), and it's real-time, it shouldn't have to recalculate those averages every second/request, somehow only do the most minimal operation.
If the report is about a record and the record's "lifecycle" is complete (say a Project, and the project lasted 6 months, had a bunch of activity, but now it's over), the report should be permanently saved so subsequent retrievals just pull a pre-computed document.
The reports don't need to be searchable, so once the data is in a document, we're just displaying the document. The client gets basically a JSON tree representing all the stats, charts, etc. so it can be rendered however in Javascript.
My question arises because I am trying to figure out a way to do real-time reporting on huge datasets.
Say I am reporting about overall user signup and activity on a site. The site has 1 million users, and there are on average 1000 page views per second. There is a User model and a PageView model let's say, where User has_many :page_views. Say I have these stats:
report = {
:users => {
:counts => {
:all => user_count,
:active => active_user_count,
:inactive => inactive_user_count
},
:averages => {
:daily => average_user_registrations_per_day,
:weekly => average_user_registrations_per_week,
:monthly => average_user_registrations_per_month,
}
},
:page_views => {
:counts => {
:all => user_page_view_count,
:active => active_user_page_view_count,
:inactive => inactive_user_page_view_count
},
:averages => {
:daily => average_user_page_view_registrations_per_day,
:weekly => average_user_page_view_registrations_per_week,
:monthly => average_user_page_view_registrations_per_month,
}
},
}
Things I have tried:
1. Where User and PageView are both ActiveRecord objects, so everything is via SQL.
I grab all of the users in chunks something like this:
class User < ActiveRecord::Base
class << self
def report
result = {}
User.find_in_batches(:include => :page_views) do |users|
# some calculations
# result[:users]...
users.each do |user|
# result[:users][:counts][:active]...
# some more calculations
end
end
result
end
end
end
2. Both records are MongoMapper::Document objects
Map-reduce is really slow to calculate on the spot, and I haven't yet spent the time to figure out how to make this work real-time-esque (checking out hummingbird). Basically I do the same thing: chunk the records, add the result to a hash, and that's it.
3. Each calculation is it's own SQL/NoSQL query
This is kind of the approach the Rails statistics gem takes. The only thing I don't like about this is the amount of queries this could possibly make (haven't benchmarked whether making 30 queries per-request-per-report is better than chunking all the objects into memory and sorting in straight ruby)
Question
The question I guess is, what's the best way, from your experience, to do real-time reporting on large datasets? With chunking/sorting the records in-memory every request (what I'm doing now, which I can somewhat optimize using hourly-cron, but it's not real-time), the reports take about a second to generate (complex date formulas and such), sometimes longer.
Besides traditional optimizations (better date implementation, sql/nosql best practices), where I can I find some practical and tried-and-true articles on building reports? I can build reports no problem, the issue is, how do you make it fast, real-time, optimized, and right? Haven't found anything really.
The easiest way to build near real-time reports for your use case is to use caching.
So in report method, you need to use rails cache
class User < ActiveRecord::Base
class << self
def report
Rails.cache.fetch('users_report', expires_in: 10.seconds) do
result = {}
User.find_in_batches(:include => :page_views) do |users|
# some calculations
# result[:users]...
users.each do |user|
# result[:users][:counts][:active]...
# some more calculations
end
end
result
end
end
end
end
And on client-side you just request this report with ajax pooling. That way generating this reports won't be a bottleneck as generating them takes ~1second, and many clients can easily get the latest result.
For better user experience you can store delta between two reports and increment your report on client side using this delta prediction, like this:
let nextPredictedReport = null;
let currentReport = null;
const startDrawingPredicted = () => {
const step = 500;
const timePassed = 0;
setInterval(() => {
timePassed += step;
const predictedReport = calcDeletaReport(currentReport, nextPredictedReport, timePassed);
drawReport(predictedReport);
}, step);
};
setInterval(() => {
doReportAjaxRequest().then((response) => {
drawReport(response.report);
currentReport = response.report;
nextPredictedReport = response.next_report;
startDrawingPredicted();
});
}, 10000);
that's just an example of the approach, calcDeletaReport and drawReport should be implemented on your own + this solution might have issues, as it's just an idea :)
Related
My app seams to be getting bogged down. Can someone can help me optimize this controller code to run faster? Or point me in the right direction.
I'm trying to display a list of customers which are defined by active is true and a list of potential customers which active is false. Archived customers are archived true.
Thank you.
if current_user.manager?
get_customers = Customer.where(:archived => false)
#cc = get_customers.where(:active => true)
#current_customers = #cc.where(:user_id => current_user.id)
#count_current = #current_customers.count
#pc = get_customers.where(:active => false)
#potential_customers = #pc.where(:user_id => current_user.id)
#count_potential = #potential_customers.count
end
How does this look for improving speed?
model
scope :not_archived, -> { where(:archived => false) }
scope :current_customers, -> { where(:active => true).not_archived }
scope :potential_customers, -> { where(:active => false).not_archived }
scope :archived_customers, -> { where(:archived => true) }
Controller
#current_customers = Customer.current_customers.includes(:contacts,:contracts)
View
link_to "Current Clients #{#count_current.size}"
You may find help here
As #Gabbar pointed out and I will add to it, your app right now is eager-loading (opposite of lazy-loading) which means that you are loading more from the database than needed. What we need to do is optimize but that totally depends on your use-case.
Whatever the use-case, you can do a few common things to make things better:
You can implement pagination (there are gems for it and you can do it yourself too) or infinite scrolling. In this case, you will be loading a set amount of records from db at first but as soon as user wants more, either they will scroll down or click 'next' button and your action will be called again but with an increment in the page number which means get the next set of records.
Implementing based on scroll involves JS and the view-height etc. but pagination is much simpler.
Gems:
kaminari gem
infinite-pages
Using includes
One more thing you must do is, use include in query if your records are related. Using include is tricky but very very helpful in time-saving. It will fetch the related needed record together in one go from database unlike your code going to and fro database multiple times. Fetching from database takes a lot of time as compared to fetching from RAM.
#users = User.all.includes(:comments) #comments for all users brought along with users but saved in RAM for future access.
#comments = #users.map(&:comments) # no need to go to db again, just RAM.
Using scopes in models:
Creating scopes in models helps too. In your case, you should create scopes like this:
scope :archived_customers, -> { where('archived IS false') }
scope :potential_customers, -> { where('active IS false') }
**OR**
scope :archived_customers, -> { where(:archived => false) }
scope :potential_customers, -> { where(:active => false) }
Loading all the available records in a single query can be very costly. Moreover, a user may be interested only in a couple of the most recent records (i.e., the latest posts in a blog) and does not want to wait for all records to load and render.
There are couples of ways to sort out this problem
example#1 implementation of Load More
example#2 implementation of Infinite Scrolling
example#3 implementation of pagination
New to elasticsearch-rails. It is acting werid.
When I call my API for the first time, at times, it responds with empty array but calling the same API again, returns proper response.
API Output - For the first time
API Output- Second time
My model :
class Consultation < ApplicationRecord
include Elasticsearch::Model
include Elasticsearch::Model::Callbacks
after_save :set_index
Consultation.import force: true
def set_index
self.__elasticsearch__.index_document
end
end
My controller :
def search
required_params_present = check_required_params(%i[search])
if required_params_present
searched_result = Elasticsearch::Model.search("*#{params[:search]}*", [Consultation]).records.records
data = ActiveModel::ArraySerializer.new(searched_result, each_serializer: ConsultationSerializer)
send_response(HTTP_STATUS_CODE_200, true, I18n.t('search'), data)
else
send_response(HTTP_STATUS_CODE_801, false, I18n.t('params_missing.error'))
end
rescue => e
send_response(HTTP_STATUS_CODE_500, false, e.message)
end
Response is empty only for the first time.
Is it that Elasticsearch take times to respond for the first time?
Any help or ideas will be really appreciated?
Newly indexed documents are not yet searchable immediately (within ~1 second), for performance reasons. See reference
You'll need to do a manual "refresh" on the index, if you want realtime search result. However, I quote below
While a refresh is much lighter than a commit, it still has a performance cost. A manual refresh can be useful when writing tests, but don’t do a manual refresh every time you index a document in production; it will hurt your performance. Instead, your application needs to be aware of the near real-time nature of Elasticsearch and make allowances for it.
In test environment, this is perfectly acceptable to do a "refresh" just so you could test immediately that the documents are already searchable.
Because it seems that you are on development, I would advise against this, but you may still free to do so with something like below
def set_index
config = (Rails.env.development? || Rails.env.test?) ? { refresh: true } : {}
self.__elasticsearch__.index_document config
end
I have a query that spans multiple tables which in the end uses Active Model Serializers to render the data. Currently a large portion of the time is spent in the serializers since I am forced to query some of the data from within the serializer itself. I want to find a way to speed this up, and that may be not using AMS (this is okay).
My data model is as follows:
Location
-> Images
-> Recent Images
-> Days Images
Image
-> User
The recent_images and days_images are the same as the images but with a scope to do a where to filter by days and limit to 6.
Currently this whole process takes about 15 seconds locally and 2-4 seconds in production. I feel like I can perform this much quicker but am not entirely sure how I can modify my code.
The query to fetch the Locations is:
ids = #company
.locations
.active
.has_image_taken
.order(last_image_taken: :desc)
.page(page)
.per(per_page)
.pluck(:id)
Location.fetch_multi(ids)
fetch_multi is from the identity_cache gem. These results then hit the serializer which is:
class V1::RecentLocationSerializer < ActiveModel::Serializer
has_many :recent_images, serializer: V1::RecentImageSerializer do
if scope[:view_user_photos]
object.fetch_recent_images.take(6)
else
ids = object.recent_images.where(user_id: scope[:current_user].id).limit(6).pluck(:id)
Image.fetch_multi(ids)
end
end
has_many :days_images do
if scope[:view_user_photos]
object.fetch_days_images
else
ids = object.days_images.where(user_id: scope[:current_user].id).pluck(:id)
Image.fetch_multi(ids)
end
end
end
The scopes for recent and days images is:
scope :days_images, -> { includes(:user).active.where('date_uploaded > ?', DateTime.now.beginning_of_day).ordered_desc_by_date }
scope :recent_images, -> { includes(:user).active.ordered_desc_by_date }
My question is if you think I need to ditch AMS so I don't have to query in the serializer, and if so, how would you recommend to render this?
I may have missed the point here - what part is slow? What do the logs look like? Are you missing a db index? I don't see a lot of joins in there, so maybe you just need an index on date_uploaded (and maybe user_id). I don't see anything in there that is doing a bunch of serializing.
You can easily speed up the sql in many ways - like, a trigger (in ruby or sql) that updates the image with a user_active boolean so you can dump that join. (you'd include that in the index)
OR build a cached table just with the IDs in it. Kind of like an inverted index (I'd also do it in redis, but that's me) that has a row for each user/location that is updated when an image is uploaded.
Push the work to the image upload rather than where it is now, viewing the list.
I have a status dashboard that shows the status of remote hardware devices that 'ping' the application every minute and log their status.
class Sensor < ActiveRecord::Base
has_many :logs
def most_recent_log
logs.order("id DESC").first
end
end
class Log < ActiveRecord::Base
belongs_to :sensor
end
Given I'm only interested in showing the current status, the dashboard only shows the most recent log for all sensors. This application has been running for a long time now and there are tens of millions of Log records.
The problem I have is that the dashboard takes around 8 seconds to load. From what I can tell, this is largely because there is an N+1 Query fetching these logs.
Completed 200 OK in 4729.5ms (Views: 4246.3ms | ActiveRecord: 480.5ms)
I do have the following index in place:
add_index "logs", ["sensor_id", "id"], :name => "index_logs_on_sensor_id_and_id", :order => {"id"=>:desc}
My controller / lookup code is the following:
class SensorsController < ApplicationController
def index
#sensors = Sensor.all
end
end
How do I make the load time reasonable?
Is there a way to avoid the N+1 and reload this?
I had thought of putting a latest_log_id reference on to Sensor and then updating this every time a new log for that sensor is posted - but something in my head is telling me that other developers would say this is a bad thing. Is this the case?
How are problems like this usually solved?
There are 2 relatively easy ways to do this:
Use ActiveRecord eager loading to pull in just the most recent logs
Roll your own mini eager loading system (as a Hash) for just this purpose
Basic ActiveRecord approach:
subquery = Log.group(:sensor_id).select("MAX('id')")
#sensors = Sensor.eager_load(:logs).where(logs: {id: subquery}).all
Note that you should NOT use your most_recent_log method for each sensor (that will trigger an N+1), but rather logs.first. Only the latest log for each sensor will actually be prefetched in the logs collection.
Rolling your own may be more efficient from a SQL perspective, but more complex to read and use:
#sensors = Sensor.all
logs = Log.where(id: Log.group(:sensor_id).select("MAX('id')"))
#sensor_logs = logs.each_with_object({}){|log, hash|
hash[log.sensor_id] = log
}
#sensor_logs is a Hash, permitting a fast lookup for the latest log by sensor.id.
Regarding your comment about storing the latest log id - you are essentially asking if you should build a cache. The answer would be 'it depends'. There are many advantages and many disadvantages to caching, so it comes down to 'is the benefit worth the cost'. From what you are describing, it doesn't appear that you are familiar with the difficulties they introduce (Google 'cache invalidation') or if they are applicable in your case. I'd recommend against it until you can demonstrate that a) it is adding real value over a non-cache solution, and b) it can be safely applied for your scenario.
There's 3 options:
eager loading
joining
caching the current status
--
is explained by PinnyM
You can do a join from the Sensor just to the latest Log record for each row, so everything gets fetched in the one query. Not sure off hand how that'll perform with the number of rows you have, likely it'll still be slower than you want.
The thing you mentioned - caching the latest_log_id (or even caching just the latest_status if that's all you need for the dashboard) is actually OK. It's called denormalization and it's a useful thing if used carefully. You've likely come across "counter cache" plugins for rails which are in the same vein - duplicating data, in the interests of being able to optimise read performance.
How do you folks retrieve all objects in code upfront?
I figure you can increase performance if you bundle all the model calls together?
This makes for a bigger deal, especially if your DB cannot keep everything in memory
def hitDBSeperately {
get X users
...code
get Y users... code
get Z users... code
}
Versus:
def hitDBInSingleCall {
get X+Y+Z users
code for X
code for Y...
}
Are you looking for an explanation between the approach where you load in all users at once:
# Loads all users into memory simultaneously
#users = User.all
#users.each do |user|
# ...
end
Where you could load them individually for a smaller memory footprint?
#user_ids = User.connection.select_values("SELECT id FROM users")
#user_ids.each do |user_id|
user = User.find(user_id)
# ...
end
The second approach would be slower since it requires N+1 queries for N users, where the first loads them all with 1 query. However, you need to have sufficient memory for creating model instances for each and every User record at the same time. This is not a function of "DB memory", but of application memory.
For any application with a non-trivial number of users, you should use an approach where you load users either individually or in groups. You can do this using:
#user_ids.in_groups_of(10) do |user_ids|
User.find_all_by_id(user_ids).each do |user|
# ...
end
end
By tuning to use an appropriate grouping factor, you can balance between memory usage and performance.
Can you give a snipet of actual ruby on rails code, our pseudo code is a little confusing?
You can avoid the n+1 problem by using eager loading. You can accomplish this by using :includes => tag in your model.find method.
http://api.rubyonrails.org/classes/ActiveRecord/Associations/ClassMethods.html