Logging Search Results in a Rails Application - ruby-on-rails

We're interested in logging and computing the number of times an item comes up in search or on a list page. With 50k unique visitors a day, we're expecting we could produce 3-4 million 'impressions' per day, which isn't a terribly high amount, but one we'd like to architect well.
We don't need to read this data in real time, but would like to be able to generate daily totals and analyze trends, etc. Similar to a business analytics tool.
We're planning to do this with an Ajax post after the page is rendered - this will allow us to count results even if those results are cached. We can do this in a single post per page, to send a comma delimited list of ids and their positions on the page.
I am hoping there is some sort of design pattern/gem/blog post about this that would help me avoid the common first-timer mistakes that may come up. I also don't really have much experience logging or reading logs.
My current strategy - make something to write events to a log file, and a background job to tally up the results at the end of the day and put the results back into mysql.

Ok, I have three approaches for you:
1) Queues
In your AJAX Handler, write the simplest method possible (use a Rack Middleware or Rails Metal) to push the query params to a queue. Then, poll the queue and gather the messages.
Queue pushes from a rack middleware are blindingly fast. We use this on a very high traffic site for logging of similar data.
An example rack middleware is below (extracted from our app, can handle request in <2ms or so:
class TrackingMiddleware
CACHE_BUSTER = {"Cache-Control" => "no-cache, no-store, max-age=0, must-revalidate", "Pragma" => "no-cache", "Expires" => "Fri, 29 Aug 1997 02:14:00 EST"}
IMAGE_RESPONSE_HEADERS = CACHE_BUSTER.merge("Content-Type" => "image/gif").freeze
IMAGE_RESPONSE_BODY = [File.open(Rails.root + "public/images/tracker.gif").read].freeze
def initialize(app)
#app = app
end
def call(env)
if env["PATH_INFO"] =~ %r{^/track.gif}
request = Rack::Request.new(env)
YOUR_QUEUE.push([Time.now, request.GET.symbolize_keys])
[200, IMAGE_RESPONSE_BODY, IMAGE_RESPONSE_HEADERS]
else
#app.call(env)
end
end
end
For the queue I'd recommend starling, I've had nothing but good times with it.
On the parsing end, I would use the super-poller toolkit, but I would say that, I wrote it.
2) Logs
Pass all the params along as query params to a static file (/1x1.gif?foo=1&bar=2&baz=3).
This will not hit the rails stack and will be blindingly fast.
When you need the data, just parse the log files!
This is the best scaling home brew approach.
3) Google Analytics
Why handle the load when google will do it for you? You would be surprised at how good google analytics is, before you home brew anything, check it out!
This will scale infinitely, because google buys servers faster than you do.
I could rant on this for ages, but I have to go now. Hope this helps!

Depending no the action required to list items, you might be able to do it in the controller and save yourself a round trip. You can do it with an after_filter, to make the addition unobtrusive.
This only works if all actions that list items you want to log, require parameters. This is because page caching ignores GET requests with parameters.
Assuming you only want to log search data on the search action.
class ItemsController < ApplicationController
after_filter :log_searches, :only => :search
def log_searches
#items.each do |item|
# write to log here
end
end
...
# rest of controller remains unchanged
...
end
Otherwise you're right on track with the AJAX, and an onload remote function.
As for processing the you could use a rake task run by a cron job to collect statistics, and possibly update items for a popularity rating.
Either way you will want to read up on the Ruby Logging class. Learning about cron jobs and rake tasks wouldn't hurt either.

This is what I ultimately did - it was enough for our use for now, and with some simple benchmarking, I feel OK about it. We'll be watching to see how it does in production before we expose the results to our customers.
The components:
class EventsController < ApplicationController
def create
logger = Logger.new("#{RAILS_ROOT}/log/impressions/#{Date.today}.log")
logger.info "#{DateTime.now.strftime} #{params[:ids]}" unless params[:ids].blank?
render :nothing => true
end
end
This is called from an ajax call in the site layout...
<% javascript_tag do %>
var list = '';
$$('div.item').each(function(item) { list += item.id + ','; });
<%= remote_function(:url => { :controller => :events, :action => :create}, :with => "'ids=' + list" ) %>
<% end %>
Then I made a rake task to import these rows of comma delimited ids into the db. This is run the following day:
desc "Calculate impressions"
task :count_impressions => :environment do
date = ENV['DATE'] || (Date.today - 1).to_s # defaults to yesterday (yyyy-mm-dd)
file = File.new("log/impressions/#{date}.log", "r")
item_impressions = {}
while (line = file.gets)
ids_string = line.split(' ')[1]
next unless ids_string
ids = ids_string.split(',')
ids.each {|i| item_impressions[i] ||= 0; item_impressions[i] += 1 }
end
item_impressions.keys.each do |id|
ActiveRecord::Base.connection.execute "insert into item_stats(item_id, impression_count, collected_on) values('#{id}',#{item_impressions[id]},'#{date}')", 'Insert Item Stats'
end
file.close
end
One thing to note - the logger variable is declared in the controller action - not in environment.rb as you would normally do with a logger. I benchmarked this - 10000 writes took about 20 seconds. Averaging about 2 milliseconds a write. With the file name in the envirnment.rb, it took about 14 seconds. We made this trade-off so we could dynamically determine the file name - an easy way to switch files at midnight.
Our main concern at this point - we have no idea how many different items will be counted per day - ie. we don't know how long the tail is. This will determine how many rows are added to the db each day. We expect we'll need to limit how far back we keep daily reports and will role up results even further at that point.

Related

How to set an expiry on a cached Ruby search?

I have a function, which returns a list of ID's, in the Rails caching guide I can see that an expiration can be set on the cached results, but I have implemented my caching somewhat differently.
def provide_book_ids(search_param)
#returned_ids ||= begin
search = client.search(query: search_param, :reload => true)
search.fetch
search.options[:query] = search_str
search.fetch(true)
search.map(&:id)
end
end
What is the recomennded way to set a 10 minute cache expiry, when written as above?
def provide_book_ids(search_param)
#returned_ids = Rails.cache.fetch("zendesk_ids", expires_in: 10.minutes) do
search = client.search(query: search_param, :reload => true)
search.fetch
search.options[:query] = search_str
search.fetch(true)
search.map(&:id)
end
end
I am assuming this code is part of some request-response cycle and not something else (for example a long running worker or some class that is initialized once in your app. In such a case you wouldn't want to use #returned_ids directly but instead call provide_book_ids to get the value, but from I understand that's not your scenario so provided approach above should work.

Get next/previous record in Collection by Date in Rails (and handle multiple dates on the same day)

I'm having trouble trying to isolate the next/previous record in a collection. I'm self-taught and relatively new to Rails/coding.
I have a Goal class which has many GoalTasks.
GoalTask has taskduedate. I want to be able to cycle next/previous on the goal_tasks, based on their taskduedate.
The issue is that a task due date is just when the task is due to be completed, but it can be set at any time and may not be in sequential order so that I don't know what else to order it by to correctly cycle through it.
I have created an array of goal_tasks to identify which one is currently being viewed (e.g. Task: 3/20), so I could use that to go to the next one, I think there might be a solution here, but it feels wrong to handle it in the view/controller?
I've tried the below solution from stackoverflow, but it doesn't handle the fact that I have multiple goal_tasks due on the same day, if I click next it just goes to the next day that goal_tasks are due. e.g. if I have three tasks due today and I'm on the first one and click next, it will just skip over the other two for today.
I then tried to add the >= (displayed below) to try and pull the next task (including those on the same day), and I've tried to ignore the current task by doing where created_at is not the same as the current goal_task and where.not, but I haven't managed to successfully get it to cycle the way I want it to, and I imagine there's a better solution.
GoalTasksController:
def show
#all_tasks_ordered_due_date_desc = #goal.goal_tasks.order('taskduedate ASC', 'id ASC')
end
show.html.erb:
Task: <%= #all_tasks_ordered_due_date_desc.find_index(#goal_task) +1 %> /
<%= #goal.goal_tasks.count%>
GoalTask.rb
scope :next_task, lambda {|taskduedate| where('taskduedate >= ?', taskduedate).order('id ASC') }
scope :last_task, lambda {|taskduedate| where('taskduedate <= ?', taskduedate).order('id DESC') }
def next_goal_task
goal.goal_tasks.next_task(self.taskduedate).first
end
Thanks
I used the method found here: Rails 5: ActiveRecord collection index_by
Which meant adding a default scope and changing GoalTask.rb to:
default_scope { order('taskduedate ASC') }
def next_goal_task
index = goal.goal_tasks.index self
goal.goal_tasks[index + 1]
end
def last_goal_task
index = goal.goal_tasks.index self
goal.goal_tasks[index -1]
end

Updating Lots of Records at Once in Rails

I've got a background job that I run about 5,000 of them every 10 minutes. Each job makes a request to an external API and then either adds new or updates existing records in my database. Each API request returns around 100 items, so every 10 minutes I am making 50,000 CREATE or UPDATE sql queries.
The way I handle this now is, each API item returned has a unique ID. I search my database for a post that has this id, and if it exists, it updates the model. If it doesn't exist, it creates a new one.
Imagine the api response looks like this:
[
{
external_id: '123',
text: 'blah blah',
count: 450
},
{
external_id: 'abc',
text: 'something else',
count: 393
}
]
which is set to the variable collection
Then I run this code in my parent model:
class ParentModel < ApplicationRecord
def update
collection.each do |attrs|
child = ChildModel.find_or_initialize_by(external_id: attrs[:external_id], parent_model_id: self.id)
child.assign_attributes attrs
child.save if child.changed?
end
end
end
Each of these individual calls is extremely quick, but when I am doing 50,000 in a short period of time it really adds up and can slow things down.
I'm wondering if there's a more efficient way I can handle this, I was thinking of doing something instead like:
class ParentModel < ApplicationRecord
def update
eager_loaded_children = ChildModel.where(parent_model_id: self.id).limit(100)
collection.each do |attrs|
cached_child = eager_loaded_children.select {|child| child.external_id == attrs[:external_id] }.first
if cached_child
cached_child.update_attributes attrs
else
ChildModel.create attrs
end
end
end
end
Essentially I would be saving the lookups and instead doing a bigger query up front (this is also quite fast) but making a tradeoff in memory. But this doesn't seem like it would be that much time, maybe slightly speeding up the lookup part, but I'd still have to do 100 updates and creates.
Is there some kind of way I can do batch updates that I'm not thinking of? Anything else obvious that could make this go faster, or reduce the amount of queries I am doing?
You can do something like this:
collection2 = collection.map { |c| [c[:external_id], c.except(:external_id)]}.to_h
def update
ChildModel.where(external_id: collection2.keys).each |cm| do
ext_id = cm.external_id
cm.assign_attributes collection2[ext_id]
cm.save if cm.changed?
collection2.delete(ext_id)
end
if collection2.present?
new_ids = collection2.keys
new = collection.select { |c| new_ids.include? c[:external_id] }
ChildModel.create(new)
end
end
Better because
fetches all required records all at once
creates all new records at once
You can use update_columns if you don't need callbacks/validations
Only drawback, more ruby code manipulation which I think is a good tradeoff for db queries..

Speed up rake task by using typhoeus

So i stumbled across this: https://github.com/typhoeus/typhoeus
I'm wondering if this is what i need to speed up my rake task
Event.all.each do |row|
begin
url = urlhere + row.first + row.second
doc = Nokogiri::HTML(open(url))
doc.css('.table__row--event').each do |tablerow|
table = tablerow.css('.table__cell__body--location').css('h4').text
next unless table == row.eventvenuename
tablerow.css('.table__cell__body--availability').each do |button|
buttonurl = button.css('a')[0]['href']
if buttonurl.include? '/checkout/external'
else
row.update(row: buttonurl)
end
end
end
rescue Faraday::ConnectionFailed
puts "connection failed"
next
end
end
I'm wondering if this would speed it up, Or because i'm doing a .each it wouldn't?
If it would could you provide an example?
Sam
If you set up Typhoeus::Hydra to run parallel requests, you might be able to speed up your code, assuming that the Kernel#open calls are what's slowing you down. Before you optimize, you might want to run benchmarks to validate this assumption.
If it is true, and parallel requests would speed it up, you would need to restructure your code to load events in batches, build a queue of parallel requests for each batch, and then handle them after they execute. Here's some sketch code.
class YourBatchProcessingClass
def initialize(batch_size: 200)
#batch_size = batch_size
#hydra = Typhoeus::Hydra.new(max_concurrency: #batch_size)
end
def perform
# Get an array of records
Event.find_in_batches(batch_size: #batch_size) do |batch|
# Store all the requests so we can access their responses later.
requests = batch.map do |record|
request = Typhoeus::Request.new(your_url_build_logic(record))
#hydra.queue request
request
end
#hydra.run # Run requests in parallel
# Process responses from each request
requests.each do |request|
your_response_processing(request.response.body)
end
end
rescue WhateverError => e
puts e.message
end
private
def your_url_build_logic(event)
# TODO
end
def your_response_processing(response_body)
# TODO
end
end
# Run the service by calling this in your Rake task definition
YourBatchProcessingClass.new.perform
Ruby can be used for pure scripting, but it functions best as an object-oriented language. Decomposing your processing work into clear methods can help clarify your code and help you catch things like Tom Lord mentioned in the comments on your question. Also, instead of wrapping your whole script in a begin..rescue block, you can use method-level rescues as in #perform above, or just wrap #hydra.run.
As a note, .all.each is a memory hog, and is thus considered a bad solution to iterating over records: .all loads all of the records into memory before iterating over them with .each. To save memory, it's better to use .find_each or .find_in_batches, depending on your use case. See: http://api.rubyonrails.org/classes/ActiveRecord/Batches.html

Where does an if/than statement that needs to run constantly go in rails?

Right now I'm building a call tracking app to learn rails and twilio. The app has 2 relevant models ; The Plans model has_many users. The plans table also has the value max_minutes.
I want it to make it so that when a particular user goes over their max_minutes, their sub account is disabled, and I can also warn them to upgrade in the view.
To do this, here's a parameter I created in the User class
def at_max_minutes?
time_to_bill=0
start_time = Time.now - ( 30 * 24 * 60 * 60) #30 days
#subaccount = Twilio::REST::Client.new(#user.twilio_account_sid, #user.twilio_auth_token)
#subaccount.calls.list({:page => 0, :page_size => 1000, :start_time => ">#{start_time.strftime("%Y-%m-%d")}"}).each do |call|
time_to_bill += (call.duration.to_f/60).ceil
end
time_to_bill >= self.plan.max_minutes
end
This allows me to run if/else statements in the view to urge them to upgrade. However, I'd also like to make an if/else statement where, if at_max_minutes? than the user's twilio subaccount is disabled, else, it's enabled.
I'm not sure where I would put that though in rails.
It would look something like this
#client = Twilio::REST::Client.new(#user.twilio_account_sid, #user.twilio_auth_token)
#account = #client.account
if at_max_minutes?
#account = #account.create({:status => 'suspended'})
else
#account = #account.create({:status => 'active'})
end
BUT, I'm not sure where I would put this code, so that it's active all the time.
How would you implement this code, for the functionality to work?
Instead of constantly computing the total minutes used in at_max_minutes?, why not keep track of a user's used minutes, and set the status to "suspended" on the transition (when used minutes goes over max_minutes). Then your view and call code would only have to check status (you may also want to store status directly on user, to save API calls over to Twilio).
Add to User model:
used_minutes
When every call ends, update minutes:
def on_call_end( call )
self.used_minutes += call.duration_in_minutes # this assumes Twilio gives you a callback and has the length of the call)
save!
end
Add an after_save to User:
after_save :check_minutes_usage
def check_minutes_usage
if used_minutes >= plan.max_minutes
#account = #account.create({:status => 'suspended'})
else
#account = #account.create({:status => 'active'})
end
end
You're going to have to do some sort of scheduled background job for this check if you want it to be "active all the time". I'd recommend resque with resque-scheduler, which is a pretty good scheduling solution for Rails. Basically what you to do is to make a job, which executes that second block of code you specified, and have it run on a regular interval (maybe every 2 hours).

Resources