Consume external API and performance - ruby-on-rails

I am working a Ruby on Rails project that consumes data through some external API.
This API allows me to get a list of cars and display them on my single webpage.
I created a model that holds all methods related to this API.
The controller uses list_cars method from the model to forward the data to the view.
This is the model dedicated to the API calls:
class CarsApi
#base_uri = 'https://api.greatcars.com/v1/'
def self.list_cars
cars = Array.new
response = HTTParty.get(#base_uri + 'cars',
headers: {
'Authorization' => 'Token token=' + ENV['GREATCARS_API_TOKEN'],
'X-Api-Version' => ENV["GREATCARS_API_VERSION"]
})
response["data"].each_with_index do |(key, value), index|
id = response["data"][index]["id"]
make = response["data"][index]["attributes"]["make"]
store = get_store(id)
location = get_location(id)
model = response["data"][index]["attributes"]["model"]
if response["data"][index]["attributes"]["status"] == "on sale"
cars << Job.new(id, make, store, location, model)
end
end
cars
end
def self.get_store(job_id)
store = ''
response_related_store = HTTParty.get(#base_uri + 'cars/' + job_id + "/relationships/store",
headers: {
'Authorization' => 'Token token=' + ENV['GREATCARS_API_TOKEN'],
'X-Api-Version' => ENV["GREATCARS_API_VERSION"]
})
if response_related_store["data"]
store_id = response_related_store["data"]["id"]
response_store = HTTParty.get(#base_uri + 'stores/' + store_id,
headers: {
'Authorization' => 'Token token=' + ENV['GREATCARS_API_TOKEN'],
'X-Api-Version' => ENV["GREATCARS_API_VERSION"]
})
store = response_store["data"]["attributes"]["name"]
end
store
end
def self.get_location(job_id)
address, city, country, zip, lat, long = ''
response_related_location = HTTParty.get(#base_uri + 'cars/' + job_id + "/relationships/location",
headers: {
'Authorization' => 'Token token=' + ENV['GREATCARS_API_TOKEN'],
'X-Api-Version' => ENV["GREATCARS_API_VERSION"]
})
if response_related_location["data"]
location_id = response_related_location["data"]["id"]
response_location = HTTParty.get(#base_uri + 'locations/' + location_id,
headers: {
'Authorization' => 'Token token=' + ENV['GREATCARS_API_TOKEN'],
'X-Api-Version' => ENV["GREATCARS_API_VERSION"]
})
if response_location["data"]["attributes"]["address"]
address = response_location["data"]["attributes"]["address"]
end
if response_location["data"]["attributes"]["city"]
city = response_location["data"]["attributes"]["city"]
end
if response_location["data"]["attributes"]["country"]
country = response_location["data"]["attributes"]["country"]
end
if response_location["data"]["attributes"]["zip"]
zip = response_location["data"]["attributes"]["zip"]
end
if response_location["data"]["attributes"]["lat"]
lat = response_location["data"]["attributes"]["lat"]
end
if response_location["data"]["attributes"]["long"]
long = response_location["data"]["attributes"]["long"]
end
end
Location.new(address, city, country, zip, lat, long)
end
end
It takes... 1 minute and 10 secondes to load my home page!
I wonder if there is a better way to do this and improve performances.

It takes... 1 minute and 10 seconds to load my home page! I wonder if there is a better way to do this and improve performances.
If you want to improve the performance of some piece of code, the first thing you should always do is add some instrumentation. It seems you already have some metric how long loading the whole page takes but you need to figure out now WHAT is actually taking long. There are many great gems and services out there which can help you.
Services:
https://newrelic.com/
https://www.skylight.io/
https://www.datadoghq.com/
Gems
https://github.com/MiniProfiler/rack-mini-profiler
https://github.com/influxdata/influxdb-rails
Or just add some plain old logging
N+1 request
One assumption why this is slow could be that you do for every car FOUR additional requests to fetch the store and location. This means if you display 10 jobs on your homepage you would need to do 50 API requests. Even if each request just takes ~1 second it's almost one minute.
One simple idea is to cache the additional resources in a lookup table. I'm not sure how many jobs would share the same store and location though so not sure how much it would actually safe.
This could be a simple lookup table like:
class Store
cattr_reader :table do
{}
end
class << self
def self.find(id)
self.class.table[id] ||= fetch_store(id)
end
private
def fetch_store(id)
# copy code here to fetch store from API
end
end
end
This way, if several jobs have the same location / store you only do one request.
Lazy load
This depends on the design of the page but you could lazy load additional information like location and store.
One thing many pages do is to display a placeholder or dominant colour and lazy load further content with Java Script.
Another idea could be load store and location when scrolling or hovering but this depends a bit on the design of your page.
Pagination / Limit
Maybe you're also requesting too many items from the API. See if the API has some options to limit the number of items you request e.g. https://api.greatcars.com/v1/cars/limit=10&page=1
But as said, even if this is limited to 10 items you would end up with 50 requests in total. So until you fix the first issue, this won't have much impact.
General Caching
Generally I think it's not a good idea to always send an API request for each request you page gets. You could introduce some caching to e.g. only do an API request once every x minutes / hour / day.
This could be as simple as just storing in a class variable or using memcached / Redis / Database.

Related

How to properly use Sidekiq to process background tasks in Rails

So, i've generated a rails app using https://github.com/Shopify/shopify_app - and for the most part have the app working as intended - it's goal is to get product quantities from an external stock management API, and then update the variant quantities in Shopify with the latest quantities from that stock management system.
My problem is that the initial POST request to the external API responds with a large number of products - this takes upwards of 15 seconds sometimes. In addition to this, another portion of my app then takes this response, and for every product in the response that also exists in Shopify, it will make a PUT request to Shopify to update the variant quantities. As with the initial request, this also takes upwards of 10-15 seconds.
My problem is that i'm hosting the app on Heroku, and as a result i've hit their 30 second request timeout limit. As a result I need to use a background worker to offset at least one of the requests above (perhaps both) to a worker queue. I've gone with the widely recommended Sidekiq gem - https://github.com/mperham/sidekiq - which is easy enough to set up.
My problem is that I don't know how to get the results from the finished Sidekiq worker job, and then use that again within the Controller - I also don't know if this is best practice (i'm a little new to Rails/App development).
I've included my controller (prior to breaking it down into workers) that currently runs the app below - I guess I just need some advice - am I doing this correctly - should some of this logic be inside a Model, and if so how would that model then communicate with the Controller, and then how would Sidekiq then fit into all of it.
Appreciate any advice or assistance, thanks.
class StockManagementController < ShopifyApp::AuthenticatedController
require 'uri'
require 'net/http'
require 'json'
require 'nokogiri'
require 'open-uri'
require 'rexml/document'
def new
#token = StockManagementController.new
end
def get_token
url = URI('https://external.api.endpoint/api/v1/AuthToken')
http = Net::HTTP.new(url.host, url.port)
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_NONE
#HEROKU_ENV_USERNAME = ENV['HEROKU_ENV_USERNAME']
#HEROKU_ENV_PASSWORD = ENV['HEROKU_ENV_PASSWORD']
request = Net::HTTP::Post.new(url)
request['content-type'] = 'application/x-www-form-urlencoded'
request['cache-control'] = 'no-cache'
request.body = 'username=' + #HEROKU_ENV_USERNAME + '&password=' + #HEROKU_ENV_PASSWORD + '&grant_type=password'
response = http.request(request)
responseJSON = JSON.parse(response.read_body)
session[:accessToken] = responseJSON['access_token']
if session[:accessToken]
flash[:notice] = 'StockManagement token generation was successful.'
redirect_to '/StockManagement/product_quantity'
else
flash[:alert] = 'StockManagement token generation was unsuccessful.'
end
end
def product_quantity
REXML::Document.entity_expansion_text_limit = 1_000_000
#theToken = session[:accessToken]
if #theToken
url = URI('https://external.api.endpoint/api/v1/ProductQuantity')
http = Net::HTTP.new(url.host, url.port)
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_NONE
request = Net::HTTP::Post.new(url)
request['authorization'] = 'bearer ' + #theToken + ''
request['content-type'] = 'application/xml'
request['cache-control'] = 'no-cache'
response = http.request(request)
responseBody = response.read_body
finalResponse = Hash.from_xml(responseBody).to_json
resultQuantity = JSON.parse finalResponse
#connectionType = resultQuantity['AutomatorResponse']['Type']
#successResponse = resultQuantity['AutomatorResponse']['Success']
#errorResponse = resultQuantity['AutomatorResponse']['ErrorMsg']
productQuantityResponse = resultQuantity['AutomatorResponse']['ResponseString']
xmlResponse = Hash.from_xml(productQuantityResponse).to_json
jsonResponse = JSON.parse xmlResponse
#fullResponse = jsonResponse['StockManagement']['Company']['InventoryQuantitiesByLocation']['InventoryQuantity']
# This hash is used to store the final list of items that we need in order to display the item's we've synced, and to show the number of items we've sycned successfully.
#finalList = Hash.new
# This array is used to contain the available products - this is used later on as a way of only rendering
#availableProducts = Array.new
# Here we get all of the variant data from Shopify.
#variants = ShopifyAPI::Variant.find(:all, params: {})
# For each peace of variant data, we push all of the available SKUs in the store to the #availableProducts Array for use later
#variants.each do |variant|
#availableProducts << variant.sku
end
#Our final list of products which will contain details from both the Stock Management company and Shopify - we will use this list to run api calls against each item
#finalProductList = Array.new
puts "Final product list has #{#fullResponse.length} items."
puts #fullResponse.inspect
# We look through every item in the response from Company
#fullResponse.each_with_index do |p, index|
# We get the Quantity and Product Code
#productQTY = p["QtyOnHand"].to_f.round
#productCode = p["Code"].upcase
# If the product code is found in the list of available products in the Shopify store...
if #availableProducts.include? #productCode
#variants.each do |variant|
if #productCode === variant.sku
if #productQTY != 0
#finalProductList << {
"sku" => variant.sku,
"inventory_quantity" => variant.inventory_quantity,
"old_inventory_quantity" => variant.old_inventory_quantity,
"id" => variant.id,
"company_sku" => #productCode,
"company_qty" => #productQTY
}
end
end
end
end
end
# If we get a successful response from StockManagement, proceed...
if #finalProductList
flash[:notice] = 'StockManagement product quantity check was successful.'
puts "Final product list has #{#finalProductList.length} items."
puts #finalProductList
#finalProductList.each do |item|
#productSKU = item["sku"]
#productInventoryQuantity = item["inventory_quantity"]
#productOldInventoryQuantity = item["old_inventory_quantity"]
#productID = item["id"]
#companySKU = item["company_sku"]
#companyQTY = item["company_qty"]
url = URI("https://example.myshopify.com/admin/variants/#{#productID}.json")
http = Net::HTTP.new(url.host, url.port)
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_NONE
request = Net::HTTP::Put.new(url)
request["content-type"] = 'application/json'
request["authorization"] = 'Basic KJSHDFKJHSDFKJHSDFKJHSDFKJHSDFKJHSDFKJHSDFKJHSDFKJHSDFKJHSDF'
request["cache-control"] = 'no-cache'
request.body = "{\n\t\"variant\": {\n\t\t\"id\": #{#productID},\n\t\t\"inventory_quantity\": #{#companyQTY},\n\t\t\"old_inventory_quantity\": #{#productOldInventoryQuantity}\n\t}\n}"
# This is the line that actually runs the put request to update the quantity.
response = http.request(request)
# Finally, we populate the finalList has with response information.
#finalList[#companySKU] = ["","You had #{#productOldInventoryQuantity} in stock, now you have #{#companyQTY} in stock."]
end
else
# If the overall sync failed, we flash an alert.
flash[:alert] = 'Quantity synchronisation was unsuccessful.'
end
# Lastly we get the final number of items that were synchronised.
#synchronisedItems = #finalList.length
# We flash this notification, letting the user known how many products were successfully synchronised.
flash[:notice] = "#{#synchronisedItems} product quantities were synchronised successfully."
# We then pretty print this to the console for debugging purposes.
pp #finalList
else
flash[:alert] = #errorResponse
end
end
end
First of all, your product_quantity method is way too long. You should break it into smaller parts. 2nd, http.verify_mode = OpenSSL::SSL::VERIFY_NONE should not be done in production. The example you've provide along with your question are too complex and are therefore difficult to answer. It sounds like you need a basic understanding of design patterns and this is not a specific ruby question.
If your app needs to make realtime API calls inside of a controller this is a poor design. You don't want to keep requests of any kind waiting for more than a couple of seconds at most. You should consider WHY you need to make these requests in the first place. If it's data you need quick access to, you should write background jobs to scrape the data on a schedule and store it in your own database.
If a user of your app makes a request which needs to wait for the API's response, you could write a worker to handle fetching the API data and eventually send a response to the user's browser probably using actioncable.
For your constant definitions you probably should do this in an initializer wihich you would keep in my_app_root/config/initializers/constants.rb which get loaded into your app at runtime. You could just call them where need using te ENV[] syntax but if you prefer simpler constants drop the # since that naming convention in ruby is for instance objects.
#app_root/config/initializers/constants.rb
HEROKU_ENV_USERNAME = ENV['HEROKU_ENV_USERNAME']
HEROKU_ENV_PASSWORD = ENV['HEROKU_ENV_PASSWORD']

Updating Lots of Records at Once in Rails

I've got a background job that I run about 5,000 of them every 10 minutes. Each job makes a request to an external API and then either adds new or updates existing records in my database. Each API request returns around 100 items, so every 10 minutes I am making 50,000 CREATE or UPDATE sql queries.
The way I handle this now is, each API item returned has a unique ID. I search my database for a post that has this id, and if it exists, it updates the model. If it doesn't exist, it creates a new one.
Imagine the api response looks like this:
[
{
external_id: '123',
text: 'blah blah',
count: 450
},
{
external_id: 'abc',
text: 'something else',
count: 393
}
]
which is set to the variable collection
Then I run this code in my parent model:
class ParentModel < ApplicationRecord
def update
collection.each do |attrs|
child = ChildModel.find_or_initialize_by(external_id: attrs[:external_id], parent_model_id: self.id)
child.assign_attributes attrs
child.save if child.changed?
end
end
end
Each of these individual calls is extremely quick, but when I am doing 50,000 in a short period of time it really adds up and can slow things down.
I'm wondering if there's a more efficient way I can handle this, I was thinking of doing something instead like:
class ParentModel < ApplicationRecord
def update
eager_loaded_children = ChildModel.where(parent_model_id: self.id).limit(100)
collection.each do |attrs|
cached_child = eager_loaded_children.select {|child| child.external_id == attrs[:external_id] }.first
if cached_child
cached_child.update_attributes attrs
else
ChildModel.create attrs
end
end
end
end
Essentially I would be saving the lookups and instead doing a bigger query up front (this is also quite fast) but making a tradeoff in memory. But this doesn't seem like it would be that much time, maybe slightly speeding up the lookup part, but I'd still have to do 100 updates and creates.
Is there some kind of way I can do batch updates that I'm not thinking of? Anything else obvious that could make this go faster, or reduce the amount of queries I am doing?
You can do something like this:
collection2 = collection.map { |c| [c[:external_id], c.except(:external_id)]}.to_h
def update
ChildModel.where(external_id: collection2.keys).each |cm| do
ext_id = cm.external_id
cm.assign_attributes collection2[ext_id]
cm.save if cm.changed?
collection2.delete(ext_id)
end
if collection2.present?
new_ids = collection2.keys
new = collection.select { |c| new_ids.include? c[:external_id] }
ChildModel.create(new)
end
end
Better because
fetches all required records all at once
creates all new records at once
You can use update_columns if you don't need callbacks/validations
Only drawback, more ruby code manipulation which I think is a good tradeoff for db queries..

Retrieving only unique records with multiple requests

I have this "heavy_rotation" filter I'm working on. Basically it grabs tracks from our database based on certain parameters (a mixture of listens_count, staff_pick, purchase_count, to name a few)
An xhr request is made to the filter_tracks controller action. In there I have a flag to check if it's "heavy_rotation". I will likely move this to the model (cos this controller is getting fat)... Anyway, how can I ensure (in a efficient way) to not have it pull the same records? I've considered an offset, but than I have to keep track of the offset for every query. Or maybe store track.id's to compare against for each query? Any ideas? I'm having trouble thinking of an elegant way to do this.
Maybe it should be noted that a limit of 14 is set via Javascript, and when a user hits "view more" to paginate, it sends another request to filter_tracks.
Any help appreciated! Thanks!
def filter_tracks
params[:limit] ||= 50
params[:offset] ||= 0
params[:order] ||= 'heavy_rotation'
# heavy rotation filter flag
heavy_rotation ||= (params[:order] == 'heavy_rotation')
#result_offset = params[:offset]
#tracks = Track.ready.with_artist
params[:order] = "tracks.#{params[:order]}" unless heavy_rotation
if params[:order]
order = params[:order]
order.match(/artist.*/){|m|
params[:order] = params[:order].sub /tracks\./, ''
}
order.match(/title.*/){|m|
params[:order] = params[:order].sub /tracks.(title)(.*)/i, 'LOWER(\1)\2'
}
end
searched = params[:q] && params[:q][:search].present?
#tracks = parse_params(params[:q], #tracks)
#tracks = #tracks.offset(params[:offset])
#result_count = #tracks.count
#tracks = #tracks.order(params[:order], 'tracks.updated_at DESC').limit(params[:limit]) unless heavy_rotation
# structure heavy rotation results
if heavy_rotation
puts "*" * 300
week_ago = Time.now - 7.days
two_weeks_ago = Time.now - 14.days
three_months_ago = Time.now - 3.months
# mix in top licensed tracks within last 3 months
t = Track.top_licensed
tracks_top_licensed = t.where(
"tracks.updated_at >= :top",
top: three_months_ago).limit(5)
# mix top listened to tracks within last two weeks
tracks_top_listens = #tracks.order('tracks.listens_count DESC').where(
"tracks.updated_at >= :top",
top: two_weeks_ago)
.limit(3)
# mix top downloaded tracks within last two weeks
tracks_top_downloaded = #tracks.order("tracks.downloads_count DESC").where(
"tracks.updated_at >= :top",
top: two_weeks_ago)
.limit(2)
# mix in 25% of staff picks added within 3 months
tracks_staff_picks = Track.ready.staff_picks.
includes(:artist).order("tracks.created_at DESC").where(
"tracks.updated_at >= :top",
top: three_months_ago)
.limit(4)
#tracks = tracks_top_licensed + tracks_top_listens + tracks_top_downloaded + tracks_staff_picks
end
render partial: "shared/results"
end
I think seeking an "elegant" solution is going to yield many diverse opinions, so I'll offer one approach and my reasoning. In my design decision, I feel that in this case it's optimal and elegant to enforce uniqueness on query intersections by filtering the returned record objects instead of trying to restrict the query to only yield unique results. As for getting contiguous results for pagination, on the other hand, I would store offsets from each query and use it as the starting point for the next query using instance variables or sessions, depending on how the data needs to be persisted.
Here's a gist to my refactored version of your code with a solution implemented and comments explaining why I chose to use certain logic or data structures: https://gist.github.com/femmestem/2b539abe92e9813c02da
#filter_tracks holds a hash map #tracks_offset which the other methods can access and update; each of the query methods holds the responsibility of adding its own offset key to #tracks_offset.
#filter_tracks also holds a collection of track id's for tracks that already appear in the results.
If you need persistence, make #tracks_offset and #track_ids sessions/cookies instead of instance variables. The logic should be the same. If you use sessions to store the offsets and id's from results, remember to clear them when your user is done interacting with this feature.
See below. Note, I refactored your #filter_tracks method to separate the responsibilities into 9 different methods: #filter_tracks, #heavy_rotation, #order_by_params, #heavy_rotation?, #validate_and_return_top_results, and #tracks_top_licensed... #tracks_top_<whatever>. This will make my notes easier to follow and your code more maintainable.
def filter_tracks
# Does this need to be so high when JavaScript limits display to 14?
#limit ||= 50
#tracks_offset ||= {}
#tracks_offset[:default] ||= 0
#result_track_ids ||= []
#order ||= params[:order] || 'heavy_rotation'
tracks = Track.ready.with_artist
tracks = parse_params(params[:q], tracks)
#result_count = tracks.count
# Checks for heavy_rotation filter flag
if heavy_rotation? #order
#tracks = heavy_rotation
else
#tracks = order_by_params
end
render partial: "shared/results"
end
All #heavy_rotation does is call the various query methods. This makes it easy to add, modify, or delete any one of the query methods as criteria changes without affecting any other method.
def heavy_rotation
week_ago = Time.now - 7.days
two_weeks_ago = Time.now - 14.days
three_months_ago = Time.now - 3.months
tracks_top_licensed(date_range: three_months_ago, max_results: 5) +
tracks_top_listens(date_range: two_weeks_ago, max_results: 3) +
tracks_top_downloaded(date_range: two_weeks_ago, max_results: 2) +
tracks_staff_picks(date_range: three_months_ago, max_results: 4)
end
Here's what one of the query methods looks like. They're all basically the same, but with custom SQL/ORM queries. You'll notice that I'm not setting the :limit parameter to the number of results that I want the query method to return. This would create a problem if one of the records returned is duplicated by another query method, like if the same track was returned by staff_picks and top_downloaded. Then I would have to make an additional query to get another record. That's not a wrong decision, just one I didn't decide to do.
def tracks_top_licensed(args = {})
args = #default.merge args
max = args[:max_results]
date_range = args[:date_range]
# Adds own offset key to #filter_tracks hash map => #tracks_offset
#tracks_offset[:top_licensed] ||= 0
unfiltered_results = Track.top_licensed
.where("tracks.updated_at >= :date_range", date_range: date_range)
.limit(#limit)
.offset(#tracks_offset[:top_licensed])
top_tracks = validate_and_return_top_results(unfiltered_results, max)
# Add offset of your most recent query to the cumulative offset
# so triggering 'view more'/pagination returns contiguous results
#tracks_offset[:top_licensed] += top_tracks[:offset]
top_tracks[:top_results]
end
In each query method, I'm cleaning the record objects through a custom method #validate_and_return_top_results. My validator checks through the record objects for duplicates against the #track_ids collection in its ancestor method #filter_tracks. It then returns the number of records specified by its caller.
def validate_and_return_top_results(collection, max = 1)
top_results = []
i = 0 # offset incrementer
until top_results.count >= max do
# Checks if track has already appeared in the results
unless #result_track_ids.include? collection[i].id
# this will be returned to the caller
top_results << collection[i]
# this is the point of reference to validate your query method results
#result_track_ids << collection[i].id
end
i += 1
end
{ top_results: top_results, offset: i }
end

Memory Leak in Ruby net/ldap Module

As part of my Rails application, I've written a little importer that sucks in data from our LDAP system and crams it into a User table. Unfortunately, the LDAP-related code leaks huge amounts of memory while iterating over our 32K users, and I haven't been able to figure out how to fix the issue.
The problem seems to be related to the LDAP library in some way, as when I remove the calls to the LDAP stuff, memory usage stabilizes nicely. Further, the objects that are proliferating are Net::BER::BerIdentifiedString and Net::BER::BerIdentifiedArray, both part of the LDAP library.
When I run the import, memory usage eventually peaks at over 1GB. I need to find some way to correct my code if the problem is there, or to work around the LDAP memory issues if that's where the problem lies. (Or if there's a better LDAP library for large imports for Ruby, I'm open to that as well.)
Here's the pertinent bit of our my code:
require 'net/ldap'
require 'pp'
class User < ActiveRecord::Base
validates_presence_of :name, :login, :email
# This method is resonsible for populating the User table with the
# login, name, and email of anybody who might be using the system.
def self.import_all
# initialization stuff. set bind_dn, bind_pass, ldap_host, base_dn and filter
ldap = Net::LDAP.new
ldap.host = ldap_host
ldap.auth bind_dn, bind_pass
ldap.bind
begin
# Build the list
records = records_updated = new_records = 0
ldap.search(:base => base_dn, :filter => filter ) do |entry|
name = entry.givenName.to_s.strip + " " + entry.sn.to_s.strip
login = entry.name.to_s.strip
email = login + "#txstate.edu"
user = User.find_or_initialize_by_login :name => name, :login => login, :email => email
if user.name != name
user.name = name
user.save
logger.info( "Updated: " + email )
records_updated = records_updated + 1
elsif user.new_record?
user.save
new_records = new_records + 1
else
# update timestamp so that we can delete old records later
user.touch
end
records = records + 1
end
# delete records that haven't been updated for 7 days
records_deleted = User.destroy_all( ["updated_at < ?", Date.today - 7 ] ).size
logger.info( "LDAP Import Complete: " + Time.now.to_s )
logger.info( "Total Records Processed: " + records.to_s )
logger.info( "New Records: " + new_records.to_s )
logger.info( "Updated Records: " + records_updated.to_s )
logger.info( "Deleted Records: " + records_deleted.to_s )
end
end
end
Thanks in advance for any help/pointers!
By the way, I did ask about this in the net/ldap support forum as well, but didn't get any useful pointers there.
One very important thing to note is that you never use the result of the method call. That means that you should pass :return_result => false to ldap.search:
ldap.search(:base => base_dn, :filter => filter, :return_result => false ) do |entry|
From the docs: "When :return_result => false, #search will return only a Boolean, to indicate whether the operation succeeded. This can improve performance with very large result sets, because the library can discard each entry from memory after your block processes it."
In other words, if you don't use this flag, all entries will be stored in memory, even if you do not need them outside the block! So, use this option.

Logging Search Results in a Rails Application

We're interested in logging and computing the number of times an item comes up in search or on a list page. With 50k unique visitors a day, we're expecting we could produce 3-4 million 'impressions' per day, which isn't a terribly high amount, but one we'd like to architect well.
We don't need to read this data in real time, but would like to be able to generate daily totals and analyze trends, etc. Similar to a business analytics tool.
We're planning to do this with an Ajax post after the page is rendered - this will allow us to count results even if those results are cached. We can do this in a single post per page, to send a comma delimited list of ids and their positions on the page.
I am hoping there is some sort of design pattern/gem/blog post about this that would help me avoid the common first-timer mistakes that may come up. I also don't really have much experience logging or reading logs.
My current strategy - make something to write events to a log file, and a background job to tally up the results at the end of the day and put the results back into mysql.
Ok, I have three approaches for you:
1) Queues
In your AJAX Handler, write the simplest method possible (use a Rack Middleware or Rails Metal) to push the query params to a queue. Then, poll the queue and gather the messages.
Queue pushes from a rack middleware are blindingly fast. We use this on a very high traffic site for logging of similar data.
An example rack middleware is below (extracted from our app, can handle request in <2ms or so:
class TrackingMiddleware
CACHE_BUSTER = {"Cache-Control" => "no-cache, no-store, max-age=0, must-revalidate", "Pragma" => "no-cache", "Expires" => "Fri, 29 Aug 1997 02:14:00 EST"}
IMAGE_RESPONSE_HEADERS = CACHE_BUSTER.merge("Content-Type" => "image/gif").freeze
IMAGE_RESPONSE_BODY = [File.open(Rails.root + "public/images/tracker.gif").read].freeze
def initialize(app)
#app = app
end
def call(env)
if env["PATH_INFO"] =~ %r{^/track.gif}
request = Rack::Request.new(env)
YOUR_QUEUE.push([Time.now, request.GET.symbolize_keys])
[200, IMAGE_RESPONSE_BODY, IMAGE_RESPONSE_HEADERS]
else
#app.call(env)
end
end
end
For the queue I'd recommend starling, I've had nothing but good times with it.
On the parsing end, I would use the super-poller toolkit, but I would say that, I wrote it.
2) Logs
Pass all the params along as query params to a static file (/1x1.gif?foo=1&bar=2&baz=3).
This will not hit the rails stack and will be blindingly fast.
When you need the data, just parse the log files!
This is the best scaling home brew approach.
3) Google Analytics
Why handle the load when google will do it for you? You would be surprised at how good google analytics is, before you home brew anything, check it out!
This will scale infinitely, because google buys servers faster than you do.
I could rant on this for ages, but I have to go now. Hope this helps!
Depending no the action required to list items, you might be able to do it in the controller and save yourself a round trip. You can do it with an after_filter, to make the addition unobtrusive.
This only works if all actions that list items you want to log, require parameters. This is because page caching ignores GET requests with parameters.
Assuming you only want to log search data on the search action.
class ItemsController < ApplicationController
after_filter :log_searches, :only => :search
def log_searches
#items.each do |item|
# write to log here
end
end
...
# rest of controller remains unchanged
...
end
Otherwise you're right on track with the AJAX, and an onload remote function.
As for processing the you could use a rake task run by a cron job to collect statistics, and possibly update items for a popularity rating.
Either way you will want to read up on the Ruby Logging class. Learning about cron jobs and rake tasks wouldn't hurt either.
This is what I ultimately did - it was enough for our use for now, and with some simple benchmarking, I feel OK about it. We'll be watching to see how it does in production before we expose the results to our customers.
The components:
class EventsController < ApplicationController
def create
logger = Logger.new("#{RAILS_ROOT}/log/impressions/#{Date.today}.log")
logger.info "#{DateTime.now.strftime} #{params[:ids]}" unless params[:ids].blank?
render :nothing => true
end
end
This is called from an ajax call in the site layout...
<% javascript_tag do %>
var list = '';
$$('div.item').each(function(item) { list += item.id + ','; });
<%= remote_function(:url => { :controller => :events, :action => :create}, :with => "'ids=' + list" ) %>
<% end %>
Then I made a rake task to import these rows of comma delimited ids into the db. This is run the following day:
desc "Calculate impressions"
task :count_impressions => :environment do
date = ENV['DATE'] || (Date.today - 1).to_s # defaults to yesterday (yyyy-mm-dd)
file = File.new("log/impressions/#{date}.log", "r")
item_impressions = {}
while (line = file.gets)
ids_string = line.split(' ')[1]
next unless ids_string
ids = ids_string.split(',')
ids.each {|i| item_impressions[i] ||= 0; item_impressions[i] += 1 }
end
item_impressions.keys.each do |id|
ActiveRecord::Base.connection.execute "insert into item_stats(item_id, impression_count, collected_on) values('#{id}',#{item_impressions[id]},'#{date}')", 'Insert Item Stats'
end
file.close
end
One thing to note - the logger variable is declared in the controller action - not in environment.rb as you would normally do with a logger. I benchmarked this - 10000 writes took about 20 seconds. Averaging about 2 milliseconds a write. With the file name in the envirnment.rb, it took about 14 seconds. We made this trade-off so we could dynamically determine the file name - an easy way to switch files at midnight.
Our main concern at this point - we have no idea how many different items will be counted per day - ie. we don't know how long the tail is. This will determine how many rows are added to the db each day. We expect we'll need to limit how far back we keep daily reports and will role up results even further at that point.

Resources