I am making a rails application to crawl the flight information from specific website. This app can be found here https://vemaybay.herokuapp.com/.
It only took around 4-5 seconds to response locally, but it took 15-20 seconds when running on heroku.
Is there anyway to speed up this response time?
I have already changed the free to hobby dyno type to avoid DB spin-up costs but I believe DB connection and query is not the root cause.
Is it related to the hosting problem? So can think about buying a host.
Below is my example code:
FlightService
def crawl(from, to, date)
return if flight_not_available?(from, to)
begin
selected_day = date.day - 1
browser = ::Ferrum::Browser.new
browser.headers.set({ "User-Agent" => "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36" })
browser.goto("https://www.abay.vn/")
browser.at_css("input#cphMain_ctl00_btnSearch").click
browser.back
browser.execute("document.getElementById('cphMain_ctl00_txtFrom').setAttribute('value','#{from}')")
browser.execute("document.getElementById('cphMain_ctl00_txtTo').setAttribute('value','#{to}')")
browser.execute("document.getElementById('cphMain_ctl00_cboDepartureDay').selectedIndex = #{selected_day}")
browser.at_css("input#cphMain_ctl00_btnSearch").click
# browser.execute("document.querySelectorAll('a.linkViewFlightDetail').forEach(btn=> btn.click())")
sleep(1)
body = Nokogiri::HTML(browser.body)
flight_numbers = body.css("table.f-result > tbody > tr.i-result > td.f-number").map(&:text)
depart_times = body.css("table.f-result > tbody > tr.i-result > td.f-time").map { |i| i.text.split(" - ").first }
arrival_times = body.css("table.f-result > tbody > tr.i-result > td.f-time").map { |i| i.text.split(" - ").second }
base_prices = body.css("table.f-result > tbody > tr.i-result > td.f-price").map(&:text)
prices = base_prices
store_flight(flight_numbers, from, to, date, depart_times, arrival_times, base_prices, prices)
browser.quit
rescue StandardError => e
Rails.logger.error e.message
fail_with_message(e.message)
browser.quit
end
end
Then in my controller i just call the crawl method to fetch data:
service = FlightService.new(from: #from, to: #to, departure_date: #departure_date, return_date: #return_date)
service.crawl_go_flights
#go_flights = service.go_flights
I would try to add NewRelic Heroku add-on, it will show you what takes the most time, most likely it will be your Ruby code doing HTTP requests in a controller action to crawl a page.
Heroku tends to be slower than running code on your own development machine because Heroku resources are shared across users unless you bought expensive M/L dynos.
Without you sharing the code for crawling we don't know much how it work and where is the bottleneck. Do you crawl the single page or many pages (then this might be slow).
You can try moving crawl logic to the background worker, for instance, use Sidekiq gem. You could crawl page from time to time and store results in your DB then your controller action would only ask for data from your DB instead of crawling page every time. You can also use a rake task every 10 minutes defined in Heroku Scheduler to crawl page instead of Sidekiq (this might be faster to do). I don't know if having data up to date every 10 minutes is good enough for your use case. You need to pick a tech solution for your business use case needs. With Sidekiq you could run jobs more often by starting them every 1 minute using clockwork gem.
Related
I inherited a rails app that is deployed using Heroku (I think). I edit it on AWS's Cloud9 IDE and, for now, just do everything in development mode. The app's purpose is to process large amounts of survey data and spit it out onto a PDF report. This works for small reports with like 10 rows of data, but when I load a report that is querying a data upload of 5000+ rows to create an HTML page which gets converted to a PDF, it takes around 105 seconds, much longer than Heroku's 30 seconds allotted for HTTP requests.
Heroku says this on their website, which gave me some hope:
"Heroku supports HTTP 1.1 features such as long-polling and streaming responses. An application has an initial 30 second window to respond with a single byte back to the client. However, each byte transmitted thereafter (either received from the client or sent by your application) resets a rolling 55 second window. If no data is sent during the 55 second window, the connection will be terminated." (Source: https://devcenter.heroku.com/articles/request-timeout#long-polling-and-streaming-responses)
This sounds excellent to me - I can just send a request to the client every second or so in a loop until we're done creating the large PDF report. However, I don't know how to send or receive a byte or so to "reset the rolling 55 second window" they're talking about.
Here's the part of my controller that is sending the request.
return render pdf: pdf_name + " " + pdf_year.to_s,
disposition: 'attachment',
page_height: 1300,
encoding: 'utf8',
page_size: 'A4',
footer: {html: {template: 'recent_grad/footer.html.erb'}, spacing: 0 },
margin: { top: 10, # default 10 (mm)
bottom: 20,
left: 10,
right: 10 },
template: "recent_grad/report.html.erb",
locals: {start: #start, survey: #survey, years: #years, college: #college, department: #department, program: #program, emphasis: #emphasis, questions: #questions}
I'm making other requests to get to this point, but I believe the part that is causing the issue is here where the template is being rendered. My template queries the database in a finite loop that stops when it runs out of survey questions to query from.
My question is this: how can I "send or receive a byte to the client" to tell Heroku "I'm still trying to create this massive PDF so please reset the timer and give me my 55 seconds!" Is it in the form of a query? Because, if so, I am querying the MySql database over and over again in my report.html.erb file.
Also, it used to work without issues and does work on small reports, but now I get the error "504 Gateway Timeout" before the request is complete on the actual page, but my puma console continues to query the database like a mad man. I assume it's a Heroku problem because the 504 error happens exactly every 35 seconds (5 seconds to process the other parts and 30 seconds to try to finish the loop in the template so it can render correctly).
If you need more information or code, please ask! Thanks in advance
EDIT:
Both of the comments below suggest possible duplicates, but neither of them have a real answer with real code, they simply refer to the docs that I am quoting here. I'm looking for a code example (or at least a way to get my foot in the door), not just a link to the docs. Thanks!
EDIT 2:
I tried what #Sergio said and installed SideKiq. I think I'm really close, but still having some issues with the worker. The worker doesn't have access to ActionView::Base which is required for the render method in rails, so it's not working. I can access the worker method which means my sidekiq and redis servers are running correctly, but it gets caught on the ActionView line with this error:
WARN: NameError: uninitialized constant HardWorker::ActionView
Here's the worker code:
require 'sidekiq'
Sidekiq.configure_client do |config|
# config.redis = { db: 1 }
config.redis = { url: 'redis://172.31.6.51:6379/0' }
end
Sidekiq.configure_server do |config|
# config.redis = { db: 1 }
config.redis = { url: 'redis://172.31.6.51:6379/0' }
end
class HardWorker
include Sidekiq::Worker
def perform(pdf_name, pdf_year)
av = ActionView::Base.new()
av.view_paths = ActionController::Base.view_paths
av.class_eval do
include Rails.application.routes.url_helpers
include ApplicationHelper
end
puts "inside hardworker"
puts pdf_name, pdf_year
av.render pdf: pdf_name + " " + pdf_year.to_s,
disposition: 'attachment',
page_height: 1300,
encoding: 'utf8',
page_size: 'A4',
footer: {html: {template: 'recent_grad/footer.html.erb'}, spacing: 0 },
margin: { top: 10, # default 10 (mm)
bottom: 20,
left: 10,
right: 10 },
template: "recent_grad/report.html.erb",
locals: {start: #start, survey: #survey, years: #years, college: #college, department: #department, program: #program, emphasis: #emphasis, questions: #questions}
end
end
Any suggestions?
EDIT 3:
I did what #Sergio said and attempted to make a PDF from an html.erb file directly and save it to a file. Here's my code:
# /app/controllers/recentgrad_controller.rb
pdf = WickedPdf.new.pdf_from_html_file('home/ec2-user/environment/gradSurvey/gradSurvey/app/views/recent_grad/report.html.erb')
save_path = Rails.root.join('pdfs', pdf_name + pdf_year.to_s + '.pdf')
File.open(save_path, 'wb') do |file|
file << pdf
end
And the error output:
RuntimeError (Failed to execute:
["/usr/local/rvm/gems/ruby-2.4.1#gradSurvey/bin/wkhtmltopdf", "file:///home/ec2-user/environment/gradSurvey/gradSurvey/app/views/recent_grad/report.html.erb", "/tmp/wicked_pdf_generated_file20190523-15416-hvb3zg.pdf"]
Error: PDF could not be generated!
Command Error: Loading pages (1/6)
Error: Failed loading page file:///home/ec2-user/environment/gradSurvey/gradSurvey/app/views/recent_grad/report.html.erb (sometimes it will work just to ignore this error with --load-error-handling ignore)
Exit with code 1 due to network error: ContentNotFoundError
):
I have no idea what it means when it says "sometimes it will work just to ignore this error with --load-error-handling ignore". The file definitely exists and I've tried maybe 5 variations of the file path.
I've had to do something like this several times. In all cases, I ended up writing a background job that does all the heavy lifting generation. And because it's not a web request, it's not affected by the 30 seconds timeout. It goes something like this:
client (your javascript code) requests a new report.
server generates job description and enqueues it for your worker to pick up.
worker picks the job from the queue and starts working (querying database, etc.)
in the meanwhile, client periodically asks the server "is my report done yet?". Server responds with "not yet, try again later"
worker is finished generating the report. It uploads the file to some storage (S3, for example), sets job status to "completed" and job result to the download link for the uploaded report file.
server, seeing that job is completed, can now respond to client status update requests "yes, it's done now. Here's the url. Have a good day."
Everybody's happy. And nobody had to do any streaming or playing with heroku's rolling response timeouts.
The scenario above uses short-polling. I find it the easiest to implement. But it is, of course, a bit wasteful with regard to resources. You can use long-polling or websockets or other fancy things.
Check my response here just in case it works for you. I didnĀ“t wanted to change the user workflow adding a bg job and then a place/notification to get the result.
I use Rails controller streaming support with Live module and set the right reponse headers. I fetch the data from some Enumerable object.
I am seeing some very slow cache reads in my rails app. Both redis (redis-rails) and memcached (dalli) produced the same results.
It looks like it is only the first call to Rails.cache that causes the slowness (averaging 500ms).
I am using skylight to instrument my app and see a graph like:
I have a Rails.cache.fetch call in this code, but when I benchmark it I see it average around 8ms, which matches what memcache-top shows for my average call time.
I thought this might be dalli connections opening slowly, but benchmarking that didnt show anything slow either. I'm at a loss for what else to check into.
Does anyone have any good techniques for tracking this sort of thing down in a rails app?
Edit #1
Memcache server is stored in ENV['MEMCACHE_SERVERS'], all the servers are in the us-east-1 datacenter.
Cache config looks like:
config.cache_store = :dalli_store, nil, { expires_in: 1.day, compress: true }
I ran something like:
100000.times { Rails.cache.fetch('something') }
and calculated the average timings and got something on the order of 8ms when running on one of my webservers.
Testing my theory of the first request is slow, I opened a console on my web server and ran the following as the first command.
irb(main):002:0> Benchmark.ms { Rails.cache.fetch('someth') { 1 } }
Dalli::Server#connect my-cache.begfpc.0001.use1.cache.amazonaws.com:11211
=> 12.043342
Edit #2
Ok, I split out the fetch into a read and write, and tracked them independently with statsd. It looks like the averages sit around what I would expect, but the max times on the read are very spiky and get up into the 500ms range.
http://s16.postimg.org/5xlmihs79/Screen_Shot_2014_12_19_at_6_51_16_PM.png
I have a fairly simple Rails app. It listens for requests in the form
example.com/items?access_key=my_secret_key
My application controller looks at the secret key to determine which user is making the call, looks up their database credentials, and connects to the appropriate database to get that person's items.
However we need to have this support multiple requests at a time, and Puma seems like everyone's favorite / the fastest server for us to use. We started running into problems when benchmarking it with ApacheBench. FYI, puma is configured to have 3 workers and min=1, max=16 threads.
If I were to run
ab -n 100 -c 10 127.0.0.1:3000/items?access_key=my_key
then this error is thrown with a whole lot of stack trace after it:
/home/user/.gem/ruby/2.0.0/gems/mysql2-0.3.16/lib/mysql2/client.rb:70: [BUG] Segmentation fault
ruby 2.0.0p353 (2013-11-22 revision 43784) [x86_64-linux]
Edit: This line also appears in the enormous amount of info that the error contains:
*** glibc detected *** puma: cluster worker 1: 17088: corrupted double-linked list: 0x00007fb671ddbd60
And it looks to me like that's tripping multiple times. I have been unable to determine exactly when (on which requests) it trips.
The benchmarking seems to still finish, but it seems quite slow (from ab):
Concurrency Level: 10
Time taken for tests: 21.085 seconds
Complete requests: 100
Total transferred: 3620724 bytes
21 seconds for 3 megabytes? Even if mysql was being slow, that's... bad. But I think it's worse than that - the amount of data isn't high enough. There are no segfaults when I run concurrency 1, and the amount of data for -n 10 -c 1 is 17 megabytes. So puma is responding with some error page that I can't see - running 'curl address' gives me the expected data, and I can't manually do concurrency.
It gets worse when I run more requests or higher concurrency.
ab -n 1000 -c 10 127.0.0.1:3000/items?access_key=my_key
yields
apr_socket_recv: Connection reset by peer (104)
Total of 199 requests completed
and
ab -n 100 -c 50 127.0.0.1:3000/items?access_key=my_key
yields
apr_socket_recv: Connection reset by peer (104)
Total of 6 requests completed
Running top in another putty window shows me that very often (most times I try to benchmark) only one of the three workers puma created is performing any work. Rarely, all three do.
Because it seems like the error might be somewhere in here, I'll show you my application_controller. It's short, but the bulk of the application (which, like I said, is fairly simple).
class ApplicationController < ActionController::Base
# Prevent CSRF attacks by raising an exception.
# For APIs, you may want to use :null_session instead.
protect_from_forgery with: :exception
def get_yaml_params
YAML.load(File.read("#{APP_ROOT}/config/ecommerce_config.yml"))
end
def access_key_login
access_key = params[:access_key]
unless access_key
show_error("missing access_key parameter")
return false
end
access_info = get_yaml_params
unless client_login = access_info[access_key]
show_error("invalid access_key")
return false
end
status = ActiveRecord::Base.establish_connection(
:adapter => "mysql2",
:host => client_login["host"],
:username => client_login["username"],
:password => client_login["password"],
:database => client_login["database"]
)
end
def generate_json (columns, footer)
// config/application.rb includes the line
// require 'json'
query = "select"
columns.each do |column, name|
query += " #{column}"
query += " AS #{name}" unless column == name
query += ","
end
query = query[0..-2] # trim ','
query += " #{footer}"
dbh = ActiveRecord::Base.connection
results = dbh.select_all(query).to_hash
data = results.map do |result|
columns.map {|column, name| result[name]}
end
({"fields" => columns.values, "values" => data}).to_json
end
def show_error(msg)
render(:text => "Error: #{msg}\n")
nil
end
end
And an example of a controller that uses it
class CategoriesController < ApplicationController
def index
access_key_login or return
columns = {
"prd_type" => "prd_type",
"prd_type_description" => "description"
}
footer = "from cin_desc;"
json = generate_json(columns, footer)
render(:json => json)
end
end
That's pretty much it as far as custom code goes. I can't find anything making this not threadsafe, so I don't know what the cause of the segfaults is. I don't know why not all of the workers spin up when requests are made. I don't know what error is getting returned to ApacheBench. Thanks for helping, I can post more information as you need it.
It appears that the stable version of mysql2 library, 0.3.17, is NOT threadsafe. Until it is updated to be threadsafe, using it with multithreaded puma will be impossible. An alternative would be to use Unicorn.
I would like to up/down-scale my dynos automatically dependings on the size of the pending list.
I heard about HireFire, but the scaling is only made every minutes, and I need it to be (almost) real time.
I would like to scale my dynos so that the pending list be ~always empty.
I was thinking about doing it by myself (with a scheduler (~15s delay) and using Heroku API), because I'm not sure there is anything out there; and if not, do you know any monitoring tools which could send an email alert if the queue lenght exceed a fixed size ? (similar to apdex on newrelic).
A potential custom code solution is included below. There are also two New Relic plgins that do Resque monitoring. I'm not sure if either do email alerts based on exceeding a certain queue size. Using resque hooks you could output log messages that could trigger email alerts (or slack, hipchat, pagerduty, etc) via a service like Papertrail or Loggly. THis might look something like:
def after_enqueue_pending_check(*args)
job_count = Resque.info[:pending].to_i
if job_count > PENDING_THRESHOLD
Rails.logger.warn('pending queue threshold exceeded')
end
end
Instead of logging you could send an email but without some sort of rate limiting on the emails you could easily get flooded if the pending queue grows rapidly.
I don't think there is a Heroku add-on or other service that can do the scaling in realtime. There is a gem that will do this using the deprecated Heroku API. You can do this using resque hooks and the Heroku platform-api. This untested example uses the heroku platform-api to scale the 'worker' dynos up and down. Just as an example I included 1 worker for every three pending jobs. The downscale will only every reset the workers to 1 if there are no pending jobs and no working jobs. This is not ideal and should be updated to fit your needs. See here for information about ensuring that then scaling down the workers you don't lose jobs: http://quickleft.com/blog/heroku-s-cedar-stack-will-kill-your-resque-workers
require 'platform-api'
def after_enqueue_upscale(*args)
heroku = PlatformAPI.connect_oauth('OAUTH_TOKEN')
worker_count = heroku.formation.info('app-name','worker')["quantity"]
job_count = Resque.info[:pending].to_i
# one worker for every 3 jobs (minimum of 1)
new_worker_count = ((job_count / 3) + 1).to_i
return if new_worker_count <= worker_count
heroku.formation.update('app-name', 'worker', {"quantity" => new_worker_count})
end
def after_perform_downscale
heroku = PlatformAPI.connect_oauth('OAUTH_TOKEN')
if Resque.info[:pending].to_i == 0 && Resque.info[:working].to_i == 0
heroku.formation.update('app-name', 'worker', {"quantity" => 1})
end
end
Im having a similiar issue and have ran into "Hirefire"
https://www.hirefire.io/.
For ruby, use:
https://github.com/hirefire/hirefire-resource
It runs similar to theoretically works like AdepScale (https://www.adeptscale.com/). However Hirefire can also scale workers and does not limit itself to just dynos. Hope this helps!
I use mechanize gem to crawl websites. I wrote a very simple, one-threaded crawler inside a Rails rake task because I needed to access to Rails models.
The crawler runs just fine, but after watching it running for a while I can see that it eats more and more RAM over time, which is bad.
I use God gem to monitor my crawler.
Below is my rake task code, I'm wondering if it exposes any chance of memory leaking?
task :abc => :environment do
prefix_url = 'http://example.com/abc-'
postfix_url = '.html'
from_page_id = (AppConfig.last_crawled_id || 1) + 1
to_page_id = 100000
agent = Mechanize.new
agent.user_agent_alias = 'Mac Safari'
(from_page_id..to_page_id).each do |i|
url = "#{prefix_url}#{i}#{postfix_url}"
puts "#{Time.now} - Crawl #{url}"
page = agent.get(url)
page.search('#content > ul').each do |s|
var = s.css('li')[0].text()
value = s.css('li')[1].text()
MyModel.create :var => var, :value => value
end
AppConfig.last_crawled_id = i
end
# Finish crawling, let's stop
`god stop crawl_abc`
end
Unless you've got the very latest version of mechanize (2.1.1 was released only a day or so ago) by default mechanize operates with an unlimited history size, ie it keeps all the pages you visited and so will gradually use more and more memory.
In your case there isn't any point to this, so calling max_history= on your agent should limit how much memory is used in this fashion