I have an issue with importing a lot of records from a user provided excel file into a database. The logic for this is working fine, and I’m using ActiveRecord-import to cut down on the number of database calls. However, when a file is too large, the processing can take too long and Heroku will return a timeout. Solution: Resque and moving the processing to a background job.
So far, so good. I’ve needed to add CarrierWave to upload the files to S3 because I can’t just hold the file in memory for the background job. The upload portion is also working fine, I created a model for them and am passing the IDs through to the queued job to retrieve the file later as I understand I can’t pass a whole ActiveRecord object through to the job.
I’ve installed Resque and Redis locally, and everything seems to be setup correctly in that regard. I can see the jobs I’m creating being queued and then run without failing. The job seems to run fine, but no records are added to the database. If I run the code from my job line by line in the console, the records are added to the database as I would expect. But when the queued jobs I’m creating run, nothing happens.
I can’t quite work out where the problem might be.
Here’s my upload controller’s create action:
def create
#upload = Upload.new(upload_params)
if #upload.save
Resque.enqueue(ExcelImportJob, #upload.id)
flash[:info] = 'File uploaded.
Data will be processed and added to the database.'
redirect_to root_path
else
flash[:warning] = 'Upload failed. Please try again.'
render :new
end
end
This is a simplified version of the job with fewer sheet columns for clarity:
class ExcelImportJob < ApplicationJob
#queue = :default
def perform(upload_id)
file = Upload.find(upload_id).file.file.file
data = parse_excel(file)
if header_matches? data
# Create a database entry for each row, ignoring the first header row
# using activerecord-import
sales = []
data.drop(1).each_with_index do |row, index|
sales << Sale.new(row)
if index % 2500 == 0
Sale.import sales
sales = []
end
end
Sale.import sales
end
def parse_excel(upload)
# Open the uploaded excel document
doc = Creek::Book.new upload
# Map rows to the hash keys from the database
doc.sheets.first.rows.map do |row|
{ date: row.values[0],
title: row.values[1],
author: row.values[2],
isbn: row.values[3],
release_date: row.values[5],
units_sold: row.values[6],
units_refunded: row.values[7],
net_units_sold: row.values[8],
payment_amount: row.values[9],
payment_amount_currency: row.values[10] }
end
end
# Returns true if header matches the expected format
def header_matches?(data)
data.first == {:date => 'Date',
:title => 'Title',
:author => 'Author',
:isbn => 'ISBN',
:release_date => 'Release Date',
:units_sold => 'Units Sold',
:units_refunded => 'Units Refunded',
:net_units_sold => 'Net Units Sold',
:payment_amount => 'Payment Amount',
:payment_amount_currency => 'Payment Amount Currency'}
end
end
end
I can probably have some improved logic anyway as right now I’m holding the whole file in memory, but that isn’t the issue I’m having – even with a small file that has only 500 or so rows, the job doesn’t add anything to the database.
Like I said my code worked fine when I wasn’t using a background job, and still works if I run it in the console. But for some reason the job is doing nothing.
This is my first time using Resque so I don’t know if I’m missing something obvious? I did create a worker and as I said it does seem to run the job. Here’s the output from Resque’s verbose formatter:
*** resque-1.27.4: Waiting for default
*** Checking default
*** Found job on default
*** resque-1.27.4: Processing default since 1508342426 [ExcelImportJob]
*** got: (Job{default} | ExcelImportJob | [15])
*** Running before_fork hooks with [(Job{default} | ExcelImportJob | [15])]
*** resque-1.27.4: Forked 63706 at 1508342426
*** Running after_fork hooks with [(Job{default} | ExcelImportJob | [15])]
*** done: (Job{default} | ExcelImportJob | [15])
In the Resque dashboard the jobs aren’t logged as failed. They get executed and I can see an increment in the ‘processed’ jobs on the stats page. But as I say the DB remains untouched. What’s going on? How can I debug the job more clearly? Is there a way to get into it with Pry?
It looks like my problem was with Resque.enqueue(ExcelImportJob, #upload.id).
I changed my code to ExcelImportJob.perform_later(#upload.id) and now my code actually runs!
I also added a resque.rake task to lib/tasks as described here: http://bica.co/2015/01/20/active-job-resque/.
That link also notes how to use rails runner to call the job without running the full Rails server and triggering the job, which is useful for debugging.
Strangely, I didn't quite manage to get the job to print anything to STDOUT as suggested by #hoffm but at least it led me down a good avenue of inquiry.
I still don't fully understand the difference between why calling Resqueue.enqueue still added my jobs to the queue and indeed seemed to run them, but the code wasn't executed, so if someone has a better grasp and an explanation, that would be much appreciated.
TL;DR: calling perform_later rather than Resque.enqueue fixed the problem but I don't know why.
Related
I inherited a rails app that is deployed using Heroku (I think). I edit it on AWS's Cloud9 IDE and, for now, just do everything in development mode. The app's purpose is to process large amounts of survey data and spit it out onto a PDF report. This works for small reports with like 10 rows of data, but when I load a report that is querying a data upload of 5000+ rows to create an HTML page which gets converted to a PDF, it takes around 105 seconds, much longer than Heroku's 30 seconds allotted for HTTP requests.
Heroku says this on their website, which gave me some hope:
"Heroku supports HTTP 1.1 features such as long-polling and streaming responses. An application has an initial 30 second window to respond with a single byte back to the client. However, each byte transmitted thereafter (either received from the client or sent by your application) resets a rolling 55 second window. If no data is sent during the 55 second window, the connection will be terminated." (Source: https://devcenter.heroku.com/articles/request-timeout#long-polling-and-streaming-responses)
This sounds excellent to me - I can just send a request to the client every second or so in a loop until we're done creating the large PDF report. However, I don't know how to send or receive a byte or so to "reset the rolling 55 second window" they're talking about.
Here's the part of my controller that is sending the request.
return render pdf: pdf_name + " " + pdf_year.to_s,
disposition: 'attachment',
page_height: 1300,
encoding: 'utf8',
page_size: 'A4',
footer: {html: {template: 'recent_grad/footer.html.erb'}, spacing: 0 },
margin: { top: 10, # default 10 (mm)
bottom: 20,
left: 10,
right: 10 },
template: "recent_grad/report.html.erb",
locals: {start: #start, survey: #survey, years: #years, college: #college, department: #department, program: #program, emphasis: #emphasis, questions: #questions}
I'm making other requests to get to this point, but I believe the part that is causing the issue is here where the template is being rendered. My template queries the database in a finite loop that stops when it runs out of survey questions to query from.
My question is this: how can I "send or receive a byte to the client" to tell Heroku "I'm still trying to create this massive PDF so please reset the timer and give me my 55 seconds!" Is it in the form of a query? Because, if so, I am querying the MySql database over and over again in my report.html.erb file.
Also, it used to work without issues and does work on small reports, but now I get the error "504 Gateway Timeout" before the request is complete on the actual page, but my puma console continues to query the database like a mad man. I assume it's a Heroku problem because the 504 error happens exactly every 35 seconds (5 seconds to process the other parts and 30 seconds to try to finish the loop in the template so it can render correctly).
If you need more information or code, please ask! Thanks in advance
EDIT:
Both of the comments below suggest possible duplicates, but neither of them have a real answer with real code, they simply refer to the docs that I am quoting here. I'm looking for a code example (or at least a way to get my foot in the door), not just a link to the docs. Thanks!
EDIT 2:
I tried what #Sergio said and installed SideKiq. I think I'm really close, but still having some issues with the worker. The worker doesn't have access to ActionView::Base which is required for the render method in rails, so it's not working. I can access the worker method which means my sidekiq and redis servers are running correctly, but it gets caught on the ActionView line with this error:
WARN: NameError: uninitialized constant HardWorker::ActionView
Here's the worker code:
require 'sidekiq'
Sidekiq.configure_client do |config|
# config.redis = { db: 1 }
config.redis = { url: 'redis://172.31.6.51:6379/0' }
end
Sidekiq.configure_server do |config|
# config.redis = { db: 1 }
config.redis = { url: 'redis://172.31.6.51:6379/0' }
end
class HardWorker
include Sidekiq::Worker
def perform(pdf_name, pdf_year)
av = ActionView::Base.new()
av.view_paths = ActionController::Base.view_paths
av.class_eval do
include Rails.application.routes.url_helpers
include ApplicationHelper
end
puts "inside hardworker"
puts pdf_name, pdf_year
av.render pdf: pdf_name + " " + pdf_year.to_s,
disposition: 'attachment',
page_height: 1300,
encoding: 'utf8',
page_size: 'A4',
footer: {html: {template: 'recent_grad/footer.html.erb'}, spacing: 0 },
margin: { top: 10, # default 10 (mm)
bottom: 20,
left: 10,
right: 10 },
template: "recent_grad/report.html.erb",
locals: {start: #start, survey: #survey, years: #years, college: #college, department: #department, program: #program, emphasis: #emphasis, questions: #questions}
end
end
Any suggestions?
EDIT 3:
I did what #Sergio said and attempted to make a PDF from an html.erb file directly and save it to a file. Here's my code:
# /app/controllers/recentgrad_controller.rb
pdf = WickedPdf.new.pdf_from_html_file('home/ec2-user/environment/gradSurvey/gradSurvey/app/views/recent_grad/report.html.erb')
save_path = Rails.root.join('pdfs', pdf_name + pdf_year.to_s + '.pdf')
File.open(save_path, 'wb') do |file|
file << pdf
end
And the error output:
RuntimeError (Failed to execute:
["/usr/local/rvm/gems/ruby-2.4.1#gradSurvey/bin/wkhtmltopdf", "file:///home/ec2-user/environment/gradSurvey/gradSurvey/app/views/recent_grad/report.html.erb", "/tmp/wicked_pdf_generated_file20190523-15416-hvb3zg.pdf"]
Error: PDF could not be generated!
Command Error: Loading pages (1/6)
Error: Failed loading page file:///home/ec2-user/environment/gradSurvey/gradSurvey/app/views/recent_grad/report.html.erb (sometimes it will work just to ignore this error with --load-error-handling ignore)
Exit with code 1 due to network error: ContentNotFoundError
):
I have no idea what it means when it says "sometimes it will work just to ignore this error with --load-error-handling ignore". The file definitely exists and I've tried maybe 5 variations of the file path.
I've had to do something like this several times. In all cases, I ended up writing a background job that does all the heavy lifting generation. And because it's not a web request, it's not affected by the 30 seconds timeout. It goes something like this:
client (your javascript code) requests a new report.
server generates job description and enqueues it for your worker to pick up.
worker picks the job from the queue and starts working (querying database, etc.)
in the meanwhile, client periodically asks the server "is my report done yet?". Server responds with "not yet, try again later"
worker is finished generating the report. It uploads the file to some storage (S3, for example), sets job status to "completed" and job result to the download link for the uploaded report file.
server, seeing that job is completed, can now respond to client status update requests "yes, it's done now. Here's the url. Have a good day."
Everybody's happy. And nobody had to do any streaming or playing with heroku's rolling response timeouts.
The scenario above uses short-polling. I find it the easiest to implement. But it is, of course, a bit wasteful with regard to resources. You can use long-polling or websockets or other fancy things.
Check my response here just in case it works for you. I didn´t wanted to change the user workflow adding a bg job and then a place/notification to get the result.
I use Rails controller streaming support with Live module and set the right reponse headers. I fetch the data from some Enumerable object.
I have 2 Sidekiq workers:
Foo:
# frozen_string_literal: true
class FooWorker
include Sidekiq::Worker
sidekiq_options queue: :foo
def perform
loop do
File.open(File.join(Rails.root, 'foo.txt'), 'w') { |file| file.write('FOO') }
end
end
end
Bar:
# frozen_string_literal: true
class BarWorker
include Sidekiq::Worker
sidekiq_options queue: :bar
def perform
loop do
File.open(File.join(Rails.root, 'bar.txt'), 'w') { |file| file.write('BAR') }
end
end
end
Which has pretty the same functionality, both runs on different queues and the yaml file looks like this:
---
:queues:
- foo
- bar
development:
:concurrency: 5
The problem is, even both are running and showing in the Busy page of Sidekiq UI, only one of them will actually create a file and put contents in. Shouldn't Sidekiq be multi-threaded?
Update:
this happens only on my machine
i created a new project with rails new and same
i cloned a colleague project and ran his sidekiq and is working!!!
i used his sidekiq version, not working!
New Update:
this happens also on my colleague machine if he clone my project
if I run 2 jobs with a finite loop ( like 10 times do something with a sleep), first job will be executed and then the second, but after the second finishes and start again both will work on same time as expected -- everyone who cloned the project from: github.com/ArayB/sidekiq-test encountered the problem.
It's not an issue with Sidekiq. It's an issue somewhere in Ruby/MRI/Thread/GIL. Google for more info, but my understanding is that sometimes threads aren't real threads (see "green threads") so really just simulate threading. The important thing is that only one thread can execute at a time.
It's interesting that with only two threads the system isn't giving time to the second thread. No idea why, but it must realize it's mistake when you run it again.
Interestingly if you run your same app but instead fire off 10 TestWorkers (and tweak the output so you can tell the difference) sidekiq will run all 10 "at once".
10.times {|i| TestWorker.perform_async(i) }
Here is the tweaked worker. Be sure to flush the output cause that can also cause issues with threading and TTY buffering not reflecting reality.
class TestWorker
include Sidekiq::Worker
def perform(n)
10.times do |i|
puts "#{n} - #{i} - #{Time.current}"
$stdout.flush
sleep 1
end
end
end
Some interesting links:
https://en.wikipedia.org/wiki/Green_threads
http://ruby-doc.org/core-2.4.1/Thread.html#method-c-pass
https://github.com/ruby/ruby/blob/v2_4_1/thread.c
Does ruby have real multithreading?
I use a Sidekiq queue to process communications with an unreliable, 3rd party API. Since this API is often down for a couple minutes at a time and then back up again, Sidekiq has been handy. When a connection issue happens, an error is raised and Sidekiq throws the job back in the queue to be retried again later, after some time has passed.
I use NewRelic to not only help debug crashes, but also for monitoring. My problem is that this current methodology above creates errors in NewRelic. If the 3rd party API is down for more than a couple of minutes, the error count accumulates enough to cause notifications to send out through NewRelic.
What I'd like to do is only raise an error from my worker when a certain number of retries have occurred for a job. I'm using sidekiq_retries_exhausted to do this. My problem is that I'm not quite sure how to put jobs back in the queue after they have an error without raising an error.
Does Sidekiq provide any facilities to return a job to a queue, increment the number of retries for the job, and have it sit there until it's due to run again, as if an exception was raised in the worker class?
You raise a specific error and tell the error service to ignore errors of that type. For NewRelic:
https://docs.newrelic.com/docs/agents/ruby-agent/installation-configuration/ruby-agent-configuration#error_collector.ignore_errors
Here is what I did to keep intentional retry errors out of AirBrake:
class TaskWorker
include Sidekiq::Worker
class RetryNotAnError < RuntimeError
end
def perform task_id
task = Task.find(task_id)
task.do_cool_stuff
if task.finished?
#log.debug "Task #{task_id} was successful."
return false
else
#log.debug "Task #{task_id} will try again later."
raise RetryNotAnError, task_id
end
end
end
Tell Airbrake to ignore it:
Airbrake.configure do |config|
config.ignore << 'RetryNotAnError'
end
It's good to make your exception name OBVIOUSLY not an error (e.g. RetryLaterNotAnError), as it will still show up in logs and such, and you don't want to freak people out when they see a bunch of them.
ps. That said, I would really like to see Sidekiq to provide an explicit, errorless retry mechanism.
If using Sidekiq Enterprise, one other option might be to utilize the optional set of additional error types that will then get treated as Sidekiq::Limiter::OverLimit violations.
For my purposes, I've used a new error class and then added it to the list in the config. Here are the notes from the sidekiq-ent code (not in the public sidekiq repo) on how to modify your config file:
# An optional set of additional error types which would be
# treated as a rate limit violation, so the job would automatically
# be rescheduled as with Sidekiq::Limiter::OverLimit.
#
# Sidekiq::Limiter.errors << MyApp::TooMuch
# Sidekiq::Limiter.errors = [Foo::Error, MyApp::Limited]
Inside the specific job you can specify the max_retries, or it will default to 20:
sidekiq_options max_limiter_retries: 10
Inside the job, I'll rescue the "expected" intermittent error that I'd rather not ignore completely and then raise the error I've added to the list, something like this:
rescue RestClient::RequestTimeout => e
raise SidekiqSoftRetry.new(e.inspect)
end
Here's what that looks like in my initialization file-- and Mike Perham was kind enough to respond with the option to update the global retry limit.
class SidekiqSoftRetry < RuntimeError
end
Sidekiq::Limiter::DEFAULT_OPTIONS[:reschedule] = 10
Sidekiq::Limiter.configure do |config|
config.errors.concat(
[
SidekiqSoftRetry,
]
)
end
I am using a background job in order to import user data from a csv file into my datase. First I did this "hard" in my User model by simply calling a method in my User model and by passing the file path which is transmitted via a form file_field:
User.import_csv(params[:file].path)
Worked well locally and on production (heroku).
Now when it comes to huge CSV files, I understood that I need a job to perform this import in the background. I am familiar with redis and sidekiq so the job was built quickly.
CsvImportJob.perform_async(URI.parse(params[:file].path))
and in my worker:
def perform(file_path)
User.import_csv(file_path)
end
Well, that also works perfect locally but as soon as I hit this on production, I see the following error in my log:
» 10 Aug 2015 13:56:26.596 2015-08-10 11:56:25.987726+00:00 app worker.1 - - 3 TID-oqvt6v1d4 ERROR: Actor crashed!
» 10 Aug 2015 13:56:26.596 2015-08-10 11:56:25.987728+00:00 app worker.1 - - Errno::ENOENT: No such file or directory # rb_sysopen - /tmp/RackMultipart20150810-6-14u804c.csv
» 10 Aug 2015 13:56:26.596 2015-08-10 11:56:25.987730+00:00 app worker.1 - - /app/vendor/ruby-2.2.2/lib/ruby/2.2.0/csv.rb:1256:in `initialize'
This is meant to be the file_path variable.
Somehow heroku is not able to find the file when I pass it to a sidekiq job. When I do this without sidekiq, it works.
I don't really know how to tackle this issue so any help is appreciated.
I had the same experience, you can look at a similar project of mine at https://github.com/coderaven/datatable-exercise/tree/parallel_processing
(Basically just focus on object_record.rb model and the jobs: import_csv_job.rb and process_csv_job.rb)
The error: Errno::ENOENT: No such file or directory # rb_sysopen
If you said that this works on heroku then probably that means that the path you are getting this is valid (in your example you are using the /tmp/ path)
So here's 2 probable problems and their solution:
1.) You have saved an unknown to Heroku path (or inaccessible path) which cannot be access or opened by the application when it is running. Since, when handling the import csv without sidekiq - the file you uploaded are save temporarily in-memory until you finish processing the csv - However, in a job scheduler (or sidekiq) the path should not be in memory and should be an existing path that is accessible to the app.
Solution: Save the file to a storage somewhere (heroku has an ephemeral filesystem so you cannot save files via the running web-app) to work this around, you have to use an Amazon S3 like service (you can also use Google Drive like what I did) to save your files there and then give the path to your sidekiq worker - so it can access and process it later.
2.) If the paths are correct and the files are save or processed correctly then from my experience it could have been that you are using File.open instead of the open-uri's open method. File.open does not accept remote files, you need to require open-uri on your worker and then use the open method to work around remote files.
ex.
require 'open-uri'
class ProcessCsvJob < ActiveJob::Base
queue_as :default
def perform(csv_path)
csv_file = open(csv_path,'rb:UTF-8')
SmarterCSV.process(csv_file) do |array|
.... code here for processing ...
end
end
end
I'm fully aware this question is already past almost a year, so if you have solved this or this answer worked then it could also help serve as a documentation archive for those who will probably experience the same problem.
You can't pass a file object to the perform method.
The fix is to massage the data beforehand and pass in the parameters you need directly.
Something like...
def import_csv(file)
CSV.foreach(file.path, headers: true) do |row|
new_user = { email: row[0], password: row[1] }
CsvImportJob.perform_async(new_user)
end
end
Note: you'd call CsvImportJob.perform_later for Sidekiq with ActiveJob and Rails 5.
You got the error because on production/staging and sidekiq run on different servers.
Use my solution: upload csv to google cloud storage
class Services::Downloader
require 'fog'
StorageCredentials = YAML.load_file("#{::Rails.root}/config/g.yml")[Rails.env]
def self.download(file_name, local_path)
storage = Fog::Storage.new(
provider: "Google",
google_storage_access_key_id: StorageCredentials['key_id'],
google_storage_secret_access_key: StorageCredentials['access_key'])
storage.get_bucket(StorageCredentials['bucket'])
f = File.open(local_path)
storage.put_object(StorageCredentials['bucket'], file_name, f)
storage.get_object_https_url(StorageCredentials['bucket'], file_name, Time.now.to_f + 24.hours)
end
end
Class User
class User < ApplicationRecord
require 'csv'
require 'open-uri'
def self.import_data(file)
load_file = open(file)
data = CSV.read(load_file, { encoding: "UTF-8", headers: true, header_converters: :symbol, converters: :all})
...
Worker
class ImportWorker
include Sidekiq::Worker
sidekiq_options queue: 'workers', retry: 0
def perform(filename)
User.import_data(filename)
end
end
and code for start worker
--
path = Services::Downloader.download(zip.name, zip.path)
ImportWorker.perform_async(path)
Situation:
In a typical cluster setup, I have a 5 instances of mongrel running behind Apache 2.
In one of my initializer files, I schedule a cron task using Rufus::Scheduler which basically sends out a couple of emails.
Problem:
The task runs 5 times, once for each mongrel instance and each recipient ends up getting 5 mails (despite the fact I store logs of each sent mail and check the log before sending). Is it possible that since all 5 instances run the task at exact same time, they end up reading the email logs before they are written?
I am looking for a solution that will make the tasks run only once. I also have a Starling daemon up and running which can be utilized.
The rooster rails plugin specifically addresses your issue. It uses rufus-scheduler and ensures the environment is loaded only once.
The way I am doing it right now:
Try to open a file in exclusive locked mode
When lock is acquired, check for messages in Starling
If message exists, other process has already scheduled the job
Set the message again to the queue and exit.
If message is not found, schedule the job, set the message and exit
Here is the code that does it:
starling = MemCache.new("#{Settings[:starling][:host]}:#{Settings[:starling][:port]}")
mutex_filename = "#{RAILS_ROOT}/config/file.lock"
scheduler = Rufus::Scheduler.start_new
# The filelock method, taken from Ruby Cookbook
# This will ensure unblocking of the files
def flock(file, mode)
success = file.flock(mode)
if success
begin
yield file
ensure
file.flock(File::LOCK_UN)
end
end
return success
end
# open_lock method, taken from Ruby Cookbook
# This will create and hold the locks
def open_lock(filename, openmode = "r", lockmode = nil)
if openmode == 'r' || openmode == 'rb'
lockmode ||= File::LOCK_SH
else
lockmode ||= File::LOCK_EX
end
value = nil
# Kernerl's open method, gives IO Object, in our case, a file
open(filename, openmode) do |f|
flock(f, lockmode) do
begin
value = yield f
ensure
f.flock(File::LOCK_UN) # Comment this line out on Windows.
end
end
return value
end
end
# The actual scheduler
open_lock(mutex_filename, 'r+') do |f|
puts f.read
digest_schedule_message = starling.get("digest_scheduler")
if digest_schedule_message
puts "Found digest message in Starling. Releasing lock. '#{Time.now}'"
puts "Message: #{digest_schedule_message.inspect}"
# Read the message and set it back, so that other processes can read it too
starling.set "digest_scheduler", digest_schedule_message
else
# Schedule job
puts "Scheduling digest emails now. '#{Time.now}'"
scheduler.cron("0 9 * * *") do
puts "Begin sending digests..."
WeeklyDigest.new.send_digest!
puts "Done sending digests."
end
# Add message in queue
puts "Done Scheduling. Sending the message to Starling. '#{Time.now}'"
starling.set "digest_scheduler", :date => Date.today
end
end
# Sleep will ensure all instances have gone thorugh their wait-acquire lock-schedule(or not) cycle
# This will ensure that on next reboot, starling won't have any stale messages
puts "Waiting to clear digest messages from Starling."
sleep(20)
puts "All digest messages cleared, proceeding with boot."
starling.get("digest_scheduler")
Why dont you use mod_passenger (phusion)? I moved from mongrel to phusion and it worked perfect (with a timeamount of < 5 minutes)!