I need to download a blob from Azure (from a batch), perform some actions on the data, and re-upload to Azure.
I'm running a script to do that in rails console, and to my surprise after running successfully for 10 - 20 iterations, it starts to crash every single iteration, at the #client.get_blob line.
Is there any part of my process that could be using up the memory in console, or in some sort of scratch disk used by console? For example, the command File.binwrite?
#client = Azure::Storage::Blob::BlobService.create(
storage_account_name: ENV.fetch('AZURE_STORAGE_ACCOUNT'),
storage_access_key: ENV.fetch('AZURE_STORAGE_KEY'))
incorrect_files.each_with_index do |name, index|
puts "starting #{index + start}: #{name}"
# Get file data
data = #client.get_blob('uploadedfiles', name)[1]
# Create a temporary file
Tempfile.create('tmpfile') do |temp|
# Fix the file data and save to temp file
File.binwrite(temp, fix(data))
# Upload new file
Uploader.upload(temp)
end
end
Related
I have an issue with importing a lot of records from a user provided excel file into a database. The logic for this is working fine, and I’m using ActiveRecord-import to cut down on the number of database calls. However, when a file is too large, the processing can take too long and Heroku will return a timeout. Solution: Resque and moving the processing to a background job.
So far, so good. I’ve needed to add CarrierWave to upload the files to S3 because I can’t just hold the file in memory for the background job. The upload portion is also working fine, I created a model for them and am passing the IDs through to the queued job to retrieve the file later as I understand I can’t pass a whole ActiveRecord object through to the job.
I’ve installed Resque and Redis locally, and everything seems to be setup correctly in that regard. I can see the jobs I’m creating being queued and then run without failing. The job seems to run fine, but no records are added to the database. If I run the code from my job line by line in the console, the records are added to the database as I would expect. But when the queued jobs I’m creating run, nothing happens.
I can’t quite work out where the problem might be.
Here’s my upload controller’s create action:
def create
#upload = Upload.new(upload_params)
if #upload.save
Resque.enqueue(ExcelImportJob, #upload.id)
flash[:info] = 'File uploaded.
Data will be processed and added to the database.'
redirect_to root_path
else
flash[:warning] = 'Upload failed. Please try again.'
render :new
end
end
This is a simplified version of the job with fewer sheet columns for clarity:
class ExcelImportJob < ApplicationJob
#queue = :default
def perform(upload_id)
file = Upload.find(upload_id).file.file.file
data = parse_excel(file)
if header_matches? data
# Create a database entry for each row, ignoring the first header row
# using activerecord-import
sales = []
data.drop(1).each_with_index do |row, index|
sales << Sale.new(row)
if index % 2500 == 0
Sale.import sales
sales = []
end
end
Sale.import sales
end
def parse_excel(upload)
# Open the uploaded excel document
doc = Creek::Book.new upload
# Map rows to the hash keys from the database
doc.sheets.first.rows.map do |row|
{ date: row.values[0],
title: row.values[1],
author: row.values[2],
isbn: row.values[3],
release_date: row.values[5],
units_sold: row.values[6],
units_refunded: row.values[7],
net_units_sold: row.values[8],
payment_amount: row.values[9],
payment_amount_currency: row.values[10] }
end
end
# Returns true if header matches the expected format
def header_matches?(data)
data.first == {:date => 'Date',
:title => 'Title',
:author => 'Author',
:isbn => 'ISBN',
:release_date => 'Release Date',
:units_sold => 'Units Sold',
:units_refunded => 'Units Refunded',
:net_units_sold => 'Net Units Sold',
:payment_amount => 'Payment Amount',
:payment_amount_currency => 'Payment Amount Currency'}
end
end
end
I can probably have some improved logic anyway as right now I’m holding the whole file in memory, but that isn’t the issue I’m having – even with a small file that has only 500 or so rows, the job doesn’t add anything to the database.
Like I said my code worked fine when I wasn’t using a background job, and still works if I run it in the console. But for some reason the job is doing nothing.
This is my first time using Resque so I don’t know if I’m missing something obvious? I did create a worker and as I said it does seem to run the job. Here’s the output from Resque’s verbose formatter:
*** resque-1.27.4: Waiting for default
*** Checking default
*** Found job on default
*** resque-1.27.4: Processing default since 1508342426 [ExcelImportJob]
*** got: (Job{default} | ExcelImportJob | [15])
*** Running before_fork hooks with [(Job{default} | ExcelImportJob | [15])]
*** resque-1.27.4: Forked 63706 at 1508342426
*** Running after_fork hooks with [(Job{default} | ExcelImportJob | [15])]
*** done: (Job{default} | ExcelImportJob | [15])
In the Resque dashboard the jobs aren’t logged as failed. They get executed and I can see an increment in the ‘processed’ jobs on the stats page. But as I say the DB remains untouched. What’s going on? How can I debug the job more clearly? Is there a way to get into it with Pry?
It looks like my problem was with Resque.enqueue(ExcelImportJob, #upload.id).
I changed my code to ExcelImportJob.perform_later(#upload.id) and now my code actually runs!
I also added a resque.rake task to lib/tasks as described here: http://bica.co/2015/01/20/active-job-resque/.
That link also notes how to use rails runner to call the job without running the full Rails server and triggering the job, which is useful for debugging.
Strangely, I didn't quite manage to get the job to print anything to STDOUT as suggested by #hoffm but at least it led me down a good avenue of inquiry.
I still don't fully understand the difference between why calling Resqueue.enqueue still added my jobs to the queue and indeed seemed to run them, but the code wasn't executed, so if someone has a better grasp and an explanation, that would be much appreciated.
TL;DR: calling perform_later rather than Resque.enqueue fixed the problem but I don't know why.
I am using a background job in order to import user data from a csv file into my datase. First I did this "hard" in my User model by simply calling a method in my User model and by passing the file path which is transmitted via a form file_field:
User.import_csv(params[:file].path)
Worked well locally and on production (heroku).
Now when it comes to huge CSV files, I understood that I need a job to perform this import in the background. I am familiar with redis and sidekiq so the job was built quickly.
CsvImportJob.perform_async(URI.parse(params[:file].path))
and in my worker:
def perform(file_path)
User.import_csv(file_path)
end
Well, that also works perfect locally but as soon as I hit this on production, I see the following error in my log:
» 10 Aug 2015 13:56:26.596 2015-08-10 11:56:25.987726+00:00 app worker.1 - - 3 TID-oqvt6v1d4 ERROR: Actor crashed!
» 10 Aug 2015 13:56:26.596 2015-08-10 11:56:25.987728+00:00 app worker.1 - - Errno::ENOENT: No such file or directory # rb_sysopen - /tmp/RackMultipart20150810-6-14u804c.csv
» 10 Aug 2015 13:56:26.596 2015-08-10 11:56:25.987730+00:00 app worker.1 - - /app/vendor/ruby-2.2.2/lib/ruby/2.2.0/csv.rb:1256:in `initialize'
This is meant to be the file_path variable.
Somehow heroku is not able to find the file when I pass it to a sidekiq job. When I do this without sidekiq, it works.
I don't really know how to tackle this issue so any help is appreciated.
I had the same experience, you can look at a similar project of mine at https://github.com/coderaven/datatable-exercise/tree/parallel_processing
(Basically just focus on object_record.rb model and the jobs: import_csv_job.rb and process_csv_job.rb)
The error: Errno::ENOENT: No such file or directory # rb_sysopen
If you said that this works on heroku then probably that means that the path you are getting this is valid (in your example you are using the /tmp/ path)
So here's 2 probable problems and their solution:
1.) You have saved an unknown to Heroku path (or inaccessible path) which cannot be access or opened by the application when it is running. Since, when handling the import csv without sidekiq - the file you uploaded are save temporarily in-memory until you finish processing the csv - However, in a job scheduler (or sidekiq) the path should not be in memory and should be an existing path that is accessible to the app.
Solution: Save the file to a storage somewhere (heroku has an ephemeral filesystem so you cannot save files via the running web-app) to work this around, you have to use an Amazon S3 like service (you can also use Google Drive like what I did) to save your files there and then give the path to your sidekiq worker - so it can access and process it later.
2.) If the paths are correct and the files are save or processed correctly then from my experience it could have been that you are using File.open instead of the open-uri's open method. File.open does not accept remote files, you need to require open-uri on your worker and then use the open method to work around remote files.
ex.
require 'open-uri'
class ProcessCsvJob < ActiveJob::Base
queue_as :default
def perform(csv_path)
csv_file = open(csv_path,'rb:UTF-8')
SmarterCSV.process(csv_file) do |array|
.... code here for processing ...
end
end
end
I'm fully aware this question is already past almost a year, so if you have solved this or this answer worked then it could also help serve as a documentation archive for those who will probably experience the same problem.
You can't pass a file object to the perform method.
The fix is to massage the data beforehand and pass in the parameters you need directly.
Something like...
def import_csv(file)
CSV.foreach(file.path, headers: true) do |row|
new_user = { email: row[0], password: row[1] }
CsvImportJob.perform_async(new_user)
end
end
Note: you'd call CsvImportJob.perform_later for Sidekiq with ActiveJob and Rails 5.
You got the error because on production/staging and sidekiq run on different servers.
Use my solution: upload csv to google cloud storage
class Services::Downloader
require 'fog'
StorageCredentials = YAML.load_file("#{::Rails.root}/config/g.yml")[Rails.env]
def self.download(file_name, local_path)
storage = Fog::Storage.new(
provider: "Google",
google_storage_access_key_id: StorageCredentials['key_id'],
google_storage_secret_access_key: StorageCredentials['access_key'])
storage.get_bucket(StorageCredentials['bucket'])
f = File.open(local_path)
storage.put_object(StorageCredentials['bucket'], file_name, f)
storage.get_object_https_url(StorageCredentials['bucket'], file_name, Time.now.to_f + 24.hours)
end
end
Class User
class User < ApplicationRecord
require 'csv'
require 'open-uri'
def self.import_data(file)
load_file = open(file)
data = CSV.read(load_file, { encoding: "UTF-8", headers: true, header_converters: :symbol, converters: :all})
...
Worker
class ImportWorker
include Sidekiq::Worker
sidekiq_options queue: 'workers', retry: 0
def perform(filename)
User.import_data(filename)
end
end
and code for start worker
--
path = Services::Downloader.download(zip.name, zip.path)
ImportWorker.perform_async(path)
When I have a lot of products (3000 and 22000 variants), adding new stock location takes hours because Spree is creating stock items for every variant.
During this time variants table is locked and whole system is unusable. Is there some workaround for this or maybe it was fixed in some new version of Spree?
I am using spree 2.0.3.
I face the same problem, with >400K variants, it's impossible to add a new stock location. So, I create a script in ruby and for all variants write an insert statement to a SQL file. I must create the stock location without propagate_all_variants
# lib/create_stock_items.rb
begin
file = File.open("stock_items.sql", "w")
rescue IOError => e
puts e
end
file.write("INSERT INTO spree_stock_items (stock_location_id, variant_id, backorderable) VALUES \n")
variants = Spree::Variant.all.pluck(:id)
length = variants.count
variants.each_with_index do |variant, index|
if index+1 == length
file.write("(#{stock_location_id}, #{variant}, false); \n")
else
file.write("(#{stock_location_id}, #{variant}, false), \n")
end
end
file.close
Then run bundle exec rails runner lib/create_stock_items.rb -e production. This will create a stock_items.sql file in Rails root path, and finally load that SQL directly on BD (rails dbconsole).
I know it's a little hack, but a very fast solution for me.
Using Rails 3.1.1 and Herkou
I have 1.000 products in my app. They all have a very slow controller which is effectively solved by fragment caching. Although the data doesn't change very often, it still needs to expire (which I do by sweeping) periodically, in my case once a week.
Now, after sweeping the cached views I don't want my users to create the new fragments by trying to access the products one after another (takes about 6-8 secs at the first load, 2-3 sec for the cached load). I assume I can do that with some sort of script that will load each Product Page one by one and thus make the server create those fragments.
I can imagine this can be handled in three ways:
Run a script on my local machine that will try to access each url with some sort of get-command - Downside: Not very pretty and will affect visitor stats in a way I would not prefer.
Run the same type script on the server after the sweeper, that will load each Product. How can I do that, in that case?
Using a smart Rails command to do this automatically. Is there such an elegant command?
I made this script and it works. The "product.slug" is because I have friendly_id installed. It will produce url-variables with names such as www.mydomain.com/productabc-123/ which will be read by Nokogiri (Nokogiri gem is needed for this solution).
PLEASE NOTE THAT I SWITCHED FROM FRAGMENT CACHING TO ACTION CACHING IN THIS SOLUTION (as opposed to the question, where I am using fragment caching). The important difference for this is when I check the cache if Rails.cache.exist?('views/www.mydomain.com/' + product.slug). For fragment_caching it should be the fragment name there instead.
require 'nokogiri'
require 'open-uri'
Product.all.each do |product|
url = 'http://www.mydomain.com/' + product.slug
begin
if Rails.cache.exist?('views/www.mydomain.com/' + product.slug)
puts url + " is already in cache"
else
doc = Nokogiri::HTML(open(url))
puts "Reads " + url
# Verifies if the caching worked. Only for trouble shooting
if Rails.cache.exist?('views/www.mydomain.com/' + product.slug)
puts "--->" + url + " is NOW in the cache"
else
puts "--->" + url + " is still not in the cache!"
end
sleep 1
end
rescue
puts 'Normal rescue of ' + url
rescue Timeout::Error
puts 'Timeout rescue of ' + url
puts 'Sleep for 5 sec'
sleep 5
retry
end
end
Create a script that runs as rake task, or better yet a worker, that runs and curls the page. There is no need to include a gem when you can just call curl
`curl -A "CacheRefresher" #{ENV['HOSTNAME']}/api/v1/#{klass.name.underscore.pluralize}/#{id} >/dev/null 2>&1`
Situation:
In a typical cluster setup, I have a 5 instances of mongrel running behind Apache 2.
In one of my initializer files, I schedule a cron task using Rufus::Scheduler which basically sends out a couple of emails.
Problem:
The task runs 5 times, once for each mongrel instance and each recipient ends up getting 5 mails (despite the fact I store logs of each sent mail and check the log before sending). Is it possible that since all 5 instances run the task at exact same time, they end up reading the email logs before they are written?
I am looking for a solution that will make the tasks run only once. I also have a Starling daemon up and running which can be utilized.
The rooster rails plugin specifically addresses your issue. It uses rufus-scheduler and ensures the environment is loaded only once.
The way I am doing it right now:
Try to open a file in exclusive locked mode
When lock is acquired, check for messages in Starling
If message exists, other process has already scheduled the job
Set the message again to the queue and exit.
If message is not found, schedule the job, set the message and exit
Here is the code that does it:
starling = MemCache.new("#{Settings[:starling][:host]}:#{Settings[:starling][:port]}")
mutex_filename = "#{RAILS_ROOT}/config/file.lock"
scheduler = Rufus::Scheduler.start_new
# The filelock method, taken from Ruby Cookbook
# This will ensure unblocking of the files
def flock(file, mode)
success = file.flock(mode)
if success
begin
yield file
ensure
file.flock(File::LOCK_UN)
end
end
return success
end
# open_lock method, taken from Ruby Cookbook
# This will create and hold the locks
def open_lock(filename, openmode = "r", lockmode = nil)
if openmode == 'r' || openmode == 'rb'
lockmode ||= File::LOCK_SH
else
lockmode ||= File::LOCK_EX
end
value = nil
# Kernerl's open method, gives IO Object, in our case, a file
open(filename, openmode) do |f|
flock(f, lockmode) do
begin
value = yield f
ensure
f.flock(File::LOCK_UN) # Comment this line out on Windows.
end
end
return value
end
end
# The actual scheduler
open_lock(mutex_filename, 'r+') do |f|
puts f.read
digest_schedule_message = starling.get("digest_scheduler")
if digest_schedule_message
puts "Found digest message in Starling. Releasing lock. '#{Time.now}'"
puts "Message: #{digest_schedule_message.inspect}"
# Read the message and set it back, so that other processes can read it too
starling.set "digest_scheduler", digest_schedule_message
else
# Schedule job
puts "Scheduling digest emails now. '#{Time.now}'"
scheduler.cron("0 9 * * *") do
puts "Begin sending digests..."
WeeklyDigest.new.send_digest!
puts "Done sending digests."
end
# Add message in queue
puts "Done Scheduling. Sending the message to Starling. '#{Time.now}'"
starling.set "digest_scheduler", :date => Date.today
end
end
# Sleep will ensure all instances have gone thorugh their wait-acquire lock-schedule(or not) cycle
# This will ensure that on next reboot, starling won't have any stale messages
puts "Waiting to clear digest messages from Starling."
sleep(20)
puts "All digest messages cleared, proceeding with boot."
starling.get("digest_scheduler")
Why dont you use mod_passenger (phusion)? I moved from mongrel to phusion and it worked perfect (with a timeamount of < 5 minutes)!