Problems with file.path with csv import via sidekiq on heroku - ruby-on-rails

I am using a background job in order to import user data from a csv file into my datase. First I did this "hard" in my User model by simply calling a method in my User model and by passing the file path which is transmitted via a form file_field:
User.import_csv(params[:file].path)
Worked well locally and on production (heroku).
Now when it comes to huge CSV files, I understood that I need a job to perform this import in the background. I am familiar with redis and sidekiq so the job was built quickly.
CsvImportJob.perform_async(URI.parse(params[:file].path))
and in my worker:
def perform(file_path)
User.import_csv(file_path)
end
Well, that also works perfect locally but as soon as I hit this on production, I see the following error in my log:
» 10 Aug 2015 13:56:26.596 2015-08-10 11:56:25.987726+00:00 app worker.1 - - 3 TID-oqvt6v1d4 ERROR: Actor crashed!
» 10 Aug 2015 13:56:26.596 2015-08-10 11:56:25.987728+00:00 app worker.1 - - Errno::ENOENT: No such file or directory # rb_sysopen - /tmp/RackMultipart20150810-6-14u804c.csv
» 10 Aug 2015 13:56:26.596 2015-08-10 11:56:25.987730+00:00 app worker.1 - - /app/vendor/ruby-2.2.2/lib/ruby/2.2.0/csv.rb:1256:in `initialize'
This is meant to be the file_path variable.
Somehow heroku is not able to find the file when I pass it to a sidekiq job. When I do this without sidekiq, it works.
I don't really know how to tackle this issue so any help is appreciated.

I had the same experience, you can look at a similar project of mine at https://github.com/coderaven/datatable-exercise/tree/parallel_processing
(Basically just focus on object_record.rb model and the jobs: import_csv_job.rb and process_csv_job.rb)
The error: Errno::ENOENT: No such file or directory # rb_sysopen
If you said that this works on heroku then probably that means that the path you are getting this is valid (in your example you are using the /tmp/ path)
So here's 2 probable problems and their solution:
1.) You have saved an unknown to Heroku path (or inaccessible path) which cannot be access or opened by the application when it is running. Since, when handling the import csv without sidekiq - the file you uploaded are save temporarily in-memory until you finish processing the csv - However, in a job scheduler (or sidekiq) the path should not be in memory and should be an existing path that is accessible to the app.
Solution: Save the file to a storage somewhere (heroku has an ephemeral filesystem so you cannot save files via the running web-app) to work this around, you have to use an Amazon S3 like service (you can also use Google Drive like what I did) to save your files there and then give the path to your sidekiq worker - so it can access and process it later.
2.) If the paths are correct and the files are save or processed correctly then from my experience it could have been that you are using File.open instead of the open-uri's open method. File.open does not accept remote files, you need to require open-uri on your worker and then use the open method to work around remote files.
ex.
require 'open-uri'
class ProcessCsvJob < ActiveJob::Base
queue_as :default
def perform(csv_path)
csv_file = open(csv_path,'rb:UTF-8')
SmarterCSV.process(csv_file) do |array|
.... code here for processing ...
end
end
end
I'm fully aware this question is already past almost a year, so if you have solved this or this answer worked then it could also help serve as a documentation archive for those who will probably experience the same problem.

You can't pass a file object to the perform method.
The fix is to massage the data beforehand and pass in the parameters you need directly.
Something like...
def import_csv(file)
CSV.foreach(file.path, headers: true) do |row|
new_user = { email: row[0], password: row[1] }
CsvImportJob.perform_async(new_user)
end
end
Note: you'd call CsvImportJob.perform_later for Sidekiq with ActiveJob and Rails 5.

You got the error because on production/staging and sidekiq run on different servers.
Use my solution: upload csv to google cloud storage
class Services::Downloader
require 'fog'
StorageCredentials = YAML.load_file("#{::Rails.root}/config/g.yml")[Rails.env]
def self.download(file_name, local_path)
storage = Fog::Storage.new(
provider: "Google",
google_storage_access_key_id: StorageCredentials['key_id'],
google_storage_secret_access_key: StorageCredentials['access_key'])
storage.get_bucket(StorageCredentials['bucket'])
f = File.open(local_path)
storage.put_object(StorageCredentials['bucket'], file_name, f)
storage.get_object_https_url(StorageCredentials['bucket'], file_name, Time.now.to_f + 24.hours)
end
end
Class User
class User < ApplicationRecord
require 'csv'
require 'open-uri'
def self.import_data(file)
load_file = open(file)
data = CSV.read(load_file, { encoding: "UTF-8", headers: true, header_converters: :symbol, converters: :all})
...
Worker
class ImportWorker
include Sidekiq::Worker
sidekiq_options queue: 'workers', retry: 0
def perform(filename)
User.import_data(filename)
end
end
and code for start worker
--
path = Services::Downloader.download(zip.name, zip.path)
ImportWorker.perform_async(path)

Related

Why am I getting Listen Loop Bad File Descriptor errors on my machine but not anyone else's machine when specific code is enabled?

I'm currently working on a project to enable database backed configurations in the frontend of our application. These need to be loaded after application initialization, so I created a module to load them and added a call to it in environment.rb, after Rails.application.initialize!.
The problem is that when this code is enabled, my console gets flooded with listen loop errors with bad file descriptors like:
2020-01-24 09:18:16 -0500: Listen loop error: #<Errno::EBADF: Bad file descriptor>
/Users/fionadurgin/.asdf/installs/ruby/2.6.5/lib/ruby/gems/2.6.0/gems/puma-4.3.1/lib/puma/server.rb:383:in `select'
/Users/fionadurgin/.asdf/installs/ruby/2.6.5/lib/ruby/gems/2.6.0/gems/puma-4.3.1/lib/puma/server.rb:383:in `handle_servers'
/Users/fionadurgin/.asdf/installs/ruby/2.6.5/lib/ruby/gems/2.6.0/gems/puma-4.3.1/lib/puma/server.rb:356:in `block in run'
When I disable either the call to the ConfigurationLoader or the methods I'm calling on the model, I no longer get these errors.
The rub is that I can't reproduce this issue on another machine, or in specs. I've tried on two other laptops and on one of our staging servers and they work perfectly with the ConfigurationLoader enabled.
I've tried restarting my computer, working from a freshly cloned repository, and setting all the file permissions for the application to 777. Nothing has worked so far.
Here's the ConfigurationLoader module:
module ConfigurationLoader
# Overrides client default configurations if frontend configurations exist
def self.call
Configurations::ImportRowMapping.override_configurations
rescue ActiveRecord::NoDatabaseError => e
log_no_database_error(e)
rescue ActiveRecord::StatementInvalid => e
log_statement_invalid_error(e)
rescue Mysql2::Error::ConnectionError => e
log_connection_error(e)
end
def self.log_no_database_error(error)
Rails.logger.warn(
'Could not initialize database backed configurations, database does '\
'not exist'
)
Rails.logger.warn(error.message)
end
def self.log_statement_invalid_error(error)
Rails.logger.warn(
'Could not initialize database backed configurations, table does '\
'not exist'
)
Rails.logger.warn(error.message)
end
def self.log_connection_error(error)
Rails.logger.warn(
'Could not initialize database backed configurations, could not '\
'connect to database'
)
Rails.logger.warn(error.message)
end
end
The call in environment.rb:
# Load the Rails application.
require_relative 'application'
require_relative 'configuration_loader'
# Initialize the Rails application.
Rails.application.initialize!
ConfigurationLoader.call
And the model method being called:
def self.override_configurations
return unless any?
Rails.application.client.payroll_service_file.payroll_service_file
.mappings = all.to_a
end
I'll note here that I get the errors when either the guard clause or the assignment are enabled.
Anyone have any ideas about what's going on? I'm about at my wits' end.
So I'm still not sure on the exact cause of the problem, but the solution was to move the configuration loader call out of environment.rb and into an after_initialize block in application.rb.

How to run heroku restart from inside of a rails app?

I understand that from the console I can run heroku restart. What I'd like to do is to have a button in my app (admin console), where pushing that button runs a heroku restart. Does anyone know how to do that and if it's possible? So the code would look something like this:
<button id="heroku_restart">Restart</button>
$("#heroku_restart").click(function() {
$.post('/restart', {}).done(function(response) {
alert(response)
})
})
class AdminsController
# this is the action mapped to the route /restart
def restart
# code for heroku restart
end
end
So per #vpibano, as of this writing, doing it with the platform-api is a breeze. Here's the action POSTed to by a button on my website:
def restart
heroku = PlatformAPI.connect_oauth(ENV["heroku_oauth_token"])
heroku.dyno.restart_all("lastmingear")
render nothing: true
end
As per the description mentioned in the post, the one way of doing it is :
1) First locate the server.pid file
pid_file = Rails.root.join("tmp", "pids", "server.opid")
2) Now, truncate the contents of the file
File.open(pid_file, "w") {|f| f.truncate(0)}
3) Finally, run the server using Kernel module:
Kernel.exec("rails s")
Note: As rightly, mentioned by #vpibano you will need authentication to access your app.
This is not a working model but a way to achieve the requirement.
Hope it helps!!

Rails - running threads after method has exited

When the client changes his profile picture it hits the update method, which responds with update.js.erb. This is a fast and straightforward process. However, behind the scenes on the server, a bunch of files (10 of them) is generated from the profile picture and these need to be uploaded to an Amazon bucket from the server. This a lengthy process and I don't want to make the client wait until it is finished. Moreover, the file uploads often fail with a RequestTimeoutException because they take longer than 15 seconds.
All this raises many questions:
How do you do the 10 file generation/upload after the update method has exited? Threads are killed after the main method has finished.
How do you catch an exception inside a thread? The following code does not catch the timeout exceptions.
threads = []
threads << Thread.new {
begin
# upload file 1 ....
rescue Rack::Timeout::RequestTimeoutException => e
# try to upload again ....
else
ensure
end
}
threads << Thread.new {
begin
# upload file 2 ....
rescue Rack::Timeout::RequestTimeoutException => e
# try to upload again ....
else
ensure
end
}
threads.each { |thr|
thr.join
}
What's the best way to try to upload a file again if it timed out?
What is the best solution to this problem?
You need to use delayed_job or whenever gem for background task, but I would like suggest sidekiq
I also faced the same problem in a project. I came accross a solution using AWS lambda. You can use carrierwave gem/ rails 5 active storage module if you are using rails to upload image on S3. If you are not using rails then use AWS-SDK for ruby to upload files on S3. You can bind events whenever a file created/modified on S3. Whenever a file created it will invoke lambda function and your work is done. can bind them to lambda function. In lambda function you can write logic to create files and upload it back to s3. You can write lambda code in ruby, node and python.
This strategy may help you.

Rails+resque background job import not adding anything to the database

I have an issue with importing a lot of records from a user provided excel file into a database. The logic for this is working fine, and I’m using ActiveRecord-import to cut down on the number of database calls. However, when a file is too large, the processing can take too long and Heroku will return a timeout. Solution: Resque and moving the processing to a background job.
So far, so good. I’ve needed to add CarrierWave to upload the files to S3 because I can’t just hold the file in memory for the background job. The upload portion is also working fine, I created a model for them and am passing the IDs through to the queued job to retrieve the file later as I understand I can’t pass a whole ActiveRecord object through to the job.
I’ve installed Resque and Redis locally, and everything seems to be setup correctly in that regard. I can see the jobs I’m creating being queued and then run without failing. The job seems to run fine, but no records are added to the database. If I run the code from my job line by line in the console, the records are added to the database as I would expect. But when the queued jobs I’m creating run, nothing happens.
I can’t quite work out where the problem might be.
Here’s my upload controller’s create action:
def create
#upload = Upload.new(upload_params)
if #upload.save
Resque.enqueue(ExcelImportJob, #upload.id)
flash[:info] = 'File uploaded.
Data will be processed and added to the database.'
redirect_to root_path
else
flash[:warning] = 'Upload failed. Please try again.'
render :new
end
end
This is a simplified version of the job with fewer sheet columns for clarity:
class ExcelImportJob < ApplicationJob
#queue = :default
def perform(upload_id)
file = Upload.find(upload_id).file.file.file
data = parse_excel(file)
if header_matches? data
# Create a database entry for each row, ignoring the first header row
# using activerecord-import
sales = []
data.drop(1).each_with_index do |row, index|
sales << Sale.new(row)
if index % 2500 == 0
Sale.import sales
sales = []
end
end
Sale.import sales
end
def parse_excel(upload)
# Open the uploaded excel document
doc = Creek::Book.new upload
# Map rows to the hash keys from the database
doc.sheets.first.rows.map do |row|
{ date: row.values[0],
title: row.values[1],
author: row.values[2],
isbn: row.values[3],
release_date: row.values[5],
units_sold: row.values[6],
units_refunded: row.values[7],
net_units_sold: row.values[8],
payment_amount: row.values[9],
payment_amount_currency: row.values[10] }
end
end
# Returns true if header matches the expected format
def header_matches?(data)
data.first == {:date => 'Date',
:title => 'Title',
:author => 'Author',
:isbn => 'ISBN',
:release_date => 'Release Date',
:units_sold => 'Units Sold',
:units_refunded => 'Units Refunded',
:net_units_sold => 'Net Units Sold',
:payment_amount => 'Payment Amount',
:payment_amount_currency => 'Payment Amount Currency'}
end
end
end
I can probably have some improved logic anyway as right now I’m holding the whole file in memory, but that isn’t the issue I’m having – even with a small file that has only 500 or so rows, the job doesn’t add anything to the database.
Like I said my code worked fine when I wasn’t using a background job, and still works if I run it in the console. But for some reason the job is doing nothing.
This is my first time using Resque so I don’t know if I’m missing something obvious? I did create a worker and as I said it does seem to run the job. Here’s the output from Resque’s verbose formatter:
*** resque-1.27.4: Waiting for default
*** Checking default
*** Found job on default
*** resque-1.27.4: Processing default since 1508342426 [ExcelImportJob]
*** got: (Job{default} | ExcelImportJob | [15])
*** Running before_fork hooks with [(Job{default} | ExcelImportJob | [15])]
*** resque-1.27.4: Forked 63706 at 1508342426
*** Running after_fork hooks with [(Job{default} | ExcelImportJob | [15])]
*** done: (Job{default} | ExcelImportJob | [15])
In the Resque dashboard the jobs aren’t logged as failed. They get executed and I can see an increment in the ‘processed’ jobs on the stats page. But as I say the DB remains untouched. What’s going on? How can I debug the job more clearly? Is there a way to get into it with Pry?
It looks like my problem was with Resque.enqueue(ExcelImportJob, #upload.id).
I changed my code to ExcelImportJob.perform_later(#upload.id) and now my code actually runs!
I also added a resque.rake task to lib/tasks as described here: http://bica.co/2015/01/20/active-job-resque/.
That link also notes how to use rails runner to call the job without running the full Rails server and triggering the job, which is useful for debugging.
Strangely, I didn't quite manage to get the job to print anything to STDOUT as suggested by #hoffm but at least it led me down a good avenue of inquiry.
I still don't fully understand the difference between why calling Resqueue.enqueue still added my jobs to the queue and indeed seemed to run them, but the code wasn't executed, so if someone has a better grasp and an explanation, that would be much appreciated.
TL;DR: calling perform_later rather than Resque.enqueue fixed the problem but I don't know why.

Fork, Ruby, ActiveRecord and File Descriptors on Fork

I understand that when we fork a process the child process inherits a copy of the parents open file descriptors and offsets. According to the man pages this refers to the same file descriptors used by the parent. Based on that theory in the following program
puts "Process #{Process.pid}"
file = File.open('sample', 'w')
forked_pid = fork do
sleep(10)
puts "Writing to file now..."
file.puts("Hello World. #{Time.now}")
end
file.puts("Welcome to winter of my discontent #{Time.now}")
file.close
file = nil
Question 1:
Shouldn't the forked process which is sleeping for 10 seconds lose its file descriptor and not be able to write to the file as the parent process completes and closes the file and exits.
Question 2: But for whatever reason if this works then how does ActiveRecord lose its connection in this scenario. It only works if I set :reconnect => true on ActiveRecord connect can it actually connect, which means its losing connection.
require "rubygems"
require "redis"
require 'active_record'
require 'mysql2'
connection = ActiveRecord::Base.establish_connection({
:adapter =&gt 'mysql2',
:username =&gt 'root_user',
:password =&gt 'Pi',
:host =&gt 'localhost',
:database => 'list_development',
:socket =&gt '/var/lib/mysql/mysql.sock'
})
class User &lt ActiveRecord::Base
end
u = User.first
puts u.inspect
fork do
sleep 3
puts "*" * 50
puts User.first.inspect
puts "*" * 50
end
puts User.first.inspect
However, the same is not true with Redis (v2.4.8) which does not lose connection on a fork, again. Does the it try to reconnect internally on a fork?
If thats the case then why isn't the write file program not throwing an error.
Could somebody explain whats going on here. Thanks
If you close a file descriptor in one process it stays valid in the other process, this is why your file example works fine.
The mysql case is different because it's a socket with another process at the end. When you call close on the mysql adapter (or when the adapter gets garbage collected when ruby exits) it actually sends a "QUIT" command to the server saying that you're disconnecting, so the server tears down its side of the socket. In general you really don't want to share a mysql connection between two processes - you'll get weird errors depending on whether the two processes are trying to use the socket at the same time.
If closing a redis connection just closes the socket (as opposed to sending a "I'm going away " message to the server) then the child connection should continue to work because the socket won't actually have been closed

Resources