Optimal way to structure polling external service (RoR) - ruby-on-rails

I have a Rails application that has a Document with the flag available. The document is uploaded to an external server where it is not immediately available (takes time to propogate). What I'd like to do is poll the availability and update the model when available.
I'm looking for the most performant solution for this process (service does not offer callbacks):
Document is uploaded to app
app uploads to external server
app polls url (http://external.server.com/document.pdf) until available
app updates model Document.available = true
I'm stuck on 3. I'm already using sidekiq in my project. Is that an option, or should I use a completely different approach (cron job).
Documents will be uploaded all the time and so it seems relevant to first poll the database/redis to check for Documents which are not available.

See this answer: Making HTTP HEAD request with timeout in Ruby
Basically you set up a HEAD request for the known url and then asynchronously loop until you get a 200 back (with a 5 second delay between iterations, or whatever).
Do this from your controller after the document is uploaded:
Document.delay.poll_for_finished(#document.id)
And then in your document model:
def self.poll_for_finished(document_id)
document = Document.find(document_id)
# make sure the document exists and should be polled for
return unless document.continue_polling?
if document.remote_document_exists?
document.available = true
else
document.poll_attempts += 1 # assumes you care how many times you've checked, could be ignored.
Document.delay_for(5.seconds).poll_for_finished(document.id)
end
document.save
end
def continue_polling?
# this can be more or less sophisticated
return !document.available || document.poll_attempts < 5
end
def remote_document_exists?
Net::HTTP.start('http://external.server.com') do |http|
http.open_timeout = 2
http.read_timeout = 2
return "200" == http.head(document.path).code
end
end
This is still a blocking operation. Opening the Net::HTTP connection will block if the server you're trying to contact is slow or unresponsive. If you're worried about it use Typhoeus. See this answer for details: What is the preferred way of performing non blocking I/O in Ruby?

Related

How can I use a websocket to report the progress of POST file uploads with Fast API

Taking the file upload and websocket examples from Fast API, consider a POST route:
#app.post("/uploadfiles/")
async def create_upload_files(files: List[UploadFile] = File(...)):
data = [file.filename for file in files]
return process_data(data)
And consider the websocket
#app.websocket("/ws/{client_id}")
async def websocket_endpoint(websocket: WebSocket, client_id: int):
await manager.connect(websocket)
try:
while True:
data = await websocket.receive_text()
await manager.send_personal_message(f"You wrote: {data}", websocket)
# Replace with update based on process_data() status
except WebSocketDisconnect:
manager.disconnect(websocket)
How can I update the websocket on the progress of any post requests for a given client_id?
I've looked through dependency injections and background tasks, and I can't seem to figure out a way to do it.
Ultimately, the tool I'm trying to build:
User uploads multiple files on client side
Server processes the files (which can take minutes to hours), and will ultimately return a single file to the user (e.g., redirect to download the file)
Meanwhile, I'd like the user to be updated on the progress on the data processing (e.g., via websockets).
If the user closes the tab (client side disconnect), the original data proecss job is stopped.
I've read about polling and other approaches, but it seems like web sockets ought to work.

Rails - multiple theads to avoid the slack 3 second API response rule

I am working with the slack API. My script does a bunch of external processing and in some cases it can take around 3-6 seconds. What is happening is the Slack API expects a 200 response within 3 seconds and because my function is not finished within 3 seconds, it retries again and then it ends up posting the same automated responses 2-3 times.
I confirmed this by commenting out all the functions and I had no issue, it posted the responses to slack fine. I then added sleep 10 and it done the same responses 3 times so the ohly thing different was it took longer.
From what I read, I need to have threaded responses. I then need to first respond to the slack API in thread 1 and then go about processing my functions.
Here is what I tried:
def events
Thread.new do
json = {
"text": "Here is your 200 response immediately slack",
}
render(json: json)
end
puts "--------------------------------Json response started----------------------"
sleep 30
puts "--------------------------------Json response completed----------------------"
puts "this is a successful response"
end
When I tested it the same issue happened so I tried using an online API tester and it hits the page, waits 30 seconds and then returns the 200 response but I need it to respond immediately with the 200, THEN process the rest otherwise I will get duplicates.
Am I using threads properly or is there another way to get around this Slack API 3 second response limit? I am new to both rails and slack API so a bit lost here.
Appreciate the eyes :)
I would recommend using ActionJob to run the code in the background if you don't need to use the result of the code in the response. First, create an ActiveJob job by running:
bin/rails generate job do_stuff
And then open up the file created in app/jobs/do_stuff_job.rb and edit the #perform function to include your code (so the puts statements and sleep 30 in your example). Finally, from the controller action you can call DoStuff.perform_later and your job will run in the background! Your final controller action will look something like this:
def events
DoStuff.perform_later # this schedules DoStuff to be done later, in
# the background, so it will return immediately
# and continue to the next line.
json = {
"text": "Here is your 200 response immediately slack",
}
render(json: json)
end
As an aside, I'd highly recommend never using Thread.new in rails. It can create some really confusing behavior especially in test scripts for a number of reasons, but usually because of how it interacts with open connections and specifically ActiveRecord.

Is there a way to get around H12 timeouts in Heroku using griddler to parse inbound messages with large attachments from sendgrid

My workflow is email--sendgrid--griddler and the rails app runs on heroku. All inbound emails have attachments, and some are rather large. I keep getting H12 timeouts on Heroku because the attachments take more than 30s to upload to Google Cloud Storage.
I have used delayed job for everything I can, but I dont THINK I can pass an attachment from griddler to a delayed job, since the attachment is ephemeral. I had a friend suggest just move to retrieving the emails from gmail instead of using sendgrid and griddler, but that would be more of a rewrite than I am up for at the moment. In an ideal world, I would be able to pass the attachments to a delayed job, but I dont know if that is possible in the end.
email_processor.rb
if pdfs = attachments.select { |attachment| attachment.content_type == 'application/pdf' }
pdfs.each do |att|
# att is a ActionDispatch::Http::UploadedFile type
# content_type = MIME::Types.type_for(att.to_path).first.content_type
content_type = att.content_type
if content_type == 'application/pdf'
# later if we use multiple attachments in single fax, change num_pages
fax = #email_address.phone_number.faxes.create(sending_email: #email_address.address ,twilio_to: result, twilio_from: #email_address.phone_number.number, twilio_num_pages: 1)
attachment = fax.create_attachment
# next two rows should be with delay
attachment.upload_file!(att)
#moved to attachment model for testing
#fax.send!
end
end
file upload from another model
def upload_file!(file)
# file should be ActionDispatch::Http::UploadedFile type
filename = file.original_filename.gsub(/\s+/, '_')
filename = filename[0..15] if filename.size > 16
path = "fax/#{fax.id}/att-#{id}-#{filename}"
upload_out!(file.open, path)
#self.fax.send!
#make_thumbnail_pdf!(file.open)
end
def upload_out!(file, path)
upload = StorageBucket.files.new key: path, body: file, public: true
upload.save # upload file
update_columns url: upload.public_url
self.fax.update(status: 'outbound-uploaded')
self.fax.process!
end
If you cannot receive and upload the attachment in 30 seconds then heroku won't work for receiving emails. You are right - the ephemeral storage on the web dyno is not accessible from the worker dyno running delayed job.
Even if it were possible for the worker dyno to read data from the web dyno's ephemeral storage there is no guarantee the web dyno could handle the POST from sendgrid in 30 seconds if the attachments were large enough.
One option is to configure sendgrid to forward the emails directly to your google app engine - https://cloud.google.com/appengine/docs/standard/python/mail/receiving-mail-with-mail-api
Your app engine script could write the attachments into google cloud storage, and then your app engine script could do a POST to your heroku app with the location of the attachment and the web app could then queue a delayed job to download and handle the attachment.
I ended up just rewriting my email processing completely. I setup gmail as mail target, and then used a scheduled task at Heroku to process the emails(look for unread) and then upload the attachments to Google Cloud Storage. Using a scheduled task gets around the H12 issues.

Preventing rapid-fire login attempts with Rack::Attack

We have been reading the Definitive guide to form based website authentication with the intention of preventing rapid-fire login attempts.
One example of this could be:
1 failed attempt = no delay
2 failed attempts = 2 sec delay
3 failed attempts = 4 sec delay
etc
Other methods appear in the guide, but they all require a storage capable of recording previous failed attempts.
Blocklisting is discussed in one of the posts in this issue (appears under the old name blacklisting that was changed in the documentation to blocklisting) as a possible solution.
As per Rack::Attack specifically, one naive example of implementation could be:
Where the login fails:
StorageMechanism.increment("bad-login/#{req.ip")
In the rack-attack.rb:
Rack::Attack.blacklist('bad-logins') { |req|
StorageMechanism.get("bad-login/#{req.ip}")
}
There are two parts here, returning the response if it is blocklisted and check if a previous failed attempt happened (StorageMechanism).
The first part, returning the response, can be handled automatically by the gem. However, I don't see so clear the second part, at least with the de-facto choice for cache backend for the gem and Rails world, Redis.
As far as I know, the expired keys in Redis are automatically removed. That would make impossible to access the information (even if expired), set a new value for the counter and increment accordingly the timeout for the refractory period.
Is there any way to achieve this with Redis and Rack::Attack?
I was thinking that maybe the 'StorageMechanism' has to remain absolutely agnostic in this case and know nothing about Rack::Attack and its storage choices.
Sorry for the delay in getting back to you; it took me a while to dig out my old code relating to this.
As discussed in the comments above, here is a solution using a blacklist, with a findtime
# config/initilizers/rack-attack.rb
class Rack::Attack
(1..6).each do |level|
blocklist("allow2ban login scrapers - level #{level}") do |req|
Allow2Ban.filter(
req.ip,
maxretry: (20 * level),
findtime: (8**level).seconds,
bantime: (8**level).seconds
) do
req.path == '/users/sign_in' && req.post?
end
end
end
end
You may wish to tweak those numbers as desired for your particular application; the figures above are only what I decided as 'sensible' for my particular application - they do not come from any official standard.
One issue with using the above that when developing/testing (e.g. your rspec test suite) the application, you can easily hit the above limits and inadvertently throttle yourself. This can be avoided by adding the following config to the initializer:
safelist('allow from localhost') do |req|
'127.0.0.1' == req.ip || '::1' == req.ip
end
The most common brute-force login attack is a brute-force password attack where an attacker simply tries a large number of emails and passwords to see if any credentials match.
You should mitigate this in the application by use of an account LOCK after a few failed login attempts. (For example, if using devise then there is a built-in Lockable module that you can make use of.)
However, this account-locking approach opens a new attack vector: An attacker can spam the system with login attempts, using valid emails and incorrect passwords, to continuously re-lock all accounts!
This configuration helps mitigate that attack vector, by exponentially limiting the number of sign-in attempts from a given IP.
I also added the following "catch-all" request throttle:
throttle('req/ip', limit: 300, period: 5.minutes, &:ip)
This is primarily to throttle malicious/poorly configured scrapers; to prevent them from hogging all of the app server's CPU.
Note: If you're serving assets through rack, those requests may be counted by rack-attack and this throttle may be activated too quickly. If so, enable the condition to exclude them from tracking.
I also wrote an integration test to ensure that my Rack::Attack configuration was doing its job. There were a few challenges in making this test work, so I'll let the code+comments speak for itself:
class Rack::AttackTest < ActionDispatch::IntegrationTest
setup do
# Prevent subtle timing issues (==> intermittant test failures)
# when the HTTP requests span across multiple seconds
# by FREEZING TIME(!!) for the duration of the test
travel_to(Time.now)
#removed_safelist = Rack::Attack.safelists.delete('allow from localhost')
# Clear the Rack::Attack cache, to prevent test failure when
# running multiple times in quick succession.
#
# First, un-ban localhost, in case it is already banned after a previous test:
(1..6).each do |level|
Rack::Attack::Allow2Ban.reset('127.0.0.1', findtime: (8**level).seconds)
end
# Then, clear the 300-request rate limiter cache:
Rack::Attack.cache.delete("#{Time.now.to_i / 5.minutes}:req/ip:127.0.0.1")
end
teardown do
travel_back # Un-freeze time
Rack::Attack.safelists['allow from localhost'] = #removed_safelist
end
test 'should block access on 20th successive /users/sign_in attempt' do
19.times do |i|
post user_session_url
assert_response :success, "was not even allowed to TRY to login on attempt number #{i + 1}"
end
# For DOS protection: Don't even let the user TRY to login; they're going way too fast.
# Rack::Attack returns 403 for blocklists by default, but this can be reconfigured:
# https://github.com/kickstarter/rack-attack/blob/master/README.md#responses
post user_session_url
assert_response :forbidden, 'login access should be blocked upon 20 successive attempts'
end
end

Queuing API calls to fit rate limit

Using Full Contact API, but they have a rate limit of 300calls/minute. I currently have it to set that it does an API call when uploading the CSV file of emails. I want to queue it such that once it hits the rate limit or does 300 calls, it waits for 1 minute and proceeds. Then I will put delayed_job on it. How can I do that? A quick fix is to use
sleep 60
but how do I find it such that it made 300 calls already, make it sleep or queue it for next set?
def self.import(file)
CSV.foreach(file.path, headers: true) do |row|
hashy = row.to_hash
email = hashy["email"]
begin
Contact.create!(email: email, contact_hash: FullContact.person(email: email).to_json)
rescue FullContact::NotFound
Contact.create!(email: email, contact_hash: "Not Found")
end
end
end
There are several issues to think about here - is there going to be a single process using your API key at any one time, or is it possible that multiple processes would be running at once? If you have multiple delayed_job workers, I think the latter is likely. I haven't used delayed_jobs enough to give you a good solution to that, but my feeling is you would be restricted to a single worker.
I am currently working on a similar problem with an API with a restriction of 1 request every 0.5 seconds, with a maximum of 1000 per day. I haven't worked out how I want to track the per-day usage yet, but I've handled the per-second restriction using threads. If you can frame the restriction as "1 request every 0.2 seconds", that might free you up from having to track it on a minute-by-minute basis (though you still have the issue of how to keep track multiple workers).
The basic idea is that I have an request method that splits a single request into a queue of request parameters (based on the maximum number of objects allowed per request by the api), and then another method iterates over that queue and calls a block which sends the actual request to the remote server. Something like this:
def make_multiple_requests(queue, &block)
result = []
queue.each do |request|
timer = Thread.new { sleep REQUEST_INTERVAL }
execution = Thread.new { result << yield(request) }
[timer, execution].each(&:join)
end
result
end
To use it:
make_multiple_requests(queue) do |request|
your_request_method_goes_here(request)
end
The main benefit here is that if a request takes longer than the allowed interval, you don't have to wait around for the sleep to finish, and you can start your next request right away. It just guarantees that the next request won't start until at least the interval has passed. I've noticed that even though the interval is set correctly, I occasionally get an 'over-quota' response from the API. In those cases, the request is retried after the appropriate interval has passed.

Resources