Running threads inside my rails controller method - ruby-on-rails

I've got a set of data that I'd like to do some calculations on inside my rails application, each calculation is independent of each other so I'd like to thread them so my response is much faster.
Here's what I've got ATM:
def show
#stats = Stats.new
Thread.new {
#stats.top_brands = #RESULT OF FIRST CALCULATION
}
Thread.new {
#stats.top_retailers = #RESULT OF SECOND CALCULATION
}
Thread.new {
#stats.top_styles = #RESULT OF THIRD CALCULATION
}
Thread.new {
#stats.top_colors = #RESULT OF FOURTH CALCULATION
}
render json: #stats
end
Now this returns a bunch of empty arrays for each of the member instances of #stats, however, if I join the threads together, it runs, but defeats the purpose of threading since each of the threads block.
Since I'm very new to threads, I'm wondering what I'm doing wrong here or if its even possible to accomplish what I'm trying do, that is, run 4 calcs in paralell and return the result to the client.
Thanks,
Joe

It first depends if your calculations are doing processor heavy operations or are doing a lot blocking IO like reading from databases, the file system or the network. It wouldn't do much good if they're doing the former since each thread is taking up CPU time and no other thread can be scheduled - worse even if you're using Ruby MRI which has a Global Interpreter Lock. If the threads are doing blocking IO however, they can at least wait, let another thread run, wait, let another run and so on until they all return.
At the end you do have to join all the threads together because you want their return values. Do this below all your Thread.new calls. Save the return value of each Thread.new to an array:
threads = []
threads << Thread.new ...
Then join them together before you render:
threads.each &:join
If you want to really be sure this helps you out just benchmark the entire action:
def show
start_time = Time.now.to_f
#stats = Stats.new
Thread.new {
#stats.top_brands = #RESULT OF FIRST CALCULATION
}
Thread.new {
#stats.top_colors = #RESULT OF FOURTH CALCULATION
}
#elapsed_time = Time.now.to_f - start_time
# do something with #elapsed_time, like putsing it or rendering it in your response
render json: #stats
end
Hope that helps.

Related

How can I keep the Tempfile contents from being empty in a separate (Ruby) thread?

In a Rails 6.x app, I have a controller method which backgrounds queries that take longer than 2 minutes (to avoid a browser timeout), advises the user, stores the results, and sends a link that can retrieve them to generate a live page (with Highcharts charts). This works fine.
Now, I'm trying to implement the same logic with a method that backgrounds the creation of a report, via a Tempfile, and attaches the contents to an email, if the query runs too long. This code works just fine if the 2-minute timeout is NOT reached, but the Tempfile is empty at the commented line if the timeout IS reached.
I've tried wrapping the second part in another thread, and wrapping the internals of each thread with a mutex, but this is all getting above my head. I haven't done a lot of multithreading, and every time I do, I feel like I stumble around till I get it. This time, I can't even seem to stumble into it.
I don't know if the problem is with my thread(s), or a race condition with the Tempfile object. I've had trouble using Tempfiles before, because they seem to disappear quicker than I can close them. Is this one getting cleaned up before it can be sent? The file handle actually still exists on the file system at the commented point, even though it's empty, so I'm not clear on what's happening.
def report
queue = Queue.new
file = Tempfile.new('report')
thr = Thread.new do
query = %Q(blah blah blah)
#calibrations = ActiveRecord::Base.connection.exec_query query
query = %Q(blah blah blah)
#tunings = ActiveRecord::Base.connection.exec_query query
if queue.empty?
unless #tunings.empty?
CSV.open(file.path, 'wb') do |csv|
csv << ["headers...", #parameters].flatten
#calibrations.each do |c|
line = [c["h1"], c["h2"], c["h3"], c["h4"], c["h5"], c["h6"], c["h7"], c["h8"]]
t = #tunings.select { |t| t["code"] == c["code"] }.first
#parameters.each do |parameter|
line << t[parameter.downcase]
end
csv << line
end
end
send_data file.read, :type => 'text/csv; charset=iso-8859-1; header=present', :disposition => "attachment; filename=\"report.csv\""
end
else
# When "timed out", `file` is empty here
NotificationMailer.report_ready(current_user, file.read).deliver_later
end
end
give_up_at = Time.now + 120.seconds
while Time.now < give_up_at do
if !thr.alive?
break
end
sleep 1
end
if thr.alive?
queue << "Timeout"
render html: "Your report is taking longer than 2 minutes to generate. To avoid a browser timeout, it will finish in the background, and the report will be sent to you in email."
end
end
The reason the file is empty is because you are giving the query 120 seconds to complete. If after 120 seconds that has not happened you add "Timeout" to the queue. The query is still running inside the thread and has not reached the point where you check if the queue is empty or not. When the query does complete, since the queue is now not empty, you skip the part where you write the csv file and go to the Notification.report line. At that point the file is still empty because you never wrote anything into it.
In the end I think you need to rethink the overall logic of what you are trying to accomplish and there needs to be more communication between the threads and the top level.
Each thread needs to tell the top level if it has already sent the result, and the top level needs to let the thread know that its past time to directly send the result, and instead should email the result.
Here is some code that I think / hope will give some insight into how to approach this problem.
timeout_limit = 10
query_times = [5, 15, 1, 15]
timeout = []
sent_response = []
send_via_email = []
puts "time out is set to #{timeout_limit} seconds"
query_times.each_with_index do |query_time, query_id|
puts "starting query #{query_id} that will take #{query_time} seconds"
timeout[query_id] = false
sent_response[query_id] = false
send_via_email[query_id] = false
Thread.new do
## do query
sleep query_time
unless timeout[query_id]
puts "query #{query_id} has completed, displaying results now"
sent_response[query_id] = true
else
puts "query #{query_id} has completed, emailing result now"
send_via_email[query_id] = true
end
end
give_up_at = Time.now + timeout_limit
while Time.now < give_up_at
break if sent_response[query_id]
sleep 1
end
unless sent_response[query_id]
puts "query #{query_id} timed out, we will email the result of your query when it is completed"
timeout[query_id] = true
end
end
# simulate server environment
loop { }
=>
time out is set to 10 seconds
starting query 0 that will take 5 seconds
query 0 has completed, displaying results now
starting query 1 that will take 15 seconds
query 1 timed out, we will email the result of your query when it is completed
starting query 2 that will take 1 seconds
query 2 has completed, displaying results now
starting query 3 that will take 15 seconds
query 1 has completed, emailing result now
query 3 timed out, we will email the result of your query when it is completed
query 3 has completed, emailing result now

Getting a Primary Key error in Rails using Sidekiq and Sidekiq-Cron

I have a Rails project that uses Sidekiq for worker tasks, and Sidekiq-Cron to handle scheduling. I am running into a problem, though. I built a controller (below) that handled all of my API querying, validation of data, and then inserting data into the database. All of the logic functioned properly.
I then tore out the section of code that actually inserts API data into the database, and moved it into a Job class. This way the Controller method could simply pass all of the heavy lifting off to a job. When I tested it, all of the logic functioned properly.
Finally, I created a Job that would call the Controller method every minute, do the validation checks, and then kick off the other Job to save the API data (if necessary). When I do this the first part of the logic seems to work, where it inserts new event data, but the logic where it checks to see if this is the first time we've seen an event for a specific object seems to be failing. The result is a Primary Key violation in PG.
Code below:
Controller
require 'date'
class MonnitOpenClosedSensorsController < ApplicationController
def holderTester()
#MonnitschedulerJob.perform_later(nil)
end
# Create Sidekiq queue to process new sensor readings
def queueNewSensorEvents(auth_token, network_id)
m = Monnit.new("iMonnit", 1)
# Construct the query to select the most recent communication date for each sensor in the network
lastEventForEachSensor = MonnitOpenClosedSensor.select('"SensorID", MAX("LastCommunicationDate") as "lastCommDate"')
lastEventForEachSensor = lastEventForEachSensor.group("SensorID")
lastEventForEachSensor = lastEventForEachSensor.where('"CSNetID" = ?', network_id)
todaysDate = Date.today
sevenDaysAgo = (todaysDate - 7)
lastEventForEachSensor.each do |event|
# puts event["lastCommDate"]
recentEvent = MonnitOpenClosedSensor.select('id, "SensorID", "LastCommunicationDate"')
recentEvent = recentEvent.where('"CSNetID" = ? AND "SensorID" = ? AND "LastCommunicationDate" = ?', network_id, event["SensorID"], event["lastCommDate"])
recentEvent.each do |recent|
message = m.get_extended_sensor(auth_token, recent["SensorID"])
if message["LastDataMessageMessageGUID"] != recent["id"]
MonnitopenclosedsensorJob.perform_later(auth_token, network_id, message["SensorID"])
# puts "hi inner"
# puts message["LastDataMessageMessageGUID"]
# puts recent['id']
# puts recent["SensorID"]
# puts message["SensorID"]
# raise message
end
end
end
# Queue up any Sensor Events for new sensors
# This would be sensors we've never seen before, from a Postgres standpoint
sensors = m.get_sensor_ids(auth_token)
sensors.each do |sensor|
sensorCheck = MonnitOpenClosedSensor.select(:SensorID)
# sensorCheck = MonnitOpenClosedSensor.select(:SensorID)
sensorCheck = sensorCheck.group(:SensorID)
sensorCheck = sensorCheck.where('"CSNetID" = ? AND "SensorID" = ?', network_id, sensor)
# sensorCheck = sensorCheck.where('id = "?"', sensor["LastDataMessageMessageGUID"])
if sensorCheck.any? == false
MonnitopenclosedsensorJob.perform_later(auth_token, network_id, sensor)
end
end
end
end
The above code breaks Sensor Events for new sensors. It doesn't recognize that a sensor already exists, first issue, and then doesn't recognize that the event it is trying to create is already persisted to the database (uses a GUID for comparison).
Job to persist data
class MonnitopenclosedsensorJob < ApplicationJob
queue_as :default
def perform(auth_token, network_id, sensor)
m = Monnit.new("iMonnit", 1)
newSensor = m.get_extended_sensor(auth_token, sensor)
sensorRecord = MonnitOpenClosedSensor.new
sensorRecord.SensorID = newSensor['SensorID']
sensorRecord.MonnitApplicationID = newSensor['MonnitApplicationID']
sensorRecord.CSNetID = newSensor['CSNetID']
lastCommunicationDatePretty = newSensor['LastCommunicationDate'].scan(/[0-9]+/)[0].to_i / 1000.0
nextCommunicationDatePretty = newSensor['NextCommunicationDate'].scan(/[0-9]+/)[0].to_i / 1000.0
sensorRecord.LastCommunicationDate = Time.at(lastCommunicationDatePretty)
sensorRecord.NextCommunicationDate = Time.at(nextCommunicationDatePretty)
sensorRecord.id = newSensor['LastDataMessageMessageGUID']
sensorRecord.PowerSourceID = newSensor['PowerSourceID']
sensorRecord.Status = newSensor['Status']
sensorRecord.CanUpdate = newSensor['CanUpdate'] == "true" ? 1 : 0
sensorRecord.ReportInterval = newSensor['ReportInterval']
sensorRecord.MinimumThreshold = newSensor['MinimumThreshold']
sensorRecord.MaximumThreshold = newSensor['MaximumThreshold']
sensorRecord.Hysteresis = newSensor['Hysteresis']
sensorRecord.Tag = newSensor['Tag']
sensorRecord.ActiveStateInterval = newSensor['ActiveStateInterval']
sensorRecord.CurrentReading = newSensor['CurrentReading']
sensorRecord.BatteryLevel = newSensor['BatteryLevel']
sensorRecord.SignalStrength = newSensor['SignalStrength']
sensorRecord.AlertsActive = newSensor['AlertsActive']
sensorRecord.AccountID = newSensor['AccountID']
sensorRecord.CreatedOn = Time.now.getutc
sensorRecord.CreatedBy = "Monnit Open Closed Sensor Job"
sensorRecord.LastModifiedOn = Time.now.getutc
sensorRecord.LastModifiedBy = "Monnit Open Closed Sensor Job"
sensorRecord.save
sensorRecord = nil
end
end
Job to call controller every minute
class MonnitschedulerJob < ApplicationJob
queue_as :default
def perform(*args)
m = Monnit.new("iMonnit", 1)
getImonnitUsers = ImonnitCredential.select('"auth_token", "username", "password"')
getImonnitUsers.each do |user|
# puts user["auth_token"]
# puts user["username"]
# puts user["password"]
if user["auth_token"] != nil
m.logon(user["auth_token"])
else
auth_token = m.get_auth_token(user["username"], user["password"])
auth_token = auth_token["Result"]
end
network_list = m.get_network_list(auth_token)
network_list.each do |network|
# puts network["NetworkID"]
MonnitOpenClosedSensorsController.new.queueNewSensorEvents(auth_token, network["NetworkID"])
end
end
end
end
Sorry about the length of the post. I tried to include as much information as I could about the code involved.
EDIT
Here is the code for the extended sensor, along with the JSON response:
def get_extended_sensor(auth_token, sensor_id)
response = self.class.get("/json/SensorGetExtended/#{auth_token}?SensorID=#{sensor_id}")
if response['Result'] != "Invalid Authorization Token"
response['Result']
else
response['Result']
end
end
{
"Method": "SensorGetExtended",
"Result": {
"ReportInterval": 180,
"ActiveStateInterval": 180,
"InactivityAlert": 365,
"MeasurementsPerTransmission": 1,
"MinimumThreshold": 4294967295,
"MaximumThreshold": 4294967295,
"Hysteresis": 0,
"Tag": "",
"SensorID": 189092,
"MonnitApplicationID": 9,
"CSNetID": 24391,
"SensorName": "Open / Closed - 189092",
"LastCommunicationDate": "/Date(1500999632000)/",
"NextCommunicationDate": "/Date(1501010432000)/",
"LastDataMessageMessageGUID": "d474b3db-d843-40ba-8e0e-8c4726b61ec2",
"PowerSourceID": 1,
"Status": 0,
"CanUpdate": true,
"CurrentReading": "Open",
"BatteryLevel": 100,
"SignalStrength": 84,
"AlertsActive": true,
"CheckDigit": "QOLP",
"AccountID": 14728
}
}
Some thoughts:
recentEvent = MonnitOpenClosedSensor.select('id, "SensorID", "LastCommunicationDate"') -
this is not doing any ordering; you are presuming that the records you retrieve here are the latest records.
m = Monnit.new("iMonnit", 1)
newSensor = m.get_extended_sensor(auth_token, sensor)
without the implementation details of get_extended_sensor it's impossible to tell you how
sensorRecord.id = newSensor['LastDataMessageMessageGUID']
is resolving.
It's highly likely that you are getting duplicate messages. It's almost never a good idea to use input data as a primary key - rather autogenerate a GUID in your job, use that as the primary key, and then use the LastDataMessageMessageGUID as a correlation id.
So the issue that I was running into, as it turns out, is as follows:
A sensor event was pulled from the API and queued up in as a worker job in Sidekiq.
If the queue is running a bit slow, API speed or simply a lot of jobs to process, the 1 minute poll might hit again and pull the same sensor event down and queue it up.
As the queue processes, the sensor event gets inserted into the database with it's GUID being the primary key
As the queue continues to catch up with itself, it hits the same event that was scheduled as a secondary job. This job then fails.
My solution to this was to move my "does this SensorID and GUID exist in the database" to the actual job. So when the job ran the first thing it'd do is check AGAIN for the record to already exist. This means I am checking twice, but this quick check has low overhead.
There is still the risk that a check could happen and pass while another job is inserting the record, before it commits it to the database, and then it could fail. But the retry would catch it, and then clear it on out as a successful process when the check doesn't validate on the second round. Having said that, however, the check occurs AFTER the API data has been pulled. Since, in theory, the database persist of a single record from the API data would happen really fast (much faster than the API call would happen), it really does lower the chances of you having to hit a retry on any job....and I mean you'd have a better chance of hitting the lottery than having the second check fail and trigger a retry.
If anyone else has a better, or more clean solution, please feel free to include it as a secondary answer!

Run external processes in non-blocking mode

I want to perform some actions in parallel periodically and once they're all done, show the results to the user on a page. It'll happen approximately 1 time per 5 mins, it depends on the users' activity.
These actions are performed by the external, third-party applications (processes). There're about 4 of them now. So I have to run 4 external processes for each user request .
While they are performing, I show an user a page with an ajax spinner and send an ajax requests to the server to check if everything is done. Once done, I show the results.
Here is a rough version of what I have
class MyController
def my_action request_id
res = external_apps_cmds_with_args.each do |x|
# new process
res = Open3.popen3 x do |stdin, stdout, stderr, wait_thr|
exit_value = wait_thr.value.exitstatus
if exit_value == 0 ....
end
end
write_res_to_db res, request_id #each external app writes to the db its own result for each request_id
end
end
The calculations CAN be done in parallel because there's NO overall result here, there are only the results from each tool. There is no race condition.
So I want them to run in non-blocking mode, obviously.
Is Open3.popen3 a non-blocking command? Or should I run the external processes in the different threads:
threads = []
external_apps_cmds_with_args.each do |x|
# new threads
threads << Thread.new do
# new process
res = Open3.popen3 x do |stdin, stdout, stderr, wait_thr|
exit_value = wait_thr.value.exitstatus
if exit_value == 0 ....
end
end
write_res_to_db res, request_id #each external app writes to the db its own result for each request_id
end
threads.each &:join
Or should I create only one thread?
# only one new thread
thread = Thread.new do
res = external_apps_cmds_with_args.each do |x|
# new process
res = Open3.popen3 x do |stdin, stdout, stderr, wait_thr|
exit_value = wait_thr.value.exitstatus
if exit_value == 0 ....
end
end
write_res_to_db res, request_id #each external app writes to the db its own result for each request_id
end
thread.join
Or should I continue using the approach I'm using now: NO threads at all?
What I would suggest is that you have one action to load the page and then a separate ajax action for each process. As the processes finish they will return data to the user (presumably in different parts of the page) and you will take advantage of the multi-process/threading capabilities of your webserver.
This approach has some issues because like your original ideas, you are tying up some of your web processes while the external processes are running and you may run into timeouts. If you want to avoid that, you could run them as background jobs (delayed_job, resque, etc..) and then display the data when the jobs have finished.

Duplicated results on Ruby threading

I need to improve a rake task that build cloth looks by fetching the images from external server.
When I try to create multiple threads, the results are duplicated.
But if I put sleep 0.1 before each Thread.new, the code works! Why?
new_looks = []
threads = []
for look in looks
# sleep 0.1 - when I put it, works!
threads << Thread.new do
# a external http request is being done here
new_looks << Look.new(ref: look["look_ref"])
end
end
puts 'waiting threads to finish...'
threads.each(&:join)
puts 'saving...'
new_looks.sort_by(&:ref).each(&:save)
Array is not generally thread safe. Switch to a thread-safe data structure such as Queue:
new_look_queue = Queue.new
threads = looks.map do |look|
Thread.new do
new_look_queue.enq Look.new(ref: look["look_ref"])
end
end
puts 'waiting threads to finish...'
threads.each(&:join)
puts 'saving...'
new_looks = []
while !new_look_queue.empty?
new_look_queue << queue.deq
end
new_looks.sort_by(&:ref).each(&:save)
Queue#enq puts a new entry in the queue; Queue#deq gets one out, blocking if there isn't one.
If you don't need the new_looks saved in order, the code gets simpler:
puts 'saving...'
while !new_look_queue.empty?
new_look_queue.deq.save
end
Or, even simpler yet, just do the save inside the thread.
If you have a great many looks, the above code will create more threads than is good. Too many threads cause the requests to take too long to process, and consume excess memory. In that case, consider create some number of producer threads:
NUM_THREADS = 8
As before, there's a queue of finished work:
new_look_queue = Queue.new
But there's now also a queue of work to be done:
look_queue = Queue.new
looks.each do |look|
look_queue.enq look
end
Each thread will live until it's out of work, so let's add some "out of work" symbols to the queue, one for each thread:
NUM_THREADS.times do {look_queue.enq :done}
And now the threads:
threads = NUM_THREADS.times.map do
Thread.new do
while (look = look_queue.deq) != :done
new_look_queue.enq Look.new(ref: look["look_ref"])
end
end
end
Processing the new_look_queue is the same as above.
Try to update your code to this one:
for look in looks
threads << Thread.new(look) do |lk|
new_looks << Look.new(ref: lk["look_ref"])
end
end
This should help you.
UPD: Forgot about Thread.new(args)

How do I iterate on a collection when I don't know what the upper limit of iterations is?

I have an API that I am pulling data from, and I want to collect all the tags from this API...but I don't know the number of tags in advance, and the API throttles access via the max number of results returned in any 1 call (100). It has an unlimited number of pages though.
So a call may look like this: Tag.update_tags(100, 5) where 100 is the max number of objects returned in 1 call and 5 is the page to begin (i.e. if you assume that the tags are stored sequentially, what this is saying is return the tag records with IDs in the range of 401 - 500.
The issue is, I don't want to manually have to enter 5 (i.e. I don't know what the upper limit is). There is no way for me to ping the total number of tags (if there were, I would simply divide it and put this call in a loop up to that number).
All I do know is that once it reaches a page that doesn't have any results, it will return an empty array [].
So, how do I loop through all the tags and stop when the result returned is an empty array (which would be the final result returned and therefore not evaluated)?
What does that loop look like?
Use an unconditional loop with a break statement when the result returns the empty array.
i = 1
loop do
result = call_to_api(i)
do_something_with(result)
i += 1
break if result.empty?
end
Of course in a production scenario you want something a little more robust, including exception handlers, some progress log reporting, and some kind of concrete iteration limit to ensure that the loop does not become infinite.
Update
Here's an example using a class to wrap up the logic.
class Api
DEFAULT_OPTIONS = {:start_position => 1, :max_iterations => 1000}
def initialize(base_uri, config)
#config = DEFAULT_OPTIONS.merge(config)
#position = config[:start_position]
#results_count = 0
end
def each(&block)
advance(&block) while can_advance?
log("Processed #{#results_count} results")
end
def advance(&block)
yield result
#results_count += result.count
#position += 1
#current_result = nil
end
def result
#current_result ||= begin
response = Net::HTTP.get_response(current_uri)
JSON.decode(response.body)
rescue
# provide some exception handling/logging
end
end
def can_advance?
#position < (#config[:start_position] + #config[:max_iterations]) && result.any?
end
def current_uri
Uri.parse("#{#base_uri}?page=#{#position}")
end
end
api = Api.new('http://somesite.com/api/v1/resource')
api.each do |result|
do_something_with(result)
end
There's also an angle with this to allow for concurrency by setting the start and iteration count for each thread which would definetly speed this up with the concurrent http requests.
Hmmm. You can get 100 items at a time, and start at a particular page. How to implement the iteration depends on what you want to do. Let's suppose that you want to collect all the unique tags. Establish a map (for example, a HashMap), then retrieve one page at a time and process it. When you hit a page that's empty, you're done.
// Implements a map and methods to update it
MyHashMap uniqueTags;
// Stores a page of tags
Page page;
Do
// get a page of tags
page = readTags();
if (page != null) {
uniqueTags.getUniqueTags(page);
} else {
break;
}
until (page == null);

Resources