Threading in rails - ruby-on-rails

I have initialized 250 threads at a time and they are returning back to update some data in the database. I am using Postgresql database in my rails 2 application. I have set Pool size 100 and max connections 100 but the problem is after 100 connections remaining threads are causing problem like "FATAL ERROR: Too many clients". So now i want is as soon as any thread complete its process then kill that thread. SO to achieve this what should i do?
Here is my code:
consider detail = "contains 250 items in an array"
threads = []
detail.each do |item|
threads << Thread.new( item) do | item |
# block of code
end
end
threads.each { | t | t.join }

I hope you are using Rails 2.2 where connection pooling has been implemented. Check this:
http://guides.rubyonrails.org/2_2_release_notes.html#connection-pooling
and this http://api.rubyonrails.org/classes/ActiveRecord/ConnectionAdapters/ConnectionPool.html

Related

Read data from csv file with foreach function

I have been reading data from csv, if there is a large csv file, for avoid this time-out(rack 12 sec timeout) i have read only 25 rows from csv after 25 rows it return and again make a request so this will continue until read all the rows.
def read_csv(offset)
r_count = 1
CSV.foreach(file.tempfile, options) do |row|
if r_count > offset.to_i
#process
end
r_count += 1
end
But here it is creating a new issue, let say first read 25 rows then when the next request comes offset is 25 that time it will read upto first 25 rows then it will start read from 26 and do process, so how can i skip this rows which already read?, i tried this if next to skip iteration but that fails, or is there any other efficient way to do this?
Code
def read_csv(fileName)
lines = (`wc -l #{fileName}`).to_i + 1
lines_processed = 0
open(fileName) do |csv|
csv.each_line do |line|
#process
lines_processed += 1
end
end
end
Pure Ruby - SLOWER
def read_csv(fileName)
lines = open("sample.csv").count
lines_processed = 0
open(fileName) do |csv|
csv.each_line do |line|
#process
lines_processed += 1
end
end
end
Benchmarks
I ran a new benchmark comparing your original method provided and my own. I also included the test file information.
"File Information"
Lines: 1172319
Size: 126M
"django's original method"
Time: 18.58 secs
Memory: 0.45 MB
"OneNeptune's method"
Time: 0.58 secs
Memory: 2.18 MB
"Pure Ruby method"
Time: 0.96
Memory: 2.06 MB
Explanation
NOTE: I added a pure ruby method, since using wc is sort of cheating, and not portable. In most cases it's important to use pure language solutions.
You can use this method to process a very large CSV file.
~2MB memory I feel is pretty optimal considering the file size, it's a bit of an increase of memory usage, but the time savings seems to be a fair trade, and this will prevent timeouts.
I did modify the method to take a fileName, but this was just because I was testing many different CSV files to make sure they all worked correctly. You can remove this if you'd like, but it'll likely be helpful.
I also removed the concept of an offset, since you stated you originally included it to try to optimize the parsing yourself, but this is no longer necessary.
Also, I keep track of how many lines are in the file, and how many were processed since you needed to use that information. Note, that lines only works on unix based systems, and it's a trick to avoid loading the entire file into memory, it counts the new lines, and I add 1 to account for the last line. If you're not going to count headers as line though, you could remove the +1 and change lines to "rows" to be more accurate.
Another logistical problem you may run into is the need to figure how to handle if the CSV file has headers.
You could use lazy reading to speed this up, the whole of the file wouldn't be read, just from the beginning of the file until the chunk you use.
See http://engineering.continuity.net/csv-tricks/ and https://reinteractive.com/posts/154-improving-csv-processing-code-with-laziness for examples.
You could also use SmarterCSV to work in chunks like this.
SmarterCSV.process(file_path, {:chunk_size => 1000}) do |chunk|
chunk.each do |row|
# Do your processing
end
do_something_else
end
enter code here
The way I did this was by streaming the result to the user, if you see what is happening it doesn't bother that much you have to wait. The timeout you mention won't happen here.
I'm not a Rails user so I give an example from Sinatra, this can be done with Rails also. See eg http://api.rubyonrails.org/classes/ActionController/Streaming.html
require 'sinatra'
get '/' do
line = 0
stream :keep_open do |out|
1.upto(100) do |line| # this would be your CSV file opened
out << "processing line #{line}<br>"
# process line
sleep 1 # for simulating the delay
end
end
end
A still better but somewhat complicated solution would be to use websockets, the browser would receive the results from the server once the processing is finished. You will need some javascript in the client also to handle this. See https://github.com/websocket-rails/websocket-rails

Mutual Exclusion not happened in Ruby

Program:
def inc(n)
n + 1
end
sum = 0
threads = (1..10).map do
Thread.new do
10_000.times do
sum = inc(sum)
end
end
end
threads.each(&:join)
p sum
Output:
$ ruby MutualExclusion.rb
100000
$
My expected output of above program is less than 100,000. Because, the above program create 10 threads and each of the thread
update the shared variable 'sum' to 10,000 times. But during execution of the program, mutual exclusion will definitely happen. Because,
mutual exclusion is not handled here. So I expect less than 100,000 as output. But it gives exactly 100,000 as output. How it is
happened ? Who handle the mutual exclusion here ? And how I experiment this problem(ME).
The default interpreter for Ruby (MRI) doesn't execute threads in parallel. The mechanism that's preventing your race condition from introducing casually unexpected behavior is the Global Interpreter Lock (GIL).
You can learn more about this, including a very similar demonstration, here: http://www.jstorimer.com/blogs/workingwithcode/8085491-nobody-understands-the-gil

Why is this RegExp taking 16 minutes to process on Rails?

I've written a function to remove email addresses from my data using gsub. The code is below. The problem is that it takes a total of 27 minutes to execute the function on a set of 10,000 records. (16 minutes for the first pattern, 11 minutes for the second). Elsewhere in the code I process about 20 other RegExp's using a similar flow (iterating through data.each) and they all finish in less than a second. (BTW, I recognize that my RegExp's aren't perfect and may catch some strings that aren't email addresses.)
Is there something about these two RegExp's that is causing the processing time to be so high? I've tried it on seven different data sources all with the same result, so the problem isn't peculiar to my data set.
def remove_email_addresses!(data)
email_patterns = [
/[[:graph:]]+#[[:graph:]]+/i,
/[[:graph:]]+ +at +[^ ][ [[:graph:]]]{0,40} +dot +com/i
]
data.each do |row|
email_patterns.each do |pattern|
row[:title].gsub!(pattern,"") unless row[:title].blank?
row[:description].gsub!(pattern,"") unless row[:description].blank?
end
end
end
Check that your faster code isn't just doing var =~ /blah/ matching, rather than replacement: that is several orders of magnitude faster.
In addition to reducing backtracking and replacing + and * with ranges for safety, as follows...
email_patterns = [
/\b[-_.\w]{1,128}#[-_.\w]{1,128}/i,
/\b[-_.\w]{1,128} {1,10}at {1,10}[^ ][-_.\w ]{0,40} {1,10}dot {1,10}com/i
]
... you could also try "unrolling your loop", though this is unlikely to cause any issues unless there is some kind of interaction between the iterators (which there shouldn't be, but...). That is:
data.each do |row|
row[:title].gsub!(patterns[0],"") unless row[:title].blank?
row[:description].gsub!(patterns[0],"") unless row[:description].blank?
row[:title].gsub!(patterns[1],"") unless row[:title].blank?
row[:description].gsub!(patterns[1],"") unless row[:description].blank?
end
Finally, if this causes little to no speedup, consider profiling with something like ruby-prof to find out whether the regexes themselves are the issue, or whether there's a problem in the do iterator or the unless clauses instead.
Could it be that the data is large enough that it causes issues with paging once read in? If so, might it be faster to read the data in and parse it in chunks of N entries, rather than process the whole lot at once?

Ruby Multi threading, what am I doing wrong?

So, in order to improve to speed of our app I'm experimenting multi threading with our rails app.
Here is the code:
require 'thwait'
require 'benchmark'
city = Location.find_by_slug("orange-county", :select => "city, state, lat, lng", :limit => 1)
filters = ContractorSearchConditions.new()
image_filter = ImageSearchConditions.new()
filters.lat = city.lat
filters.lon = city.lng
filters.mile_radius = 20
filters.page_size = 15
filters.page = 1
image_filter.page_size = 5
sponsored_filter = filters.dup
sponsored_filter.has_advertised = true
sponsored_filter.page_size = 50
Benchmark.bm do |b|
b.report('with') do
1.times do
cities = Thread.new{
Location.where("lat between ? and ? and lng between ? and ?", city.lat-0.5, city.lat+0.5, city.lng-0.5, city.lng+0.5)
}
images = Thread.new{
Image.search(image_filter)[:hits]
}
sponsored_results_extended = Thread.new{
sponsored_filter.mile_radius = 50
#sponsored_results = Contractor.search( sponsored_filter )
}
results = Thread.new{
Contractor.search( filters )
}
ThreadsWait.all_waits(cities, images, sponsored_results_extended, results)
#cities = cities.value
#images = images.value
#sponsored_results = sponsored_results_extended.value
#results = results.value
end
end
b.report('without') do
1.times do
#cities = Location.where("lat between ? and ? and lng between ? and ?", city.lat-0.5, city.lat+0.5, city.lng-0.5, city.lng+0.5)
#image = Image.search(image_filter)[:hits]
#sponsored_results = Contractor.search( sponsored_filter )
#results = Contractor.search( filters )
end
end
end
Class.search is running a search on our ElasticSearch servers.(3 servers behind a Load balancer), where active record queries are being runned in our RDS instance.
(Everything is in the same datacenter.)
Here is the output on our dev server:
Bob#dev-web01:/usr/local/dev/buildzoom/rails$ script/rails runner script/thread_bm.rb -e development
user system total real
with 0.100000 0.010000 0.110000 ( 0.342238)
without 0.020000 0.000000 0.020000 ( 0.164624)
Nota: I've a very limited knowledge if no knowledge about thread, mutex, GIL, ..
There is a lot more overhead in the "with" block than the "without" block due to the Thread creation and management. Using threads will help the most when the code is IO-bound, and it appears that is NOT the case. Four searches complete in 20ms (without block), which implies that in parallel those searches should take less that amount of time. The "with" block takes 100ms to execute, so we can deduce that at least 80ms of that time is not spent in searches. Try benchmarking with longer queries to see how the results differ.
Note that I've made the assumption that all searches have the same latency, which may or may not be true, and always perform the same. It may be possible that the "without" block benefits from some sort of query caching since it runs after the "with" block. Do results differ when you swap the order of the benchmarks? Also, I'm ignoring overhead from the iteration (1.times). You should remove that unless you change the number of iterations to something greater than 1.
Even though you are using threads, and hence performing query IO in parallel, you still need to deserialize whatever results are coming back from your queries. This uses the CPU. MRI Ruby 2.0.0 has a global interpreter lock. This means Ruby code can only run one line at a time, not in parallel, and only on one CPU core. In order to deserialize all your results, the CPU has to context switch many times between the different threads. This is a lot more overhead than deserializing each result set sequentially.
If your wall time is dominated by waiting for a response from your queries, and they don't all come back at the same time, then there might be an advantage to parallelizing with threads. But it's hard to predict that.
You could try using JRuby or Rubinius. These will both utilize multiple cores, and hence can actually speed up your code as expected.

Ruby File IO - Failed to allocate memory

Below is a method which inserts records into the devices database. I am having a problem where I get a 'failed to allocate memory' error.
It is being run on a Windows Mobile device with quite limited memory.
There are 10 models, one is reasonably large with 108,000 records.
The error occurs when executing this line (f.readlines().each do |line|) but it occurs after the largest model has already been inserted.
Is the memory not being released by the block that is iterating through the lines? Or is there something else happening?
Any help on this matter would be greatly appreciated!
def insertRecordsIntoRhom(models)
updateAmount = 45 / models.length
GC.enable
models.each_with_index do |model,i|
csvColumns = Array.new
db = ::Rho::RHO.get_src_db(model)
db.start_transaction
begin
j=0
f = File.new("#{model}.csv")
f.readlines().each do |line|
#extract columns from header line of csv
if j==0
csvColumns = getCsvFieldFromHeader(line)
j+=1
next
end
eval(models[i] + ".create(#{csvPutIntoHash(line,csvColumns)})")
end
f.close
db.commit
rescue
db.rollback
end
end
end
IO#readlines returns an Array, i.e. it reads the whole file and returns a list of all the lines. No line can be garbage collected until you are completely done iterating that list.
Since you only need one line at a time, you should use IO#each_line instead. This will read only a little bit at a time and pass you lines one by one. Once you are done with a line, it can be garbage collected while the rest of the file is being processed.
Finally, note that Ruby comes bundled with a good CSV library, you probably want to use that if you can instead of rolling your own.

Resources