Program:
def inc(n)
n + 1
end
sum = 0
threads = (1..10).map do
Thread.new do
10_000.times do
sum = inc(sum)
end
end
end
threads.each(&:join)
p sum
Output:
$ ruby MutualExclusion.rb
100000
$
My expected output of above program is less than 100,000. Because, the above program create 10 threads and each of the thread
update the shared variable 'sum' to 10,000 times. But during execution of the program, mutual exclusion will definitely happen. Because,
mutual exclusion is not handled here. So I expect less than 100,000 as output. But it gives exactly 100,000 as output. How it is
happened ? Who handle the mutual exclusion here ? And how I experiment this problem(ME).
The default interpreter for Ruby (MRI) doesn't execute threads in parallel. The mechanism that's preventing your race condition from introducing casually unexpected behavior is the Global Interpreter Lock (GIL).
You can learn more about this, including a very similar demonstration, here: http://www.jstorimer.com/blogs/workingwithcode/8085491-nobody-understands-the-gil
Related
I'm evaluating the use of lua scrips in redis, and they seem to be a bit slow. I a benchmark as follows:
For a non-lua version, I did a simple SET key_i val_i 1M times
For a lua version, I did the same thing, but in a script: EVAL "SET KEYS[1] ARGV[1]" 1 key_i val_i
Testing on my laptop, the lua version is about 3x slower than the non-lua version. I understand that lua is a scripting language, not compiled, etc. etc. but this seems like a lot of performance overhead--is this normal?
Assuming this is indeed normal, are there any workaround? Is there a way to implement a script in a faster language, such as C (which redis is written in) to achieve better performance?
Edit: I am testing this using the go code located here: https://gist.github.com/ortutay/6c4a02dee0325a608941
The problem is not with Lua or Redis; it's with your expectations. You are compiling a script 1 million times. There is no reason to expect this to be fast.
The purpose of EVAL within Redis is not to execute a single command; you could do that yourself. The purpose is to do complex logic within Redis itself, on the server rather than on your local client. That is, instead of doing one set operation per-EVAL, you actually perform the entire series of 1 million sets within a single EVAL script, which will be executed by the Redis server itself.
I don't know much about Go, so I can't write the syntax for calling it. But I know what the Lua script would look like:
for i = 1, ARGV[1] do
local key = "key:" .. tostring(i)
redis.call('SET', key, i)
end
Put that in a Go string, then pass that to the appropriate call, with no key arguments and a single non-key argument that is the number of times to loop.
I stumbled on this thread and was also curious of the benchmark results. I wrote a quick Ruby script to compare them. The script does a simple "SET/GET" operation on the same key using different options.
require "redis"
def elapsed_time(name, &block)
start = Time.now
block.call
puts "#{name} - elapsed time: #{(Time.now-start).round(3)}s"
end
iterations = 100000
redis_key = "test"
redis = Redis.new
elapsed_time "Scenario 1: From client" do
iterations.times { |i|
redis.set(redis_key, i.to_s)
redis.get(redis_key)
}
end
eval_script1 = <<-LUA
redis.call("SET", "#{redis_key}", ARGV[1])
return redis.call("GET", "#{redis_key}")
LUA
elapsed_time "Scenario 2: Using EVAL" do
iterations.times { |i|
redis.eval(eval_script1, [redis_key], [i.to_s])
}
end
elapsed_time "Scenario 3: Using EVALSHA" do
sha1 = redis.script "LOAD", eval_script1
iterations.times { |i|
redis.evalsha(sha1, [redis_key], [i.to_s])
}
end
eval_script2 = <<-LUA
for i = 1,#{iterations} do
redis.call("SET", "#{redis_key}", tostring(i))
redis.call("GET", "#{redis_key}")
end
LUA
elapsed_time "Scenario 4: Inside EVALSHA" do
sha1 = redis.script "LOAD", eval_script2
redis.evalsha(sha1, [redis_key], [])
end
eval_script3 = <<-LUA
for i = 1,2*#{iterations} do
redis.call("SET", "#{redis_key}", tostring(i))
redis.call("GET", "#{redis_key}")
end
LUA
elapsed_time "Scenario 5: Inside EVALSHA with 2x the operations" do
sha1 = redis.script "LOAD", eval_script3
redis.evalsha(sha1, [redis_key], [])
en
I got the following results running on my Macbook pro
Scenario 1: From client - elapsed time: 11.498s
Scenario 2: Using EVAL - elapsed time: 6.616s
Scenario 3: Using EVALSHA - elapsed time: 6.518s
Scenario 4: Inside EVALSHA - elapsed time: 0.241s
Scenario 5: Inside EVALSHA with 2x the operations - elapsed time: 0.5s
In summary:
scenario 1 vs. scenario 2 show that the main contributor is the round trip time as scenario 1 makes 2 requests to Redis while scenario 2 only makes 1 and scenario 1 is ~2x the execution time
scenario 2 vs. scenario 3 shows that EVALSHA does provide some benefit and I am sure this benefit increases the more complex the script gets
scenario 4 vs scenario 5 shows the overhead of invoking the script is near minimal as we doubled the number of operations and saw a ~2x increase in execution time.
so there is now a workaround using a module created by John Sully. It works for Redis and KeyDB and allows you to use the V8 JIT engine which runs complex scripts much faster than Lua scripts. https://github.com/JohnSully/ModJS
Within a larger lua-script, I have to copy several tables dt:
for i=1,dt:nrow() do
local r = {}
for j=1,dt:ncol() do
r[j] = dt[i][j]
end
rslt:append(r)
end
The tables are about 50,000 lines x 25 cols, containing mainly doubles. luajit takes about 10 times as long as "standard" lua. On all other calculations/operations I do before, luajit is faster (1.5 to 3 times).
As silly as this may sound, try pre-allocating the r table with 25 values:
local r = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
Unfortunately Lua API doesn't allow pre-allocation of tables, so this is the only way to avoid re-allocations caused by array assignment in the inner loop. My tests show noticeable improvement, but not close to 10x (although I don't use your methods, so your results may vary).
I've written a function to remove email addresses from my data using gsub. The code is below. The problem is that it takes a total of 27 minutes to execute the function on a set of 10,000 records. (16 minutes for the first pattern, 11 minutes for the second). Elsewhere in the code I process about 20 other RegExp's using a similar flow (iterating through data.each) and they all finish in less than a second. (BTW, I recognize that my RegExp's aren't perfect and may catch some strings that aren't email addresses.)
Is there something about these two RegExp's that is causing the processing time to be so high? I've tried it on seven different data sources all with the same result, so the problem isn't peculiar to my data set.
def remove_email_addresses!(data)
email_patterns = [
/[[:graph:]]+#[[:graph:]]+/i,
/[[:graph:]]+ +at +[^ ][ [[:graph:]]]{0,40} +dot +com/i
]
data.each do |row|
email_patterns.each do |pattern|
row[:title].gsub!(pattern,"") unless row[:title].blank?
row[:description].gsub!(pattern,"") unless row[:description].blank?
end
end
end
Check that your faster code isn't just doing var =~ /blah/ matching, rather than replacement: that is several orders of magnitude faster.
In addition to reducing backtracking and replacing + and * with ranges for safety, as follows...
email_patterns = [
/\b[-_.\w]{1,128}#[-_.\w]{1,128}/i,
/\b[-_.\w]{1,128} {1,10}at {1,10}[^ ][-_.\w ]{0,40} {1,10}dot {1,10}com/i
]
... you could also try "unrolling your loop", though this is unlikely to cause any issues unless there is some kind of interaction between the iterators (which there shouldn't be, but...). That is:
data.each do |row|
row[:title].gsub!(patterns[0],"") unless row[:title].blank?
row[:description].gsub!(patterns[0],"") unless row[:description].blank?
row[:title].gsub!(patterns[1],"") unless row[:title].blank?
row[:description].gsub!(patterns[1],"") unless row[:description].blank?
end
Finally, if this causes little to no speedup, consider profiling with something like ruby-prof to find out whether the regexes themselves are the issue, or whether there's a problem in the do iterator or the unless clauses instead.
Could it be that the data is large enough that it causes issues with paging once read in? If so, might it be faster to read the data in and parse it in chunks of N entries, rather than process the whole lot at once?
So, in order to improve to speed of our app I'm experimenting multi threading with our rails app.
Here is the code:
require 'thwait'
require 'benchmark'
city = Location.find_by_slug("orange-county", :select => "city, state, lat, lng", :limit => 1)
filters = ContractorSearchConditions.new()
image_filter = ImageSearchConditions.new()
filters.lat = city.lat
filters.lon = city.lng
filters.mile_radius = 20
filters.page_size = 15
filters.page = 1
image_filter.page_size = 5
sponsored_filter = filters.dup
sponsored_filter.has_advertised = true
sponsored_filter.page_size = 50
Benchmark.bm do |b|
b.report('with') do
1.times do
cities = Thread.new{
Location.where("lat between ? and ? and lng between ? and ?", city.lat-0.5, city.lat+0.5, city.lng-0.5, city.lng+0.5)
}
images = Thread.new{
Image.search(image_filter)[:hits]
}
sponsored_results_extended = Thread.new{
sponsored_filter.mile_radius = 50
#sponsored_results = Contractor.search( sponsored_filter )
}
results = Thread.new{
Contractor.search( filters )
}
ThreadsWait.all_waits(cities, images, sponsored_results_extended, results)
#cities = cities.value
#images = images.value
#sponsored_results = sponsored_results_extended.value
#results = results.value
end
end
b.report('without') do
1.times do
#cities = Location.where("lat between ? and ? and lng between ? and ?", city.lat-0.5, city.lat+0.5, city.lng-0.5, city.lng+0.5)
#image = Image.search(image_filter)[:hits]
#sponsored_results = Contractor.search( sponsored_filter )
#results = Contractor.search( filters )
end
end
end
Class.search is running a search on our ElasticSearch servers.(3 servers behind a Load balancer), where active record queries are being runned in our RDS instance.
(Everything is in the same datacenter.)
Here is the output on our dev server:
Bob#dev-web01:/usr/local/dev/buildzoom/rails$ script/rails runner script/thread_bm.rb -e development
user system total real
with 0.100000 0.010000 0.110000 ( 0.342238)
without 0.020000 0.000000 0.020000 ( 0.164624)
Nota: I've a very limited knowledge if no knowledge about thread, mutex, GIL, ..
There is a lot more overhead in the "with" block than the "without" block due to the Thread creation and management. Using threads will help the most when the code is IO-bound, and it appears that is NOT the case. Four searches complete in 20ms (without block), which implies that in parallel those searches should take less that amount of time. The "with" block takes 100ms to execute, so we can deduce that at least 80ms of that time is not spent in searches. Try benchmarking with longer queries to see how the results differ.
Note that I've made the assumption that all searches have the same latency, which may or may not be true, and always perform the same. It may be possible that the "without" block benefits from some sort of query caching since it runs after the "with" block. Do results differ when you swap the order of the benchmarks? Also, I'm ignoring overhead from the iteration (1.times). You should remove that unless you change the number of iterations to something greater than 1.
Even though you are using threads, and hence performing query IO in parallel, you still need to deserialize whatever results are coming back from your queries. This uses the CPU. MRI Ruby 2.0.0 has a global interpreter lock. This means Ruby code can only run one line at a time, not in parallel, and only on one CPU core. In order to deserialize all your results, the CPU has to context switch many times between the different threads. This is a lot more overhead than deserializing each result set sequentially.
If your wall time is dominated by waiting for a response from your queries, and they don't all come back at the same time, then there might be an advantage to parallelizing with threads. But it's hard to predict that.
You could try using JRuby or Rubinius. These will both utilize multiple cores, and hence can actually speed up your code as expected.
Note: This is a sequel to my previous question about powersets.
I have got a nice Ruby solution to my previous question about generating a powerset of a set without a need to keep a stack:
class Array
def powerset
return to_enum(:powerset) unless block_given?
1.upto(self.size) do |n|
self.combination(n).each{|i| yield i}
end
end
end
# demo
['a', 'b', 'c'].powerset{|item| p item} # items are generated one at a time
ps = [1, 2, 3, 4].powerset # no block, so you'll get an enumerator
10.times.map{ ps.next } # 10.times without a block is also an enumerator
It does the job and works nice.
However, I would like to try to rewrite the same solution in Erlang, because for the {|item| p item} block I have a big portion of working code already written in Erlang (it does some stuff with each generated subset).
Although I have some experience with Erlang (I have read all 2 books :)), I am pretty confused with the example and the comments that sepp2k kindly gave me to my previous question about powersets. I do not understand the last line of the example - the only thing that I know is that is is a list comprehension. I do not understand how I can modify it to be able to do something with each generated subset (then throw it out and continue with the next subset).
How can I rewrite this Ruby iterative powerset generation in Erlang? Or maybe the provided Erlang example already almost suits the need?
All the given examples have O(2^N) memory complexity, because they return whole result (the first example). Two last examples use regular recursion so that stack raises. Below code which is modification and compilation of the examples will do what you want:
subsets(Lst) ->
N = length(Lst),
Max = trunc(math:pow(2, N)),
subsets(Lst, 0, N, Max).
subsets(Lst, I, N, Max) when I < Max ->
_Subset = [lists:nth(Pos+1, Lst) || Pos <- lists:seq(0, N-1), I band (1 bsl Pos) =/= 0],
% perform some actions on particular subset
subsets(Lst, I+1, N, Max);
subsets(_, _, _, _) ->
done.
In the above snippet tail recursion is used which is optimized by Erlang compiler and converted to simple iteration under the covers. Recursion may be optimized this way only if recursive call is the last one within function execution flow. Note also that each generated Subset may be used in place of the comment and it will be forgotten (garbage collected) just after that. Thanks to that neither stack nor heap won't grow, but you also have to perform operation on subsets one after another as there's no final result containing all of them.
My code uses same names for analogous variables like in the examples to make it easier to compare both of them. I'm sure it could be refined for performance, but I only want to show the idea.