Working with a large data object between ruby processes - ruby-on-rails

I have a Ruby hash that reaches approximately 10 megabytes if written to a file using Marshal.dump. After gzip compression it is approximately 500 kilobytes.
Iterating through and altering this hash is very fast in ruby (fractions of a millisecond). Even copying it is extremely fast.
The problem is that I need to share the data in this hash between Ruby on Rails processes. In order to do this using the Rails cache (file_store or memcached) I need to Marshal.dump the file first, however this incurs a 1000 millisecond delay when serializing the file and a 400 millisecond delay when serializing it.
Ideally I would want to be able to save and load this hash from each process in under 100 milliseconds.
One idea is to spawn a new Ruby process to hold this hash that provides an API to the other processes to modify or process the data within it, but I want to avoid doing this unless I'm certain that there are no other ways to share this object quickly.
Is there a way I can more directly share this hash between processes without needing to serialize or deserialize it?
Here is the code I'm using to generate a hash similar to the one I'm working with:
#a = []
0.upto(500) do |r|
#a[r] = []
0.upto(10_000) do |c|
if rand(10) == 0
#a[r][c] = 1 # 10% chance of being 1
else
#a[r][c] = 0
end
end
end
#c = Marshal.dump(#a) # 1000 milliseconds
Marshal.load(#c) # 400 milliseconds
Update:
Since my original question did not receive many responses, I'm assuming there's no solution as easy as I would have hoped.
Presently I'm considering two options:
Create a Sinatra application to store this hash with an API to modify/access it.
Create a C application to do the same as #1, but a lot faster.
The scope of my problem has increased such that the hash may be larger than my original example. So #2 may be necessary. But I have no idea where to start in terms of writing a C application that exposes an appropriate API.
A good walkthrough through how best to implement #1 or #2 may receive best answer credit.
Update 2
I ended up implementing this as a separate application written in Ruby 1.9 that has a DRb interface to communicate with application instances. I use the Daemons gem to spawn DRb instances when the web server starts up. On start up the DRb application loads in the necessary data from the database, and then it communicates with the client to return results and to stay up to date. It's running quite well in production now. Thanks for the help!

A sinatra app will work, but the {un}serializing, and the HTML parsing could impact performance compared to a DRb service.
Here's an example, based on your example in the related question. I'm using a hash instead of an array so you can use user ids as indexes. This way there is no need to keep both a table on interests and a table of user ids on the server. Note that the interest table is "transposed" compared to your example, which is the way you want it anyways, so it can be updated in one call.
# server.rb
require 'drb'
class InterestServer < Hash
include DRbUndumped # don't send the data over!
def closest(cur_user_id)
cur_interests = fetch(cur_user_id)
selected_interests = cur_interests.each_index.select{|i| cur_interests[i]}
scores = map do |user_id, interests|
nb_match = selected_interests.count{|i| interests[i] }
[nb_match, user_id]
end
scores.sort!
end
end
DRb.start_service nil, InterestServer.new
puts DRb.uri
DRb.thread.join
# client.rb
uri = ARGV.shift
require 'drb'
DRb.start_service
interest_server = DRbObject.new nil, uri
USERS_COUNT = 10_000
INTERESTS_COUNT = 500
# Mock users
users = Array.new(USERS_COUNT) { {:id => rand(100000)+100000} }
# Initial send over user interests
users.each do |user|
interest_server[user[:id]] = Array.new(INTERESTS_COUNT) { rand(10) == 0 }
end
# query at will
puts interest_server.closest(users.first[:id]).inspect
# update, say there's a new user:
new_user = {:id => 42}
users << new_user
# This guy is interested in everything!
interest_server[new_user[:id]] = Array.new(INTERESTS_COUNT) { true }
puts interest_server.closest(users.first[:id])[-2,2].inspect
# Will output our first user and this new user which both match perfectly
To run in terminal, start the server and give the output as the argument to the client:
$ ruby server.rb
druby://mal.lan:51630
$ ruby client.rb druby://mal.lan:51630
[[0, 100035], ...]
[[45, 42], [45, 178902]]

Maybe it's too obvious, but if you sacrifice a little access speed to the members of your hash, a traditional database will give you much more constant time access to values. You could start there and then add caching to see if you could get enough speed from it. This will be a little simpler than using Sinatra or some other tool.

be careful with memcache, it has some object size limitations (2mb or so)
One thing to try is to use MongoDB as your storage. It is pretty fast and you can map pretty much any data structure into it.

If it's sensible to wrap your monster hash in a method call, you might simply present it using DRb - start a small daemon that starts a DRb server with the hash as the front object - other processes can make queries of it using what amounts to RPC.
More to the point, is there another approach to your problem? Without knowing what you're trying to do, it's hard to say for sure - but maybe a trie, or a Bloom filter would work? Or even a nicely interfaced bitfield would probably save you a fair amount of space.

Have you considered upping the memcache max object size?
Versions greater than 1.4.2
memcached -I 11m #giving yourself an extra MB in space
or on previous versions changing the value of POWER_BLOCK in the slabs.c and recompiling.

What about storing the data in Memcache instead of storing the Hash in Memcache? Using your code above:
#a = []
0.upto(500) do |r|
#a[r] = []
0.upto(10_000) do |c|
key = "#{r}:#{c}"
if rand(10) == 0
Cache.set(key, 1) # 10% chance of being 1
else
Cache.set(key, 0)
end
end
end
This will be speedy and you won't have to worry about serialization and all of your systems will have access to it. I asked in a comment on the main post about accessing the data, you will have to get creative, but it should be easy to do.

Related

Mongoid identity_map and memory usage, memory leaks

When I executing query
Mymodel.all.each do |model|
# ..do something
end
It uses allot of memory and amount of used memory increases at all the time and at the and it crashes. I found out that to fix it I need to disable identity_map but when I adding to my mongoid.yml file identity_map_enabled: false I am getting error
Invalid configuration option: identity_map_enabled.
Summary:
A invalid configuration option was provided in your mongoid.yml, or a typo is potentially present. The valid configuration options are: :include_root_in_json, :include_type_for_serialization, :preload_models, :raise_not_found_error, :scope_overwrite_exception, :duplicate_fields_exception, :use_activesupport_time_zone, :use_utc.
Resolution:
Remove the invalid option or fix the typo. If you were expecting the option to be there, please consult the following page with repect to Mongoid's configuration:
I am using Rails 4 and Mongoid 4, Mymodel.all.count => 3202400
How can I fix it or maybe some one know other way to reduce amount of memory used during executing query .all.each ..?
Thank you very much for the help!!!!
I started with something just like you by doing loop through millions of record and the memory just keep increasing.
Original code:
#portal.listings.each do |listing|
listing.do_something
end
I've gone through many forum answers and I tried them out.
1st attempt: I try to use the combination of WeakRef and GC.start but no luck, I fail.
2nd attempt: Adding listing = nil to the first attempt, and still fail.
Success Attempt:
#start_date = 10.years.ago
#end_date = 1.day.ago
while #start_date < #end_date
#portal.listings.where(created_at: #start_date..#start_date.next_month).each do |listing|
listing.do_something
end
#start_date = #start_date.next_month
end
Conclusion
All the memory allocated for the record will never be released during
the query request. Therefore, trying with small number of record every
request does the job, and memory is in good condition since it will be
released after each request.
Your problem isn't the identity map, I don't think Mongoid4 even has an identity map built in, hence the configuration error when you try to turn it off. Your problem is that you're using all. When you do this:
Mymodel.all.each
Mongoid will attempt to instantiate every single document in the db.mymodels collection as a Mymodel instance before it starts iterating. You say that you have about 3.2 million documents in the collection, that means that Mongoid will try to create 3.2 million model instances before it tries to iterate. Presumably you don't have enough memory to handle that many objects.
Your Mymodel.all.count works fine because that just sends a simple count call into the database and returns a number, it won't instantiate any models at all.
The solution is to not use all (and preferably forget that it exists). Depending on what "do something" does, you could:
Page through all the models so that you're only working with a reasonable number of them at a time.
Push the logic into the database using mapReduce or the aggregation framework.
Whenever you're working with real data (i.e. something other than a trivially small database), you should push as much work as possible into the database because databases are built to manage and manipulate big piles of data.

Rails - how to cache data for server use, serving multiple users

I have a class method (placed in /app/lib/) which performs some heavy calculations and sub-http requests until a result is received.
The result isn't too dynamic, and requested by multiple users accessing a specific view in the app.
So, I want to schedule a periodic run of the method (using cron and Whenever gem), store the results somewhere in the server using JSON format and, by demand, read the results alone to the view.
How can this be achieved? what would be the correct way of doing that?
What I currently have:
def heavyMethod
response = {}
# some calculations, eventually building the response
File.open(File.expand_path('../../../tmp/cache/tests_queue.json', __FILE__), "w") do |f|
f.write(response.to_json)
end
end
and also a corresponding method to read this file.
I searched but couldn't find an example of achieving this using Rails cache convention (and not some private code that I wrote), on data which isn't related with ActiveRecord.
Thanks!
Your solution should work fine, but using Rails.cache should be cleaner and a bit faster. Rails guides provides enough information about Rails.cache and how to get it to work with memcached, let me summarize how I would use it in your case
Heavy method
def heavyMethod
response = {}
# some calculations, eventually building the response
Rails.cache.write("heavy_method_response", response)
end
Request
response = Rails.cache.fetch("heavy_method_response")
The only problem here is that when ur server starts for the first time, the cache will be empty. Also if/when memcache restarts.
One advantage is that somewhere on the flow, the data u pass in is marshalled into storage, and then unmartialled on the way out. Meaning u can pass in complex datastructures, and dont need to serialize to json manually.
Edit: memcached will clear your item if it runs out of memory. Will be very rare since its using a LRU (i think) algoritm to expire things, and I presume you will use this often.
To prevent this,
set expires_in larger than your cron period,
change your fetch code to call the heavy_method if ur fetch fails (like Rails.cache.fetch("heavy_method_response") {heavy_method}, and change heavy_method to just return the object.
Use something like redis which will not delete items.

How do I ensure correctness when using find_in_batches?

current my application have stat needs and I
make up a background job using rufus-scheduler and runs at 3:00
to batch process these records into CacheStat table. It's just like
any normal application's Weekly/Monthly Stat needs.
And I found out using find_each(say using User.find_each to iterate
all users), which invokes find_in_batches, I checkout the source code
of rails,
while records.any?
records_size = records.size
primary_key_offset = records.last.id
yield records
break if records_size < batch_size
if primary_key_offset
records = relation.where(table[primary_key].gt(primary_key_offset)).to_a
else
raise "Primary key not included in the custom select clause"
end
end
which the implentation is by comparing the primary-key,
my concern is the cocurrency,while I processing the batch,
whatif some records be inserted in-between?
does anybody have this kind of problem?
While I think, this code implementation may be be problemic,
because new records will always have larger PK and later in the
end will be find.
So this is what this kind of needs be implemented? If I want to
implement a batch stat processing by myself(without rails), then I
need to ensure have an integer primary key and using these fields to
compare(better not to use other kind of fields)?
(I was thinking of this because I'm kind of in the middle of switching
from mysql to mongo, so maybe later I need to implement this kind of
functionality by myself).
If I understand correctly, you can ensure correctness here by enforcing transactional isolation, e.g.
User.transaction do
User.find_each do |user|
user
end
end

Rails - given an array of Users - how to get a output of just emails?

I have the following:
#users = User.all
User has several fields including email.
What I would like to be able to do is get a list of all the #users emails.
I tried:
#users.email.all but that errors w undefined
Ideas? Thanks
(by popular demand, posting as a real answer)
What I don't like about fl00r's solution is that it instantiates a new User object per record in the DB; which just doesn't scale. It's great for a table with just 10 emails in it, but once you start getting into the thousands you're going to run into problems, mostly with the memory consumption of Ruby.
One can get around this little problem by using connection.select_values on a model, and a little bit of ARel goodness:
User.connection.select_values(User.select("email").to_sql)
This will give you the straight strings of the email addresses from the database. No faffing about with user objects and will scale better than a straight User.select("email") query, but I wouldn't say it's the "best scale". There's probably better ways to do this that I am not aware of yet.
The point is: a String object will use way less memory than a User object and so you can have more of them. It's also a quicker query and doesn't go the long way about it (running the query, then mapping the values). Oh, and map would also take longer too.
If you're using Rails 2.3...
Then you'll have to construct the SQL manually, I'm sorry to say.
User.connection.select_values("SELECT email FROM users")
Just provides another example of the helpers that Rails 3 provides.
I still find the connection.select_values to be a valid way to go about this, but I recently found a default AR method that's built into Rails that will do this for you: pluck.
In your example, all that you would need to do is run:
User.pluck(:email)
The select_values approach can be faster on extremely large datasets, but that's because it doesn't typecast the returned values. E.g., boolean values will be returned how they are stored in the database (as 1's and 0's) and not as true | false.
The pluck method works with ARel, so you can daisy chain things:
User.order('created_at desc').limit(5).pluck(:email)
User.select(:email).map(&:email)
Just use:
User.select("email")
While I visit SO frequently, I only registered today. Unfortunately that means that I don't have enough of a reputation to leave comments on other people's answers.
Piggybacking on Ryan's answer above, you can extend ActiveRecord::Base to create a method that will allow you to use this throughout your code in a cleaner way.
Create a file in config/initializers (e.g., config/initializers/active_record.rb):
class ActiveRecord::Base
def self.selected_to_array
connection.select_values(self.scoped)
end
end
You can then chain this method at the end of your ARel declarations:
User.select('email').selected_to_array
User.select('email').where('id > ?', 5).limit(4).selected_to_array
Use this to get an array of all the e-mails:
#users.collect { |user| user.email }
# => ["test#example.com", "test2#example.com", ...]
Or a shorthand version:
#users.collect(&:email)
You should avoid using User.all.map(&:email) as it will create a lot of ActiveRecord objects which consume large amounts of memory, a good chunk of which will not be collected by Ruby's garbage collector. It's also CPU intensive.
If you simply want to collect only a few attributes from your database without sacrificing performance, high memory usage and cpu cycles, consider using Valium.
https://github.com/ernie/valium
Here's an example for getting all the emails from all the users in your database.
User.all[:email]
Or only for users that subscribed or whatever.
User.where(:subscribed => true)[:email].each do |email|
puts "Do something with #{email}"
end
Using User.all.map(&:email) is considered bad practice for the reasons mentioned above.

how to save activerecord object of rails to redis

I am using redis as my web cache, and I want to store those activerecord objects to redis directly, but using redis-rb I get an error.
It seems that I can't serialize it or some what. Is there a lib to do this for me?
Am I have to serialize it to json format?
Which serialization format would be the most efficient?
Redis stores strings (and a few other data structures of strings); so you can serialize into Redis values however you like so long as you end up with a string.
JSON is probably the best place to start as it's lean, not overly brittle, works well with live upgrade patterns, and is readable in situ. Later you can add more complexity to meet your goals as needed, e.g., compression. #to_json and #from_json are already on ActiveRecord if you want to use JSON (with YAJL or its ilk that shouldn't be excessively slow, relatively speaking.) #to_xml is also there, if you're into S&M.
Raw marshaling can also work, but occasionally goes horrifically wrong (I've had marshaled objects exceed 2MB after LZO compression that were only a few K in JSON.)
If it's really a bottleneck for you, you'll want to run your own efficiency tests for your goal(s), e.g., write speed, read speed, or storage size, with your own objects and data patterns.
You can convert your model to a hash using attributes method and then save it with mapped_hmset
def redis_set()
redis.mapped_hmset("namespace:modelName:#{self.id}", self.attributes)
end
def redis_get(id)
redis.hgetall("namespace:modelName:#{id}")
end
def self.set(friend_list, player_id)
redis.set("friend_list_#{player_id}", Marshal.dump(friend_list)) == 'OK' ? friend_list : nil
end
def self.get(player_id)
friend_list = redis.get("friend_list_#{player_id}")
Marshal.load(friend_list) if friend_list
end

Resources