I have a ruby on rails app that takes around 10000 requests per minute.
And some of these requests perform a write to a database table. The maximum amount of connections to the database is 200.
And I would like to know what is more efficient. Writing to an array in cache and saving the data in the background in one operation, or saving each request directly to the database?
Are there any race conditions or performance issues if I write the data to an array in cache?
Are there any better approaches to optimize performance and avoid a database bottleneck?
Sample Code
#...
def self.add_data_message_to_queue(event_id, chat_item)
bucket_name = 'BUCKET_GET_CHAT_' + event_id.to_s
chat_queue = Rails.cache.fetch(bucket_name)
if chat_queue.blank?
chat_queue = []
end
chat_queue.push(chat_item)
Rails.cache.write(bucket_name, chat_queue, expires_in:Integer(30).days)
end
Server: Unicorn (High Concurrency)
Thanks in advance
SOLUTION
According to benchmarks writing to memcache is way more efficient.
Although it is necessary to handle race conditions. According to feedback from the memcachier team.
Test Saving Chat Messages to DB
Same Test - Not Saving Chat Messages to current DB
Response time is way better. The app can serve more requests per minute as well.
Handling Race Conditions
*( Feedback from memcachier team )
There are, in general, two ways to address this in memcache:
Since you're appending to an array, you could instead use memcache's APPEND and PREPEND operations. They are not supported in Rails.cache, but the underlying library, Dalli, supports these commands. Their semantic is that they will append/prepend a string to the existing value. When you fetch the value, you'll get all the strings you "appended" concatenated together, so you'd have to, e.g., separate each element by a semi-colon or something like that to break it into an array.
A more general solution (which works for any data-race conditions) is to use the versioning support in memcache. Specifically, each value in memcache is assigned a version, which is returned on any get requests. Set operations in memcache can take an optional CAS (for compare-and-swap) field, such that the operation will succeed only if the version matches the current value stored. Again, I believe Rails.cache doesn't support this, but Dalli does, through the cas method:
cache.add("bucket_get_chat", [])
while(!cache.cas("bucket_get_chat") {|val| val.push(chat_item)}); end
Related
According to streaming example at http://orientdb.com/docs/3.0.x/java/Java-Query-API.html, we can use the Orient result set streaming API as follows
ODatabaseDocument db;
...
String statement = "SELECT FROM V WHERE name = ? and surnanme = ?";
OResultSet rs = db.query(statement, "John", "Smith");
rs.stream().forEach(x -> System.out.println(x.getProperty("age")));
rs.close();
This is fine but too trivial - what if we need to keep the rs/stream around? We can't very well close the resultset because we want to reuse the stream on a subsequent user request in a web application, say (in scenarios such as paging).
But to keep the streams "alive" the Orient user guide says that:
OResultSet is implemented as a paginated structure, that holds some
iterators open during the iteration. This is true both in remote and
in embedded usage.
You should always invoke OResultSet.close() at the end of the
execution, to free resources.
OResultSet instances are automatically closed when you close the
ODatabase that returned them.
It is important to always close result sets, even when they are
converted to streams (after the stream is consumed).
Are there any best practices around this. As far as I can tell, we would need to:
1) Keep the Orient database connection open until the user "paging" session is done (which could be say 5-10 minutes). Only when the user says "done" can we close the result set & close the database connection. The Orient database connection (and whatever stream it generated) thus becomes "private" to a single application user. Moreover, since every user request can be activated on a different thread, the said database connection would need to be made active on the current thread before using it.
2) Use the Java Stream API to navigate through arbitrary subsets of the "arbitrarily" large resultset. How would memory usage be handled by the underlying Orient db stream implementation? What determines the memory usage for using a "single rs/stream" and keeping it around for a while? What happens when we have thousands of open rs/streams especially if each user has their own "private" rs/stream they're looking at?
3) If a given Orient database connection can only be used on a single thread at a time (an Orient requirement), how do we handle multiple users with their own custom long-lived rs/streams/connections? Does this mean that if we have a 1000 clients using their own private rs/stream (that they hang on to for say 5 minutes), then we have to keep 1000 database connections open (i.e. one for each user/rs?) What are the limits around this? This style is obviously quite different from the more typical execute query/close rs pattern for quick user interaction that is stateless from one request to the next (naive paging that re-executes queries every time for a given range and this can get expensive)
P.S. I realize that once we get a Java stream, then we pretty much start just using the Java API itself - so I suppose that JOOQ streaming usage (for example) would be pretty similar to Orient streaming usage once you start getting into the Stream interfaces - I'm not familiar with the Java Streams API, but I suppose How to paginate a list of objects in Java 8? is a good place to start?
My conclusion is that streaming works well when scrolling through a large result set without consuming a large amount of memory or having to keep re-executing offset/limit queries (similar to forward only scrolling over JDBC resultsets). A typical use case is an export scenario.
For forward and backward paging, in Orient at least, you likely need an indexed property/properties and perform range queries - you'll need to make sure the index is SB-tree so that it supports range queries.
FYI, Solr has a cursor mechanism which works pretty well for forward pagination on sorted results - but if you keep some simple state markers on the client you can also go back to results already encountered. "go to" random pages is not supported in Solr cursors but you can always re-sort/filter on some other criteria in order to move "useful" results to the top of the resultset instead of deep paging (https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html)
running Psql 9, Ruby 2.4 and Rails 5.x
Memory wise, which code would be better?
object_with_huge_texts.each do |x|
MyModel.create(text_col: x.huge_text)
end
versus
values = Array.new
object_with_huge_texts.each do |x|
values.push("(" << x.huge_text << ")")
end
ActiveRecord::Base.connection.execute(
"INSERT INTO my_model (text_col) VALUES '#{values.join(",")}'"
)
I understand that the second option will be 1 sql query vs n+1.
But will the giant values array cause too big a memory bloat?
It depends on how "huge" this data is. I've used servers with >1TB of memory and even a thrifty $5/mo. VPS still has >1GB in most cases, so it's all relative.
The first version benefits from garbage collection, as each model is created Ruby can discard the data, but there's additional overhead for the model itself.
The second version requires composing a potentially huge SQL string and smashing it in all at once. This could be problematic for two reasons: Your Ruby memory footprint might be too large, or your database might reject the query as being too big. The Postgres default "max query size" is typically 1GB.
If you're doing bulk loads on a regular basis and need it to be efficient you could try using a prepared statement with a placeholder value, then supply different values when executing. This scales quite well and performance is usually comparable to multi-insert style operations so long as there's not a lot of index pressure on the data.
Content addressable storage systems use the hash of the stored data as the identifier and the address. Collisions are incredibly rare, but if the system is used a lot for a long time, it might happen. What happens if there are two pieces of data that produce the same hash? Is it inevitable that the most recently stored one wins and data is lost, or is it possible to devise ways to store both and allow accessing both?
To keep the question narrow, I'd like to focus on Camlistore. What happens if permanodes collide?
It is assumed that collisions do not happen. Which is a perfectly reasonable assumption, given a strong hash function and a casual, non-malicious user inputs. SHA-1, which is what Camlistore currently uses, is also resistant to malicious attempts to produce collision.
In case a hash function becomes weak with time and needs to be retired, Camlistore supports a migration to a new hash function for new blobrefs, while keeping old blob refs accessible.
If a collision did happen, as far as I understand, the first stored blobref with that hash would win.
source: https://groups.google.com/forum/#!topic/camlistore/wUOnH61rkCE
In an ideal collision-resistant system, when a new file / object is ingested:
A hash is computed of the incoming item.
If the incoming hash does not already exist in the store:
the item data is saved and associated with the hash as its identifier
If incoming hash does match an existing hash in the store:
The existing data is retrieved
A bit-by-bit comparison of the existing data is performed with the new data
If the two copies are found to be identical, the new entry is linked to the existing hash
If the new copies are not identical, the new data is either
Rejected, or
Appended or prefixed* with additional data (e.g. a timestamp or userid) and re-hashed; this entire process is then repeated.
So no, it's not inevitable that information is lost in a content-addressable storage system.
* Ideally, the existing stored data would then be re-hashed in the same way, and the original hash entry tagged somehow (e.g. linked to a zero-byte payload) to notate that there were multiple stored objects that originally resolved to that hash (similar in concept to a 'Disambiguation page' on Wikipedia). Whether that is necessary depends on how data needs to be retrieved from the system.
While intentionally causing a collision may be astronomically impractical for a given algorithm, a random collision is possible as soon as the second storage transaction.
Note: Some small / non-critical systems skip the binary comparison step, trading risk for bandwidth or processing time. (Usually, this is only done if certain metadata matches, such as filename or data length.)
The risk profile of such a system (e.g. a single git repository) is far different than for an enterprise / cloud-scale environment that ingests large amounts of binary data, especially if that data is apparent random binary data (e.g. encrypted / compressed files) combined with something like sliding-window deduplication.
See also, e.g.:
https://stackoverflow.com/a/2437377/5711986
Composite Key e.g hash + userId
I'm trying to perform a daily operation on a larger than normal dataset (2m+ records). However, Rails seems to take a very long time performing operations on such a dataset. Operations like
Dataset.all.each do |data|
...
end
take a very long time to complete (I assume this is because it can't fit all the items into memory at once, right?).
Does anyone have any strategies on how I could handle this situation? I know SQL would probably speed up the process, but I'm looking to use the Rails environment as I can do many more complicated things to the data than I can with just SQL statements.
You want to use ActiveRecord's find_each for this.
Dataset.find_each do |data|
...
end
When processing a large set of rows, a database is very fast and efficient, it what they were designed for. I would recommend attempting to do all this processing in SQL if you want max performance. If you prefer to use Rails, or it is impossible to do everything you want in SQL, you might attempt to do some pre-processing in SQL and the remainder in Rails. Short of that, 2m+ rows is a lot to loop over, even if each only takes a fraction of a second it add up to a long time.
I am a newbie working in a simple Rails app that translates a document (long string) from a language to another. The dictionary is a table of terms (a string regexp to find and substitute, and a block that ouputs a substituting string). The table is 1 million records long.
Each request is a document that wants to be translated. In a first brutish force approach I need to run the whole dictionary against each request/document.
Since the dictionary will run whole every time (from the first record to the last), instead of loading the table of records of the dictionary with each document, I think the best would be to have the whole dictionary as an array in memory.
I know it is not the most efficient, but the dictionary has to run whole at this point.
1.- If no efficiency can be gained by restructuring the document and dictionary (meaning it is not possible to create smaller subsets of the dictionary). What is the best design approach?
2.- Do you know of similar projects that I can learn from?
3.- Where should I look to learn how to load such a big table into memory (cache?) at rails startup?
Any answer to any of the posed questions will be greatly appreciated. Thank you very much!
I don't think your web hoster will be happy with a solution like this. This script
dict = {}
(0..1000_000).each do | num |
dict[/#{num}/] = "#{num}_subst"
end
consumes a gigabyte of RAM on my MBP for storing the hash table. Another approach will be to store your substitutions marshaled in memcached so that you could (at least) store them across machines.
require 'rubygems'
require 'memcached'
#table = Memcached.new("localhost:11211")
retained_keys = (0..1000_000).each do | num |
stored_blob = Marshal.dump([/#{num}/, "#{num}_subst"])
#table.set("p#{num}", stored_blob)
end
You will have to worry about keeping the keys "hot" since memcached will expire them if they are not needed.
The best approach however, for your case, would be very simple - write your substitutions to a file (one line per substitution) and make a stream-filter that reads the file line by line, and replaces from this file. You can also parallelize that by mapping work on this, say, per letter of substitution and replacing markers.
But this should get you started:
require "base64"
File.open("./dict.marshal", "wb") do | file |
(0..1000_000).each do | num |
stored_blob = Base64.encode64(Marshal.dump([/#{num}/, "#{num}_subst"]))
file.puts(stored_blob)
end
end
puts "Table populated (should be a 35 meg file), now let's run substitutions"
File.open("./dict.marshal", "r") do | f |
until f.eof?
pattern, replacement = Marshal.load(Base64.decode64(f.gets))
end
end
puts "All replacements out"
To populate the file AND load each substitution, this takes me:
real 0m21.262s
user 0m19.100s
sys 0m0.502s
To just load the regexp and the string from file (all the million, piece by piece)
real 0m7.855s
user 0m7.645s
sys 0m0.105s
So this is 7 seconds IO overhead, but you don't lose any memory (and there is huge room for improvement) - the RSIZE is about 3 megs. You should easily be able to make it go faster if you do IO in bulk, or make one file for 10-50 substitutions and load them as a whole. Put the files on an SSD or a RAID and you got a winner, but you get to keep your RAM.
In production mode, Rails will not reload classes between requests. You can keep something in memory easily by putting it into a class variable.
You could do something like:
class Dictionary < ActiveRecord::Base
##cached = nil
mattr_accessor :cached
def self.cache_dict!
##cached = Dictionary.all
end
end
And then in production.rb:
Dictionary.cache_dict!
For your specific questions:
Possibly write the part that's inefficient in C or Java or a faster language
Nope, sorry. Maybe you could do a MapReduce algorithm to distribute the load across servers.
See above.
This isn't so much a specific answer to one of your questions as a process recommendation. If you're having (or anticipating) performance issues, you should be using a profiler from the get-go.
Check out this tutorial: How to Profile Your Rails Application.
My experience on a number of platforms (ANSI C, C#, Ruby) is that performance problems are very hard to deal with in advance; rather, you're better off implementing something that looks like it might be performant then load-testing it through a profiler.
Then, once you know where your time is being spent, you can expend some effort on optimisation.
If I had to take a guess, I'd say the regex work you'll be performing will be as much of a performance bottleneck as any ActiveRecord work. But without verifying that with a profiler, that guess is of little value.
If you use something like cache_fu, you can then leverage something like memcache without doing any work yourself. If you are trying to bring 1 MM rows into memory, being able to leverage the distributed nature of memcache will probably be useful.