Sharing an large array with all users on a rails app - ruby-on-rails

I have inherited an app that generates a large array for every user that visit the app. I recently discovered that it is identical for nearly all the users!!
Now I want to somehow make one copy of it so it is not built over and over again. I have thought of a few options and wanted input to see which one is the best:
1) Create a model and shove the data into the database
2) Create a YAML file and have the app load it when it initializes.
I personally like the model idea but a few engineers at work feel as though it does not deserve to be a full model. 97% of the times users will see the same exact thing but 3% of the time users will get a slightly different array (a few elements will have changed).
Any other approaches that I should consider.??..thanks in advance.

Remember that if you store the data in the DB, each request which requires the data will have to execute a DB query to pull it out. If you are running multiple server threads, each thread could have its own copy in memory (if they are all handling requests which require the use of the array). In that case, you wouldn't be saving any memory (though you might save time from not having to regenerate the array).
If you are running multiple server processes (not threads), and if the array contents change as the application is running, and the changes have to be visible to all the processes, caching in memory won't work. You will have to use the DB in that case.
From the information in your comment, I suggest you try something like this:
Store the array in your DB, and make sure that the record(s) used have created/updated timestamps. Cache the contents in memory using a constant/global variable/class variable. Also store the last time the cache was updated.
Every time you need to use the array, retrieve the relevant "updated" timestamp from the DB. (You may need to use hand-coded SQL and ModelName.connection.execute to avoid pulling back all the data in the record, which ActiveRecord will probably do.) If the timestamp is later than the last time your cache was updated, pull the array from the DB and update your cache.
Use a Mutex ('require thread') when retrieving/updating the cached data, in case your server setup may use multiple threads. (I don't think that Passenger does, but I have had problems similar to threading problems when using Passenger+RMagick, so I would still use a Mutex to be safe.)
Wrap all the code which deals with the cached array in a library class (or a class method on the model used to store the data), so the details of cache management don't spill over into the rest of the application.
Do a little bit of performance testing on the cache setup using Benchmark.measure {}. If a bug in the setup actually made performance worse rather than better, that would be sad...

I'd go with option 2. You can add two constants (for the 97% and 3%) that load from a YAML file when the app initializes. That ought to shrink your memory footprint considerably.
Having said that, yikes, this is just a band-aid on a hack, but you knew that already. I'd consider putting some time into a redesign, if you have that luxury.

Related

Core data migration without reading all of the data into memory?

We have a core data DB (sqlite store) which, for some users, is about 100-150 MB. I wouldn't think that would be too big for a storage system to deal with (even on a mobile device), but we've found that with that size core data DB, ANY lightweight migration takes ~10+ seconds to complete. Even something as simple as adding a completely new independent entity (not related to any other entity). With raw sqlite this would be a create table statement. So, my question is whether anyone else has seen this and, if so, have they found a workaround to make such simple migrations faster? Specifically, I'm looking for a way to handle adding a new independent entity to an existing core data DB that's ~100-150 MB and have it be quick (i.e., under 5 seconds).
I believe that core data migrations ALWAYS have to read all of the data from the source and write it all to a destination for a migration (which is terrible BTW), but I'm hoping someone can prove me wrong. :) I couldn't find any way to do it with a custom migration either.
I've considered munging the DB with sql directly to basically make the model look like what CoreData would expect (I've done stuff like this to manually "downgrade" a core data DB for debugging purposes), but we really want to avoid doing something like that in production.
UPDATE:
For reference, this is the current approach we are taking. This is not a generic solution, but will work for our use case. Unless I get a better answer I'll add this as an answer at some point in the future and accept it.
We're going to deal with this by essentially making the DB smaller. There are 2 out of 15 entities that take up the bulk of the space in the DB (~95%). We're going to create completely separate data models each with one of those entities. This is done without changing the main model at all (hence, no core data migration). We'll then make a task that runs with background priority in GCD that, if any of those entities are found in the main DB, they're moved to the appropriate separate DB and removed from the main DB. This is done in batches with some sleeps between batches so it's less resource intensive and doesn't affect normal app operation. We'll modify the code that accesses those entities to try to get them from the new DB and fall back to the main DB if they're not in there.
In a future update after we find that all, or at least most, of our users have updated their data in the new DBs we'll drop those entities from the main DB.
This leaves us with a small main DB that can have migrations applied quickly and two large DBs that have migrations done more slowly. Those large DBs, in our case, should change less often (maybe never?) and even if they do change there are limited places in the app that require them so we can work around it in the UI (e.g., report some feature as unavailable until we move data).
A 10-20 second delay for an update to a huge dataset seems perfectly reasonable to me. Just don't do it on the main thread.
This means you'll have to modify the boilerplate Core Data stack setup that you get in the usual Xcode templates. Instead of always setting up the stack on the main thread at launch time, check to see if migration is needed. If so, put up appropriate UI, do the migration in a background thread, and be ready to invoke beginBackgroundTaskWithExpirationHandler: if needed.

How does memcache and rails work at max limits?

I am trying to understand how memcache works when (if) you fill up the allocated memory buffer. In particular I want to understand the lifecycle of a key value pair in cache. I am talking about low level cache operations in rails where I am directly creating the key/value pairs. e.g. commands like
Rails.cache.write key, cached_data
Rails.cache.fetch key
Assume for the sake of argument I have an infinite loop that was just generating random UUIDs as keys and storing random data. What happens when the cache fills up? Do older items just get bumped off or is there some specific algorithm behind the scenes that handles this eventuality?
I have read elsewhere "Cache Invalidation is a Hard Problem".
Just trying to understand how it actually works.
Maybe some simple code examples that illustrate the best way to create and destroy cached data? Do you have to explicitly define when entries should expire?
MemcacheD handles this behind the scenes. Check out this question -
Memcache and expired items
You can define expiration parameters, check out this wiki page -
http://code.google.com/p/memcached/wiki/NewProgramming#Cache_Invalidation
For cache invalidation specific to you application logic (and not just exhaustion of memory behind the scenes), the delete function will simply remove the data. As far when to delete cached data in your app, thats harder to say - hence the quote you referenced about cache invalidation being hard. I might suggest you start by thinking about ActiveRecord callbacks like after_commit - http://api.rubyonrails.org/classes/ActiveRecord/Callbacks.html, to let you easily invalidate cached data whenever your database changes.
But this is only a suggestion, there are many different cache invalidation schemes out there.

safe and performing way to save site wide variable

I am building a rails app, have a site wide counter variable (counter) that is used (read and write) by different pages, basically many pages can cause the counter increment, what is a multi-thread-safe way to store this variable so that
1) it is thread-safe, I may have many concurrent user access might read and write this variable
2) high performing, I originally thought about persist this variable in DB, but wondering is there a better way given there can be high volume of request and I don't want to make this DB query the bottleneck of my app...
suggestions?
It depends whether it has to be perfectly accurate. Assuming not, you can store it in memcached and sync to the database occasionally. If memcached crashes, it expires (shouldn't happen if configured properly), you have to shutdown, etc., reload it from the database on startup.
You could also look at membase. I haven't tried it, but to simplify, it's a distributed memcached server that automatically persists to disk.
For better performance and accuracy, you could look at a sharded approach.
Well you need persistence, so you have to store it in the Database/Session/File, AFAIK.

How to build cached stats in database without taking down site?

I'm working on a Ruby on Rails site.
In order to improve performance, I'd like to build up some caches of various stats so that in the future when displaying them, I only have to display the caches instead of pulling all database records to calculate those stats.
Example:
A model Users has_many Comments. I'd like to store into a user cache model how many comments they have. That way when I need to display the number of comments a user has made, it's only a simple query of the stats model. Every time a new comment is created or destroyed, it simply increments or decrements the counter.
How can I build these stats while the site is live? What I'm concerned about is that after I request the database to count the number of Comments a User has, but before it is able to execute the command to save it into stats, that user might sneak in and add another comment somewhere. This would increment the counter, but then by immediately overwritten by the other thread, resulting in incorrect stats being saved.
I'm familiar with the ActiveRecord transactions blocks, but as I understand it, those are to guarantee that all or none succeed as a whole, rather than to act as mutex protection for data on the database.
Is it basically necessary to take down the site for changes like these?
Your use case is already handled by rails. It's called counter cache. There is a rails cast here: http://railscasts.com/episodes/23-counter-cache-column
Since it is so old, it might be out of date. The general idea is there though.
It's generally not a best practice to co-mingle application and reporting logic. Send your reporting data outside the application, either to another database, to log files that are read by daemons, or to some other API that handle the storage particulars.
If all that sounds like too much work then, you don't really want real time reporting. Assuming you have a backup of some sort (hot or cold) run the aggregations and generate the reports on the backup. That way it doesn't affect running application and you data shouldn't be more than 24 hours stale.
FYI, I think I found the solution here:
http://guides.ruby.tw/rails3/active_record_querying.html#5
What I'm looking for is called pessimistic locking, and is addressed in 2.10.2.

Storing Data In Memory: Session vs Cache vs Static

A bit of backstory: I am working on an web application that requires quite a bit of time to prep / crunch data before giving it to the user to edit / manipulate. The data request task ~ 15 / 20 secs to complete and a couple secs to process. Once there, the user can manipulate vaules on the fly. Any manipulation of values will require the data to be reprocessed completely.
Update: To avoid confusion, I am only making the data call 1 time (the 15 sec hit) and then wanting to keep the results in memory so that I will not have to call it again until the user is 100% done working with it. So, the first pull will take a while, but, using Ajax, I am going to hit the in-memory data to constantly update and keep the response time to around 2 secs or so (I hope).
In order to make this efficient, I am moving the intial data into memory and using Ajax calls back to the server so that I can reduce processing time to handle the recalculation that occurs w/ this user's updates.
Here is my question, with performance in mind, what would be the best way to storing this data, assuming that only 1 user will be working w/ this data at any given moment.
Also, the user could potentially be working in this process for a few hours. When the user is working w/ the data, I will need some kind of failsafe to save the user's current data (either in a db or in a serialized binary file) should their session be interrupted in some way. In other words, I will need a solution that has an appropriate hook to allow me to dump out the memory object's data in the case that the user gets disconnected / distracted for too long.
So far, here are my musings:
Session State - Pros: Locked to one user. Has the Session End event which will meet my failsafe requirements. Cons: Slowest perf of the my current options. The Session End event is sometimes tricky to ensure it fires properly.
Caching - Pros: Good Perf. Has access to dependencies which could be a bonus later down the line but not really useful in current scope. Cons: No easy failsafe step other than a write based on time intervals. Global in scope - will have to ensure that users do not collide w/ each other's work.
Static - Pros: Best Perf. Easies to maintain as I can directly leverage my current class structures. Cons: No easy failsafe step other than a write based on time intervals. Global in scope - will have to ensure that users do not collide w/ each other's work.
Does anyone have any suggestions / comments on what I option I should choose?
Thanks!
Update: Forgot to mention, I am using VB.Net, Asp.Net, and Sql Server 2005 to perform this task.
I'll vote for secret option #4: use the database for this. If you're talking about a 20+ second turnaround time on the data, you are not going to gain anything by trying to do this in-memory, given the limitations of the options you presented. You might as well set this up in the database (give it a table of its own, or even a separate database if the requirements are that large).
I'd go with the caching method of for storing the data across any page loads. You can name the cache you want to store the data in to avoid conflicts.
For tracking user-made changes, I'd go with a more old-school approach: append to a text file each time the user makes a change and then sweep that file at intervals to save changes back to DB. If you name the files based on the user/account or some other session-unique indicator then there's no issue with conflict and the app (or some other support app, which might be a better idea in general) can sweep through all such files and update the DB even if the session is over.
The first part of this can be adjusted to stagger the write out more: save changes to Session, then write that to file at intervals, then sweep the file at larger intervals. you can tune it to performance and choose what level of possible user-change loss will be possible.
Use the Session, but don't rely on it.
Simply, let the user "name" the dataset, and make a point of actively persisting it for the user, either automatically, or through something as simple as a "save" button.
You can not rely on the session simply because it is (typically) tied to the users browser instance. If they accidentally close the browser (click the X button, their PC crashes, etc.), then they lose all of their work. Which would be nasty.
Once the user has that kind of control over the "persistent" state of the data, you can rely on the Session to keep it in memory and leverage that as a cache.
I think you've pretty much just answered your question with the pros/cons. But if you are looking for some peer validation, my vote is for the Session. Although the performance is slower (do you know by how much slower?), your processing is going to take a long time regardless. Do you think the user will know the difference between 15 seconds and 17 seconds? Both are "forever" in web terms, so go with the one that seems easiest to implement.
perhaps a bit off topic. I'd recommend putting those long processing calls in asynchronous (not to be confused with AJAX's asynchronous) pages.
Take a look at this article and ping me back if it doesn't make sense.
http://msdn.microsoft.com/en-us/magazine/cc163725.aspx
I suggest to create a copy of the data in a new database table (let's call it EDIT) as you send the initial results to the user. If performance is an issue, do this in a background thread.
As the user edits the data, update the table (also in a background thread if performance becomes an issue). If you have to use threads, you must make sure that the first thread is finished before you start updating the rows.
This allows a user to walk away, come back, even restart the browser and commit whenever she feels satisfied with the result.
One possible alternative to what the others mentioned, is to store the data on the client.
Assuming the dataset is not too large, and the code that manipulates it can be handled client side. You could store the data as an XML data island or JSON object. This data could then be manipulated/processed and handled all client side with no round trips to the server. If you need to persist this data back to the server the end resulting data could be posted via an AJAX or standard postback.
If this does not work with your requirements I'd go with just storing it on the SQL server as the other comment suggested.

Resources