I'm using paper_trail to keep track of changes. Up until now I kept all versions forever. This made it easy to check the revision number of an object by simply checking the count of changes in the version (eg: Object.first.versions.length)
Now the versions table is getting really big and I don't need all versions, so I decided to drop old ones automatically and only key n-versions. Unfortunately, this breaks the monotonic increase of revisions.
What's the best way to keep a separate counter for the number of changes to an object?
Does it make sense to add an additional revisions field to the database that uses a counter_cache which is increased every time an object is updated or are there better ways to do that?
Related
I'm working with a real-time editor (https://beefree.io/) in my Rails app. Multiple users can edit a document and it autosaves every 10 seconds, returning a JSON object representing the document along with a version number that increments with every change. The JSON object is being saved to a table associated with a Document model.
I'm curious if optimistic locking could be used to prevent user's old changes from overwriting newer changes in the case that a save request doesn't complete in time. The idea would be to use the version number that the editor provides, and use it within the lock_version column. Can I pass an arbitrary value to lock_version like that? Or is the database meant to increment the value itself?
Another issue I have is that I'm saving to a table that has other columns that I don't want to be locked by this lock_version attribute. Can I specifically lock the real-time data column?
I'm curious if optimistic locking could be used to prevent user's old changes from overwriting newer changes in the case that a save request doesn't complete in time. The idea would be to use the version number that the editor provides, and use it within the lock_version column. Can I pass an arbitrary value to lock_version like that? Or is the database meant to increment the value itself?
With Rails' built-in optimistic locking, Active Record (and not your code) is supposed to be responsible for incrementing lock_version. You might be able to pass in your own lock_version with an update, but Rails will still auto-increment it with any other updates to the model. But, given the second part of your question...
Another issue I have is that I'm saving to a table that has other columns that I don't want to be locked by this lock_version attribute. Can I specifically lock the real-time data column?
Locking only for updates to certain columns is not a feature Rails currently supports, though there might be a gem or monkey-patch out there you could accomplish it with. However, given this as well as your need for a custom version number, it would probably be easiest just to implement optimistic locking yourself for just this update:
# last version is what the editor thinks is the last saved version
# new version is the new version provided by the editor
# this update will only succeed if the document still has version=last_version
Document.where(id: document.id, version: last_version)
.update_all(version: new_version, text: new_text)
I do want to point out that, while this will prevent out-of-order changes, collaborative editing (two or more people editing the same doc in real time) is still not going to be fully functional. To get a real collaborative editor, you'll need to implement an solution based on either OT (operational transform) or CRDT (conflict-free replicated data type). See also:
https://www.aha.io/engineering/articles/how-to-build-collaborative-text-editor-rails
https://github.com/benaubin/rails-collab
I'm developing an application in rails 3.2 with mongodb as database. I am using the mongoid gem.
I want to track all the changes to the database during runtime, but I have no idea how to do it.
Any clues on this?
There are several approaches you could use to track changes to your data, depending on whether you are just interested in monitoring current activity or need to create some sort of transactional history.
In rough order of least-to-most effort:
1) Enable the MongoDB Query Profiler
If you enable the query profiler at a level of "2" it will collect profiling data for all operations (reads as well as writes) to a specific database. You can also enable this in your configuration options) to change the profiling default for all databases on mongod startup. The profile data is saved in a capped collection per database and will only contain a recent snapshot of queries. Query profiling status can also be changed at runtime, so you can easily enable or disable as required.
2) Add Mongoid callbacks to your application
Add appropriate logic Mongoid callbacks such as after_insert, after_save, after_upsert depending on what information you are trying to capture.
3) Create a tailable cursor on the MongoDB oplog
If you run MongoDB as part of a replica set (or with the --replSet option), it creates a capped collection called the oplog (operations log). You can use a tailable cursor to follow changes as they are committed to the oplog. The oplog details all changes to the database, and is the mechanism MongoDB uses for replication.
I am not familiar with Mongodb nor mongoid, but here is my idea for someone using MySQL (I hope it gives you some clue):
First you take a snapshot of your database (using a tool like mysqldump).
Then at certain intervals, you check (audit) for those records which have an updated_at value greater (later) than the time you took the snapshot, or your last audit time.
Now, you have two versions of your data, which you can check for changes.
As I said, this is just an idea, and it needs to be more refined. I hope it gives you a clue.
Is it possible to store all changes of a set by using some means of logical paths - of the changes as they occur - such that one may revert the changes by essentially "stepping back"? I assume that something would need to map the changes as they occur, and the process of reverting them would thus ultimately be linear.
Apologies for any incoherence and this isn't applicable to any particular language. Rather, it's a problem of memory – i.e. can a set * (e.g. which may be some store of user input)* of a finite size that's changed continuously * (e.g. at any given time for any amount of time - there's no limit with regards to how much it can be changed)* be mapped procedurally such that new - future - changes are assumed to be the consequence of prior change * (in a second, mirror store that can be used to revert the state of the set all the way to its initial state)*.
You might want to look at some functional data structures. Functional languages, like Erlang, make it easy to roll back to the earlier state, since changes are always made on new data structures instead of mutating existing ones. While this feature can be used at repeatedly internally, Erlang programming typically uses this abundantly at the top level of a "process" so that on any kind of failure, it aborts both processing as well as all the changes in their entirety simply by throwing an exception (in a non-functional language, using mutable data structures, you'd be able to throw an exception to abort, but restoring originals would be your program's job not the runtime's job). This is one reason that Erlang has a solid reputation.
Some of this functional style of programming is usefully applied to non-functional languages, in particular, use of immutable data structures, such as immutable sets, lists, or trees.
Regarding immutable sets, for example, one might design a functionally-oriented data structure where modifications always generate a new set given some changes and an existing set (a change set consisting of additions and removals). You'd leave the old set hanging around for reference (by whomever); languages with automatic garbage collection reclaim the old ones when they're no longer being used (referenced).
You can put a id or tag into your set data structure, this way you can do some introspection to see what data structure id someone has a hold of. You also can capture the id of the base off of which each new version was generated; this gives you some history or lineage.
If desired, you can also capture a reference to the entire old data structure in the new one, or, one can maintain a global list of all of the sets as they are being generated. If you do, however, you'll have to take over more responsibility for storage management, as an automatic collector will probably not find any unused (unreferenced) garbage to collect without additional some help.
Database designs do some of this in their transaction controllers. For the purposes of your question, you can think of a database as a glorified set. You might look into MVCC (Multi-version Concurrency Control) as one example that is reasonably well written up in literature. This technique keeps old snapshot versions of data structures around (temporarily), meaning that mutations always appear to be in new versions of the data. An old snapshot is maintained until no active transaction references it; then is discarded. When two concurrently running transactions both modify the database, they each get a new version based off the same current and latest data set. (The transaction controller knows exactly which version each transaction is based off of, though the transaction's client doesn't see the version information.) Assuming both concurrent transactions choose to commit their changes, the versioning control in the transaction controller recognizes that the second committer is trying to commit a change set that is not a logical successor to the first (since both changes sets as we postulated above were based on the same earlier version). If possible, the transaction controller will merge the changes as if the 2nd committer was really working off the other, newer version committed by the first committer. (There are varying definitions of when this is possible, MVCC says it is when there are no write conflicts, which is a less-than-perfect answer but fast and scalable.) But if not possible, it will abort the 2nd committers transaction and inform the 2nd committer thereof (they then have the opportunity, should they like, to retry their transaction starting from the newer base). Under the covers, various snapshot versions in flight by concurrent transactions will probably share the bulk of the data (with some transaction-specific change sets that are consulted first) in order to make the snapshots cheap. There is usually no API provided to access older versions, so in this domain, the transaction controller knows that as transactions retire, the original snapshot versions they were using can also be (reference counted and) retired.
Another area this is done is using Append-Only-Files. Logging is a way of recording changes; some databases are based 100% on log-oriented designs.
BerkeleyDB has a nice log structure. Though used mostly for recovery, it does contain all the history so you can recreate the database from the log (up to the point you purge the log in which case you should also archive the database). Again someone has to decide when they can start a new log file, and when they can purge old log files, which you'd do to conserve space.
These database techniques can be applied in memory as well. (Nothing is free, though, of course ;)
Anyway, yes, there are fields where this is done.
Immutable data structures help preserve history, by simply keeping old copies; changes always go to new copies. (And efficiency techniques can make this not as bad as it sounds.)
Id's can help understand lineage without necessarily holding onto all the old copies.
If you do want to hold onto all old the copies, you have to look at your domain design to understand when/how/if old data structures possibly can get accessed with an eye toward how to eventually reclaim them. You'll mostly likely have to help get involved in defining how they get released, if ever. Or how they get archived for posterity though at the cost of slower access later.
I'm working on a Ruby on Rails site.
In order to improve performance, I'd like to build up some caches of various stats so that in the future when displaying them, I only have to display the caches instead of pulling all database records to calculate those stats.
Example:
A model Users has_many Comments. I'd like to store into a user cache model how many comments they have. That way when I need to display the number of comments a user has made, it's only a simple query of the stats model. Every time a new comment is created or destroyed, it simply increments or decrements the counter.
How can I build these stats while the site is live? What I'm concerned about is that after I request the database to count the number of Comments a User has, but before it is able to execute the command to save it into stats, that user might sneak in and add another comment somewhere. This would increment the counter, but then by immediately overwritten by the other thread, resulting in incorrect stats being saved.
I'm familiar with the ActiveRecord transactions blocks, but as I understand it, those are to guarantee that all or none succeed as a whole, rather than to act as mutex protection for data on the database.
Is it basically necessary to take down the site for changes like these?
Your use case is already handled by rails. It's called counter cache. There is a rails cast here: http://railscasts.com/episodes/23-counter-cache-column
Since it is so old, it might be out of date. The general idea is there though.
It's generally not a best practice to co-mingle application and reporting logic. Send your reporting data outside the application, either to another database, to log files that are read by daemons, or to some other API that handle the storage particulars.
If all that sounds like too much work then, you don't really want real time reporting. Assuming you have a backup of some sort (hot or cold) run the aggregations and generate the reports on the backup. That way it doesn't affect running application and you data shouldn't be more than 24 hours stale.
FYI, I think I found the solution here:
http://guides.ruby.tw/rails3/active_record_querying.html#5
What I'm looking for is called pessimistic locking, and is addressed in 2.10.2.
I have a rails app that tracks membership cardholders, and needs to report on a cardholder's status. The status is defined - by business rule - as being either "in good standing," "in arrears," or "canceled," depending on whether the cardholder's most recent invoice has been paid.
Invoices are sent 30 days in advance, so a customer who has just been invoiced is still in good standing, one who is 20 days past the payment due date is in arrears, and a member who fails to pay his invoice more than 30 days after it is due would be canceled.
I'm looking for advice on whether it would be better to store the cardholder's current status as a field at the customer level (and deal with the potential update anomalies resulting from potential updates of invoice records without updating the corresponding cardholder's record), or whether it makes more sense to simply calculate the current cardholder status based on data in the database every time the status is requested (which could place a lot of load on the database and slow down the app).
Recommendations? Or other ideas I haven't thought of?
One important constraint: while it's unlikely that anyone will modify the database directly, there's always that possibility, so I need to try to put some safeguards in place to prevent the various database records from becoming out of sync with each other.
The storage of calculated data in your database is generally an optimisation. I would suggest that you calculate the value on every request and then monitor the performance of your application. If the fact that this data is not stored becomes an issue for you then is the time to refactor and store the value within the database.
Storing calculated values, particularly those that can affect multiple tables are generally a bad idea for the reasons that you have mentioned.
When/if you do refactor and store the value in the DB then you probably want a batch job that checks the value for data integrity on a regular basis.
The simplest approach would be to calculate the current cardholder status based on data in the database every time the status is requested. That way you have no duplication of data, and therefore no potential problems with the duplicates becoming out of step.
If, and only if, your measurements show that this calculation is causing a significant slowdown, then you can think about caching the value.
Recently I had similar decision to take and I decided to store status as a field in database. This is because I wanted to reduce sql queries and it looks simpler. I choose to do it that way because I will very often need to get this status and calculating it is (at least in my case) a bit complicated.
Possible problem with it is that it get out of sync, so I added some after_save and after_destroy to child model, to keep it synchronized. And of course if somebody would modify database in different way, it would make some problems.
You can write simple rake task that will check all statuses and, if needed, correct them. You can run it in cron so you don't have to worry about it.