Rails 5.2 Active Storage: How to determine and ensure that there are no floating blobs with direct uploads - ruby-on-rails

We recently upgraded our app to Rails 5.2 to make use of the active storage direct upload feature.
Following this guide to integrate direct upload with our existing JS drag and drop, we've been able to get the upload working. We take the signed ID returned, and add it to hidden fields . Then on form submission, we create a new record and associate the blobs signed_id to create the association.
However, if the user doesn't go through with the form submission, is there a recommended way to ensure that the blobs/files without model associations get purged? The tricky part seems how to determine when to purge the blob.

Purging process depends on you underlying storage, for example on S3 you can define object expiration policy for temporary blobs, on filesystem - periodically delete all files from temporary folder that are older than some limit.
As for temporary blob age that should be purged - this also depends on your application. Obviously it should be longer than user is filling the form plus some margin. If you do not have a problem with these lingering a bit longer - you can make the threshold somewhere around 24 hours or even more and purge once a day, so users will not encounter lost file for sure.

I went through the same questioning, and ended up concluding there is no ideal way: since it depends on the user absence of input, it is to be expected that the Blob can only be purged after a certain arbitrary timeout.
It can be a cron-like job for example.
Removing all dangling Blobs can be done through a one-liner though:
ActiveStorage::Blob.unattached.each(&:purge)
(Note: I spent quite some time on the MD5 computation too, if it's your case, take a look at the blog article I posted on MD5 computation in javascript)

Related

Uniquely identify files with same name and size but with different contents

We have a scenario in our project where there are files coming from the client with the same file name, sometimes with the same file size too. Currently when we upload a file, we are checking the new file name with the existing files in the database and if there is a reference we are marking it as duplicate and would not allow to upload at all. But now we have a requirement to check the content of the file when they have the same file name. So we need to find out a solution to differentiate such files based on contents. So, how do we efficiently do that - meaning how to do it avoiding even a minute chance of error?
Rails 3.1, Ruby 1.9.3
Below is one option I have read from a web reference.
require 'digest'
digest_value = Digest::MD5.base64digest(File.read( file_path ))
And the above line will read all the contents of the incoming file and based on which it will generate a unique hash, right? Then we can use it for unique file identification. But we have more than 500 users simultaneously working in 24/7 mode and most of them will be doing this operation. So, if the incoming file has a huge size (> 25MB) then the Digest will take more time to read the whole contents and there by suffer performance issues. So, what could be a better solution considering all these facts?
I have read the question and the comments and I have to say you have the problem stated not 100% correct. It seems that what you need is to identify identical content. Period. Despite whether name and size are equal or not. Correct me if I am wrong, but you likely don’t want to allow users to update 100 duplicates of the same file just because the user has 100 copies of it in local, having different names.
So far, so good. I would use the following approach. The file name is not involved anyhow. The file size might help in terms of fast-check the uniqueness (sizes differ hence files are definitely different.)
Then one might allow the upload with an instant “OK” response. Afterwards, the server in the background should run Digest::MD5, comparing the file against all already uploaded. If there is a duplicate, the new copy of the file should be removed, but the name should stay on the filesystem, being a symbolic link to the original.
That way you’ll not frustrate users, giving them an ability to have as many copies of the file as they want under different names, while preserving the HDD volume at the lowest possible level.

Removing documents from CouchDB replicas

We have a product that uses central CouchDB databases per client replicating to Apps running on user's iPads. Most of the database can replicate normally but we have two categories of document that we want to filter:
Documents with an owner - we want to filter the replication to only the current users documents (and documents with no specified owner).
Last X documents of some type. For some sorts of documents we only want to leave the last 10 (say) copies on the iPad.
We can set up both rules easily enough using filtered replication - so that the server only presents the subset of documents we want to the iPad for replication. Except... it does not work.
If a document has no owner (replicated) and later an owner is specified, it vanishes from the replication stream - but not from the iPad. In fact the version of the document that remains on the iPad still has NO owner and so we cant even hide it in code.
When a document becomes the 11th oldest and vanishes from the replication stream, it does not vanish from the iPad. Indeed unless the iPad database is rebuilt all versions of these documents end up there, and no longer replicate which is worse than just replicating them all in the first place.
We did find a hacky workaround - in the case where a document gains a new owner OR becomes older than X, we duplicate it and delete the original. The delete propogates to the iPad and the new document is filtered out of replication. This worked well enough (although it is a bit inefficient). However then we realised the newly copied document had lost all of its revision information and we were relying on the revisions to track changes!
So - does anyone have any other suggestion? What we are looking for is a mechanism to pull a document from the iPad replicas on demand. I am aware we could instruct the iPad to delete the documents locally - but then sooner or later those deletes would leak back to the server and destroy the original?
... we were relying on the revisions to track changes
IMHO this is the most interesting point to talk about an alternative solution.
I'm sorry but i have to say you using the CouchDB revision control in the way it that is not recommended. The document revisions are temporary. The best way to track changes of a document is to write a changes log inside or outside the doc.
How would you persist changes outside the doc itself - yes, you would create new docs. Surprise: you "Hack" is the right solution \o/
Maybe you shaking your head and your are not happy because you have tried to remove docs from the iPad to make them invisible client-side. That was the starting point of your "Hack", right?
My recommendation is to not combine "visibility" and "existence". Better would be to use your know-how with building view-indexes server-side in the same way client-side with PouchDB. Let the replication just handle replication - thats hard enough. Use views/filters client- and server-side to solve visibility requirements.

How to build cached stats in database without taking down site?

I'm working on a Ruby on Rails site.
In order to improve performance, I'd like to build up some caches of various stats so that in the future when displaying them, I only have to display the caches instead of pulling all database records to calculate those stats.
Example:
A model Users has_many Comments. I'd like to store into a user cache model how many comments they have. That way when I need to display the number of comments a user has made, it's only a simple query of the stats model. Every time a new comment is created or destroyed, it simply increments or decrements the counter.
How can I build these stats while the site is live? What I'm concerned about is that after I request the database to count the number of Comments a User has, but before it is able to execute the command to save it into stats, that user might sneak in and add another comment somewhere. This would increment the counter, but then by immediately overwritten by the other thread, resulting in incorrect stats being saved.
I'm familiar with the ActiveRecord transactions blocks, but as I understand it, those are to guarantee that all or none succeed as a whole, rather than to act as mutex protection for data on the database.
Is it basically necessary to take down the site for changes like these?
Your use case is already handled by rails. It's called counter cache. There is a rails cast here: http://railscasts.com/episodes/23-counter-cache-column
Since it is so old, it might be out of date. The general idea is there though.
It's generally not a best practice to co-mingle application and reporting logic. Send your reporting data outside the application, either to another database, to log files that are read by daemons, or to some other API that handle the storage particulars.
If all that sounds like too much work then, you don't really want real time reporting. Assuming you have a backup of some sort (hot or cold) run the aggregations and generate the reports on the backup. That way it doesn't affect running application and you data shouldn't be more than 24 hours stale.
FYI, I think I found the solution here:
http://guides.ruby.tw/rails3/active_record_querying.html#5
What I'm looking for is called pessimistic locking, and is addressed in 2.10.2.

Sharing an large array with all users on a rails app

I have inherited an app that generates a large array for every user that visit the app. I recently discovered that it is identical for nearly all the users!!
Now I want to somehow make one copy of it so it is not built over and over again. I have thought of a few options and wanted input to see which one is the best:
1) Create a model and shove the data into the database
2) Create a YAML file and have the app load it when it initializes.
I personally like the model idea but a few engineers at work feel as though it does not deserve to be a full model. 97% of the times users will see the same exact thing but 3% of the time users will get a slightly different array (a few elements will have changed).
Any other approaches that I should consider.??..thanks in advance.
Remember that if you store the data in the DB, each request which requires the data will have to execute a DB query to pull it out. If you are running multiple server threads, each thread could have its own copy in memory (if they are all handling requests which require the use of the array). In that case, you wouldn't be saving any memory (though you might save time from not having to regenerate the array).
If you are running multiple server processes (not threads), and if the array contents change as the application is running, and the changes have to be visible to all the processes, caching in memory won't work. You will have to use the DB in that case.
From the information in your comment, I suggest you try something like this:
Store the array in your DB, and make sure that the record(s) used have created/updated timestamps. Cache the contents in memory using a constant/global variable/class variable. Also store the last time the cache was updated.
Every time you need to use the array, retrieve the relevant "updated" timestamp from the DB. (You may need to use hand-coded SQL and ModelName.connection.execute to avoid pulling back all the data in the record, which ActiveRecord will probably do.) If the timestamp is later than the last time your cache was updated, pull the array from the DB and update your cache.
Use a Mutex ('require thread') when retrieving/updating the cached data, in case your server setup may use multiple threads. (I don't think that Passenger does, but I have had problems similar to threading problems when using Passenger+RMagick, so I would still use a Mutex to be safe.)
Wrap all the code which deals with the cached array in a library class (or a class method on the model used to store the data), so the details of cache management don't spill over into the rest of the application.
Do a little bit of performance testing on the cache setup using Benchmark.measure {}. If a bug in the setup actually made performance worse rather than better, that would be sad...
I'd go with option 2. You can add two constants (for the 97% and 3%) that load from a YAML file when the app initializes. That ought to shrink your memory footprint considerably.
Having said that, yikes, this is just a band-aid on a hack, but you knew that already. I'd consider putting some time into a redesign, if you have that luxury.

How are you mapping database records to physical files such as image uploads

37 signals suggests id partitioning to accomplish this thing..
http://37signals.com/svn/archives2/id_partitioning.php
Any suggestions would be more than welcome.
Thanks.
We use Paperclip for storing our files. It can do what you want pretty easily.
We use partitioning by date so an image uploaded today would end up in 2009/12/10/image_12345.jpg. The path is stored in the db for reference and the path to the image folder (the parent of 2009) is placed in some config file. If we need to change things later it makes it very easy.
You can map by virtually everything. We use mapping by user on our designs, but it's a HR system so it makes sense (there's no way the user will have 32k file entries) and the files are clearly connected with user. On Media Library parts of the system dividing by date or ID will be more useful.
The catch is, you should store some part of file path in database table (as suggested before). Will it be date, or user hash/name (often also divided, eg u/user j/john j/jo/john etc). Then you don't have to worry about changing division system, as this will only require database update.

Resources