MongoDB, Carrierwave, GridFS and prevention of files' duplication - upload

I am dealing with Mongoid, carrierwave and gridFS to store my uploads.
For example, I have a model Article, containing a file upload(a picture).
class Article
include Mongoid::Document
field :title, :type => String
field :content, :type => String
mount_uploader :asset, AssetUploader
end
But I would like to only store the file once, in the case where I'll upload many times the same file for differents articles.
I saw GridFS has a MD5 checksum.
What would be the best way to prevent duplication of identicals files ?
EDIT:
In fact, on my website the users would be able to upload files.
But to avoid to store multiple of identical files, I would like to just make links throught an association table. Nothing of difficult, but how to do this the libraries specified below.
If you have any idea.
Thanks

De-duplication may very well be a worthy goal depending on your application, but my first instinct to approaching this problem would be to turn it around -- why do you expect a lot of duplicate uploads? Can you reduce that likelihood so that users don't have to spend needless time uploading and you don't have to spend needless effort processing checks for duplicates?
What if you create an Asset model and attach the uploader to that, then an Article references_one :asset, and you let users choose from already available assets when creating a new article or upload a new one if needed?
I may not understand your application domain if you're giving a simplified example (please explain further if so), and it's certainly possible that duplication could still be a real issue, but I'd start by asking why significant duplication is expected, and next gather some data about how much of a problem it really is in your app and dataset before expending a lot of effort to address it.

Related

dropbox clone - To use Active Storage directly on User Model or Have a separate Model handling the file attachment?

I'm building an app similar to Dropbox's concept where file storage is the key feature. It struck me when I started planning the app as to whether I should:
Have a User Model with has_many_attached to handle all files/images related to a user
Have a UserFile Model with has_one_attached and belongs_to :user
Still a rookie here, and I guess my concern is that I'm not sure if option 1 will have more limitations as the database grow in the future and accessing, storing, viewing, updating, and deleting any files belonging to the user may not be as flexible.
Also, additional tracking on the file is required, i.e. download counter, document verified etc.
Looking at option 2, it is definitely working but it makes the flow more complex and definitely will be difficult to maintain down the road.
Thanks in advance for your input.
Have search for stackoverflow and even rails guides but there is no information that I can see that helps me on this decision. At least, perhaps i cannot understand them.

Rails implementation of a database-based file system

Because "file system" and "rails" are such common topics both together and separate I fail to find any Ruby on Rails open source app that implements a file system in the database. I would like to use such an application as a starting point or template.
I've already been able to implement the User and the Directory models (using Ancestry for the latter), and I'm on my way for the File model (my app only requires one kind of file).
class User < ActiveRecord::Base
attr_accessible :email, :name, :password, :password_confirmation
has_secure_password
has_many :directories, dependent: :destroy
# ...
end # class User
class Directory < ActiveRecord::Base
attr_accessible :name, :parent_id
has_ancestry
belongs_to :user
has_many :files, dependent: :destroy
# ...
end # class Directory
# not actually implemented, yet
class File < ActiveRecord::Base
attr_accessible :name
belongs_to :directory
# ...
end # class File
In views I'm using jsTree to present the tree and a form to add/delete, edit, ... This will need to change into using AJAX because redirecting back to same page does not preserve the expanded/collapsed state of the tree.
However I have this nagging feeling that I'm doing something that has already been done lots of times. Can you please provide links to such application(s) or give hints about implementing both the model part and the view part?
Hints about implementing the model part
To get model to be organised as a tree structure the tecnique is know as Nested set model therefore a common name (helpful to googling etc. ) could be "Activerecord nesting" ;-)
Your choice about Ancestry is welcome but you can benefit having a look at projects (mix-in,plug-in,...) like:
awesome_nested_set
act_as_nested_set
Better nested set
act_as_a_tree
Closure Tree
Arboreal
For the file upload 'n store part I would suggest , in addition to the already mentioned Paperclip, to look at carrierwave by itself provides a storage based on the "fog" gem (supports storing files with AWS, Google, Local and Rackspace ) but you can opt for database (e.g. sqlite) storage leveraging carrierwave-activerecord
Hints about implementing the view part
About "views" you might be interested in this answer about jQuery File Tree a configurable AJAX file browser plugin for jQuery and dnamique blog which has a rails connector for this plugin and sources and demo about it.
as an alternative, look at the implementation (sources) of the applications mentioned in next section.
Links to such applications
Here some "File manager" of interest:
Boxroom
Saphyra (available as mountable engine)
rails based CMS might have code of some interest
I think you're on the right track. Your Directory and File models look fine to me.
Your nagging feeling is partly correct. It's a common requirement to support uploading and storing files, but it's not that common to model and display an entire hierarchal directory structure.
You may want to reconsider actually storing the files in the database. This is usually a bad idea. Since files are such variable sizes, they can bloat your table and hurt performance. I recommend storing your files in Amazon S3. This is much more reliable and fast storage, and you can serve S3 urls directly to clients to reduce bandwidth and load on your own servers. You can use the paperclip gem to handle file uploads and store the files either on disk or on S3.

Rails: Progressive Validation, use STI or something different?

I have a rails app where users share specific kinds of photos. Currently the app requires photos to be categorized in several ways before they are valid, hence users must upload photos one at a time and categorize them in order to save them to the database.
Categorization takes some time, so I'd like to allow users to upload batches of photos and then come back and categorize them when they have time, but when photos are stored without being fully categorized I don't want them mixed in with "complete" photos.
I'd ideally like this to be a sort of "Wizard" system where users can upload a bunch of photos at once and then proceed through their personal queue and categorize each photo (to finish creating it) when they have time.
My question is: how would you approach a problem like this?
I've been thinking about using Single Table Inheritance to create two subclasses of Photo: IncompletePhoto and CompletePhoto. The IncompletePhoto would only require the image file itself, but CompletePhoto would require categorization. Users could view their own IncompletePhotos, but search results within the app would only return CompletePhotos.
Does that sound like the right approach for the problem I'm trying to solve, or is there a better way? I've never used STI before and I'm not sure whether or not it's a good idea.
I'd say that STI was created to be useful when you have different objects with some, but not all common properties, for the cases where you'll benefit from DRY in both database and models. I'm not sure if there is a way to correctly change the type of instance of such a model. Well, you can just modify the type column itself, but the Ruby class of the object will be the same, and so will be validations, unless you will re-fetch the model after saving and then run validations manually. The latter sounds like a dirty hack for me.
As a correct way, I'd suggest you to add complete column, and use validators in form of validates ..., :if => :complete.

When does STI make sense? We are storing the same information for every type but using it differently

So I know STI is the most reviled thing ever but I have an instance where I think it might actually make sense. My app is parsing a bunch of different types of xml files. Every file model stores the exact same information. Just some info about what user it is associated with, when it was uploaded, and where it is stored on S3.
After the xml file gets stored then I parse it for information which I use to create various other models. Each type of file is going to create different things. It is possible there could be 100 or more different types of xml files although I don't think I'm going to write parsers for that many. Does STI make sense in this case?
The downside I guess is models are all in one directory so it is going to flood that directory unless hack Rails and stick it in a subdir in models dir.
The other option is I have a kind field and put something in the lib directory that handles all this. Or I'm using resque, maybe every xml file parser should be it's own job. There are drawbacks to that though like it being kind of awkward to force a job in the rails console.
From your explanation, the 'file' model is only storing the results of the file upload process and associated meta data. Without more information about the other kinds of models being generated from the parsed XML data, I don't see why single table inheritance applies to this use case.

rails + paperclip: Is a generic "Attachment" model a good idea?

On my application I've several things with attachments on them, using paperclip.
Clients have one logo.
Stores can have one or more pictures. These pictures, in addition, can have other
information such as the date in which they were taken.
Products can have one or more pictures of them, categorized (from the font, from the
back, etc).
For now, each one of my Models has its own "paperclip-fields" (Client has_attached_file) or has_many models that have attached files (Store has_many StorePictures, Product has_many ProductPictures)
My client has also told me that in the future we might be adding more attachments to the system (i.e. pdf documents for the clients to download).
My application has a rather complex authorization system implemented with declarative_authorization. One can not, for example, download pictures from a product he's not allowed to 'see'.
I'm considering re-factoring my code so I can have a generic "Attachment" model. So any model can has_many :attachments.
With this context, does it sound like a good idea? Or should I continue making Foos and FooPictures?
I've found there are often cases where a generic Attachment class is a whole lot easier to manage than independent attachments on various other types of records. The only down-side to the simple Attachment approach is that the thumbnails that need to be produced are defined for all possible attachments simultaneously instead of on a case-by-case basis.
A hybrid approach that allows for more flexibility is to create a STI-based Attachment table by including a 'type' column and making use-specific subclasses such as ProductAttachment that defines specific styles.

Resources