Uniquely identify files with same name and size but with different contents - ruby-on-rails

We have a scenario in our project where there are files coming from the client with the same file name, sometimes with the same file size too. Currently when we upload a file, we are checking the new file name with the existing files in the database and if there is a reference we are marking it as duplicate and would not allow to upload at all. But now we have a requirement to check the content of the file when they have the same file name. So we need to find out a solution to differentiate such files based on contents. So, how do we efficiently do that - meaning how to do it avoiding even a minute chance of error?
Rails 3.1, Ruby 1.9.3
Below is one option I have read from a web reference.
require 'digest'
digest_value = Digest::MD5.base64digest(File.read( file_path ))
And the above line will read all the contents of the incoming file and based on which it will generate a unique hash, right? Then we can use it for unique file identification. But we have more than 500 users simultaneously working in 24/7 mode and most of them will be doing this operation. So, if the incoming file has a huge size (> 25MB) then the Digest will take more time to read the whole contents and there by suffer performance issues. So, what could be a better solution considering all these facts?

I have read the question and the comments and I have to say you have the problem stated not 100% correct. It seems that what you need is to identify identical content. Period. Despite whether name and size are equal or not. Correct me if I am wrong, but you likely don’t want to allow users to update 100 duplicates of the same file just because the user has 100 copies of it in local, having different names.
So far, so good. I would use the following approach. The file name is not involved anyhow. The file size might help in terms of fast-check the uniqueness (sizes differ hence files are definitely different.)
Then one might allow the upload with an instant “OK” response. Afterwards, the server in the background should run Digest::MD5, comparing the file against all already uploaded. If there is a duplicate, the new copy of the file should be removed, but the name should stay on the filesystem, being a symbolic link to the original.
That way you’ll not frustrate users, giving them an ability to have as many copies of the file as they want under different names, while preserving the HDD volume at the lowest possible level.

Related

Rails 5.2 Active Storage: How to determine and ensure that there are no floating blobs with direct uploads

We recently upgraded our app to Rails 5.2 to make use of the active storage direct upload feature.
Following this guide to integrate direct upload with our existing JS drag and drop, we've been able to get the upload working. We take the signed ID returned, and add it to hidden fields . Then on form submission, we create a new record and associate the blobs signed_id to create the association.
However, if the user doesn't go through with the form submission, is there a recommended way to ensure that the blobs/files without model associations get purged? The tricky part seems how to determine when to purge the blob.
Purging process depends on you underlying storage, for example on S3 you can define object expiration policy for temporary blobs, on filesystem - periodically delete all files from temporary folder that are older than some limit.
As for temporary blob age that should be purged - this also depends on your application. Obviously it should be longer than user is filling the form plus some margin. If you do not have a problem with these lingering a bit longer - you can make the threshold somewhere around 24 hours or even more and purge once a day, so users will not encounter lost file for sure.
I went through the same questioning, and ended up concluding there is no ideal way: since it depends on the user absence of input, it is to be expected that the Blob can only be purged after a certain arbitrary timeout.
It can be a cron-like job for example.
Removing all dangling Blobs can be done through a one-liner though:
ActiveStorage::Blob.unattached.each(&:purge)
(Note: I spent quite some time on the MD5 computation too, if it's your case, take a look at the blog article I posted on MD5 computation in javascript)

OneDrive Api - With whom the item is shared?

In OneDrive Business account I have shared files and folders and I'm trying to get a list of emails/users with whom the items are shared.
Both
https://graph.microsoft.com/v1.0/me/drive/sharedWithMe
and
https://graph.microsoft.com/v1.0/me/drive/root/children
both produce a similar result. I get the list of files, but the property Permissions is never present. All I see is whether the items are shared, but not with whom.
Now, I'm aware of /drive/items/{fileId}/permissions, but this would mean checking the files one-by-one. My app deals with a lot of files and I would really appreciate a way to get hose permissions in bulk...
Is there such an option?
/sharedWithMe is actually the opposite of what you're looking for. These are not files you've shared with others but rather files others have shared with you.
As for your specific scenario, permissions is unfortunately not supported in a collection. In other words, it isn't possible to $epand=permissions on the /children collection. Each file needs to be inspected separately.
You can however reduce the number of files you need to inspect by looking at the shared property. For example, if the scope property is set to user you know this file was shared with a specific user. If the shared property is null, you know this file is only available to the current user.
You can also reduce the number of calls you're making by using JSON Batching. After constructing a list of shared files you want to check, you can use Batching to process them in blocks of 20. This should greatly reduce the amount of overhead and dramatically improve the overall performance.
_api/web/onedriveshareditems?$top=100&$expand=SpItemUrl might just do the trick. This is the URL that is used by the web interface of OneDrive. Hope it helps

Solution For Monitoring and Maintaining App's Size on Disc

I'm building an app that makes extensive use of CoreData and a lot of my models have UIImage and NSData properties (for images and videos). Since it's not a great idea to store that data directly into CoreData, I built a file manager class that writes the files into different buckets in the documents directory depends on the context in which was created and media type.
My question now is how do I manage the documents directory? Is there a way to detect how much space the app has used up out of its total allocated space? Additionally, what is the best way to go about cleaning those directories; do I check every time a file is written or only on app launch, ect ect.
Is there a way to detect how much space the app has used up out of its total allocated space?
Apps don't have a limit on total allocated space, they're limited by the amount of space on the device. You can find out how much space you're using for these files by using NSFileManager to scan the directories. There are several methods that do this in different ways-- check out enumeratorAtPath:, for example. For each file, use a method like attributesOfItemAtPath:error: to get the file size.
Better would be to track the file sizes as you create and delete files. Keep a running total, stored in user defaults. When you create a new file, increase it by the amount of new data. When you remove a file, decrease the running total.
Additionally, what is the best way to go about cleaning those directories; do I check every time a file is written or only on app launch, ect ect.
If these files are local data that's inherently part of the associated Core Data object, the sensible approach is to delete a file when its Core Data object is deleted. The managed object needs the data file, so don't delete the file if you still use the object. That means there must be some way to link the two, but I'm assuming that's already true since you say that these files are used by managed objects somehow.
If the files are something like cached data that's easily re-created or re-downloaded, you should put them in the location returned by NSTemporaryDirectory(). Then iOS can delete them when it thinks the space is needed. You can also clear out old files whenever it seems appropriate, by scanning for older files or ones that haven't been used in a while (the details depend on exactly how you use the files).

How efficient iOS file system in dealing with large number of files in single folder

If I have large number of files (n x 100K individual files) what would be most efficient way to store them in iOS file system (from speed of access to the file by path point of view)? Should I dump them all in single folder or break them in multilevel folder hierarchy.
Basically this breaks in three questions:
does file access time depend on number of "sibling" files (I think
answer is yes. If I am correct file names are organized into b-tree
so it should be O(log n))?
how expensive is traversing from one folder to another along the
path (is it something like m * O( log nm ) - where m is number of
components in the path and nm is number of "siblings" at each path
component )?
What gets cached at file system level to make above assumptions incorrect?
It would be great if some one had direct experience with this kind of problem and can share some real life results.
You comments will be highly appreciated
This seems like it might provide relevant, hard data:
File System vs Core Data: the image cache test
http://biasedbit.com/blog/filesystem-vs-coredata-image-cache
Conclusion:
File system cache is, as expected, faster. Core Data falls shortly behind when storing (marginally slower) but load times are way higher when performing single random accesses.
For such a simple case Core Data functionality really doesn't pay up, so stick to the file system version.
I think you should store everything is a one folder and create a hash table which include key (file name) and value (source path) pare.By creating hash table complexity with be constant log(1) and this will speed up your process as well.
The file system is not an optimal database. With that many thousands of files, you should consider using Core Data, or other database instead to store the name and contents of each file.

How are you mapping database records to physical files such as image uploads

37 signals suggests id partitioning to accomplish this thing..
http://37signals.com/svn/archives2/id_partitioning.php
Any suggestions would be more than welcome.
Thanks.
We use Paperclip for storing our files. It can do what you want pretty easily.
We use partitioning by date so an image uploaded today would end up in 2009/12/10/image_12345.jpg. The path is stored in the db for reference and the path to the image folder (the parent of 2009) is placed in some config file. If we need to change things later it makes it very easy.
You can map by virtually everything. We use mapping by user on our designs, but it's a HR system so it makes sense (there's no way the user will have 32k file entries) and the files are clearly connected with user. On Media Library parts of the system dividing by date or ID will be more useful.
The catch is, you should store some part of file path in database table (as suggested before). Will it be date, or user hash/name (often also divided, eg u/user j/john j/jo/john etc). Then you don't have to worry about changing division system, as this will only require database update.

Resources