How are you mapping database records to physical files such as image uploads - ruby-on-rails

37 signals suggests id partitioning to accomplish this thing..
http://37signals.com/svn/archives2/id_partitioning.php
Any suggestions would be more than welcome.
Thanks.

We use Paperclip for storing our files. It can do what you want pretty easily.

We use partitioning by date so an image uploaded today would end up in 2009/12/10/image_12345.jpg. The path is stored in the db for reference and the path to the image folder (the parent of 2009) is placed in some config file. If we need to change things later it makes it very easy.

You can map by virtually everything. We use mapping by user on our designs, but it's a HR system so it makes sense (there's no way the user will have 32k file entries) and the files are clearly connected with user. On Media Library parts of the system dividing by date or ID will be more useful.
The catch is, you should store some part of file path in database table (as suggested before). Will it be date, or user hash/name (often also divided, eg u/user j/john j/jo/john etc). Then you don't have to worry about changing division system, as this will only require database update.

Related

Uniquely identify files with same name and size but with different contents

We have a scenario in our project where there are files coming from the client with the same file name, sometimes with the same file size too. Currently when we upload a file, we are checking the new file name with the existing files in the database and if there is a reference we are marking it as duplicate and would not allow to upload at all. But now we have a requirement to check the content of the file when they have the same file name. So we need to find out a solution to differentiate such files based on contents. So, how do we efficiently do that - meaning how to do it avoiding even a minute chance of error?
Rails 3.1, Ruby 1.9.3
Below is one option I have read from a web reference.
require 'digest'
digest_value = Digest::MD5.base64digest(File.read( file_path ))
And the above line will read all the contents of the incoming file and based on which it will generate a unique hash, right? Then we can use it for unique file identification. But we have more than 500 users simultaneously working in 24/7 mode and most of them will be doing this operation. So, if the incoming file has a huge size (> 25MB) then the Digest will take more time to read the whole contents and there by suffer performance issues. So, what could be a better solution considering all these facts?
I have read the question and the comments and I have to say you have the problem stated not 100% correct. It seems that what you need is to identify identical content. Period. Despite whether name and size are equal or not. Correct me if I am wrong, but you likely don’t want to allow users to update 100 duplicates of the same file just because the user has 100 copies of it in local, having different names.
So far, so good. I would use the following approach. The file name is not involved anyhow. The file size might help in terms of fast-check the uniqueness (sizes differ hence files are definitely different.)
Then one might allow the upload with an instant “OK” response. Afterwards, the server in the background should run Digest::MD5, comparing the file against all already uploaded. If there is a duplicate, the new copy of the file should be removed, but the name should stay on the filesystem, being a symbolic link to the original.
That way you’ll not frustrate users, giving them an ability to have as many copies of the file as they want under different names, while preserving the HDD volume at the lowest possible level.

Solution For Monitoring and Maintaining App's Size on Disc

I'm building an app that makes extensive use of CoreData and a lot of my models have UIImage and NSData properties (for images and videos). Since it's not a great idea to store that data directly into CoreData, I built a file manager class that writes the files into different buckets in the documents directory depends on the context in which was created and media type.
My question now is how do I manage the documents directory? Is there a way to detect how much space the app has used up out of its total allocated space? Additionally, what is the best way to go about cleaning those directories; do I check every time a file is written or only on app launch, ect ect.
Is there a way to detect how much space the app has used up out of its total allocated space?
Apps don't have a limit on total allocated space, they're limited by the amount of space on the device. You can find out how much space you're using for these files by using NSFileManager to scan the directories. There are several methods that do this in different ways-- check out enumeratorAtPath:, for example. For each file, use a method like attributesOfItemAtPath:error: to get the file size.
Better would be to track the file sizes as you create and delete files. Keep a running total, stored in user defaults. When you create a new file, increase it by the amount of new data. When you remove a file, decrease the running total.
Additionally, what is the best way to go about cleaning those directories; do I check every time a file is written or only on app launch, ect ect.
If these files are local data that's inherently part of the associated Core Data object, the sensible approach is to delete a file when its Core Data object is deleted. The managed object needs the data file, so don't delete the file if you still use the object. That means there must be some way to link the two, but I'm assuming that's already true since you say that these files are used by managed objects somehow.
If the files are something like cached data that's easily re-created or re-downloaded, you should put them in the location returned by NSTemporaryDirectory(). Then iOS can delete them when it thinks the space is needed. You can also clear out old files whenever it seems appropriate, by scanning for older files or ones that haven't been used in a while (the details depend on exactly how you use the files).

How efficient iOS file system in dealing with large number of files in single folder

If I have large number of files (n x 100K individual files) what would be most efficient way to store them in iOS file system (from speed of access to the file by path point of view)? Should I dump them all in single folder or break them in multilevel folder hierarchy.
Basically this breaks in three questions:
does file access time depend on number of "sibling" files (I think
answer is yes. If I am correct file names are organized into b-tree
so it should be O(log n))?
how expensive is traversing from one folder to another along the
path (is it something like m * O( log nm ) - where m is number of
components in the path and nm is number of "siblings" at each path
component )?
What gets cached at file system level to make above assumptions incorrect?
It would be great if some one had direct experience with this kind of problem and can share some real life results.
You comments will be highly appreciated
This seems like it might provide relevant, hard data:
File System vs Core Data: the image cache test
http://biasedbit.com/blog/filesystem-vs-coredata-image-cache
Conclusion:
File system cache is, as expected, faster. Core Data falls shortly behind when storing (marginally slower) but load times are way higher when performing single random accesses.
For such a simple case Core Data functionality really doesn't pay up, so stick to the file system version.
I think you should store everything is a one folder and create a hash table which include key (file name) and value (source path) pare.By creating hash table complexity with be constant log(1) and this will speed up your process as well.
The file system is not an optimal database. With that many thousands of files, you should consider using Core Data, or other database instead to store the name and contents of each file.

serve my text from the filesystem instead of a database?

I am working on a content management application in which the data being stored on the database is extremely generic. In this particular instance a container has many resources and those resources map to some kind of digital asset, whether that be a picture, a movie, an uploaded file or even plain text.
I have been arguing with a colleague for a week now because in addition to storing the pictures, etc - they would like to store the text assets on the file system and have the application look up the file location(from the database) and read in the text file(from the file system) before serving to the client application.
Common sense seemed to scream at me that this was ridiculous and if we are bothering to look up something from the database, we might as well store the text in a database column and have it served along up with the row lookup. Database lookup + File IO seemed sounds uncontrollably slower then just Database Lookup. After going back and forth for some time, I decided to run some benchmarks and found the results a little surprising. There seems to be very little consistency when it comes to benchmark times. The only clear winner in the benchmarks was pulling a large dataset from the database and iterating over the results to display the text asset, however pulling objects one at a time from the database and displaying their text content seems to be neck and neck.
Now I know the limitations of running benchmarks, and I am not sure I am even running the correct idea of "tests" (for example, File system writes are ridiculously faster then database writes, didn't know that!). I guess my question is for confirmation. Is File I/O comparable to database text storage/lookup? Am I missing a part of the argument here? Thanks ahead of time for your opinions/advice!
A quick work about what I am using:
This is a Ruby on Rails application,
using Ruby 1.8.6 and Sqlite3. I plan
on moving the same codebase to MySQL
tomorrow and see if the benchmarks are
the same.
The major advantage you'll get from not using the filesystem is that the database will manage concurrent access properly.
Let's say 2 processes need to modify the same text as the same time, synchronisation with the filesystem may lead to race conditions, whereas you will have no problem at all with everyhing in database.
I think your benchmark results will depend on how you store the text data in your database.
If you store it as LOB then behind the scenes it is stored in an ordinary file.
With any kind of LOB you pay the Database lookup + File IO anyway.
VARCHAR is stored in the tablespace
Ordinary text data types (VARCHAR et al) are very limited in size in typical relational database systems. Something like 2000 or 4000 (Oracle) sometimes 8000 or even 65536 characters. Some databases support long text
but these have serious drawbacks and are not recommended.
LOBs are references to file system objects
If your text is larger you have to use a LOB data type (e.g. CLOB in Oracle).
LOBs usually work like this:
The database stores only a reference to a file system object.
The file system object contains the data (e.g. the text data).
This is very similar to what your colleague proposes except the DBMS lifts the heavy work of
managing references and files.
The bottom line is:
If you can store your text in a VARCHAR then go for it.
If you can't you have two options: Use a LOB or store the data in a file referenced from the database. Both are technically similar and slower than using VARCHAR.
I did this before. Its a mess, you need to keep the filesystem and the database synchronized all the time, so that makes the programming more complicated, as you would guess.
My advice is either go for an all filesystem solution, or all database solution, depending on the data. Notably, if you require lots of searches, conditional data retrieval, then go for database, otherwise fs.
Note that database may not be optimized for storage of large binary files. Still, remember, if you use both, youre gonna have to keep them synchronized, and it doesnt make for an elegant nor enjoyble (to program) solution.
Good luck!
At least, if your problems come from the "performance side", you could use a "no SQL" storage solution like Redis (via Ohm, for example), or CouchDB...

What is the best way to upload user portrait

Stored in the database or file system ?
And I need several different sizes. like 128*128, 96*96, 64*64 and so on.
What is the best way to upload user portrait?
Not knowing all of your constraints, I'd upload the 128x128 picture and then create all the other portraits on the fly.
I don't think you need to worry about storing the images in the DB, specially if you're running SQLServer 2008 (and you use the new FILESTREAM type).
Definitively it depends of the amount of images you need to store.
If you store them in the file system you just need to keep the URL or location of the image the database. To change the size you can do it in real time using components to achieve the change depending of the language you're using. For .NET ASPjpeg is a very good one, but you can manipulate the image with the System.Drawing.Imaging class. No need to manipulate database filestream or BLOB fields.
In the other hand, storing images in the database can make it too big in order to backup or download depending of the amount of records, even if you are working with SQL server, but you have everything in the same place. Maintenance is faster.
The problem with storing images in the file system is the cleaning procedure, if you delete a record, you need to delete the image in the file system too in order to prevent of keeping garbage.

Resources