Organizing thousands of images on a server - storage

I'm developing a website which might grow up to a few thousand users, all of which would upload up to ten pictures on the server.
I'm wondering what would be the best way of storing pictures.
Lets assume that I have, 5000 users with 10 pictures each, which gives us 50 000 pics. (I guess it wouldn't be a good idea to store them in the database in blobs ;) )
Would it be a good way to dynamically create directories for every 100 users registered, (50 dirs in total, assuming 5000 users), and upload their pictures there? Would naming convention 'xxx_yy.jpg' (xxx being user id and yy picture number) be ok?
In this case, however, there would be 1000 (100x10) pictures in one folder, isn't it too many?

I would most likely store the images by a hash of their contents. A 128-bit SHA, for instance. So, I'd rename a user's uploaded image 'foo.jpg' to be its 128-bit sha (probably in base 64, for uniform 16-character names) and then store the user's name for the file and its SHA in a database. I'd probably also add a reference count. Then if some folks all upload the same image, it only gets stored once and you can delete it when all references vanish.
As for actual physical storage, now that you have a guaranteed uniform naming scheme, you can use your file system as a balanced tree. You can either decide how many files maximum you want in a directory, and have a balancer move files to maintain this, or you can imagine what a fully populated tree would look like, and store your files that way.
The only real drawback to this scheme is that it decouples file names from contents so a database loss can mean not knowing what any file is called, but you should be careful to back up that kind of information anyway.

Different filesystems perform differently with directories holding large numbers of files. Some slow down tremendously. Some don't mind at all. For example, IBM JFS2 stores the contents of directory inodes as a B+ Tree sorted by filename.... so it probably provides log(n) access time even in the case of very large directories.
getting ls or dir to read, sort, get size/date info, and print them to stdout is a completely different task from accessing the file contents given the filename.... So don't let the inability of ls to list a huge directory guide you.
Whatever you do, don't optimize too early. Just make sure your file access mechanism can be asbstracted (make a FileStorage that you .getfile(id) from, or something...).
That way you can put in whatever directory structure you like, or for example if you find it's better to store these items as a BLOB column in a database, you have that option...

granted i have never stored 50,000 images, but i usually just store all images in the same directory and name them as such to avoid conflict. then store the reference in the db.
$ext = explode( '.', $filename );
$newName = md5( microtime() ) . '.' . $ext;
that way you never have the same two filenames as microtime will never be the same.

Related

Cassandra Data storage: data directory space not equal to the space occupied

This is a beginners question on Cassandra Architecture.
I have a 3 node Cassandra cluster. The data directory is at $CASSANDRA_HOME/data/data. I've loaded a huge data set. I did a nodetool flush and then nodetool tablestats on the table I loaded the data. This says the total space occupied is around 50GiB. I was curious and checked the size of my data directory du $CASSANDRA_HOME/data/data on each of the nodes,which shows around 1-2GB on each. How could the data directory be less than the space occupied by a single table? Am I missing something? My table is created with replication factor 1
du gives out the true storage capacity used by the paths given to it. This is not always directly connected to the size of the data stored in these paths.
Two main factors mix up the output of du compared to any other storage usage information you might get (e. g. from Cassandra).
du might give out a smaller number than expected because of two reasons: ⓐ It combines hard links. This means that if the paths given to it contain hard linked files (I won't explain hard links here, but this term is a fixed one for Unixish operating systems so it can be looked up easily), these are counted only once while the files exist multiple times. ⓑ It is aware of sparse files; these are files which contain large (sometimes huge) areas of empty space (zero-bytes). In many Unixish file systems these can be stored efficiently, depending on how they have been created.
du might give out a larger number than expected because file systems have some overhead. To store a file of n bytes, n + h bytes need to be stored because of this. h depends on the file system and its configuration. The most important factor is that file systems typically store files in a block structure. If a file isn't exactly the size of a multiple of the block size of the file system, the last needed block is still allocated completely by this file, so some of its size if wasted. du will show the whole block as allocated because, in fact, it is.
So in your case Cassandra might talk about space occupied of 50GiB but a lot of it might be empty (never written-to) space. This might be stored in a sparse file on the file system which in fact only uses 2GiB of storage size (which du shows).

How do content addressable storage systems deal with possible hash collisions?

Content addressable storage systems use the hash of the stored data as the identifier and the address. Collisions are incredibly rare, but if the system is used a lot for a long time, it might happen. What happens if there are two pieces of data that produce the same hash? Is it inevitable that the most recently stored one wins and data is lost, or is it possible to devise ways to store both and allow accessing both?
To keep the question narrow, I'd like to focus on Camlistore. What happens if permanodes collide?
It is assumed that collisions do not happen. Which is a perfectly reasonable assumption, given a strong hash function and a casual, non-malicious user inputs. SHA-1, which is what Camlistore currently uses, is also resistant to malicious attempts to produce collision.
In case a hash function becomes weak with time and needs to be retired, Camlistore supports a migration to a new hash function for new blobrefs, while keeping old blob refs accessible.
If a collision did happen, as far as I understand, the first stored blobref with that hash would win.
source: https://groups.google.com/forum/#!topic/camlistore/wUOnH61rkCE
In an ideal collision-resistant system, when a new file / object is ingested:
A hash is computed of the incoming item.
If the incoming hash does not already exist in the store:
the item data is saved and associated with the hash as its identifier
If incoming hash does match an existing hash in the store:
The existing data is retrieved
A bit-by-bit comparison of the existing data is performed with the new data
If the two copies are found to be identical, the new entry is linked to the existing hash
If the new copies are not identical, the new data is either
Rejected, or
Appended or prefixed* with additional data (e.g. a timestamp or userid) and re-hashed; this entire process is then repeated.
So no, it's not inevitable that information is lost in a content-addressable storage system.
* Ideally, the existing stored data would then be re-hashed in the same way, and the original hash entry tagged somehow (e.g. linked to a zero-byte payload) to notate that there were multiple stored objects that originally resolved to that hash (similar in concept to a 'Disambiguation page' on Wikipedia). Whether that is necessary depends on how data needs to be retrieved from the system.
While intentionally causing a collision may be astronomically impractical for a given algorithm, a random collision is possible as soon as the second storage transaction.
Note: Some small / non-critical systems skip the binary comparison step, trading risk for bandwidth or processing time. (Usually, this is only done if certain metadata matches, such as filename or data length.)
The risk profile of such a system (e.g. a single git repository) is far different than for an enterprise / cloud-scale environment that ingests large amounts of binary data, especially if that data is apparent random binary data (e.g. encrypted / compressed files) combined with something like sliding-window deduplication.
See also, e.g.:
https://stackoverflow.com/a/2437377/5711986
Composite Key e.g hash + userId

Store a checksum or similar for a file, to tell easily if it's the same as another file

In our app we have a table called support_files which stores documents that have been uploaded , which are mostly PDFs.
I'd like to get a unique list of these files, often the same file is uploaded more than once. I thought that a way to do this would be to add a column to the database called "checksum", and then, for each file, calculate the checksum somehow and store it in the column. (This is obviously the slow part).
Once this is done then I can easily filter out duplicates from my table by examining the checksum column.
Can anyone recommend a method to generate this checksum/hash/whatever? Ideally I'd like to generate a hash/checksum that's large enough to guarantee uniqueness, but small enough to fit into a string field in my database.
My server's running on Ubuntu server, and the total number of files I need to checksum is currently around 12,000. For the sake of argument assume it won't grow over 100,000.
A bit of Googling reveals sha1sum, but this may be more suited to telling if a file has been accidentally changed rather than if two files are different?
Take a look at Digest::SHA256, it can interface directly with files and it works great.
From the referenced documentation:
p Digest::SHA256.file("X11R6.8.2-src.tar.bz2").hexdigest
# => "f02e3c85572dc9ad7cb77c2a638e3be24cc1b5bea9fdbb0b0299c9668475c534"
``

Sorting 20GB of data

In the past I had to work with big files, somewhere about in the 0.1-3GB range. Not all the 'columns' were needed so it was ok to fit the remaining data in RAM.
Now I have to work with files in 1-20GB range, and they will probably grow as the time will pass. That is totally different because you cannot fit the data in RAM anymore.
My file contains several millions of 'entries' (I have found one with 30 mil entries). On entry consists in about 10 'columns': one string (50-1000 unicode chars) and several numbers. I have to sort the data by 'column' and show it. For the user only the top entries (1-30%) are relevant, the rest is low quality data.
So, I need some suggestions about in which direction to head out. I definitively don't want to put data in a DB because they are hard to install and configure for non computer savvy persons. I like to deliver a monolithic program.
Showing the data is not difficult at all. But sorting... without loading the data in RAM, on regular PCs (2-6GB RAM)... will kill some good hours.
I was looking a bit into MMF (memory mapped files) but this article from Danny Thorpe shows that it may not be suitable: http://dannythorpe.com/2004/03/19/the-hidden-costs-of-memory-mapped-files/
So, I was thinking about loading only the data from the column that has to be sorted in ram AND a pointer to the address (into the disk file) of the 'entry'. I sort the 'column' then I use the pointer to find the entry corresponding to each column cell and restore the entry. The 'restoration' will be written directly to disk so no additional RAM will be required.
PS: I am looking for a solution that will work both on Lazarus and Delphi because Lazarus (actually FPC) has 64 bit support for Mac. 64 bit means more RAM available = faster sorting.
I think a way to go is Mergesort, it's a great algorithm for sorting a
large amount of fixed records with limited memory.
General idea:
read N lines from the input file (a value that allows you to keep the lines in memory)
sort these lines and write the sorted lines to file 1
repeat with the next N lines to obtain file 2
...
you reach the end of the input file and you now have M files (each of which is sorted)
merge these files into a single file (you'll have to do this in steps as well)
You could also consider a solution based on an embedded database, e.g. Firebird embedded: it works well with Delphi/Windows and you only have to add some DLL in your program folder (I'm not sure about Lazarus/OSX).
If you only need a fraction of the whole data, scan the file sequentially and keep only the entries needed for display. F.I. lets say you need only 300 entries from 1 million. Scan the first first 300 entries in the file and sort them in memory. Then for each remaining entry check if it is lower than the lowest in memory and skip it. If it is higher as the lowest entry in memory, insert it into the correct place inside the 300 and throw away the lowest. This will make the second lowest the lowest. Repeat until end of file.
Really, there are no sorting algorithms that can make moving 30gb of randomly sorted data fast.
If you need to sort in multiple ways, the trick is not to move the data itself at all, but instead to create an index for each column that you need to sort.
I do it like that with files that are also tens of gigabytes long, and users can sort, scroll and search the data without noticing that it's a huge dataset they're working with.
Please finde here a class which sorts a file using a slightly optimized merge sort. I wrote that a couple of years ago for fun. It uses a skip list for sorting files in-memory.
Edit: The forum is german and you have to register (for free). It's safe but requires a bit of german knowledge.
If you cannot fit the data into main memory then you are into the realms of external sorting. Typically this involves external merge sort. Sort smaller chunks of the data in memory, one by one, and write back to disk. And then merge these chunks.

importing and processing data from a CSV File in Delphi

I had an pre-interview task, which I have completed and the solution works, however I was marked down and did not get an interview due to having used a TADODataset. I basically imported a CSV file which populated the dataset, the data had to be processed in a specific way, so I used Filtering and Sorting of the dataset to make sure that the data was ordered in the way I wanted it and then I did the logic processing in a while loop. The feedback that was received said that this was bad as it would be very slow for large files.
My main question here is if using an in memory dataset is slow for processing large files, what would have been better way to access the information from the csv file. Should I have used String Lists or something like that?
It really depends on how "big" and the available resources(in this case RAM) for the task.
"The feedback that was received said that this was bad as it would be very slow for large files."
CSV files are usually used for moving data around(in most cases that I've encountered files are ~1MB+ up to ~10MB, but that's not to say that others would not dump more data in CSV format) without worrying too much(if at all) about import/export since it is extremely simplistic.
Suppose you have a 80MB CSV file, now that's a file you want to process in chunks, otherwise(depending on your processing) you can eat hundreds of MB of RAM, in this case what I would do is:
while dataToProcess do begin
// step1
read <X> lines from file, where <X> is the max number of lines
you read in one go, if there are less lines(i.e. you're down to 50 lines and X is 100)
to process, then you read those
// step2
process information
// step3
generate output, database inserts, etc.
end;
In the above case, you're not loading 80MB of data into RAM, but only a few hundred KB, and the rest you use for processing, i.e. linked lists, dynamic insert queries(batch insert), etc.
"...however I was marked down and did not get an interview due to having used a TADODataset."
I'm not surprised, they were probably looking to see if you're capable of creating algorithm(s) and provide simple solutions on the spot, but without using "ready-made" solutions.
They were probably thinking of seeing you use dynamic arrays and creating one(or more) sorting algorithm(s).
"Should I have used String Lists or something like that?"
The response might have been the same, again, I think they wanted to see how you "work".
The interviewer was quite right.
The correct, scalable and fastest solution on any medium file upwards is to use an 'external sort'.
An 'External Sort' is a 2 stage process, the first stage being to split each file into manageable and sorted smaller files. The second stage is to merge these files back into a single sorted file which can then be processed line by line.
It is extremely efficient on any CSV file with over say 200,000 lines. The amount of memory the process runs in can be controlled and thus dangers of running out of memory can be eliminated.
I have implemented many such sort processes and in Delphi would recommend a combination of TStringList, TList and TQueue classes.
Good Luck

Resources