Prologue: I'm building a sort of CMS/social networking service that will host many images.
I'm intending on using Eucalyptus/Amazon S3 to store the images and was wondering about the significance of the seemingly-random file-names used by sites like Tumblr, Twitter, &c., e.g.
31.media.tumblr.com/d6ba16060ea4dfd3c67ccf4dbc91df92/tumblr_n164cyLkNl1qkdb42o1_500.jpg
and
pbs.twimg.com/media/Bg7B_kBCMAABYfF.jpg
How do they generate these strings, and what benefits does this incur over just incrementing an integer for each file name? Maybe just random characters? Maybe hashing an integer?
Thanks!
Twitter uses an encoding method called 'snowflake'. There is github source
The basic format encodes a timestamp (42 bits), data center id (5 bits), and worker id (the computer at the data center; 5 bits)
For tweet IDs, they write the value as a long decimal number. Tweet ID '508285932617736192' is hex value '070DCB5CDA022000'. The first 42 bits are the timestamp (the time_t value is 070DCB5C + the epoch, 1291675244). The next five bits are the data center (in this case, '1'), and the next five bits are the worker id ('2').
For images, they do the exact same thing, but use base64 encoding (following the RFC 4648 standard for URL encoding; the last two base64 characters are hyphen and underscore).
BwjA8nCCcAAy5zA.jpg decodes as 2014-09-02 20:23:58 GMT, data center #1, worker #7
This is a way to organize media and to guarantee that media will not get written over if another file has the same file name. For example if Twitter had a million photos in its pbs.twimg.com/media/ directory and it is possible that two out of those million photos were named cat.jpg, Twitter would run into an issue uploading the second file with the same name or calling for a file where two exist with the same name. In result, Twitter (amongst other applications) has created a way to prevent the database from getting those two files mixed up and in result renames the file after compressing it to a file name with much more specificity: a set of numbers, letters, and symbols that may seem random but are incrementally generated.
In your CMS, I suggest creating some sort of failsafe to prevent two files from clashing, whether it's one trying to write over another when uploaded or if it's retrieving one file that has the same name as another. You can do this in a few different ways. One method would be as I just described, rename the file and create a system that auto-increments the files' names. Do not generate these file names in an obvious pattern because then all media will be easily accessible through the address bar. This is another reason why the URLs are not as readable.
You can also apply the file_exists() function in your uploader. This is a PHP function that checks whether or not a file with a certain name already exists in a certain directory. Read more about that function here.
Hope this helps.
My guess about the tumblr file naming scheme is as follows:
d6ba16060ea4dfd3c67ccf4dbc91df92 - hash of the image file, might be
MD5 or SHA-1
tumblr_n164cyLkNl1qkdb42o1_500.jpg - several parts:
tumblr_ - obvious prefix to advertise the site
n164cyLkNl1qkdb42o - consists of 2 parts, 10 characters before '1'
and 7 after
n164cyLkNl - some kind of a hash for post ID that the image belongs to. Might be a custom alphabet Base64 value
qkdb42o - hash of the tumblr blog name.
Then goes the number, in this case '1' - # of image in a photoset, if
it's a single photo then it's just '1'.
Finally, _500 - maximum width of image in pixels.
Source: I have collected quite a lot of images and tags from tumblr and the pattern turned out to be obvious. You can see how tagging manner is the same for the same blog name hash, while tags of posts with same post number hash are 100% identical.
Now, if only there was a way to decode those hashes back to original value (assuming they're not actually hashes but encoded values, which is unlikely).
Related
I am building something like image hosting. Image is identified by user ID/image ID pair. Both are 20 chars long.
I would like to use this pair as a shareable image URL but I do not want it to be too obvious such as two images from a same user to have a same suffix or prefix.
Basically, I am doing something similar to JWT but the encrypted string length is a real concern.
Is there some good algorithm for this?
Take this hash for example:
ba7816bf 8f01cfea 414140de 5dae2223 b00361a3 96177a9c b410ff61 f20015ad
It's too long for my purposes so I intend to use a small chunk from it, such as:
ba7816bf8f01
ba7816bf
Or similar. My intended use case:
Video gallery on a website, represented by thumbnails. They are in random order.
They play in the lightbox. They don't have a unique ID, only their URL is unique.
While the lightbox is open I add something to the end of the page URL with JS History API.
//example.com/video-gallery/lightbox/ba7816bf8f01
The suffix needs to be short and simple, definitely not a URL.
People share the URL.
The server can make sense of the lightbox/ba7816bf8f01 in relation to /video-gallery.
Visiting the URL, the lightbox needs to find which video the suffix belongs to and play it.
I thought I'd SHA256 the URL of the video, use the first few characters as an ad-hoc ID. How many characters should I use from the generated hash, to considerably reduce the chance of collision?
I got the idea from URLs and Hashing by Google.
The Wikipedia page on birthday attacks has a table with the number of entries you need to produce a certain chance of collision with a certain number of bits as a random identifier. If you want to have a one in a million chance of a collision and expect to store a million documents, for example, you’ll need fewer than 64 bits (16 hex characters).
Base64 is a good way to fit more bits into the same length of string compared to hex, too, taking 1⅓ characters per byte instead of 2.
I maintain a client server DMS written in Delphi/Sql Server.
I would like to allow the users to search a string inside all the documents stored in the db. (files are stored as blob, they are stored as zipped files to save space).
My idea is to index them on "checkin", so as i store a nwe file I extract all the text information in it and put it in a new DB field. So somehow my files table will be:
ID_FILE integer
ZIPPED_FILE blob
TEXT_CONTENT text field (nvarchar in sql server)
I would like to support "indexing" of at least most common text-like files, such as:pdf, txt, rtf, doc, docx,pdf, may be adding xls and xlsx, ppt, pptx.
For MS Office files I can use ActiveX since I alerady do it in my application, for txt files i can simply read the file, but for pdf and odt?
Could you suggest the best techinque or even a 3rd party component (not free too) that parses with "no fear" all file types?
Thanks
searching documents this way would leed to a very slow and inconvenient to use, I'd advice you produce two additional tables instead of TEXT_CONTENT field.
When you parse the text, you should extract valuable words and try to standardise them so that you
- get rid of lower/upper case problems
- get rid of characters that might be used interchangeably.
i.e. in Turkish we have ç character that might be entered as c.
- get rid of verbs that are common in the language you are dealing with.
i.e. "Thing I am looking for", "Thing" "Looking" might be in your interest
- get rid of whatever problem use face.
Each word, that has already an entry in the table should re-use the ID already given in the string_search table.
the records may look like this.
original_file_table
zip_id number
zip_file blob
string_search
str_id number
standardized_word text (or any string type with an appropriate secondary index)
file_string_reference
zip_id number
str_id number
I hope that I could give you the idea what I am thinking of.
Your major problem is zipping your files before putting them as a blob in your database which makes them unsearchable by the database itself. I would suggest the following.
Don't zip files you put in the database. Disk space is cheap.
You can write a query like this as long as you save the files in a text field.
Select * from MyFileTable Where MyFileData like '%Thing I am looking for%'
This is slow but it will work. This will work because the text in most of those file types is in plain text not binary (though some of the newer file types are now binary)
The other alternative is to use an indexing engine such as Apache Lucene or Apache Solr which will as you put it
parses with "no fear" all file types?
I am working with some click stream data and i would need to give specifications to the vendor regarding a preferred format to be consumed by SSIS.
As its URL data in the text file which column delimiter would you recommend. I was thinking pipe "|" but i realize that pipes can be used within the URL.
I did some testing to specify multiple charecters as delimiter lile |^| but when I am creating a flat file connection there is not option in SSIS. I had type these charecters. But when i went to edit the flat file connection manager it had changed to {|}^{|}. It just made me nervous to the import succeeded.
I just wanted to see if anybody has good ideas as to which would safe column delimiter to use.
Probably tab-delimited would be fairly safe, at least assuming that by "clickstream" you mean a list of URLs or something similar. But in theory any delimiter should be fine as long as the supplier quotes the data appropriately.
In the 90s there was a toy called Barcode Battler. It scanned barcodes, and from the values generated an RPG like monster with various stats such as hit points, attack power, magic power, etc. Could there be a way to do a similar thing with a URL? From just an ordinary URL, generate stats like that. I was thinking of maybe taking the ASCII values of various characters in the URL and using them, but this seems too predictable and obvious.
Take the MD5 sum of the ASCII encoding of the URL? Incredibly easy to do on most platforms. That would give you 128 bits to come up with the stats from. If you want more, use a longer hash algorithm.
(I can't remember the details about what's allowed in a URL - if non-ASCII is allowed, you could use UTF-8 instead.)