How unique are the first 8-12 characters of SHA256 hashes? - url

Take this hash for example:
ba7816bf 8f01cfea 414140de 5dae2223 b00361a3 96177a9c b410ff61 f20015ad
It's too long for my purposes so I intend to use a small chunk from it, such as:
ba7816bf8f01
ba7816bf
Or similar. My intended use case:
Video gallery on a website, represented by thumbnails. They are in random order.
They play in the lightbox. They don't have a unique ID, only their URL is unique.
While the lightbox is open I add something to the end of the page URL with JS History API.
//example.com/video-gallery/lightbox/ba7816bf8f01
The suffix needs to be short and simple, definitely not a URL.
People share the URL.
The server can make sense of the lightbox/ba7816bf8f01 in relation to /video-gallery.
Visiting the URL, the lightbox needs to find which video the suffix belongs to and play it.
I thought I'd SHA256 the URL of the video, use the first few characters as an ad-hoc ID. How many characters should I use from the generated hash, to considerably reduce the chance of collision?
I got the idea from URLs and Hashing by Google.

The Wikipedia page on birthday attacks has a table with the number of entries you need to produce a certain chance of collision with a certain number of bits as a random identifier. If you want to have a one in a million chance of a collision and expect to store a million documents, for example, you’ll need fewer than 64 bits (16 hex characters).
Base64 is a good way to fit more bits into the same length of string compared to hex, too, taking 1⅓ characters per byte instead of 2.

Related

Encript/hash 2 strings in a single one that is as short as possible

I am building something like image hosting. Image is identified by user ID/image ID pair. Both are 20 chars long.
I would like to use this pair as a shareable image URL but I do not want it to be too obvious such as two images from a same user to have a same suffix or prefix.
Basically, I am doing something similar to JWT but the encrypted string length is a real concern.
Is there some good algorithm for this?

How to append a long list of ID's to a URL

I have an application that has list of ID's as a part of the URL but soon the number of ID's are going to increase drastically. What is the best method to have a big number of ID's in the URL. The ID's can go up to 200 or more at a time.
You could encode your ID array in a string (JSON is an easy format for that) and transmit it as a single variable via POST.
Simple GET Parameters or even the URL itself has some limits on it's length that can no be avoided. Most Webservers also have security Filters in place that wont accept more than a certain number of Parameters. (Suhosin)
See:
What is the maximum length of a URL in different browsers?
What is apache's maximum url length?
http://www.suhosin.org/stories/configuration.html

Image URL Naming Scheme

Prologue: I'm building a sort of CMS/social networking service that will host many images.
I'm intending on using Eucalyptus/Amazon S3 to store the images and was wondering about the significance of the seemingly-random file-names used by sites like Tumblr, Twitter, &c., e.g.
31.media.tumblr.com/d6ba16060ea4dfd3c67ccf4dbc91df92/tumblr_n164cyLkNl1qkdb42o1_500.jpg
and
pbs.twimg.com/media/Bg7B_kBCMAABYfF.jpg
How do they generate these strings, and what benefits does this incur over just incrementing an integer for each file name? Maybe just random characters? Maybe hashing an integer?
Thanks!
Twitter uses an encoding method called 'snowflake'. There is github source
The basic format encodes a timestamp (42 bits), data center id (5 bits), and worker id (the computer at the data center; 5 bits)
For tweet IDs, they write the value as a long decimal number. Tweet ID '508285932617736192' is hex value '070DCB5CDA022000'. The first 42 bits are the timestamp (the time_t value is 070DCB5C + the epoch, 1291675244). The next five bits are the data center (in this case, '1'), and the next five bits are the worker id ('2').
For images, they do the exact same thing, but use base64 encoding (following the RFC 4648 standard for URL encoding; the last two base64 characters are hyphen and underscore).
BwjA8nCCcAAy5zA.jpg decodes as 2014-09-02 20:23:58 GMT, data center #1, worker #7
This is a way to organize media and to guarantee that media will not get written over if another file has the same file name. For example if Twitter had a million photos in its pbs.twimg.com/media/ directory and it is possible that two out of those million photos were named cat.jpg, Twitter would run into an issue uploading the second file with the same name or calling for a file where two exist with the same name. In result, Twitter (amongst other applications) has created a way to prevent the database from getting those two files mixed up and in result renames the file after compressing it to a file name with much more specificity: a set of numbers, letters, and symbols that may seem random but are incrementally generated.
In your CMS, I suggest creating some sort of failsafe to prevent two files from clashing, whether it's one trying to write over another when uploaded or if it's retrieving one file that has the same name as another. You can do this in a few different ways. One method would be as I just described, rename the file and create a system that auto-increments the files' names. Do not generate these file names in an obvious pattern because then all media will be easily accessible through the address bar. This is another reason why the URLs are not as readable.
You can also apply the file_exists() function in your uploader. This is a PHP function that checks whether or not a file with a certain name already exists in a certain directory. Read more about that function here.
Hope this helps.
My guess about the tumblr file naming scheme is as follows:
d6ba16060ea4dfd3c67ccf4dbc91df92 - hash of the image file, might be
MD5 or SHA-1
tumblr_n164cyLkNl1qkdb42o1_500.jpg - several parts:
tumblr_ - obvious prefix to advertise the site
n164cyLkNl1qkdb42o - consists of 2 parts, 10 characters before '1'
and 7 after
n164cyLkNl - some kind of a hash for post ID that the image belongs to. Might be a custom alphabet Base64 value
qkdb42o - hash of the tumblr blog name.
Then goes the number, in this case '1' - # of image in a photoset, if
it's a single photo then it's just '1'.
Finally, _500 - maximum width of image in pixels.
Source: I have collected quite a lot of images and tags from tumblr and the pattern turned out to be obvious. You can see how tagging manner is the same for the same blog name hash, while tags of posts with same post number hash are 100% identical.
Now, if only there was a way to decode those hashes back to original value (assuming they're not actually hashes but encoded values, which is unlikely).

If you know the length of a string and apply a SHA1 hash to it, can you unhash it?

Just wondering if knowing the original string length means that you can better unlash a SHA1 encryption.
No, not in the general case: a hash function is not an encryption function and it is not designed to be reversible.
It is usually impossible to recover the original hash for certain. This is because the domain size of a hash function is larger than the range of the function. For SHA-1 the domain is unbounded but the range is 160bits.
That means that, by the Pigeonhole principle, multiple values in the domain map to the same value in the range. When such two values map to the same hash, it is called a hash collision.
However, for a specific limited set of inputs (where the domain of the inputs is much smaller than the range of the hash function), then if a hash collision is found, such as through an brute force search, it may be "acceptable" to assume that the input causing the hash was the original value. The above process is effectively a preimage attack. Note that this approach very quickly becomes infeasible, as demonstrated at the bottom. (There are likely some nice math formulas that can define "acceptable" in terms of chance of collision for a given domain size, but I am not this savvy.)
The only way to know that this was the only input that mapped to the hash, however, would be to perform an exhaustive search over all the values in the range -- such as all strings with the given length -- and ensure that it was the only such input that resulted in the given hash value.
Do note, however, that in no case is the hash process "reversed". Even without the Pigeon hole principle in effect, SHA-1 and other cryptographic hash functions are especially designed to be infeasible to reverse -- that is, they are "one way" hash functions. There are some advanced techniques which can be used to reduce the range of various hashes; these are best left to Ph.D's or people who specialize in cryptography analysis :-)
Happy coding.
For fun, try creating a brute-force preimage attack on a string of 3 characters. Assuming only English letters (A-Z, a-z) and numbers (0-9) are allowed, there are "only" 623 (238,328) combinations in this case. Then try on a string of 4 characters (624 = 14,776,336 combinations) ... 5 characters (625 = 916,132,832 combinations) ... 6 characters (626 = 56,800,235,584 combinations) ...
Note how much larger the domain is for each additional character: this approach quickly becomes impractical (or "infeasible") and the hash function wins :-)
One way password crackers speed up preimage attacks is to use rainbow tables (which may only cover a small set of all values in the domain they are designed to attack), which is why passwords that use hashing (SHA-1 or otherwise) should always have a large random salt as well.
Hash functions are one-way function. For a given size there are many strings that may have produced that hash.
Now, if you know that the input size is fixed an small enough, let's say 10 bytes, and you know that each byte can have only certain values (for example ASCII's A-Za-z0-9), then you can use that information to precompute all the possible hashes and find which plain text produces the hash you have. This technique is the basis for Rainbow tables.
If this was possible , SHA1 would not be that secure now. Is it ? So no you cannot unless you have considerable computing power [2^80 operations]. In which case you don't need to know the length either.
One of the basic property of a good Cryptographic hash function of which SHA1 happens to be one is
it is infeasible to generate a message that has a given hash
Theoretically, let's say the string was also known to be solely of ASCII characters, and it's of size n.
There are 95 characters in ASCII not including controls. We'll assume controls weren't used.
There are 95ⁿ possible such strings.
There are 1.461501×10⁴⁸ possible SHA-1 values (give or take) and a just n=25, there are 2.7739×10⁴⁹ possible ASCII-only strings without controls in them, which would mean guaranteed collisions (some such strings have the same SHA-1).
So, we only need to get to n=25 when this becomes impossible even with infinite resources and time.
And remember, up until now I've been making it deliberately easy with my ASCII-only rule. Real-world modern text doesn't follow that.
Of course, only a subset of such strings would be anything likely to be real (if one says "hello my name is Jon" and the other says "fsdfw09r12esaf" then it was probably the first). Stil though, up until now I was assuming infinite time and computing power. If we want to work it out sometime before the universe ends, we can't assume that.
Of course, the nature of the attack is also important. In some cases I want to find the original text, while in others I'll be happy with gibberish with the same hash (if I can input it into a system expecting a password).
Really though, the answer is no.
I posted this as an answer to another question, but I think it is applicable here:
SHA1 is a hashing algorithm. Hashing is one-way, which means that you can't recover the input from the output.
This picture demonstrates what hashing is, somewhat:
As you can see, both John Smith and Sandra Dee are mapped to 02. This means that you can't recover which name was hashed given only 02.
Hashing is used basically due to this principle:
If hash(A) == hash(B), then there's a really good chance that A == B. Hashing maps large data sets (like a whole database) to a tiny output, like a 10-character string. If you move the database and the hash of both the input and the output are the same, then you can be pretty sure that the database is intact. It's much faster than comparing both databases byte-by-byte.
That can be seen in the image. The long names are mapped to 2-digit numbers.
To adapt to your question, if you use bruteforce search, for a string of a given length (say length l) you will have to hash through (dictionary size)^l different hashes.
If the dictionary consists of only alphanumeric case-sensitive characters, then you have (10 + 26 + 26)^l = 62^l hashes to hash. I'm not sure how many FLOPS are required to produce one hash (as it is dependent on the hash's length). Let's be super-unrealistic and say it takes 10 FLOP to perform one hash.
For a 12-character password, that's 62^12 ~ 10^21. That's 10,000 seconds of computations on the fastest supercomputer to date.
Multiply that by a few thousand and you'll see that it is unfeasible if I increase my dictionary size a little bit or make my password longer.

Method for generating numerical values from a URL

In the 90s there was a toy called Barcode Battler. It scanned barcodes, and from the values generated an RPG like monster with various stats such as hit points, attack power, magic power, etc. Could there be a way to do a similar thing with a URL? From just an ordinary URL, generate stats like that. I was thinking of maybe taking the ASCII values of various characters in the URL and using them, but this seems too predictable and obvious.
Take the MD5 sum of the ASCII encoding of the URL? Incredibly easy to do on most platforms. That would give you 128 bits to come up with the stats from. If you want more, use a longer hash algorithm.
(I can't remember the details about what's allowed in a URL - if non-ASCII is allowed, you could use UTF-8 instead.)

Resources