Method for generating numerical values from a URL - url

In the 90s there was a toy called Barcode Battler. It scanned barcodes, and from the values generated an RPG like monster with various stats such as hit points, attack power, magic power, etc. Could there be a way to do a similar thing with a URL? From just an ordinary URL, generate stats like that. I was thinking of maybe taking the ASCII values of various characters in the URL and using them, but this seems too predictable and obvious.

Take the MD5 sum of the ASCII encoding of the URL? Incredibly easy to do on most platforms. That would give you 128 bits to come up with the stats from. If you want more, use a longer hash algorithm.
(I can't remember the details about what's allowed in a URL - if non-ASCII is allowed, you could use UTF-8 instead.)

Related

How unique are the first 8-12 characters of SHA256 hashes?

Take this hash for example:
ba7816bf 8f01cfea 414140de 5dae2223 b00361a3 96177a9c b410ff61 f20015ad
It's too long for my purposes so I intend to use a small chunk from it, such as:
ba7816bf8f01
ba7816bf
Or similar. My intended use case:
Video gallery on a website, represented by thumbnails. They are in random order.
They play in the lightbox. They don't have a unique ID, only their URL is unique.
While the lightbox is open I add something to the end of the page URL with JS History API.
//example.com/video-gallery/lightbox/ba7816bf8f01
The suffix needs to be short and simple, definitely not a URL.
People share the URL.
The server can make sense of the lightbox/ba7816bf8f01 in relation to /video-gallery.
Visiting the URL, the lightbox needs to find which video the suffix belongs to and play it.
I thought I'd SHA256 the URL of the video, use the first few characters as an ad-hoc ID. How many characters should I use from the generated hash, to considerably reduce the chance of collision?
I got the idea from URLs and Hashing by Google.
The Wikipedia page on birthday attacks has a table with the number of entries you need to produce a certain chance of collision with a certain number of bits as a random identifier. If you want to have a one in a million chance of a collision and expect to store a million documents, for example, you’ll need fewer than 64 bits (16 hex characters).
Base64 is a good way to fit more bits into the same length of string compared to hex, too, taking 1⅓ characters per byte instead of 2.

trying to figure out the charset

I'm downloading a CSV from Google Docs and in it characters like “ are saved as \xE2\x80\x9C and ” are saved as \xE2\x80\x9D.
My question is... what charset are those being saved in? How might I go about figuring that out?
It is in UTF-8.. You can tell by decoding it as UTF-8 and it shows the correct characters.
UTF-8 also has a unique and very distinctive pattern, just 3 bytes with highest bit set forming a valid UTF-8 sequence are enough to tell if something is UTF-8 with 99% confidence. Even with 2 bytes with highest bit set forming a valid UTF-8 sequence, you can already get to 90%.
In a case it wasn't UTF-8, and was some 8-bit code page instead, it would be impossible to tell just by looking at the bytes alone. Without any other information, you would basically have to brute force by decoding it in various 8-bit encodings and then seeing if it looks correct. The other possibility is using an algorithm that would go through the encodings automatically, and see if it the result makes sense in any language.
With more information like what operating system and locale the file was saved in, you could reduce the amount of possible encodings to try by a huge deal though.

If you know the length of a string and apply a SHA1 hash to it, can you unhash it?

Just wondering if knowing the original string length means that you can better unlash a SHA1 encryption.
No, not in the general case: a hash function is not an encryption function and it is not designed to be reversible.
It is usually impossible to recover the original hash for certain. This is because the domain size of a hash function is larger than the range of the function. For SHA-1 the domain is unbounded but the range is 160bits.
That means that, by the Pigeonhole principle, multiple values in the domain map to the same value in the range. When such two values map to the same hash, it is called a hash collision.
However, for a specific limited set of inputs (where the domain of the inputs is much smaller than the range of the hash function), then if a hash collision is found, such as through an brute force search, it may be "acceptable" to assume that the input causing the hash was the original value. The above process is effectively a preimage attack. Note that this approach very quickly becomes infeasible, as demonstrated at the bottom. (There are likely some nice math formulas that can define "acceptable" in terms of chance of collision for a given domain size, but I am not this savvy.)
The only way to know that this was the only input that mapped to the hash, however, would be to perform an exhaustive search over all the values in the range -- such as all strings with the given length -- and ensure that it was the only such input that resulted in the given hash value.
Do note, however, that in no case is the hash process "reversed". Even without the Pigeon hole principle in effect, SHA-1 and other cryptographic hash functions are especially designed to be infeasible to reverse -- that is, they are "one way" hash functions. There are some advanced techniques which can be used to reduce the range of various hashes; these are best left to Ph.D's or people who specialize in cryptography analysis :-)
Happy coding.
For fun, try creating a brute-force preimage attack on a string of 3 characters. Assuming only English letters (A-Z, a-z) and numbers (0-9) are allowed, there are "only" 623 (238,328) combinations in this case. Then try on a string of 4 characters (624 = 14,776,336 combinations) ... 5 characters (625 = 916,132,832 combinations) ... 6 characters (626 = 56,800,235,584 combinations) ...
Note how much larger the domain is for each additional character: this approach quickly becomes impractical (or "infeasible") and the hash function wins :-)
One way password crackers speed up preimage attacks is to use rainbow tables (which may only cover a small set of all values in the domain they are designed to attack), which is why passwords that use hashing (SHA-1 or otherwise) should always have a large random salt as well.
Hash functions are one-way function. For a given size there are many strings that may have produced that hash.
Now, if you know that the input size is fixed an small enough, let's say 10 bytes, and you know that each byte can have only certain values (for example ASCII's A-Za-z0-9), then you can use that information to precompute all the possible hashes and find which plain text produces the hash you have. This technique is the basis for Rainbow tables.
If this was possible , SHA1 would not be that secure now. Is it ? So no you cannot unless you have considerable computing power [2^80 operations]. In which case you don't need to know the length either.
One of the basic property of a good Cryptographic hash function of which SHA1 happens to be one is
it is infeasible to generate a message that has a given hash
Theoretically, let's say the string was also known to be solely of ASCII characters, and it's of size n.
There are 95 characters in ASCII not including controls. We'll assume controls weren't used.
There are 95ⁿ possible such strings.
There are 1.461501×10⁴⁸ possible SHA-1 values (give or take) and a just n=25, there are 2.7739×10⁴⁹ possible ASCII-only strings without controls in them, which would mean guaranteed collisions (some such strings have the same SHA-1).
So, we only need to get to n=25 when this becomes impossible even with infinite resources and time.
And remember, up until now I've been making it deliberately easy with my ASCII-only rule. Real-world modern text doesn't follow that.
Of course, only a subset of such strings would be anything likely to be real (if one says "hello my name is Jon" and the other says "fsdfw09r12esaf" then it was probably the first). Stil though, up until now I was assuming infinite time and computing power. If we want to work it out sometime before the universe ends, we can't assume that.
Of course, the nature of the attack is also important. In some cases I want to find the original text, while in others I'll be happy with gibberish with the same hash (if I can input it into a system expecting a password).
Really though, the answer is no.
I posted this as an answer to another question, but I think it is applicable here:
SHA1 is a hashing algorithm. Hashing is one-way, which means that you can't recover the input from the output.
This picture demonstrates what hashing is, somewhat:
As you can see, both John Smith and Sandra Dee are mapped to 02. This means that you can't recover which name was hashed given only 02.
Hashing is used basically due to this principle:
If hash(A) == hash(B), then there's a really good chance that A == B. Hashing maps large data sets (like a whole database) to a tiny output, like a 10-character string. If you move the database and the hash of both the input and the output are the same, then you can be pretty sure that the database is intact. It's much faster than comparing both databases byte-by-byte.
That can be seen in the image. The long names are mapped to 2-digit numbers.
To adapt to your question, if you use bruteforce search, for a string of a given length (say length l) you will have to hash through (dictionary size)^l different hashes.
If the dictionary consists of only alphanumeric case-sensitive characters, then you have (10 + 26 + 26)^l = 62^l hashes to hash. I'm not sure how many FLOPS are required to produce one hash (as it is dependent on the hash's length). Let's be super-unrealistic and say it takes 10 FLOP to perform one hash.
For a 12-character password, that's 62^12 ~ 10^21. That's 10,000 seconds of computations on the fastest supercomputer to date.
Multiply that by a few thousand and you'll see that it is unfeasible if I increase my dictionary size a little bit or make my password longer.

Extra text after the URL in a QR code

I've seen a number of QR codes that contain a URL but also have extra some text after it. Something like:
http://www.example.com Thanks for scanning this QR code!
I've experimented with using a number of different delimiting methods (several spaces, a question mark, two dashes, one or two returns) and all work to varying degrees on various scanning programs.
Some respect the space character, others respect the return. Some think the URL isn't a URL at all when I use a return. Long story short, it's all over the map how the various scanning programs (NeoReader, iNigma, Qrafter, Beetag, OptiScan, etc.) treat characters after a URL.
Is there any consensus on weather (a)this is even a good idea or not and (b)if so, what is the 'correct' (best practice) way to do it? (I know I should go read the RFC for the exact definition of a URL but since the reader programs are all over the map, I suspect they didn't read it either.)
You can make it work by converting the text message into valid URL, while trying to keep readability.
In your case it can be:
http://www.example.com?Thanks_for_scanning_this_QR_code
It's not perfect, but it can help on web analytics side to distinguish all QR code users.
Spaces are definitely not part of a URL, so, in that sense a space definitely should delimit the end of a URL.
The entire string is not a URL, taken as a whole of course. So yes it's asking for trouble.
As you've found, the empirical answer is that not every reader does what you want. Barcode Scanner for instance understands the split here, but does not prompt the user to launch the browser since the payload isn't a URL per se.
So: it's a bad idea.

Allowing non-English (ASCII) characters in the URL for SEO?

I have lots of UTF-8 content that I want inserted into the URL for SEO purposes. For example, post tags that I want to include in th URI (site.com/tags/id/TAG-NAME). However, only ASCII characters are allowed by the standards.
Characters that are allowed in a URI
but do not have a reserved purpose are
called unreserved. These include
uppercase and lowercase letters,
decimal digits, hyphen, period,
underscore, and tilde.
The solution seems to be to:
Convert the character string into a
sequence of bytes using the UTF-8
encoding
Convert each byte that is
not an ASCII letter or digit to %HH,
where HH is the hexadecimal value of
the byte
However, that converts the legible (and SEO valuable) words into mumbo-jumbo. So I'm wondering if google is still smart enough to handle searches in URL's that contain encoded data - or if I should attempt to convert those non-english characters into there semi-ASCII counterparts (which might help with latin based languages)?
Firstly, search engines really don't care about the URLs. They help visitors: visitors link to sites, and search engines care about that. URLs are easy to spam, if they cared there would be incentive to spam. No major search engines wants that. The allinurl: is merely a feature of google to help advanced users, not something that gets factored into organic rankings. Any benefits you get from using a more natural URL will probably come as a fringe benefit of the PR from an inferior search engine indexing your site -- and there is some evidence this can be negative with the advent of negative PR too.
From Google Webmaster Central
Does that mean I should avoid
rewriting dynamic URLs at all?
That's
our recommendation, unless your
rewrites are limited to removing
unnecessary parameters, or you are
very diligent in removing all
parameters that could cause problems.
If you transform your dynamic URL to
make it look static you should be
aware that we might not be able to
interpret the information correctly in
all cases. If you want to serve a
static equivalent of your site, you
might want to consider transforming
the underlying content by serving a
replacement which is truly static. One
example would be to generate files for
all the paths and make them accessible
somewhere on your site. However, if
you're using URL rewriting (rather
than making a copy of the content) to
produce static-looking URLs from a
dynamic site, you could be doing harm
rather than good. Feel free to serve
us your standard dynamic URL and we
will automatically find the parameters
which are unnecessary.
I personally don't believe it matters all that much short of getting a little more click through and helping users out. So far as Unicode, you don't understand how this works: the request goes to the hex-encoded unicode destination, but the rendering engine must know how to handle this if it wishes to decode them back to something visually appealing. Google will render (aka decode) unicode (encoded) URL's properly.
Some browsers make this slightly more complex by always encoding the hostname portion, because of phishing attacks using ideographs that look the same.
I wanted to show you an example of this, here is request to http://hy.wikipedia.org/wiki/Գլխավոր_Էջ issued by wget:
Hypertext Transfer Protocol
GET /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB HTTP/1.0\r\n
[Expert Info (Chat/Sequence): GET /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB HTTP/1.0\r\n]
[Message: GET /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB HTTP/1.0\r\n]
[Severity level: Chat]
[Group: Sequence]
Request Method: GET
Request URI: /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB
Request Version: HTTP/1.0
User-Agent: Wget/1.11.4\r\n
Accept: */*\r\n
Host: hy.wikipedia.org\r\n
Connection: Keep-Alive\r\n
\r\n
As you can see, wget like every other browser will just url-encode the destination for you, and the continue the request to the url-encoded destination. The url-decoded domain only exists as a visual convenience.
Do you know what language everything will be in? Is it all latin based?
If so, then I would suggest building a sort of lookup table that will convert UTF-8 to ASCII when possible(and non-colliding) Something like that would convert Ź into Z and such, and when there is a collision or the character doesn't exist in your lookup table, then it just uses %HH.

Resources