Why are URLs encoded in Base32? - url

This is a really short question I think but I'm not sure I understand the point of it.
Why are URLs encoded in Base32? What are the benefits of it and what are the drawbacks of it?

Sometimes URL data needs to be encoded to encapsulate things that aren't easily type-able, such as "ÓĆ", or even binary data that has no text representation at all. Putting that inside of a query string was problematic. Some servers don't understand Unicode text in a query string, though that situation is certainly getting better.
So the data needs to be encoded somehow that the server can interpret correctly, and the application knows how to use. Base32 is commonly used for that. It encodes any binary data into a ASCII text representation of that data. When the original data is needed, it is decoded.
So why not base64? Base64 will almost always have a shorter encoding length. Base64's weakness is that it uses both upper and lower case letter for encoding. There is a distinction between A and a. Whereas Base32 only uses one letter's casing, so it can be case insensitive. Generally (but not always), URLs are case insensitive, and using Base32 keeps that notion alive. This distinction is useful when the encoded data is meant to be typed, read aloud, etc.
The drawback to Base32 is that the resulting encoding is almost always longer due to a much smaller character set.

Related

Cannot get expected result for Spring4D cryptography examples

The Spring4D library has cryptography classes, however I cannot get them to work as expected. I'm probably using them incorrectly, however lack of any examples makes it difficult.
For example on the website https://quickhash.com/hash-sha256-online, I can hash the word "test" to generate the following hash:
9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08
Using the Spring4D library, the following code produces a different hash:
CreateSHA256.ComputeHash('test').ToString;
results in:
9EFEA1AEAC9EDA04A892885A65FDAE0E6D9BE8C9FC96DA76D31B929262E12B1D
Upper/lower case aside, it is a different hash altogether. I know must be doing something wrong, but again there's no examples of use so I'm stuck on how to do this.
Hashing algorithms operate on binary data, typically represented using byte arrays.
Unfortunately, both of the resources you have used offer the ability to hash text. In order to hash text, you first need to convert from text to binary. To do so requires a choice of encoding. And neither method makes it clear what that choice is.
When I use this Delphi code:
LowerCase(CreateSHA256.ComputeHash(TEncoding.UTF8.GetBytes('test')).ToString)
I get the same hash as appears in your question.
I urge you never to attempt to encrypt/hash text and instead regard these operations as operating on binary. Always use an explicit encoding and then encrypt/hash the array of bytes that the encoding produced.
I've picked the UTF-8 encoding here, because it is a full Unicode encoding, and tends to be efficient in terms of space. However, I don't think your online encoder uses UTF-8. In fact I've no idea what encoding it uses, it is unclear on the matter. This is of course the same old issue of text being different from binary.
In my opinion it is a design flaw of the Delphi library that you use that it allows you to hash text without an explicit choice of encoding. If this library must offer a function that hashes text, then it should require the caller to supply an extra TEncoding parameter.
There is no conversion going on internally so it hashes the UnicodeString which is at least 2 bytes per character.
If you want the same result as on the page you have to use UTF8Encode or directly pass as AnsiString.
However I tried some strings that contained different unicode characters and the page returned a different result. So I am not quite sure how they treat the strings there. I guess it's a codepage thing.
Edit: If you use this page http://www.xorbin.com/tools/sha256-hash-calculator it generates the same hash as TSHA256 with UTF8Encode.
Which type of string are you using? Do you use AnsiString or WideString (Unicode string). Delphi 2009 and Newer are using WideString by default.
Why is string type inportant? All hasging algorithm operates on raw bytes data so it is omportant if each character of your string is stored in one Byte of memory (AnsiString) or multiple Bytes of memory (WideString).

Can urls have UTF-8 characters?

I was curious if I should encode urls with ASCII or UTF-8. I was under the belief that urls cannot have non-ASCII characters, but someone told me they can have UTF-8, and I searched around and couldn't quite find which one is true. Does anyone know?
There are two parts to this, but they both amount to "yes".
With IDNA, it is possible to register domain names using the full Unicode repertoire (with a few minor twists to prevent ambiguities and abuse).
The path part is not strictly regulated, but it's possible to encode arbitrary strings in the path. The browser could opt to display a human-readable rendering rather than an encoded path. However, this requires heuristics, as there is no way to specify the character set and encoding of the path.
So, http://xn--msic-0ra.example/mot%C3%B6rhead is a (fictional example, not entirely correct) computer-readable encoded URL which could be displayed to the user as http://müsic.example/motörhead. The domain name is encoded as xn--msic-0ra.example in something called Punycode, and the path contains the label "motörhead" encoded as UTF-8 and URL encoded (the Unicode code point U+00F6 is reprecented with the two bytes 0xC3 0xB6 in UTF-8).
The path could also be mot%F6rhead which is the same label in Latin-1. In this case, deducing a reasonable human-readable representation would be much harder, but perhaps the context of the surrounding characters could offer enough hints for a good guess.
In isolation, %F6 could be pretty much anything, and %C3%B6 could be e.g. UTF-16.

Squeak Monticello character-encoding

For a work project I am using headless Squeak on a (displayless, remote) Linuxserver and also using Squeak on a Windows developer-machine.
Code on the developer machine is managed using Monticello. I have to copy the mcz to the server using SFTP unfortunately (e.g. having a push-repository on the server is not possible for security reasons). The code is then merged by eg:
MczInstaller installFileNamed: 'name-b.18.mcz'.
Which generally works.
Unfortunately our code-base contains strings that contain Umlauts and other non-ascii characters. During the Monticello-reimport some of them get replaced with other characters and some get replaced with nothing.
I also tried e.g.
MczInstaller installStream: (FileStream readOnlyFileNamed: '...') binary
(note .mcz's are actually .zip's, so binary should be appropriate, i guess it is the default anyway)
Finding out how to make Monticello's transfer preserve the Squeak internal-encoding of non-ascii's is the main Goal of my question. Changing all the source code to only use ascii-strings is (at least in this codebase) much less desirable because manual labor is involved. If you are interested in why it is not a simple grep-replace in this case read this side note:
(Side note: (A simplified/special case) The codebase uses Seaside's #text: method to render strings that contain chars that have to be html-escaped. This works fine with our non-ascii's e.g. it converts ä into ä, if we were to grep-replace the literal ä's by ä explicitly, then we would have to use the #html: method instead (else double-escape), however that would then require that we replace all other characters that have to be html-escaped as well (e.g. &), but then again the source-code itself contains such characters. And there are other cases, like some #text:'s that take third-party strings, they may not be replaced by #html's...)
Squeak does use unicode (ISO 10646) internally for encoding characters in a String.
It might use extension like CP1252 for characters in range 16r80 to: 16r9F, but I'm not really sure anymore.
The characters codes are written as is on the stream source.st, and these codes are made of a single byte for a ByteString when all characters are <= 16rFF. In this case, the file should look like encoded in ISO-8859-L1 or CP1252.
If ever you have character codes > 16rFF, then a WideString is used in Squeak. Once again the codes are written as is on the stream source.st, but this time these are 32 bits codes (written in big-endian order). Technically, the encoding is thus UTF-32BE.
Now what does MczInstaller does? It uses the snapshot/source.st file, and uses setConverterForCode for reading this file, which is either UTF-8 or MacRoman... So non ASCII characters might get changed, and this is even worse in case of WideString which will be re-interpreted as ByteString.
MC itself doesn't use the snapshot/source.st member in the archive.
It rather uses the snapshot.bin (see code in MCMczReader, MCMczWriter).
This is a binary file whose format is governed by DataStream.
The snippet that you should use is rather:
MCMczReader loadVersionFile: 'YourPackage-b.18.mcz'
Monticello isn't really aware of character encoding. I don't know the present situation in squeak but the last time I've looked into it there was an assumed character encoding of latin1. But that would mean it should work flawlessly in your situation.
It should work somehow anyway if you are writing and reading from the same kind of image. If the proper character encoding fails usually the internal byte representation is written from memory to disk. While this prevents any cross dialect exchange of packages it should work if using the same image kind.
Anyway there are things that should or could work but they often go wrong. So most projects try to avoid using non 7bit characters in their code.
You don't need to convert non 7bit characters to HTML entities. You can use
Character value: 228
for producing an ä in your code without using non 7bit characters. On every character you like to add a conversion you can do
$ä asciiValue => 228
I know this is not the kind of answer some would want to get. But monticello is one of these things that still need to be adjusted for proper character encoding.

trying to figure out the charset

I'm downloading a CSV from Google Docs and in it characters like “ are saved as \xE2\x80\x9C and ” are saved as \xE2\x80\x9D.
My question is... what charset are those being saved in? How might I go about figuring that out?
It is in UTF-8.. You can tell by decoding it as UTF-8 and it shows the correct characters.
UTF-8 also has a unique and very distinctive pattern, just 3 bytes with highest bit set forming a valid UTF-8 sequence are enough to tell if something is UTF-8 with 99% confidence. Even with 2 bytes with highest bit set forming a valid UTF-8 sequence, you can already get to 90%.
In a case it wasn't UTF-8, and was some 8-bit code page instead, it would be impossible to tell just by looking at the bytes alone. Without any other information, you would basically have to brute force by decoding it in various 8-bit encodings and then seeing if it looks correct. The other possibility is using an algorithm that would go through the encodings automatically, and see if it the result makes sense in any language.
With more information like what operating system and locale the file was saved in, you could reduce the amount of possible encodings to try by a huge deal though.

Character Encoding and the ’ Issue

Even today, one frequently sees character encoding problems with significant frequency. Take for example this recent job post:
(Note: This is an example, not a spam job post... :-)
I have recently seen that exact error on websites, in popular IM programs, and in the background graphics on CNN.
My two-part question:
What causes this particular, common encoding issue?
As a developer, what should I do with user input to avoid common encoding issues like
this one? If this question requires simplification to provide a
meaningful answer, assume content is entered through a web browser.
What causes this particular, common encoding issue?
This will occur when the conversion between characters and bytes has taken place using the wrong charset. Computers handles data as bytes, but to represent the data in a sensible manner to humans, it has to be converted to characters (strings). This conversion takes place based on a charset of which there are many different ones.
In the particular ’ example, this is a typical CP1252 representation of the Unicode Character 'RIGHT SINQLE QUOTATION MARK' (U+2019) ’ which was been read using UTF-8. In UTF-8, that character exist of the bytes 0xE2, 0x80 and 0x99. If you check the CP1252 codepage layout, then you'll see that those bytes represent exactly the characters â, € and ™.
This can be caused by the website not having read in the original source properly (it should have used CP1252 for this), or is displaying an UTF-8 page with the wrong charset=CP1252 attribute in Content-Type response header (or the attribute is missing; on Windows machines the default charset of CP1252 would be used then).
As a developer, what should I do with user input to avoid common encoding issues like this one? If this question requires simplification to provide a meaningful answer, assume content is entered through a web browser.
Ensure that you read the characters from arbitrary byte stream sources (e.g. a file, an URL, a network socket, etc) using a known and predefinied charset. Then, ensure that you're consistently storing, writing and sending it using an Unicode charset, preferably UTF-8.
If you're familiar with Java (your question history confirms this), you may find this article useful.

Resources