What's the characterset of SHA1? - sha1

I need to know what character will the SHA1 will generate for me?
Is it possible to know the characterset of the SHA1? Or if it's configurable, what's the default characterset of it?
Thank you.

SHA-1 doesn't generate text, it generates a binary hash (like most digests), so it doesn't have a charset (or care about the input's charset for that matter).
You can represent it as text (a string representation of the hex value, and base64 are popular) if you want, especially if you need to transfer it over the network or display it to users. That encoding is up to you.

I'm fairly sure it's just binary data rather than any character encoding. You could then encode that in Base64 if you like.

The hash algorithm SHA1 takes a stream of bytes as input, and calculates the 160-bits digest. Command line versions output the digest as a hexadecimal string. No charsets involved.

Related

Delphi decoded base64 to something

I am stuck a bit in decoding. I got a base64-encoded .rtf file.
A little part of this looks like this: Bek\u252\''fcld\u337\''3f
Which represents: Beküldő
But my output data after decoding is: Bekuld?
If I manually replace the characters it works.
StringReplace(Result, 'U337\''3F', '''F5', [rfReplaceAll, rfIgnoreCase]);
Does anyone know a general solution for this? Some conversation or something?
For instance, \u242 means Unicode character #242.
So you could search for \u in the RTF content (ignoring any \\ escaped sequence), then retrieve the following number, and use it as a character.
But RTF is a very complex beast.
Check what the RTF 1.5 specifications says about encoding:
\uN This keyword represents a single Unicode character which has no
equivalent ANSI representation based on the current ANSI code page. N
represents the Unicode character value expressed as a decimal number.
This keyword is followed immediately by equivalent character(s) in
ANSI representation. In this way, old readers will ignore the \uN
keyword and pick up the ANSI representation properly. When this
keyword is encountered, the reader should ignore the next N
characters, where N corresponds to the last \ucN value encountered.
Perhaps the easiest is to use a hidden RichEdit for decoding, under Windows/VCL.

What kind of encoding is this URL?

A bunch of photos in a website directory has these URLs for each photo:
www.example.com/3aecbc1bf32c7615fb732d407b1b571a.jpg
www.example.com/27cbb6.jpg
My question is are the random gibberish part some kind of encoding that can be decoded? Or is each photo really represented by these random character strings? I wish to understand the pattern so I can guess the URL and view all the photos in the directory. Thanks.
That string looks to be an md5 hash value. The result of an MD5 hash function will always be the same length, regardless of the length of the input string. The output is always 128 bits, or 32 characters.

Why JDK8's Base64 uses ISO-8859-1?

I'm writing my own BASE64 encoder/decoder for some constrained environments.
And I found that Base64.Encoder#encodeString saying that it uses ISO-8859-1 for construct a String from those encoded bytes.
I perfectly presuming that ISO-8859-1 charset also covers all base64 alphabets.
Is there any possible reason not to use US-ASCII?
I suspect it's more efficient: converting from ISO-8859-1 back to text is just a matter of promoting each byte straight to a char, whereas for ASCII you'd need to check that the byte is valid ASCII. The result for base64 will always be the same, of course.
(That's only a guess, but an educated one. You could always run benchmarks if you want to validate it...)

How to identify if a TBytes array may safely convert to AnsiString, string or UTF8String?

Given a TBytes array, can we identify if the array may convert to AnsiString, String or UTF8String without losing any characters?
What you appear to be asking to do is impossible. You seem to have a byte array of unknown provenance, that may be encoded as ANSI, UTF-8 or UTF-16. You are hoping to be able to determine which encoding is correct.
This is impossible because there exist byte arrays that are valid in all three of those encodings, and that represent different strings in each encoding. Raymond Chen shows a nice clean example here: The Notepad file encoding problem, redux.
You can use heuristic algorithms to attempt to guess the encoding, an example of which is IsTextUnicode. But any such approach is by necessity not robust.

Can urls have UTF-8 characters?

I was curious if I should encode urls with ASCII or UTF-8. I was under the belief that urls cannot have non-ASCII characters, but someone told me they can have UTF-8, and I searched around and couldn't quite find which one is true. Does anyone know?
There are two parts to this, but they both amount to "yes".
With IDNA, it is possible to register domain names using the full Unicode repertoire (with a few minor twists to prevent ambiguities and abuse).
The path part is not strictly regulated, but it's possible to encode arbitrary strings in the path. The browser could opt to display a human-readable rendering rather than an encoded path. However, this requires heuristics, as there is no way to specify the character set and encoding of the path.
So, http://xn--msic-0ra.example/mot%C3%B6rhead is a (fictional example, not entirely correct) computer-readable encoded URL which could be displayed to the user as http://müsic.example/motörhead. The domain name is encoded as xn--msic-0ra.example in something called Punycode, and the path contains the label "motörhead" encoded as UTF-8 and URL encoded (the Unicode code point U+00F6 is reprecented with the two bytes 0xC3 0xB6 in UTF-8).
The path could also be mot%F6rhead which is the same label in Latin-1. In this case, deducing a reasonable human-readable representation would be much harder, but perhaps the context of the surrounding characters could offer enough hints for a good guess.
In isolation, %F6 could be pretty much anything, and %C3%B6 could be e.g. UTF-16.

Resources