Shortening a GUID - character-encoding

We generate a GUID for a document and need to include the GUID in a barcode (Type 29 2D) which is C40 encoded and has the following restrictions.
Can be a max length of 25 characters
Can only use UPPER Alphanumerical characters, no special characters.
I had thought of converting to Base64 but that uses special characters.

You could use a base36 encoding.
Given that a UUID is only 16 bytes, it should fit into 25 base36 characters.
To demonstrate, here's a small JavaScript snippet that takes the example UUID from the Wikipedia page (123e4567-e89b-12d3-a456-426614174000) and converts it to base36:
const guid = BigInt('0x123e4567e89b12d3a456426614174000');
const encoded = guid.toString(36).toUpperCase();
console.log("Encoded: " + encoded); // 12VQJRNXK8WHV3I8QI6QGRLZ4
console.log("Length: " + encoded.length); // 25

Related

How can I convert a 4-byte string into an unicode emoji?

A webservice i use in my Delphi 10.3 returns a string to me consisting of these four bytes: F0 9F 99 82 . I expect a slightly smiling emoji. This site shows this byte sequence as the UTF-8 representation of that emoji. So I guess i have a UTF-8 representation in my string, but its an actual unicode string? How do i convert my string into the actual unicode representation, to show it, for example, in a TMemo?
The character 🙂 has the Unicode code point U+1F642. Displaying text is defined thru an encoding: how a set of bytes has to be interpreted:
in UTF-8 one character can consist of 8, 16, 24 or 32 bits (1 to 4 Bytes); this one is $F0 $9F $99 $82.
in UTF-16 one character can consist of 16 or 32 bits (2 or 4 bytes = 1 or 2 Words); this one is $D83D $DE42 (using surrogates).
in UTF-32 one character always consists of 32 bits (4 bytes = 1 Cardinal or DWord) and always equals to the code point, that is $1F642.
In Delphi, you can use:
TEncoding.UTF8.GetString() for UTF-8
(or TEncoding.Unicode.GetString() if you'd have UTF-16LE
and TEncoding.BigEndianUnicode.GetString() if you'd have UTF-16BE).
Keep in mind that 🙂 is just a character like each letter, symbol and whitespace of this text: it can be marked thru selection (i.e. Ctrl+A) and copied to the clipboard (i.e. Ctrl+C). No special care is needed.

Concatenate two Base64 encoded strings

I want to decode two Base64 encoded strings and combine them to make one 128 bit string. I am able to decode the Base64 encoded strings. Can some one guide me on how to combine these two decoded strings?
This is the code I used for decoding the two encoded strings.
NSData *decodedData_contentKey = [[NSData alloc] initWithBase64EncodedString:str_content options:0];
NSString *decodedString_contentKey = [[NSString alloc] initWithData:decodedData_contentKey encoding:NSUTF8StringEncoding];
NSLog(#"%#", decodedString_contentKey);
Thanks.
Base 64 is a statically sized encoding of octets/bytes into characters/text: 6 bits of a byte are represented as a printable ASCII character. Hence the name: 2^6 = 64, it uses a alphabet of 64 characters to encode the binary data (+ plus a delimiter character: '=' that does not contain encoded bits).
UTF-8 - used in your sample code - on the other hand is a character-encoding. It is used to encode characters in octets. So character encoding works the other way around. What you are actually doing is to decode the characters back from the bytes. UTF-8 does not use 128 bit values, nor does is it statically sized; multiple bytes may be used to represent one character. It will likely fail when it comes across an octet or octets that do not combine into a valid character encoding.
There is no such thing as base 128 encoding. Please think of what you are trying to accomplish and ask a new question that we can decode, if you get stuck.
GUESSED ANSWER:
Base 64 encoding will output 64 bits (8 bytes) of ASCII text for each 6 bytes. Therefore, if you want 128 bit (16 bytes) of encoding output, you simply have to input 12 bytes. As the base 64 encoding restarts at each 4 character boundary however (4 * 8 = 32 bits of encoding, each 8 bit character represents 6 bits, 4 * 6 = 24 bits of data, 24 bits is 3 bytes -> each 4 character string holds precisely 3 bytes of input), you can simply concatenate the two base 64 strings without decoding.

Cannot find base64 encoded string

I have a MS Doc file and I have converted it from Blob to Base64 encoded string. It contains a string in it as: <z></z>
And I have base64 encoded string for this: <z></z>
But when I search it in the above string converted from blob data then I am not able to find it!!
Can you guide me what I am doing wrong:
Blob beforeblob1 = Blob.valueOf(vDovMerge.Merge_Text__c);
String vDovMergeBlob = EncodingUtil.base64Encode(beforeblob1 );
String v = EncodingUtil.base64Encode(vDoc.Body);
system.debug('****v****'+v);
Blob beforeblob = Blob.valueOf('<z></z>');
String rep = EncodingUtil.base64Encode(beforeblob );
system.debug('****rep****'+rep );
v = v.replace(rep ,vDovMergeBlob );
system.debug('****v****'+v);
Base64-encoding converts 3 bytes of input to 4 bytes of output. So when encoding <z></z> only it is sure to start as the first byte of the block to be encoded. When encoding it as part of a larger data-block it may end up starting as the second or third byte to be encoded thus producing totally different output - that even depends on the data surrounding your block.
Example:
Assuming ASCII-encoding
encoding <z></z> results in PHo+PC96Pg==
encoding a<z></z>results in YTx6Pjwvej4=
encoding aa<z></z> results in YWE8ej48L3o+
encoding aaa<z></z> results in YWFhPHo+PC96Pg== which again contains the original encoding since it starts on a 3-byte-boundary.
So the only way to search the base64-encoded data would be to treat it as a bit-stream and search for the bit-pattern of <z></z> without respect to byte-boundaries - doesn't sound like a lot of fun to me :-(

Percent escaping special characters like é on iOS

I'm currently struggling with percent escaping special characters on iOS, for instance "é" when contained in a query parameter value.
I'm using AFNetworking, but the issue isn't specific to it.
The "é" character should be percent escaped to "%E9", yet the result is "%C3%A9". The reason is because "é" is represented as those 2 bytes in UTF8.
The actual percent escaping method is the well known one and I'm passing UTF8 as string encoding. The string itself is #"é".
static NSString * AFPercentEscapedQueryStringPairMemberFromStringWithEncoding(NSString *string, NSStringEncoding encoding)
{
static NSString * const kAFCharactersToBeEscaped = #":/?&=;+!##$()~";
static NSString * const kAFCharactersToLeaveUnescaped = #"[].";
return (__bridge_transfer NSString *)CFURLCreateStringByAddingPercentEscapes(kCFAllocatorDefault, (__bridge CFStringRef)string, (__bridge CFStringRef)kAFCharactersToLeaveUnescaped, (__bridge CFStringRef)kAFCharactersToBeEscaped, CFStringConvertNSStringEncodingToEncoding(encoding));
}
I had hoped passing in UTF16 string encoding would solve it, but it doesn't. The result is "%FF%FE%E9%00" in this case, it contains "%E9" but I must be missing something obvious.
Somehow I can't get my head around it.
Any pointers would be awesome.
RFC 3986 explains that, unless the characters you're encoding fall into the unreserved US-ASCII range, the convention is to convert the character to (in this case, A UTF8-encoded) byte value, and and use that value as the percent encoding base.
The behavior you're seeing is correct.
The disparity between the encoded values given for UTF-8 vs. UTF-16 is due to a couple of factors.
Encoding Differences
First, there's the difference in the way that the respective encodings are actually defined. UTF-16 will always use two bytes to represent its character, and essentially concatenates the higher order byte with the lower order byte to define the code. (The ordering of these bytes will depend on whether the code is encoded as Little Endian or Big Endian.) UTF-8, on the other hand, uses a dynamic number of bytes, depending on where in the Unicode code page the character exists. The way UTF-8 relates how many bytes it's going to use is by the bits that are set in the first byte itself.
So if we look at C3 A9, that translates into the following bits:
1100 0011 1010 1001
Looking at RFC 2279, we see that the beginning set of '1's with an terminating '0' denotes how many bytes will be used--in this case, 2. Stripping off the initial 110 metadata, we're left with 00011 from the first byte: that represents the leftmost bits of the actual value.
For the next byte (1010 1001), again from the RFC we see that, for every subsequent byte, 10 will be "prefix" metadata for the actual value. Stripping that off, we're left with 101001.
Concatenating the actual value bits, we end up with 00011 101001, which is 233 in base-10, or E9 in base-16.
Encoding Identification
The other thing to consider specifically from the UTF-16 value (%FF%FE%E9%00) is from the original RFC, which mentions that there's no explicit definition of the encoding used, in the encoded value itself. So in this case, iOS is "cheating", giving you an indication of what encoding is used. FF FE is a well-known byte-ordering mark used in UTF-16 encoded files, to denote that UTF-16 is the encoding used. As for E9 00, as mentioned, UTF-16 always uses two bytes. In this case, since all of its data can be represented in 1 byte, the other is simply null.

Do the SHA1 of a string always return ASCII characters?

The input string can be a unicode string.Do the output string after calculating SHA1 will always return ASCII characters?
It depends but strictly speaking, no. The output of the SHA-1 hash is 160 bits, or 20 bytes, but the bytes are not guaranteed to be in the ASCII range.
However, some hash functions output the hex equivalent (i.e. 40 characters) of the 20 bytes, so if the first three bytes of the actual hash are 0x7e, 0x03, and 0xb2, the output would begin with "7e03b2", in which case the output is ASCII.
SHA1 returns 20 bytes. SHA1 does not deal with encodings, text, ASCII, etc.
One common way to represent binary data is by encoding it in hexadecimal - in this case, the output is always [a-f][0-9]
sha1 returns a binary string. Some sha1 functions may, as a convenience, also encode that binary string into hexadecimal or base64 - if so, the result will be ASCII characters. But sha1 itself does not return ASCII.

Resources