golang convert iso8859-1 to utf8 - character-encoding

I am trying to convert an ISO 8859-1 encoded string to UTF-8.
The following function works with my testdata which contains german umlauts, but I'm not quite sure what source encoding the rune(b) cast assumes. Is it assuming some kind of default encoding, e.g. ISO8859-1 or is there any way to tell it what encoding to use?
func toUtf8(iso8859_1_buf []byte) string {
var buf = bytes.NewBuffer(make([]byte, len(iso8859_1_buf)*4))
for _, b := range(iso8859_1_buf) {
r := rune(b)
buf.WriteRune(r)
}
return string(buf.Bytes())
}

rune is an alias for int32, and when it comes to encoding, a rune is assumed to have a Unicode character value (code point). So the value b in rune(b) should be a unicode value. For 0x00 - 0xFF this value is identical to Latin-1, so you don't have to worry about it.
Then you need to encode the runes into UTF8. But this encoding is simply done by converting a []rune to string.
This is an example of your function without using the bytes package:
func toUtf8(iso8859_1_buf []byte) string {
buf := make([]rune, len(iso8859_1_buf))
for i, b := range iso8859_1_buf {
buf[i] = rune(b)
}
return string(buf)
}

The effect of
r := rune(expression)
is:
Declare variable r with type rune (alias for int32).
Initialize variable r with the value of expresion.
No (re)encoding is involved and saying which one should be optionally used is possible only by explicitly writing/handling some re-encoding in code. Luckily, in this case no (re)encoding is necessary, Unicode incorporated those codes of ISO 8859-1 in a comparable way as ASCII. (If I checked correctly here)

Related

from utf8 to latin1

I have an utf-8 string like this:
=C3=A0=C3=A8=C3=AC=C3=B2=C3=B9
I need to decode the string to latin1. The expected result is:
àèìòù
This is what I tried without success:
utf8.decode(stringData.runes.toList())
I solved.
Thanks to #lenz comment and the js function found here I solved with this function:
String decodeQuotedPrintable(String input) {
return input
// https://tools.ietf.org/html/rfc2045#section-6.7, rule 3:
// “Therefore, when decoding a `Quoted-Printable` body, any trailing white
// space on a line must be deleted, as it will necessarily have been added
// by intermediate transport agents.”
.replaceAll(RegExp(r'[\t\x20]$', multiLine: true), '')
// Remove hard line breaks preceded by `=`. Proper `Quoted-Printable`-
// encoded data only contains CRLF line endings, but for compatibility
// reasons we support separate CR and LF too.
.replaceAll(RegExp(r'=(?:\r\n?|\n|$)', multiLine: true), '')
// Decode escape sequences of the form `=XX` where `XX` is any
// combination of two hexidecimal digits. For optimal compatibility,
// lowercase hexadecimal digits are supported as well. See
// https://tools.ietf.org/html/rfc2045#section-6.7, note 1.
.replaceAllMapped(RegExp(r'=([a-fA-F\d]{2})'), (match) {
var codePoint = int.parse(match[1], radix: 16);
return String.fromCharCode(codePoint);
});
}
and with this code:
var output = decodeQuotedPrintable(input);
output = utf8.decode(latin1.encode(input));

Convert string to base64 byte array in swift and java give different value

Incase of android everything is working perfectly. I want to implement same feature in iOS too but getting different values. Please check the description with images below.
In Java/Android Case:
I tried to convert the string to base64 byte array in java like
byte[] data1 = Base64.decode(balance, Base64.DEFAULT);
Output:
In Swift3/iOS Case:
I tried to convert the string to base64 byte array in swift like
let data:Data = Data(base64Encoded: balance, options: NSData.Base64DecodingOptions(rawValue: 0))!
let data1:Array = (data.bytes)
Output:
Finally solved:
This is due to signed and unsigned integer, meaning unsigned vs signed (so 0 to 255 and -127 to 128). Here, we need to convert the UInt8 array to Int8 array and therefore the problem will be solved.
let intArray = data1.map { Int8(bitPattern: $0) }
In no case should you try to compare data on 2 systems the way you just did. That goes for all types but specially for raw data.
Raw data are NOT presentable without additional context which means any system that does present them may choose how to present them (raw data may represent some text in UTF8 or some ASCII, maybe jpeg image or png or raw RGB pixel data, it might be an audio sample or whatever). In your case one system is showing them as a list of signed 8bit integers while the other uses 8bit unsigned integers for the same thing. Another system might for instance show you a hex string which would look completely different.
As #Larme already mentioned these look the same as it is safe to assume that one system uses signed and the other unsigned values. So to convert from signed (Android) to unsigned (iOS) you need to convert negative values as unsigned = 256+signet so for instance -55 => 256 + (-55) = 201.
If you really need to compare data in your case it is the best to save them into some file as raw data. Then transfer that file to another system and compare native raw data to those in file to check there is really a difference.
EDIT (from comment):
Printing raw data as a string is a problem but there are a few ways. The thing is that many bytes are not printable as strings, may be whitespaces or some reserved codes but mostly the problem is that value of 0 means the end of string in most cases which may exist in the middle of your byte sequence.
So you already have 2 ways of printing byte by byte which is showing Int8 or Uint8 corresponding values. As described in comment converting directly to string may not work as easy as
let string = String(data: data, encoding: .utf8) // Will return nil for strange strings
One way of converting data to string may be to convert each byte into a corresponding character. Check this code:
let characterSequence = data.map { UnicodeScalar($0) } // Create an array of characters from bytes
let stringArray = characterSequence.map { String($0) } // Create an array of strings from array of characters
let myString = stringArray.reduce("", { $0 + $1 }) // Convert an array of strings to a single string
let myString2 = data.reduce("", { $0 + String(UnicodeScalar($1)) }) // Same thing in a single line
Then to test it I used:
let data = Data(bytes: Array(0...255)) // Generates with byte values of 0, 1, 2... up to 255
let myString2 = data.reduce("", { $0 + String(UnicodeScalar($1)) })
print(myString2)
The printing result is:
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþ
Then another popular way is using a hex string. It can be displayed as:
let hexString = data.reduce("", { $0 + String(format: "%02hhx",$1) })
print(hexString)
And with the same data as before the result is:
000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f303132333435363738393a3b3c3d3e3f404142434445464748494a4b4c4d4e4f505152535455565758595a5b5c5d5e5f606162636465666768696a6b6c6d6e6f707172737475767778797a7b7c7d7e7f808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9fa0a1a2a3a4a5a6a7a8a9aaabacadaeafb0b1b2b3b4b5b6b7b8b9babbbcbdbebfc0c1c2c3c4c5c6c7c8c9cacbcccdcecfd0d1d2d3d4d5d6d7d8d9dadbdcdddedfe0e1e2e3e4e5e6e7e8e9eaebecedeeeff0f1f2f3f4f5f6f7f8f9fafbfcfdfeff
I hope this is enough but in general you could do pretty much anything with array of bytes and show them. For instance you could create an image treating bytes as RGB 8-bit per component if it would make sense. It might sound silly but if you are looking for some patterns it might be quite a witty solution.

Convert base64-encoded string to normal string in swift 3

I want to convert following base64-encoded String in Swift 3:
dfYcSGpvBqyzvkAXkdbHDA==
to its equivalant String:
uöHjo¬³¾#‘ÖÇ
Following websites do the job very fine:
http://www.motobit.com/util/base64-decoder-encoder.asp
http://www.utilities-online.info/base64/#.WG-FwrFh2Rs
So does the PHP's function base64_decode. The documentation of this function says:
Returns FALSE if input contains character from outside the base64
alphabet.
But I am unable to do the same in Swift 3. Following code doesn't do the job too:
func convertBase64ToNormalString(base64String:String)->String!{
let decodedData = Data(base64Encoded: base64String, options: Data.Base64DecodingOptions())
let bytes = decodedData?.bytes
return String(bytes: bytes, encoding: NSUTF8StringEncoding)
}
Here is the contextual information about why I need to convert the base64 string into an string:
My Php developer wants me to send all APIs params values encrypted with AES algorithm. For that, I am using this lib.
He has given me AES key in Hex format (I mentioned in my last question) and iv in base64 (given above) and he instructed me decode this base64 key before using because he was also doing the same in his PHP code. Here is his PHP code of encryption and decryption:
function encryptParamAES($plaintext, $encryptionEnabled = true) {
if (! $encryptionEnabled) {
return $plaintext;
}
// --- ENCRYPTION ---
// Constants =========================================================
// the key should be random binary, use scrypt, bcrypt or PBKDF2 to
// convert a string into a key
// key is specified using hexadecimal
$key = pack ( 'H*', "dcb04a9e103a5cd8b53763051cef09bc66abe029fdebae5e1d417e2ffc2a07a4" );
// create a random IV to use with CBC encoding
$iv_size = 16;
$iv = base64_decode ( "dfYcSGpvBqyzvkAXkdbHDA==" );
// End Of constants ===================================================
// echo "IV: " . base64_encode ( $iv ) . "\n<br>";
// echo "IV Size: " . $iv_size . "\n<br>";
// show key size use either 16, 24 or 32 byte keys for AES-128, 192
// and 256 respectively
// $key_size = strlen ( $key );
// echo "Key size: " . $key_size . "\n<br><br>";
// creates a cipher text compatible with AES (Rijndael block size = 128)
// to keep the text confidential
// only suitable for encoded input that never ends with value 00h
// (because of default zero padding)
$ciphertext = mcrypt_encrypt ( MCRYPT_RIJNDAEL_128, $key, $plaintext, MCRYPT_MODE_CBC, $iv );
// prepend the IV for it to be available for decryption
$ciphertext = $iv . $ciphertext;
// encode the resulting cipher text so it can be represented by a string
$ciphertext_base64 = base64_encode ( $ciphertext );
return $ciphertext_base64;
}
function decryptParamAES($ciphertext_base64, $encryptionEnabled = true) {
if (! $encryptionEnabled) {
return $ciphertext_base64;
}
// --- DECRYPTION ---
// Constants =========================================================
// the key should be random binary, use scrypt, bcrypt or PBKDF2 to
// convert a string into a key
// key is specified using hexadecimal
$key = pack ( 'H*', "dcb04a9e103a5cd8b53763051cef09bc66abe029fdebae5e1d417e2ffc2a07a4" );
// create a random IV to use with CBC encoding
$iv_size = 16;
$iv = base64_decode ( "dfYcSGpvBqyzvkAXkdbHDA==" );
// End Of constants ===================================================
$ciphertext_dec = base64_decode ( $ciphertext_base64 );
// retrieves the IV, iv_size should be created using mcrypt_get_iv_size()
$iv_dec = substr ( $ciphertext_dec, 0, $iv_size );
// retrieves the cipher text (everything except the $iv_size in the front)
$ciphertext_dec = substr ( $ciphertext_dec, $iv_size );
// may remove 00h valued characters from end of plain text
$plaintext_dec = mcrypt_decrypt ( MCRYPT_RIJNDAEL_128, $key, $ciphertext_dec, MCRYPT_MODE_CBC, $iv_dec );
return rtrim ( $plaintext_dec );
}
I just saw this PHP code and wondered, why he is not using $iv as mcrypt_decrypt function's last param!! Will update you on it. But still question remains the same, PHP function base64_decode doesn't return FALSE for the above base64 string!
I tested this function myself by terminal command: php test.php. Here test.php contains following code:
<?php
$iv = base64_decode ( "dfYcSGpvBqyzvkAXkdbHDA==" );
echo $iv;
?>
And the output was: u?Hjo???#???
Looking at your revised question, you're trying to take this base-64 string, and using it as the iv in your AES algorithm. I can understand why you are wondering how to convert that resulting Data into a String, but you should not do that. Yes, there's a rendition of AES that expects the iv as a string. But there's another rendition that expects an Array<UInt8>. So, just like MartinR said in his answer to your other question, build an array of UInt8 instead, like so:
let iv = Array(Data(base64Encoded: "dfYcSGpvBqyzvkAXkdbHDA==")!)
That resulting iv is an Array<UInt8> (also known as [UInt8]). You can use that with your AES function.
My original discussion about converting Data objects to UTF8 strings is below. But the key message is that you shouldn't try to do so. Just build your array of UInt8 and use that with your library's AES function.
Looking at your other question (Convert hex-encoded String to String in Swift 3), you revealed in comments that you were dealing with an AES key. I'm suspicious that we're dealing with a similar issue here (though that was 32 bytes of data, and here we have 16 bytes).
Bottom line, I'd suggest you completely drop this "how to I get a string representation of the data captured in this base-64 string" line of inquiry. If it's an encryption key (or some token or whatever), don't bother trying to represent it as a string. (This is the raison d'être of base-64, to create transmittable string representations of data that isn't a string.)
I'd suggest you step back and describe the broader problem that you are trying to solve. Stop trying to create strings out of these binary payloads. In your code snippet, you successfully create a Data from the base-64 string. The real question, I think, is not "how do I now get a string from that?", but rather "what do I do this Data?"
We can't answer that question without more context about where you got this data and what it is for.
By the way, my original answer to your question is below.
The problem is that base-64 string translates to 16 bytes of data whose hexadecimal representation is
75f61c48 6a6f06ac b3be4017 91d6c70c
But that payload is not valid UTF8 string. The third byte, 1c is not consistent with UTF8 string. If you look at the definition of UTF8, it says that if a byte is in the range of f6–fb, which the second byte is, that the character consists of that and the following two bytes, both of which should be in the range of 21–7e or a0–ff, which 1c is not.
So, this simply is not a valid UTF8 string. Clearly those sites you're using are not gracefully detecting/handling invalid UTF8 strings.
Perhaps the data in this base-64 string was converted from a string using a different encoding. Or, perhaps it was not originally a string at all. Not all binary payloads have clean string representations. Frankly, this is why we use base-64 representations in the first place, to come up with a text representation of a blob of data that is not a string.
If you provide more information about the source of the data contained in this base-64 string, we might be able to advise you further.

get length of string in UTF8

How can I get the length (not number of bytes) of a string in its UTF-8 encoded form (PHP's mb_strlen(.., 'UTF-8') equivalent)?
I tried string.characters.count but it does not return the correct length for certain characters like an emoji.
Example:
let s = "✌🏿️"
print(s.characters.count) // prints 2, but should print 3.
You can access the UTF-8 encoding of a string with the .utf8 property. Use count on that to get the number of UTF-8 code units in the string:
let string = "\u{1f603}" // One of the smiley face emojis...
print(string.utf8.count) // prints "4"
Based on your edited question, what you are probably looking for is the number of UnicodeScalars used to encode the string. You access that with the unicodeScalars property:
let s = "✌🏿️"
print(s.unicodeScalars.count) // prints 3
The reason everyone is confused is because your original question asks for the length of the string in its UTF-8 encoded form. The answer that you actually wanted had nothing to do with the length of the string in its UTF-8 encoded form.
I think you are confused about the difference between Unicode "extended grapheme clusters", Unicode code points, and the various encodings (like UTF-8) that can be used to encode a Unicode code point.
A Character in Swift represents what Unicode calls an "extended grapheme cluster". That is to say, it is a single visual character, even if it is made up of multiple Unicode code points.
A Unicode code point is a single linguistic symbol that is given a 32-bit value. Two or more Unicode code points can combine to create a single Character. In Swift, the Unicode code point is represented by the UnicodeScalar type.
When it comes time to store a string, or send it over the internet, or otherwise turn it into data that is represented by bytes, you have to decide how to encode it. There are all kinds of encodings, the most common is probably UTF-8, which encodes the string as a series of UInt8 values.
That's just a brief snippet of the difference between the three concepts. It is actually a really interesting subject and if you Google some of those terms, you will find a lot more good information.
let str = "ačŘ"
print("str has \(str.characters.count) characters") // 3
print("and \(str.utf8.count) bytes as encoded in UTF-8") // 5
update (based on your notes)
let s = "✌🏿️"
let arr:[UInt8] = [226, 156, 140, 240, 159, 143, 191, 239, 184, 143]
var arrCchar = arr.map { (uint8) -> Int8 in
Int8(bitPattern: uint8)
}
arrCchar += [0] // to be null terminated
let str = String.fromCString(&arrCchar)
print(str) // Optional("✌🏿️")
s == str // TRUE !!!!
by characters
s.characters.forEach { (c) -> () in
let str = String(c)
print(str.utf8.map{$0}, "which represents character: ", c)
str.unicodeScalars.forEach({ (u) -> () in
print("composed from unicode scalar(s): ", u.debugDescription)
})
}
/*
[226, 156, 140] which represents character: ✌
composed from unicode scalar(s): "\u{270C}"
[240, 159, 143, 191, 239, 184, 143] which represents character: 🏿️
composed from unicode scalar(s): "\u{0001F3FF}"
composed from unicode scalar(s): "\u{FE0F}"
*/
Every character in Unicode can be represented by one or more unicode scalars. A unicode scalar is a unique 21-bit number (and name) for a character or modifier, such as U+0061 for LOWERCASE LATIN LETTER A("a"), or U+1F425 for FRONT-FACING BABY CHICK ("\U0001f425").
When a Unicode string is written to a text file or some other storage, these unicode scalars are encoded in one of several Unicode-defined formats. Each format encodes the string in small chunks known as code units. These include the UTF-8 format (which encodes a string as 8-bit code units) and the UTF-16 format (which encodes a string as 16-bit code units).
//copy from Apple Developer swift programming guide

Does golang provide htonl/htons?

In C programming, when I want to send an integer across network, we need to use htonl() or htons() to convert the integer from host byte order
to network byte order before sending it.
But in golang, I have checked the net package, and can't find the similar functions like htons/htonl. So how should I send an integer when using golang? Do I need to implement htons/htonl
myself?
Network byte order is just big endian, so you can use the encoding/binary package to perform the encoding.
For example:
data := make([]byte, 6)
binary.BigEndian.PutUint16(data, 0x1011)
binary.BigEndian.PutUint32(data[2:6], 0x12131415)
Alternatively, if you are writing to an io.Writer, the binary.Write() function from the same package may be more convenient (again, using the binary.BigEndian value as the order argument).
I think what you're after is ByteOrder in encoding/binary.
A ByteOrder specifies how to convert byte sequences into 16-, 32-, or 64-bit unsigned integers.
Just convert integers between big-endian and little-endian.
// NetToHostShort converts a 16-bit integer from network to host byte order, aka "ntohs"
func NetToHostShort(i uint16) uint16 {
data := make([]byte, 2)
binary.BigEndian.PutUint16(data, i)
return binary.LittleEndian.Uint16(data)
}
// NetToHostLong converts a 32-bit integer from network to host byte order, aka "ntohl"
func NetToHostLong(i uint32) uint32 {
data := make([]byte, 4)
binary.BigEndian.PutUint32(data, i)
return binary.LittleEndian.Uint32(data)
}
// HostToNetShort converts a 16-bit integer from host to network byte order, aka "htons"
func HostToNetShort(i uint16) uint16 {
b := make([]byte, 2)
binary.LittleEndian.PutUint16(b, i)
return binary.BigEndian.Uint16(b)
}
// HostToNetLong converts a 32-bit integer from host to network byte order, aka "htonl"
func HostToNetLong(i uint32) uint32 {
b := make([]byte, 4)
binary.LittleEndian.PutUint32(b, i)
return binary.BigEndian.Uint32(b)
}

Resources