codeUnits property vs utf8.encode function in Dart - dart

I have this little code:
void main(List<String> args) {
const data = 'amigo+/=:chesu';
var encoded = base64Encode(utf8.encode(data));
var encoded2 = base64Encode(data.codeUnits);
var decoded = utf8.decode(base64Decode(encoded));
var decoded2 = utf8.decode(base64Decode(encoded2));
print(encoded);
print(encoded2);
print(decoded);
print(decoded2);
}
The output is:
YW1pZ28rLz06Y2hlc3U=
YW1pZ28rLz06Y2hlc3U=
amigo+/=:chesu
amigo+/=:chesu
codeUnits property gives an unmodifiable list of the UTF-16 code units, is it OK to use utf8.decode function? or what function should be used for encoded2?

It's simply not a good idea to do base64Encode(data.codeUnits) because base64Encode encodes bytes, and data.codeUnits isn't necessarily bytes.
Here they are (because all the characters of the string have code points below 256, they are even ASCII.)
Using ut8.encode before base64Encode is good. It works for all strings.
The best way to convert from UTF-16 code units to a String is String.fromCharCodes.
Here you are using base64Encode(data.codeUnits) which only works if the data string contains only code units up to 255. So, if you assume that, then it means that decoding that can be done using either latin1.decode or String.fromCharCodes.
Using ascii.decode and utf8.decode also works if the string only contains ASCII (which it does here, but which isn't guaranteed by base64Encode succeeding).
In short, don't do base64Encode(data.codeUnits). Convert the string to bytes before doing base64Encode, then use the reverse conversion to convert bytes back to strings.

I tried this
print(utf8.decode('use âsmartâ symbols like â thisâ'.codeUnits));
and got this
use “smart” symbols like ‘ this’
The ” and ‘ are smart characters from iOS keyboard

Related

Converting Extended ASCII characters in Dart

My flutter app retrieves information via a REST interface which can contain Extended ASCII characters e.g. e-acute 0xe9. How can I convert this into UTF-8 (e.g 0xc3 0xa9) so that it displays correctly?
0xE9 corresponds to e-acute (é) in the ISO-8859/Latin 1 encoding. (It's one of many possible encodings for "extended ASCII", although personally I associate the term "extended ASCII" with code page 437.)
You can decode it to a Dart String (which internally stores UTF-16) using Latin1Codec. If you really want UTF-8, you can encode that String to UTF-8 afterward with Utf8Codec.
import 'dart:convert';
void main() {
var s = latin1.decode([0xE9]);
print(s); // Prints: é
var utf8Bytes = utf8.encode(s);
print(utf8Bytes); // Prints: [195, 169]
}
I was getting confused because sometimes the data contained extended ascii characters and sometimes UTF-8 characters. When I tried doing a UTF-8 decode it baulked at the extended ascii.
I fixed it by trying the utf8 decode and catching the error when it is extended ascii, it seems to decode this OK.

Replace double backslash in Dart

I have this escaped string:
\u0414\u043B\u044F \u043F\u0440\u043E\u0434\u0430\u0436\u0438
\u043D\u0435\u0434\u0432\u0438\u0436\u0438\u043C\u043E\u0441\u0442\u0438
If I do:
print('\u0414\u043B\u044F \u043F\u0440\u043E\u0434\u0430\u0436\u0438 \u043D\u0435\u0434\u0432\u0438\u0436\u0438\u043C\u043E\u0441\u0442\u0438');
Console will show me:
Для продажи недвижимости
But if I get escaped 2 times string from the server:
\\u0414\\u043B\\u044F
\\u043F\\u0440\\u043E\\u0434\\u0430\\u0436\\u0438
\\u043D\\u0435\\u0434\\u0432\\u0438\\u0436\\u0438\\u043C\\u043E\\u0441\\u0442\\u0438
And do some replace job:
var result = string.replaceAll(new RegExp(r'\\'), r'\');
Compiler will not decode those characters and will show same escaped string:
print(result);
Console:
\u0414\u043B\u044F \u043F\u0440\u043E\u0434\u0430\u0436\u0438
\u043D\u0435\u0434\u0432\u0438\u0436\u0438\u043C\u043E\u0441\u0442\u0438
How I can remove those redunant slashes?
In string literals in Dart source files, \u0414 is a literal representing a unicode code point, whereas in the case of data returned from the server, you're just getting back a string containing backslashes, us, and digits that looks like a bunch of unicode code point literals.
The ideal fix is to have your server return the UTF-8 string you'd like to display rather than a string that uses Dart's string literal syntax that you need to parse. Writing a proper parser for such strings is fairly involved. You can take a look at unescapeCodeUnits in the Dart SDK for an example.
A very inefficient (not to mention entirely hacky and unsafe for real-world use) means of decoding this particular string would be to extract the string representations of the unicode codepoints with a RegExp parse the hex to an int, then use String.fromCharCode().
Note: the following code is absolutely not safe for production use and doesn't match other valid Dart code point literals such as \u{1f601}, or reject entirely invalid literals such as \uffffffffff.
// Match \u0123 substrings (note this will match invalid codepoints such as \u123456789).
final RegExp r = RegExp(r'\\\\u([0-9a-fA-F]+)');
// Sample string to parse.
final String source = r'\\u0414\\u043B\\u044F \\u043F\\u0440\\u043E\\u0434\\u0430\\u0436\\u0438 \\u043D\\u0435\\u0434\\u0432\\u0438\\u0436\\u0438\\u043C\\u043E\\u0441\\u0442\\u0438';
// Replace each \u0123 with the decoded codepoint.
final String decoded = source.replaceAllMapped(r, (Match m) {
// Extract the parenthesised hex string. '\\u0123' -> '123'.
final String hexString = m.group(1);
// Parse the hex string to an int.
final int codepoint = int.parse(hexString, radix: 16);
// Convert codepoint to string.
return String.fromCharCode(codepoint);
});

Conversion of sequence of bytes to ASCII string in lua

I am trying to write custom dissector for Wireshark, which will change byte/hex output to ASCII string.
I was able to write the body of this dissector and it works. My only problem is conversion of this data to ASCII string.
Wireshark declares this data to be sequence of bytes.
To Lua the data type is userdata (tested using type(data)).
If I simply convert it to string using tostring(data) my dissector returns 24:50:48, which is the exact hex representation of bytes in an array.
Is there any way to directly convert this byte sequence to ascii, or can you help me convert this colon separated string to ascii string? I am totally new to Lua. I've tried something like split(tostring(data),":") but this returns Lua Error: attempt to call global 'split' (a nil value)
Using Jakuje's answer I was able to create something like this:
function isempty(s)
return s == nil or s == ''
end
data = "24:50:48:49:4A"
s = ""
for i in string.gmatch(data, "[^:]*") do
if not isempty( i ) then
print(string.char(tonumber(i,16)))
s = s .. string.char(tonumber(i,16))
end
end
print( s )
I am not sure if this is effective, but at least it works ;)
There is no such function as split in Lua (consulting reference manual is a good start). You should use probably string.gmatch function as described on wiki:
data = "24:50:48"
for i in string.gmatch(data, "[^:]*") do
print(i)
end
(live example)
Further you are searching for string.char function to convert bytes to ascii char.
You need to mark range of bytes in the buffer that you're interested in and convert it to the type you want:
data:range(offset, length):string()
-- or just call it, which works the same thanks to __call metamethod
data(offset, length):string()
See TvbRange description in https://wiki.wireshark.org/LuaAPI/Tvb for full list of available methods of converting buffer range data to different types.

UInt8 XOR'd array result to NSString conversion returns nil every time

I'm having issues working with iOS Swift 2.0 to perform an XOR on a [UInt8] and convert the XORd result to a String. I'm having to interface with a crude server that wants to do simple XOR encryption with a predefined array of UInt8 values and return that result as a String.
Using iOS Swift 2.0 Playground, create the following array:
let xorResult : [UInt8] = [24, 48, 160, 212] // XORd result
let result = NSString(bytes: xorResult, length: xorResult.count, encoding: NSUTF8StringEncoding)
The result is always nil. If you remove the 160 and 212 values from the array, NSString is not nil. If I switch to NSUTF16StringEncoding then I do not receive nil, however, the server does not support UTF16. I have tried converting the values to a hex string, then converting the hex string to NSData, then try to convert that to NSUTF8StringEncoding but still nil until I remove the 160 and 212. I know this algorithm works in Java, however in Java we're using a combination of char and StringBuilder and everything is happy. Is there a way around this in iOS Swift?
To store an arbitrary chunk of binary data as as a string, you need
a string encoding which maps each single byte (0 ... 255) to some
character. UTF-8 does not have this property, as for example 160
is the start of a multi-byte UTF-8 sequence and not valid on its own.
The simplest encoding with this property is the ISO Latin 1 aka
ISO 8859-1, which is the
ISO/IEC 8859-1
encoding when supplemented with the C0 and C1 control codes.
It maps the Unicode code points U+0000 .. U+00FF
to the bytes 0x00 .. 0xFF (compare 8859-1.TXT).
This encoding is available for
(NS)String as NSISOLatin1StringEncoding.
Please note: The result of converting an arbitrary binary chunk to
a (NS)String with NSISOLatin1StringEncoding will contain embedded
NUL and control characters. Some functions behave unexpectedly
when used with such a string. For example, NSLog() terminates the
output at the first embedded NUL character. This conversion
is meant to solve OP's concrete problem (creating a QR-code which
is recognized by a 3rd party application). It is not meant as
a universal mechanism to convert arbitrary data to a string which may
be printed or presented in any way to the user.

Handing strings with binary data in it using java.nio

I am having issues parsing text files that have illegal characters(binary markers) in them. An answer would be something as follows:
test.csv
^000000^id1,text1,text2,text3
Here the ^000000^ is a textual representation of illegal characters in the source file.
I was thinking about using the java.nio to validate the line before I process it. So, I was thinking of introducing a Validator trait as follows:
import java.nio.charset._
trait Validator{
private def encoder = Charset.forName("UTF-8").newEncoder
def isValidEncoding(line:String):Boolean = {
encoder.canEncode(line)
}
}
Do you guys think this is the correct approach to handle the situation?
Thanks
It is too late when you already have a String, UTF-8 can always encode any string*. You need to go to the point where you are decoding the file initially.
ISO-8859-1 is an encoding with interesting properties:
Literally any byte sequence is valid ISO-8859-1
The code point of each decoded character is exactly the same as the value of the byte it was decoded from
So you could decode the file as ISO-8859-1 and just strip non-English characters:
//Pseudo code
str = file.decode("ISO-8859-1");
str = str.replace( "[\u0000-\u0019\u007F-\u00FF]", "");
You can also iterate line-by-line, and ignore each line that contains a character in [\u0000-\u0019\u007F-\u00FF], if that's what you mean by validating a line before processing it.
It also occurred to me that the binary marker could be a BOM. You can use a hex editor to view the values.
*Except those with illegal surrogates which is probably not the case here.
Binary data is not a string. Don't try to hack around input sequences that would be illegal upon conversion to a String.
If your input is an arbitrary sequence of bytes (even if many of them conform to ASCII), don't even try to convert it to a String.

Resources