Converting Extended ASCII characters in Dart - dart

My flutter app retrieves information via a REST interface which can contain Extended ASCII characters e.g. e-acute 0xe9. How can I convert this into UTF-8 (e.g 0xc3 0xa9) so that it displays correctly?

0xE9 corresponds to e-acute (é) in the ISO-8859/Latin 1 encoding. (It's one of many possible encodings for "extended ASCII", although personally I associate the term "extended ASCII" with code page 437.)
You can decode it to a Dart String (which internally stores UTF-16) using Latin1Codec. If you really want UTF-8, you can encode that String to UTF-8 afterward with Utf8Codec.
import 'dart:convert';
void main() {
var s = latin1.decode([0xE9]);
print(s); // Prints: é
var utf8Bytes = utf8.encode(s);
print(utf8Bytes); // Prints: [195, 169]
}

I was getting confused because sometimes the data contained extended ascii characters and sometimes UTF-8 characters. When I tried doing a UTF-8 decode it baulked at the extended ascii.
I fixed it by trying the utf8 decode and catching the error when it is extended ascii, it seems to decode this OK.

Related

codeUnits property vs utf8.encode function in Dart

I have this little code:
void main(List<String> args) {
const data = 'amigo+/=:chesu';
var encoded = base64Encode(utf8.encode(data));
var encoded2 = base64Encode(data.codeUnits);
var decoded = utf8.decode(base64Decode(encoded));
var decoded2 = utf8.decode(base64Decode(encoded2));
print(encoded);
print(encoded2);
print(decoded);
print(decoded2);
}
The output is:
YW1pZ28rLz06Y2hlc3U=
YW1pZ28rLz06Y2hlc3U=
amigo+/=:chesu
amigo+/=:chesu
codeUnits property gives an unmodifiable list of the UTF-16 code units, is it OK to use utf8.decode function? or what function should be used for encoded2?
It's simply not a good idea to do base64Encode(data.codeUnits) because base64Encode encodes bytes, and data.codeUnits isn't necessarily bytes.
Here they are (because all the characters of the string have code points below 256, they are even ASCII.)
Using ut8.encode before base64Encode is good. It works for all strings.
The best way to convert from UTF-16 code units to a String is String.fromCharCodes.
Here you are using base64Encode(data.codeUnits) which only works if the data string contains only code units up to 255. So, if you assume that, then it means that decoding that can be done using either latin1.decode or String.fromCharCodes.
Using ascii.decode and utf8.decode also works if the string only contains ASCII (which it does here, but which isn't guaranteed by base64Encode succeeding).
In short, don't do base64Encode(data.codeUnits). Convert the string to bytes before doing base64Encode, then use the reverse conversion to convert bytes back to strings.
I tried this
print(utf8.decode('use âsmartâ symbols like â thisâ'.codeUnits));
and got this
use “smart” symbols like ‘ this’
The ” and ‘ are smart characters from iOS keyboard

How to convert arabic digits to numeric in ruby

Params include arabic digits also which I want to convert it into digits:-
"lexus/yr_٢٠٠١_٢٠٠٦"
I tried this one
params[:query].tr!('٠١٢٣٤٥٦٧٨٩','0123456789').delete!(" ")
but it gives an error
Encoding::CompatibilityError Exception: incompatible character encodings: UTF-8 and ASCII-8BIT
for that I do
params[:query].force_encoding('utf-8').encode.tr!('٠١٢٣٤٥٦٧٨٩','0123456789').delete!(" ")
how I can convert this?
If you have enforced UTF-8 encoding then this should work fine.
str = "lexus/yr_٢٠٠١_٢٠٠٦"
str.tr('٠١٢٣٤٥٦٧٨٩','0123456789')
returns "lexus/yr_2001_2006"
ASCII 8-bit is not really an encoding. It is binary data, not something text based. Transcoding ASCII 8-bit to UTF-8 is not a meaningful operation. I would recommend ensuring that the request that passes the query parameter through your textfield is using valid string encoding, if you can control this. You can use String#valid_encoding? method in ruby to check you are receiving a correctly encoded string.

Lua hex string to ASCII?

I'm wanting to convert a hex string to ASCII character, (for the game ROBLOX).
Here's the page for the ASCII icon:
http://www.fileformat.info/info/unicode/char/25ba/index.htm
Although I'm not even sure that Lua supports that icon.
EDIT:
Turns out ROBLOX doesn't support UTF-8 symbols at all due to their 'chat filtering'.
Strings in Lua are encoding-agnostic and you can just use the character in the string:
print"►"
Alternatively:
Output the Unicode code directly with print"\u{25BA}".
Output the UTF-8 encoding directly with print"\xE2\x96\xBA".
Output the UTF-8 encoding directly with print"\226\150\186".

UInt8 XOR'd array result to NSString conversion returns nil every time

I'm having issues working with iOS Swift 2.0 to perform an XOR on a [UInt8] and convert the XORd result to a String. I'm having to interface with a crude server that wants to do simple XOR encryption with a predefined array of UInt8 values and return that result as a String.
Using iOS Swift 2.0 Playground, create the following array:
let xorResult : [UInt8] = [24, 48, 160, 212] // XORd result
let result = NSString(bytes: xorResult, length: xorResult.count, encoding: NSUTF8StringEncoding)
The result is always nil. If you remove the 160 and 212 values from the array, NSString is not nil. If I switch to NSUTF16StringEncoding then I do not receive nil, however, the server does not support UTF16. I have tried converting the values to a hex string, then converting the hex string to NSData, then try to convert that to NSUTF8StringEncoding but still nil until I remove the 160 and 212. I know this algorithm works in Java, however in Java we're using a combination of char and StringBuilder and everything is happy. Is there a way around this in iOS Swift?
To store an arbitrary chunk of binary data as as a string, you need
a string encoding which maps each single byte (0 ... 255) to some
character. UTF-8 does not have this property, as for example 160
is the start of a multi-byte UTF-8 sequence and not valid on its own.
The simplest encoding with this property is the ISO Latin 1 aka
ISO 8859-1, which is the
ISO/IEC 8859-1
encoding when supplemented with the C0 and C1 control codes.
It maps the Unicode code points U+0000 .. U+00FF
to the bytes 0x00 .. 0xFF (compare 8859-1.TXT).
This encoding is available for
(NS)String as NSISOLatin1StringEncoding.
Please note: The result of converting an arbitrary binary chunk to
a (NS)String with NSISOLatin1StringEncoding will contain embedded
NUL and control characters. Some functions behave unexpectedly
when used with such a string. For example, NSLog() terminates the
output at the first embedded NUL character. This conversion
is meant to solve OP's concrete problem (creating a QR-code which
is recognized by a 3rd party application). It is not meant as
a universal mechanism to convert arbitrary data to a string which may
be printed or presented in any way to the user.

Handing strings with binary data in it using java.nio

I am having issues parsing text files that have illegal characters(binary markers) in them. An answer would be something as follows:
test.csv
^000000^id1,text1,text2,text3
Here the ^000000^ is a textual representation of illegal characters in the source file.
I was thinking about using the java.nio to validate the line before I process it. So, I was thinking of introducing a Validator trait as follows:
import java.nio.charset._
trait Validator{
private def encoder = Charset.forName("UTF-8").newEncoder
def isValidEncoding(line:String):Boolean = {
encoder.canEncode(line)
}
}
Do you guys think this is the correct approach to handle the situation?
Thanks
It is too late when you already have a String, UTF-8 can always encode any string*. You need to go to the point where you are decoding the file initially.
ISO-8859-1 is an encoding with interesting properties:
Literally any byte sequence is valid ISO-8859-1
The code point of each decoded character is exactly the same as the value of the byte it was decoded from
So you could decode the file as ISO-8859-1 and just strip non-English characters:
//Pseudo code
str = file.decode("ISO-8859-1");
str = str.replace( "[\u0000-\u0019\u007F-\u00FF]", "");
You can also iterate line-by-line, and ignore each line that contains a character in [\u0000-\u0019\u007F-\u00FF], if that's what you mean by validating a line before processing it.
It also occurred to me that the binary marker could be a BOM. You can use a hex editor to view the values.
*Except those with illegal surrogates which is probably not the case here.
Binary data is not a string. Don't try to hack around input sequences that would be illegal upon conversion to a String.
If your input is an arbitrary sequence of bytes (even if many of them conform to ASCII), don't even try to convert it to a String.

Resources