Lua hex string to ASCII? - lua

I'm wanting to convert a hex string to ASCII character, (for the game ROBLOX).
Here's the page for the ASCII icon:
http://www.fileformat.info/info/unicode/char/25ba/index.htm
Although I'm not even sure that Lua supports that icon.
EDIT:
Turns out ROBLOX doesn't support UTF-8 symbols at all due to their 'chat filtering'.

Strings in Lua are encoding-agnostic and you can just use the character in the string:
print"►"
Alternatively:
Output the Unicode code directly with print"\u{25BA}".
Output the UTF-8 encoding directly with print"\xE2\x96\xBA".
Output the UTF-8 encoding directly with print"\226\150\186".

Related

Converting Extended ASCII characters in Dart

My flutter app retrieves information via a REST interface which can contain Extended ASCII characters e.g. e-acute 0xe9. How can I convert this into UTF-8 (e.g 0xc3 0xa9) so that it displays correctly?
0xE9 corresponds to e-acute (é) in the ISO-8859/Latin 1 encoding. (It's one of many possible encodings for "extended ASCII", although personally I associate the term "extended ASCII" with code page 437.)
You can decode it to a Dart String (which internally stores UTF-16) using Latin1Codec. If you really want UTF-8, you can encode that String to UTF-8 afterward with Utf8Codec.
import 'dart:convert';
void main() {
var s = latin1.decode([0xE9]);
print(s); // Prints: é
var utf8Bytes = utf8.encode(s);
print(utf8Bytes); // Prints: [195, 169]
}
I was getting confused because sometimes the data contained extended ascii characters and sometimes UTF-8 characters. When I tried doing a UTF-8 decode it baulked at the extended ascii.
I fixed it by trying the utf8 decode and catching the error when it is extended ascii, it seems to decode this OK.

What scheme is used to encode unicode characters in a .url shortcut?

What scheme is used to encode unicode characters in a windows url shortcut?
For example, a new shortcut for url "http://Ψαℕ℧▶" produces a .url file with the text:
[{000214A0-0000-0000-C000-000000000046}]
Prop3=19,2
[InternetShortcut]
IDList=
URL=http://?aN??/
[InternetShortcut.A]
URL=http://?aN??/
[InternetShortcut.W]
URL=http://+A6gDsSEVIScltg-/
What is the algorithm to decode "+A6gDsSEVIScltg-" to "Ψαℕ℧▶"?
I am not asking for API code, but I would like to know the encoding scheme details.
Note: The encoding scheme is not utf-8 nor utf-16 nor ucs-2 and no %encoding.
+A6gDsSEVIScltg- is the UTF-7 encoded form of Ψαℕ℧▶.
The correct way to process a .url file is to use the IUniformResourceLocator and IPropertyStorage interfaces from the CLSID_InternetShortcut COM object. See Internet Shortcuts on MSDN for details.
The answer (utf-7) allowed me to successfully develop the url conversion routine.
Let me summarize the steps:
To obtain the unicode url from a InternetShortcut.W found in a .url file.
. Pass ascii chars until crlf, after making them internet safe.
. A none escaped + character starts a utf-7 formatted unicode sequence:
. Collect 6-bit nibbles from base64 coded ascii
. Per collected 16 bits, convert the 16 bits to utf-8 (1,2, or 3 chars)
. Pass the utf8 generated characters as %hh
. Continue until the occurrence of a "-" character
. The bit collector should be zero

How to convert arabic digits to numeric in ruby

Params include arabic digits also which I want to convert it into digits:-
"lexus/yr_٢٠٠١_٢٠٠٦"
I tried this one
params[:query].tr!('٠١٢٣٤٥٦٧٨٩','0123456789').delete!(" ")
but it gives an error
Encoding::CompatibilityError Exception: incompatible character encodings: UTF-8 and ASCII-8BIT
for that I do
params[:query].force_encoding('utf-8').encode.tr!('٠١٢٣٤٥٦٧٨٩','0123456789').delete!(" ")
how I can convert this?
If you have enforced UTF-8 encoding then this should work fine.
str = "lexus/yr_٢٠٠١_٢٠٠٦"
str.tr('٠١٢٣٤٥٦٧٨٩','0123456789')
returns "lexus/yr_2001_2006"
ASCII 8-bit is not really an encoding. It is binary data, not something text based. Transcoding ASCII 8-bit to UTF-8 is not a meaningful operation. I would recommend ensuring that the request that passes the query parameter through your textfield is using valid string encoding, if you can control this. You can use String#valid_encoding? method in ruby to check you are receiving a correctly encoded string.

What character encoding are the following German words using?

I'm trying to process a German word list and can't figure out what encoding the file is in. The 'file' unix command says the file is "Non-ISO extended-ASCII text". Most of the words are in ascii, but here are the exceptions:
ANDR\x82
ATTACH\x82
C\x82ZANNE
CH\x83TEAU
CONF\x82RENCIER
FABERG\x82
L\x82VI-STRAUSS
RH\x93NETAL
P\xF2ANGE
Any hints would be great. Thanks!
EDIT: To be clear, the hex codes above are C hex string literals so replace \xXX with the literal hex value XX.
It looks like CP437 or CP852, assuming the \x82 sequences encode single characters, and are not literally four characters. Well, at least everything else does, but the last line is a bit of a puzzle.

Parsing \"–\" with Erlang re

I've parsed an HTML page with mochiweb_html and want to parse the following text fragment
0 – 1
Basically I want to split the string on the spaces and dash character and extract the numbers in the first characters.
Now the string above is represented as the following Erlang list
[48,32,226,128,147,32,49]
I'm trying to split it using the following regex:
{ok, P}=re:compile("\\xD2\\x80\\x93"), %% characters 226, 128, 147
re:split([48,32,226,128,147,32,49], P, [{return, list}])
But this doesn't work; it seems the \xD2 character is the problem [if I remove it from the regex, the split occurs]
Could someone possibly explain
what I'm doing wrong here ?
why the '–' character seemingly requires three integers for representation [226, 128, 147]
Thanks.
226,128,147 is E2,80,93 in hex.
> {ok, P} = re:compile("\xE2\x80\x93").
...
> re:split([48,32,226,128,147,32,49], P, [{return, list}]).
["0 "," 1"]
As to your second question, about why a dash takes 3 bytes to encode, it's because the dash in your input isn't an ASCII hyphen (hex 2D), but is a Unicode en-dash (hex 2013). Your code is recieving this in UTF-8 encoding, rather than the more obvious UCS-2 encoding. Hex 2013 comes out to hex E28093 in UTF-8 encoding.
If your next question is "why UTF-8", it's because it's far easier to retrofit an old system using 8-bit characters and null-terminated C style strings to use Unicode via UTF-8 than to widen everything to UCS-2 or UCS-4. UTF-8 remains compatible with ASCII and C strings, so the conversion can be done piecemeal over the course of years, or decades if need be. Wide characters require a "Big Bang" one-time conversion effort, where everything has to move to the new system at once. UTF-8 is therefore far more popular on systems with legacies dating back to before the early 90s, when Unicode was created.

Resources