lua string.upper not working with accented characters? - lua

I'm trying to convert some French text to upper case in lua, it is not converting the accented characters. Any idea why?
test script:
print('échelle')
print(string.upper('échelle'))
print('ÉCHELLE')
print(string.lower('ÉCHELLE'))
output:
échelle
éCHELLE
ÉCHELLE
Échelle

It might be a bit overkill, but you can do this with slnunicode (which is available in LuaRocks).
require "unicode"
print(unicode.utf8.upper("échelle"))
-- ÉCHELLE
You may need to use unicode.ascii.upper or unicode.latin1.upper depending on the encoding of your source files.

You need to set a suitable locale, which depends how these strings are encoded in the source.
You seem to be using Latin 1 because of the output you gave.
In this case, trying adding the line below at the top of your script:
os.setlocale("fr_FR.ISO8859-1")
This name is for Mac OS X. For Linux, try
os.setlocale("fr_FR.iso88591")
If you're using UTF, then setting a locale won't help because string.lower converts the string one byte at a time.

Lua just uses the C library function toupper, which AFAIK doesn't support accented characters. You'd need to write a routine for that yourself.

To explain this all more effectively, Lua does not have built-in support for non-ASCII strings. You can store a Latin-1 or UTF-8-encoded string, but none of the special string manipulation functions (upper, lower, etc) will work on any non-ASCII character.
There are Lua libraries that add varying degrees of Unicode support. So you will have to use one of them.

Related

Why is the following piece of Lua code, completely valid?

From my Lua knowledge (and according to what I have read in Lua manuals), I've always been under impression that an identifier in Lua is only limited to A-Z & a-z & _ & digits (and can not start using a digit nor be a reserved keyword i.e. local local = 123).
And now I have run into some (obfuscated) Lua program which uses all kind of weird characters for an identifier:
https://i.imgur.com/HPLKMxp.png
-- Most likely, copy+paste won't work. Download the file from https://tknk.io/7HHZ
print(_VERSION .. " " .. (jit and "JIT" or "non-JIT"))
local T = {}
T.math = T.math or {}
T.math.​â®â€‹âŞâ®â€‹­ď»żâ€Śâ€­âŽ­ = math.sin
T.math.â¬â€‹â­â¬â­â«â®â€­â€¬ = math.cos
for k, v in pairs(T.math) do print(k, v) end
Output:
Lua 5.1 JIT
â¬â€‹â­â¬â­â«â®â€­â€¬ function: builtin#45
​â®â€‹âŞâ®â€‹­ď»żâ€Śâ€­âŽ­ function: builtin#44
It is unclear to me, why is this set of characters allowed for an identifier?
In other words, why is it a completely valid Lua program?
Unlike some languages, Lua is not really defined by a formal specification, one which covers every contingency and entirely explains all of Lua's behavior. Something as simple as "what character set is a Lua file encoded in" isn't really explain in Lua's documentation.
All the docs say about identifiers is:
Names (also called identifiers) in Lua can be any string of letters, digits, and underscores, not beginning with a digit and not being a reserved word.
But nothing ever really says what a "letter" is. There isn't even a definition for what character set Lua uses. As such, it's essentially implementation-dependent. A "letter" is... whatever the implementation wants it to be.
So, let's say you're writing a Lua implementation. And you want users to be able to provide Unicode-encoded strings (that is, strings within the Lua text). Lua 5.3 requires this. But you also don't want them to have to use UTF-16 encoding for their files (also because lua_load gets sequences of bytes, not shorts). So your Lua implementation assumes the byte sequence it gets in lua_load is encoded in UTF-8, so that users can write strings that use Unicode characters.
When it comes to writing the lexer/parser part of this implementation, how do you handle this? The simplest, easiest way to handle UTF-8 is to... not handle UTF-8. Indeed, that's the whole point of that encoding. Since everything that Lua defines with specific symbols are encoded in ASCII, and ASCII text is also UTF-8 text with the same meaning, you can basically treat a UTF-8 string like an ASCII string. For in-Lua strings, you just copy the sequence of bytes between the start and end characters of the string.
So how do you go about lexing identifiers? Well, you could ask the question above. Or you could ask a much simpler question: is the character a space, control character, digit, or symbol? A "letter" is merely something that isn't one of those.
Lua defines what things it considers to be "symbols". ASCII can tell you what is a control character, space, and a digit. In such an implementation, any UTF-8 code unit with a value outside of ASCII is a letter. Even if technically, those code units decode into something Unicode thinks of as a "symbol", your lexer just threats it as a letter.
This simple form of UTF-8 lexing gives you fast performance and low memory overhead. You don't have to decode UTF-8 into Unicode codepoints, and you don't need a giant Unicode table to tell you whether a codepoint is a "symbol" or "space" or whatever. And of course, it's also something that would naturally fall out of many ASCII-based Lua implementations.
So most Lua implementations will do it this way, if only by accident. Doing something more would require deliberate effort.
It also allows a user to use Unicode character sequences as identifiers. That means that someone can easily write code in their native language (outside of keywords).
But it also means that obfuscators have lots of ways to create "identifiers" that are just strings of nonsensical bytes. Indeed, because there are multiple ways in Unicode to "spell" the same apparent Unicode string (unless you examine the bytes directly), obfuscators can rig up identifiers that appear when rendered in a text editor to all be the same text, while actually being different strings.
To clarify there is only one identifier T
T.math is sugar syntax for T["math"] this also extends to the obfuscate strings. It is perfectly valid to have a key contain any characters or even start with a number.
Now being able to use the . rather then [ ] does not work with a string that don't conform to the identifier's limitations. See Nicol Bolas' answer for a great break down of those limitations.

Reversing a string which is written in a right-to-left language

I'm looking for a way to reverse a string which is written in a right-to-left language such as Hebrew or Arabic.
Generally, as I attempted to use the string.reverse function the results were missing characters and odd characters at the beginnings and endings of the incomplete "reversed" words.
For example, this is the code I used as a test(Hebrew characters in the string):
local str = "ניסיון"
print(string.reverse(str))
These were the results:
�ויסינ�
Now I assumed it had to do with the "\0" or "null" character, but I couldn't really find any way to check it. The same issue persisted when I created my own function in Lua to reverse the string.
Is there a way to reverse a string of a Right-To-Left language without that being the result?
The � is a unicode symbol that it uses to replace an "unknown, unrecognizable, or unrepresentable character." So basically, the string.reverse()'s unicode doesn't recognize those characters on the end so it replaces them with the � symbol.
I haven't done much messing around in Lua with non-english characters, but I would suggest having a look at the Lua Unicode library page, or look into this module which provides support for UTF-8 for Lua and LuaJIT. Finally, this Stack Overflow question has a good explanation of how Lua's support for Unicode works.
Failing all that, you may just have to make your own reverse function by storing each char to an array and then reversing the order of the array before finally compiling them back into a string.
Hopefully this is helpful!
string.reverse reverse the bytes in a string, not necessarily its characters if the string contains text encoded using a multibyte encoding such as UTF-8.

Trouble making a heart symbol in Lua?

I was wondering how to make the heart sign or "♥" in Lua, I have tried \003 because that is the ASCII code for it, but it does not print it out.
This has little to do with Lua.
You need to find out which character set and encoding is used in your environment and select a font that supports ♥ in that encoding.
Then you need to use an editor for your Lua script that saves in that encoding. If that part is not possible then you can determine the byte sequence required, code it as numeric escapes in a literal string and save in a compatible encoding such as CP437. For example, if you are outputting to a UTF-8 processor, "\xE2\x99\xA5".
Keep in mind that a Lua string is a counted sequence of bytes. It's up to you and your editor to put the right bytes in in the file, it's up to your environment (e.g., console) to interpret those bytes in a particular character encoding, and up to the font to display the glyph.
In a Windows console, you can select the Lucinda Console font, chcp 65001 to use UTF-8 and use Lua 5.1 like this: lua -e "print('\226\153\165')". As a comparison, chcp 437 to use IBM437 and use Lua 5.1 like this: lua -e "print('\003')".
For ASCII, only range 0x20 to 0x7E are printable. Others, including 0x03, isn't printable. Printing its value would be up to the implementation.
If the environment supports Unicode, you can simply call:
print("♥")
For instance, Lua Demo outputs ♥, same in ideone.

CGI::unescape can't handle unescaping "wymiana+teflon%F3w"?

I am working on data imported from legacy database into sqlite for development, legacy database has a lot of url encoded strings with Polish characters. I can get most of these strings readable by using
CGI::unescape_html( CGI::unescape "string" )
except for one case (that I noticed yet, there may be more as I didn't do any testing yet), the letter "ó". For instance, using unescapeHTML on string "wymiana+teflon%F3w" throws an invalid byte sequence exception.
Question now is either my string is properly escaped, as other Polish characters are using sequences of "&#nnn;" like "b%26%23322%3Bad+zapisu+%2D+powinno+by%26%23263%3B+brak", which seems to follow standard for numeric character referencing. BTW, this string is properly unescaped into
"bład zapisu - powinno być brak"
But, on the other hand, there are also strings with similar character encoding, e.g. "odpowietrzanie+weza%5C" which is properly handled by CGI::unescapeHTML. However, %5C represents a backslash not a letter with code point lower than U+0256. Can it be the reason? I tried to research on this but haven't found any explanation. I also updated my Ruby to 2.1.0 as CGI::Util has changed in new version, but still no luck.
ó is 0xF3 in ISO-8859-2 (and ISO-8859-1) but '\xF3' is not a valid UTF-8 string, that ó should be %C3%B3 in the URL if you're expecting UTF-8. Someone somewhere probably used the deprecated escape JavaScript function to encode the string instead of modern encodeURIComponent; you can see the difference with a simple test in your browser's JavaScript console:
> escape('ó')
"%F3"
> encodeURIComponent('ó')
"%C3%B3"
There's the %F3 you're seeing and the %C3%B3 that you want to see. One thing that should work is to fix the encoding by hand:
irb> CGI::unescape('wymiana+teflon%F3w').force_encoding('ISO-8859-2').encode('UTF-8')
=> "wymiana teflonów"
This assumes that you know what should be ISO-8859-1 and what should be UTF-8. You might have a mix of both ISO-8859-2 (or -1, -3, ..., Windows CP-1258, ...) in your data; unfortunately, there's no reliable way to tell the difference as the encodings overlap and there's no way to be sure what result makes sense without eye-balling it and knowing the various languages involved.
Probably the best you can do is:
Send everything through through your CGI::unescape_html(CGI::unescape(...)) converter.
Wrap that in an exception handler to trap the inevitable problems.
Stash the problem strings off to the side somewhere.
Try the ISO-8859-2 to UTF-8 conversion on the strings from (3) and eye-ball them to see if they makes sense.
Repeat with other common encodings until there's nothing left that you care about.
Note that I'm using ISO-8859-2 instead of the more common ISO-8859-1 as Latin-2 is for Eastern European languages (such as Polish) whereas Latin-1 is for Western European languages. They overlap on ó but there is no ł in Latin-1. With tasks like this you usually try the encodings that are probably there first, then fall back on other common encodings, then fall back to whatever other encodings you can think of, and then fall back on hard liquor.
Good luck, modernizing legacy data is not the funnest job in the world.
I've chosen another way to solve my problem, simply substituting all occurrences of '%F3' with '%26%23xF3%3B' before unescaping. BTW, capital letter Ó also needs similar substitution. The actual code I used:
def unescape_ó(s)
s = s.gsub(/%D3|%F3/, {'%D3' =>'%26%23xD3%3B', '%F3' => '%26%23xF3%3B'})
end
With this approach I don't have to handle invalid byte sequence exception as properly escaped string is used in CGI::unescapeHTML

Matching Unicode letters with RegExp

I am in need of matching Unicode letters, similarly to PCRE's \p{L}.
Now, since Dart's RegExp class is based on ECMAScript's, it doesn't have the concept of \p{L}, sadly.
I'm looking into perhaps constructing a big character class that matches all Unicode letters, but I'm not sure where to start.
So, I want to match letters like:
foobar
מכון ראות
But the R symbol shouldn't be matched:
BlackBerry®
Neither should any ASCII control characters or punctuation marks, etc. Essentially every letter in every language Unicode supports, whether it's å, ä, φ or ת, they should match if they are actual letters.
I know this is an old question. But RegExp now supports unicode categories (since Dart 2.4) so you can do something like this:
RegExp alpha = RegExp(r'\p{Letter}', unicode: true);
print(alpha.hasMatch("f")); // true
print(alpha.hasMatch("ת")); // true
print(alpha.hasMatch("®")); // false
I don't think that complete information about classification of Unicode characters as letters or non-letters is anywhere in the Dart libraries. You might be able to put something together that would mostly work using things in the Intl library, particularly Bidi. I'm thinking that, for example,
isLetter(oneCharacterString) => Bidi.endsWithLtr(oneLetterString) || Bidi.endsWithRTL(oneLetterString);
might do a plausible job. At least it seems to have a number of ranges for valid characters in there. Or you could put together your own RegExp based on the information in _LTR_CHARS and _RTL_CHARS. It explicitly says it's not 100% accurate, but good for most practical purposes.
Looks like you're going to have to iterate through the runes in the string and then check the integer value against a table of unicode ranges.
Golang has some code to generate these tables directly from the unicode source. See maketables.go, and some of the other files in the golang unicode package.
Or take the lazy option, and file a Dart bug, and wait for the Dart team to implement it ;)
There's no support for this yet in Dart or JS.
The Xregexp JS library has support for generating fairly large character class regexps to support something like this. You may be able to generate the regexp, print it and cut and paste it into your app.

Resources