For example, from English to Spanish. Eg.
función miFunción( )
impresión '¡Hola!'
fin
Translating the keywords is simple: just edit luaX_tokens. Do not reorder that list.
Using other names for functions is also simple and can be done without patching Lua by writing a Lua script with impresion=print etc.
The main problem is using accents. Using a code page like ISO-8859-1 is easy: just edit lctype.c and add accented letters as letters. Using UTF8 is hard because Lua is byte oriented. Perhaps it's doable by editing lctype.c. See this.
Related
From my Lua knowledge (and according to what I have read in Lua manuals), I've always been under impression that an identifier in Lua is only limited to A-Z & a-z & _ & digits (and can not start using a digit nor be a reserved keyword i.e. local local = 123).
And now I have run into some (obfuscated) Lua program which uses all kind of weird characters for an identifier:
https://i.imgur.com/HPLKMxp.png
-- Most likely, copy+paste won't work. Download the file from https://tknk.io/7HHZ
print(_VERSION .. " " .. (jit and "JIT" or "non-JIT"))
local T = {}
T.math = T.math or {}
T.math.​â®â€‹âŞâ®â€‹ď»żâ€Śâ€âŽ = math.sin
T.math.â¬â€‹ââ¬ââ«â®â€â€¬ = math.cos
for k, v in pairs(T.math) do print(k, v) end
Output:
Lua 5.1 JIT
â¬â€‹ââ¬ââ«â®â€â€¬ function: builtin#45
​â®â€‹âŞâ®â€‹ď»żâ€Śâ€âŽ function: builtin#44
It is unclear to me, why is this set of characters allowed for an identifier?
In other words, why is it a completely valid Lua program?
Unlike some languages, Lua is not really defined by a formal specification, one which covers every contingency and entirely explains all of Lua's behavior. Something as simple as "what character set is a Lua file encoded in" isn't really explain in Lua's documentation.
All the docs say about identifiers is:
Names (also called identifiers) in Lua can be any string of letters, digits, and underscores, not beginning with a digit and not being a reserved word.
But nothing ever really says what a "letter" is. There isn't even a definition for what character set Lua uses. As such, it's essentially implementation-dependent. A "letter" is... whatever the implementation wants it to be.
So, let's say you're writing a Lua implementation. And you want users to be able to provide Unicode-encoded strings (that is, strings within the Lua text). Lua 5.3 requires this. But you also don't want them to have to use UTF-16 encoding for their files (also because lua_load gets sequences of bytes, not shorts). So your Lua implementation assumes the byte sequence it gets in lua_load is encoded in UTF-8, so that users can write strings that use Unicode characters.
When it comes to writing the lexer/parser part of this implementation, how do you handle this? The simplest, easiest way to handle UTF-8 is to... not handle UTF-8. Indeed, that's the whole point of that encoding. Since everything that Lua defines with specific symbols are encoded in ASCII, and ASCII text is also UTF-8 text with the same meaning, you can basically treat a UTF-8 string like an ASCII string. For in-Lua strings, you just copy the sequence of bytes between the start and end characters of the string.
So how do you go about lexing identifiers? Well, you could ask the question above. Or you could ask a much simpler question: is the character a space, control character, digit, or symbol? A "letter" is merely something that isn't one of those.
Lua defines what things it considers to be "symbols". ASCII can tell you what is a control character, space, and a digit. In such an implementation, any UTF-8 code unit with a value outside of ASCII is a letter. Even if technically, those code units decode into something Unicode thinks of as a "symbol", your lexer just threats it as a letter.
This simple form of UTF-8 lexing gives you fast performance and low memory overhead. You don't have to decode UTF-8 into Unicode codepoints, and you don't need a giant Unicode table to tell you whether a codepoint is a "symbol" or "space" or whatever. And of course, it's also something that would naturally fall out of many ASCII-based Lua implementations.
So most Lua implementations will do it this way, if only by accident. Doing something more would require deliberate effort.
It also allows a user to use Unicode character sequences as identifiers. That means that someone can easily write code in their native language (outside of keywords).
But it also means that obfuscators have lots of ways to create "identifiers" that are just strings of nonsensical bytes. Indeed, because there are multiple ways in Unicode to "spell" the same apparent Unicode string (unless you examine the bytes directly), obfuscators can rig up identifiers that appear when rendered in a text editor to all be the same text, while actually being different strings.
To clarify there is only one identifier T
T.math is sugar syntax for T["math"] this also extends to the obfuscate strings. It is perfectly valid to have a key contain any characters or even start with a number.
Now being able to use the . rather then [ ] does not work with a string that don't conform to the identifier's limitations. See Nicol Bolas' answer for a great break down of those limitations.
I'm currently translating a website from english into other languages but have a problem when it comes to technical terms (non words) like "crontab".
Should I keep the english translation or is there another way to find the equivalent?
These aren't actually english words and when it comes to languages like Japanese, I'm at a loss as to what to do.
Here's an example sentence as an example:
"Use crontab to schedule scripts."
which translated into Japanese via Google Translate becomes:
"スクリプトをスケジュールするcrontabを使用してください。"
You can see how bizarre this looks, and I'm wondering if the sentence could even be understood by a Japanese speaker.
What do I do in these situations?
Using English words in Japanese
Talking about the word crontab, I think it's not bizarre to write it in English in a Japanese sentence like this:
crotabを使用してください
(please use crontab)
On Japanese wikipedia, you can see how crontab is used without translating into Japanese.
http://ja.wikipedia.org/wiki/Crontab
In Japanese technical writing, especially when you mention name of tools, it is common to use English as it is without translating into Japanese.
Using Katakana
You could also write the sentence like below using Katakana.
クーロンタブを使用してください
(please use crontab).
Japanese usually writes words from English in Katakana. Japanese Katakana is phonetic, in other words each character represents a sound (not meaning). But In this case, it doesn't look natural.
Mistranslation
There is a mistranslation in your Japanese sentence.
スクリプトをスケジュールするcrontabを使用してください。
(Please use crontab which scedule a script.)
To correct this, you could go like this:
スクリプトをスケジュールするには、crontabを使用してください。
(In order to schedule a script, please use crontab.)
Hope this helps.
I am using Clozure Cl on Mac os x 10.9 and Portable allegro serve
I have a file with text has characters like ı ç ş ö (these are some characters Turkish also have) and some Arabic characters. I cannot serve them. when i visit from the browser this kind of characters are not displayed at all, only part of text showed is the ones until the first ı in the text.
In Lisp i use a function composed with a do and read-lines and format (or i have tried print princ prin1 also) reads entire document and when i set the :external-format :utf-8 it shows the read characters properly in Lisp. Problem is in serving them, if i can serve them as i read on Lisp it will be done.
Also If do not set :external-formatat all, in Lisp it is read improperly, as expected, however, this time the browser can show all the text but with wrong characters in place of above described characters.
How to fix that and use external-formats character encodings properly?
See http://www.xach.com/lisp/allegro-cl/2001-3/964.html for an example on how to use :external-format in AllegroServe.
Cheers
Frank
P.S. I also posted an answer to the same question newsgroup comp.lang.lisp .
I am in need of matching Unicode letters, similarly to PCRE's \p{L}.
Now, since Dart's RegExp class is based on ECMAScript's, it doesn't have the concept of \p{L}, sadly.
I'm looking into perhaps constructing a big character class that matches all Unicode letters, but I'm not sure where to start.
So, I want to match letters like:
foobar
מכון ראות
But the R symbol shouldn't be matched:
BlackBerry®
Neither should any ASCII control characters or punctuation marks, etc. Essentially every letter in every language Unicode supports, whether it's å, ä, φ or ת, they should match if they are actual letters.
I know this is an old question. But RegExp now supports unicode categories (since Dart 2.4) so you can do something like this:
RegExp alpha = RegExp(r'\p{Letter}', unicode: true);
print(alpha.hasMatch("f")); // true
print(alpha.hasMatch("ת")); // true
print(alpha.hasMatch("®")); // false
I don't think that complete information about classification of Unicode characters as letters or non-letters is anywhere in the Dart libraries. You might be able to put something together that would mostly work using things in the Intl library, particularly Bidi. I'm thinking that, for example,
isLetter(oneCharacterString) => Bidi.endsWithLtr(oneLetterString) || Bidi.endsWithRTL(oneLetterString);
might do a plausible job. At least it seems to have a number of ranges for valid characters in there. Or you could put together your own RegExp based on the information in _LTR_CHARS and _RTL_CHARS. It explicitly says it's not 100% accurate, but good for most practical purposes.
Looks like you're going to have to iterate through the runes in the string and then check the integer value against a table of unicode ranges.
Golang has some code to generate these tables directly from the unicode source. See maketables.go, and some of the other files in the golang unicode package.
Or take the lazy option, and file a Dart bug, and wait for the Dart team to implement it ;)
There's no support for this yet in Dart or JS.
The Xregexp JS library has support for generating fairly large character class regexps to support something like this. You may be able to generate the regexp, print it and cut and paste it into your app.
I'm trying to convert some French text to upper case in lua, it is not converting the accented characters. Any idea why?
test script:
print('échelle')
print(string.upper('échelle'))
print('ÉCHELLE')
print(string.lower('ÉCHELLE'))
output:
échelle
éCHELLE
ÉCHELLE
Échelle
It might be a bit overkill, but you can do this with slnunicode (which is available in LuaRocks).
require "unicode"
print(unicode.utf8.upper("échelle"))
-- ÉCHELLE
You may need to use unicode.ascii.upper or unicode.latin1.upper depending on the encoding of your source files.
You need to set a suitable locale, which depends how these strings are encoded in the source.
You seem to be using Latin 1 because of the output you gave.
In this case, trying adding the line below at the top of your script:
os.setlocale("fr_FR.ISO8859-1")
This name is for Mac OS X. For Linux, try
os.setlocale("fr_FR.iso88591")
If you're using UTF, then setting a locale won't help because string.lower converts the string one byte at a time.
Lua just uses the C library function toupper, which AFAIK doesn't support accented characters. You'd need to write a routine for that yourself.
To explain this all more effectively, Lua does not have built-in support for non-ASCII strings. You can store a Latin-1 or UTF-8-encoded string, but none of the special string manipulation functions (upper, lower, etc) will work on any non-ASCII character.
There are Lua libraries that add varying degrees of Unicode support. So you will have to use one of them.