Trouble making a heart symbol in Lua? - lua

I was wondering how to make the heart sign or "♥" in Lua, I have tried \003 because that is the ASCII code for it, but it does not print it out.

This has little to do with Lua.
You need to find out which character set and encoding is used in your environment and select a font that supports ♥ in that encoding.
Then you need to use an editor for your Lua script that saves in that encoding. If that part is not possible then you can determine the byte sequence required, code it as numeric escapes in a literal string and save in a compatible encoding such as CP437. For example, if you are outputting to a UTF-8 processor, "\xE2\x99\xA5".
Keep in mind that a Lua string is a counted sequence of bytes. It's up to you and your editor to put the right bytes in in the file, it's up to your environment (e.g., console) to interpret those bytes in a particular character encoding, and up to the font to display the glyph.
In a Windows console, you can select the Lucinda Console font, chcp 65001 to use UTF-8 and use Lua 5.1 like this: lua -e "print('\226\153\165')". As a comparison, chcp 437 to use IBM437 and use Lua 5.1 like this: lua -e "print('\003')".

For ASCII, only range 0x20 to 0x7E are printable. Others, including 0x03, isn't printable. Printing its value would be up to the implementation.
If the environment supports Unicode, you can simply call:
print("♥")
For instance, Lua Demo outputs ♥, same in ideone.

Related

Japanese encoding JIS_X_0208 codepage in python and C++

I am trying to encode and decode Japanese characters that are incoded in JIS_X_0208.
In python I use this command to encode my string from uft-8 to japanese characters
string.decode('utf8').encode('iso2022_jp')
to encode the kanji properly
I decode it in C++ with this line to UTF-16
MultiByteToWideChar(932, 0, &s[0], s.size(), &unicodeBuffer[0], s.size());
All the kanji are properly encoded/decoded.
But the problem is that it is not compliant with JIS_X_0208. I prefer to specify that the usage of JIS_X_0208 is mandatory and I can't change it.
For instance, the roman character are supposed to be encoded in two bytes with the first one starting with 0x23, for example le letter T should be encoded as 0x23 0x54 (according to both he JIS_X_0208 wikipedia page and the sample I was gevin as example).
I guess the only issue I have is to find the correct codepage for the encoding, but I can't find the one I need.
Does anyone know what the correct codepage is, or at least where I can find the available codepage for C++ and python on Windows?
Thank you in advance.

Cobol REPLACING ALL pattern matching

I'm working on converting some legacy COBOL code and came across a statement like this:
INSPECT WS-LOCAL-VAR REPLACING ALL X'0D25' BY ' '
I understand that the INSPECT...REPLACING ALL statement will look through WS-LOCAL-VAR, match the pattern X'0D25' and replace it with a space.
What I don't understand is the purpose of the X outside of '0D25'. All examples of REPLACING ALL that I've found online don't use anything other than a char literal for pattern matching.
How does the X affect which patterns are replaced?
COBOL is running on an EBCDIC machine and the input file is coming from a Windows machine.
The X indicates that the characters in the string are in hexadecimal. In this case, X"0D" indicates the return carriage character and X"25" the % sign (assuming an ASCII system).
A similar notation is used to indicate national strings (N"
こんにちは") and boolean/bit strings (B"0101010") and their respective hexadecimal equivalents (NX"01F5A4" and BX"2A").
Is the Cobol running on a EBCDIC machine (Mainframe / AS400) and is the file coming from a Windows Machine ???.
Ebcdic has only one end-of-line character x'25' as apposed to the 2 (\r, \n) in ascii. X'0D25' is the Ebcdic representation of Windows End-of-Line Marker \r\n. In Ebcdic 0D is not a valid character.
Possibly sources of the problem:
Poor conversion of a Windows Text file when transfered to the mainframe / AS400.
Java (and possibly other modern languages) on Windows. Java on windows supports writing Ebcdic Text files using its standard writers. But on Windows, Java insists on writing \r\n even though \r is not a valid EBCDIC character and you get corrupt files containing x'0D25'.
If you move a program that hard codes \r\n to the mainframe and run it, you will also get x'0d25' in files.

how to compare spanish character in Lua

I have to compare the contents of a lua variable with a string having spanish characters e.g.
if is equal to bisción.
if myvar = "bisción" does not work when myvar contains the same value.
I could not find anything relevant to this in Lua documentation except setting the locales at http://www.lua.org/pil/20.html. However, this also does not seem to work.
How do I test for equality (If it matters, I am using ubuntu 14.04)
This is not a problem of Lua itself.
> print("bisción" == "bisción")
true
Perhaps there is a discrepancy between the character encoding used by your source code editor, and by your data sources. Lua makes the compare operation at byte level. It's enough to have the Lua source file encoded with UTF-8, for example, and the data loaded from a file with UTF-16 encoding, and the compare fails.

lua string.upper not working with accented characters?

I'm trying to convert some French text to upper case in lua, it is not converting the accented characters. Any idea why?
test script:
print('échelle')
print(string.upper('échelle'))
print('ÉCHELLE')
print(string.lower('ÉCHELLE'))
output:
échelle
éCHELLE
ÉCHELLE
Échelle
It might be a bit overkill, but you can do this with slnunicode (which is available in LuaRocks).
require "unicode"
print(unicode.utf8.upper("échelle"))
-- ÉCHELLE
You may need to use unicode.ascii.upper or unicode.latin1.upper depending on the encoding of your source files.
You need to set a suitable locale, which depends how these strings are encoded in the source.
You seem to be using Latin 1 because of the output you gave.
In this case, trying adding the line below at the top of your script:
os.setlocale("fr_FR.ISO8859-1")
This name is for Mac OS X. For Linux, try
os.setlocale("fr_FR.iso88591")
If you're using UTF, then setting a locale won't help because string.lower converts the string one byte at a time.
Lua just uses the C library function toupper, which AFAIK doesn't support accented characters. You'd need to write a routine for that yourself.
To explain this all more effectively, Lua does not have built-in support for non-ASCII strings. You can store a Latin-1 or UTF-8-encoded string, but none of the special string manipulation functions (upper, lower, etc) will work on any non-ASCII character.
There are Lua libraries that add varying degrees of Unicode support. So you will have to use one of them.

ReadLn working with WideString (utf-8 files)

I use delphi 7.
I need to read a utf-8 file line by line, each line contain a word and its weight (a number)
So I need to read every next line, then divide a line by a separator (tab char) and save this in memory.
So,
1) is there a library to work with utf-8 files in Delphi (3-rd party maybe)
2) will functions operate ok with widestring? I use PosEx. So, if they won't, can you also give a link to 3-rd party library to work with widestrings?
If it is really UTF-8 that you are dealing with, then you should not need anything special as far as reading and processing them. You should be able to treat them as pchar or even as a normal Delphi 7 string. If you try to show the contents in some kind of message box, then you may need to do some conversions. For example, I don't believe the Delphi 7 message box method would display UTF-8 strings correctly if the string contained any byte values over 127 (0x7f). For something like that, you would need to convert to UTF-16 and call the Windows API MessageBoxW or something similar. Otherwise, though, UTF-8 strings can be treated in many situations the same as single byte ANSI strings.
I don't think UTF-8 is typically referred to as "widestring". I might be wrong, but I think that typically means UTF-16.
If your file is encoded as UTF-8, and the characters you're looking for are ASCII, then there's no need to use WideString at all. ASCII is a subset of UTF-8, and any ASCII character is guaranteed not to interfere with the special encoding used for other characters in UTF-8. The number characters 0 through 9 and the tab character are all ASCII.
The JCL comes with various functions and classes for dealing with Unicode, if you find you really need to use them.
If most of your input is UTF-8, it might be worthwhile to change your codepage on startup from the "default" to utf8 (codepage 65001). This will make all ansistring->widestring conversions effectively become a lossless utf-8->utf-16.
With D7, you will need a set of so called "unicode" components, components that base themselves on the winapi -W functions. Delphi's own components only do this with the watershed D2009 release that switches the default string type to UTF-16.
If you want to heavily invest in Unicode support, upgrading might be a smart thing to do
WideString is an UTF-16 implementation (a COM BSTR compatible one), it can't store UTF-8 strings, if you assign an 8 bit string it will be converted to UTF-16. But unless you use explicitly the proper conversion function, Delphi will interpret the 8 bit string using the current codepage.
An UTF-8 string can be stored in a Delphi AnsiString (the default string type in Delphi 7), but string manipulation functions are designed for ANSI codepages, not UTF-8. The difference is that UTF-8 is a multi byte character set. But the first 127 ANSI characters, more than one byte is needed to encode a given "character", while many ANSI codepages (especially those for European languages) only require one byte, encoding only 255 "characters" (while UTF-8 can encode the whole Unicode set).
If you're just looking for the tab character AFAIK you could use simply an AnsiString, but you have to ensure that any byte above $80 you may need to look for is not part of a multibyte sequence. If you have more complex processing needs, it may be easier to find libraries working on UTF-16 strings than UTF-8. As Rob Kennedy said, JCL is a good starting point as a free library implementing UTF string manipulation.
You could simply read the file as-is into a normal TStringList via its LoadFrom...() methods, then loop through the list as needed. If loading the entire file into memory at one time is not an option, then you can open the file using a TFileStream and then use the TStreamReader.ReadLine() method to read the stream line-by-line.
If you need to decode a given UTF-8 sequence to UTF-16 for processing, then I would suggest using the Win32 API MultiByteToWideChar() function directly, only because the RTL's UTF8Decode() function has a broken UTF-8 implementation in older Delphi versions (not sure about D7, but it definately does in D6).
The nice thing about either loading approach is that they are both encoding-aware in D2009 and later, which means that if you ever upgrade, you can make a couple of very small code changes to tell the RTL that the data is UTF-8, and it will decode it to UTF-16 for you automatically, and then the rest of your processing code can remain the same (assuming you are not doing anything that is Ansi-specific).

Resources