iconv C API: charset conversion from/to local encoding - character-encoding

I am using the iconv C API and I want iconv to detect the local encoding of the computer. Is that possible? Apparently it is because when I look in the source code, I find in the file iconv_open1.h that if the fromcode or tocode variables are empty strings ("") then the local encoding is used using the locale_charset() function call.
Someone also told me that in order to convert the locale encoding to unicode, all I needed was to use iconv_open ("UTF-8", "")
Unfortunately, I find no mention of this in the documentation.
And when I convert some iso-8859-1 text to the locale encoding (which is utf-8 on my machine), then during conversion I get errno=EILSEQ (illegal sequence). I checked and iconv_open returned no error.
If instead of the empty string in iconv_open I specify "utf-8", then I get no error. Obviously iconv failed to detect my current charset.
edit: I checked with a simple C program that puts(nl_langinfo(CODESET)) and I get ANSI_X3.4-1968 (which is ASCII). Apparently, I got a problem with charset detection.
edit: this should be related to Why is nl_langinfo(CODESET) different from locale charmap?
additional information: my program is written in Ada, and I bind at link-time to C functions. Apparently, the locale setting is not initialized the same way in the Ada runtime and C runtime.

I'll take the same answer as in Why is nl_langinfo(CODESET) different from locale charmap?
You need to first call
setlocale(LC_ALL, "");

Related

Using Umlaut or special characters in ibm-doors from batch

We have a link module that looks something like this:
const string lMod = "/project/_admin/somethingÜ" // Umlaut
We later use the linkMod like this to loop through the outlinks:
for a in obj->lMod do {}
But this only works when executing directly from DOORS and not from a batch script since it for some reason doesn't recognize the Umlaut causing the inside of the loop to never to be run; exchanging lMod with "*" works and also shows the objects linked to by the lMod.
We are already using UTF-8 encoding for the file:
pragma encoding, "UTF-8"
Any solutions are welcome.
Encode the file as UTF-8 in Notepad++ by going to Encoding > Convert to UTF-8. (Make sure it's not already set to UTF-8 before you do it).

How to encode a STRING variable into a given code page

I've got a string variable containing a text that I need to encode and write to a file, in UTF-16LE code page.
Currently the following code generates a UTF-8 file and I don't see any option in the statement OPEN DATASET to generate the file in UTF-16LE.
REPORT zmyprogram.
DATA(filename) = `/tmp/myfile`.
OPEN DATASET filename IN TEXT MODE ENCODING DEFAULT FOR OUTPUT.
TRANSFER 'HELLO WORLD' TO filename.
CLOSE DATASET filename.
I guess one solution is to first encode the string in memory, then write the encoded bytes to the file.
Generally speaking, how to encode a string of characters into a given code page, in memory?
In the first part, I explain how to encode a string of characters into a given code page (all is done in memory), and in the second part, I explain specifically how to write files to the application server in a given code page.
General way (all in memory)
If a string of characters (type STRING) has to be encoded, the result has to be stored in a string of bytes, which corresponds to the built-in data type XSTRING.
There are several possibilities which depend on the ABAP version:
Since 7.53, use the class CL_ABAP_CONV_CODEPAGE:
DATA(xstring) = cl_abap_conv_codepage=>create_out( codepage = `UTF-16LE` )->convert( source = `ABCDE` ).
Since 7.02, use the class CL_ABAP_CODEPAGE:
DATA xstring TYPE xstring.
xstring = cl_abap_codepage=>convert_to( source = `ABCDE` codepage = `UTF-16LE` ).
Before 7.02, use the class CL_ABAP_CONV_OUT_CE (documentation provided with the class):
First, instantiate the conversion object, use a SAP code page number instead of the ISO name (list of values shown hereafter):
DATA: conv TYPE REF TO CL_ABAP_CONV_OUT_CE, xstring TYPE xstring.
conv = CL_ABAP_CONV_OUT_CE=>CREATE( encoding = '4103' ). "4103 = utf-16le
Then encode the string and retrieve the bytes encoded:
conv->RESET( ).
conv->WRITE( data = `ABCDE` ).
xstring = conv->GET_BUFFER( ).
Eventually, instead of using RESET, WRITE and GET_BUFFER, the method CONVERT was added in 6.40 and retroported :
conv->CONVERT( EXPORTING data = `ABCDE` IMPORTING buffer = xstring ).
With the class CL_ABAP_CONV_OUT_CE, you need to use the number of the SAP Code Page, not the ISO name. Here are the most common SAP code pages and their equivalent ISO names:
1100: ISO-8859-1
1101: US-ASCII
1160: Windows-1252 ("ANSI")
1401: ISO-8859-2
4102: UTF-16BE
4103: UTF-16LE
4104: UTF-32BE
4105: UTF-32LE
4110: UTF-8
Etc. (the possible values are defined in the table TCP00A, in lines with column CPATTRKIND = 'H').
 
Writing a file on the application server in a given code page
In ABAP, OPEN DATASET can directly specify the target code page, most code pages are supported including UTF-8, but not other UTF (code pages 41xx) which can be done only by the solution explained in 2.3 below (by first encoding in memory).
2.1) IN TEXT MODE ENCODING ...
Possible ENCODING values:
UTF-8: in this mode, it's possible to add the Byte Order Mark if needed, via the option WITH BYTE-ORDER MARK.
DEFAULT: will be UTF-8 in a SAP "Unicode" system (that you can check via the menu System > Status > Unicode System Yes/No), NON-UNICODE otherwise.
NON-UNICODE: will depend on the current ABAP linguistic environment; for language English, it's the character encoding iso-8859-1, for language Polish, it's the character encoding iso-8859-2, etc. (the equivalences are shown in table TCP0C.)
Example in ABAP version 7.52 to write to UTF-8 with the byte order mark:
REPORT zmyprogram.
DATA(filename) = `/tmp/dataset_utf_8`.
OPEN DATASET filename IN TEXT MODE ENCODING UTF-8 WITH BYTE-ORDER MARK FOR OUTPUT.
TRY.
TRANSFER `Witaj świecie` TO filename.
CATCH cx_sy_conversion_codepage INTO DATA(lx).
" Character not supported in language code page
ENDTRY.
CLOSE DATASET filename.
Example in ABAP version 7.52 to write to iso-8859-2 (Polish language here):
REPORT zmyprogram.
SET LOCALE LANGUAGE 'L'. " Polish
DATA(filename) = `/tmp/dataset_nonunicode_pl`.
OPEN DATASET filename IN TEXT MODE ENCODING NON-UNICODE FOR OUTPUT.
TRY.
TRANSFER `Witaj świecie` TO filename.
CATCH cx_sy_conversion_codepage INTO DATA(lx).
" Character not supported in language code page
ENDTRY.
CLOSE DATASET filename.
2.2) IN LEGACY TEXT MODE CODE PAGE ...
Use any code page number except code pages 41xx (i.e. UTF-8 and other UTF; see workaround in 2.3 below).
Example in ABAP version 7.52 to write to iso-8859-2 (code page 1401) :
REPORT zmyprogram.
DATA(filename) = `/tmp/dataset_iso_8859_2`.
OPEN DATASET filename IN LEGACY TEXT MODE CODE PAGE '1401' FOR OUTPUT. " iso-8859-2
TRY.
TRANSFER `Witaj świecie` TO filename.
CATCH cx_sy_conversion_codepage INTO DATA(lx).
" Character not supported in language code page
ENDTRY.
CLOSE DATASET filename.
2.3) UTF = general way + IN BINARY MODE
Example in ABAP version 7.52:
REPORT zmyprogram.
TRY.
DATA(xstring) = cl_abap_codepage=>convert_to( source = `Witaj świecie` codepage = `UTF-16LE` ).
CATCH cx_sy_conversion_codepage INTO DATA(lx).
" Character not supported in language code page
BREAK-POINT.
ENDTRY.
DATA(filename) = `/tmp/dataset_utf_16le`.
OPEN DATASET filename IN BINARY MODE FOR OUTPUT.
TRANSFER xstring TO filename.
CLOSE DATASET filename.

LoadFromFile with Unicode data

My input file(f) has some Unicode (Swedish) that isn't being read correctly.
Neither of these approaches works, although they give different results:
LoadFromFile(f);
or
LoadFromFile(f,TEncoding.GetEncoding(GetOEMCP));
I'm using Delphi XE
How can I LoadFromFile some Unicode data....also how do I subsequently SaveToFile? Thanks
In order to load a Unicode text file you need to know its encoding. If the file has a Byte Order Mark (BOM), then you can simply call LoadFromFile(FileName) and the RTL will use the BOM to determine the encoding.
If the file does not have a BOM then you need to explicitly specify the encoding, e.g.
LoadFromFile(FileName, TEncoding.UTF8);
LoadFromFile(FileName, TEncoding.Unicode);//UTF-16 LE
LoadFromFile(FileName, TEncoding.BigEndianUnicode);//UTF-16 BE
For some reason, unknown to me, there is no built in support for UTF-32, but if you had such a file then it would be easy enough to add a TEncoding instance to handle that.
I assume that you mean 'UTF-8' when you say 'Unicode'.
If you know that the file is UTF-8, then do
LoadFromFile(f, TEncoding.UTF8).
To save:
SaveToFile(f, TEncoding.UTF8);
(The GetOEMCP WinAPI function is for old 255-character character sets.)

HTML decoding in C/C++

I'm using libcurl for getting HTML pages.
I have some problems with Hebrew characters.
for example this: סלקום
gets gibberish.
How do I get Hebrew characters and not gibberish?
Do I need some HTML decoder?
Does libcurl support such operation?
Does libiconv support such operation?
I appreciate any help.
Thanks
Edit: Ok, so what you’re seeing is UTF-8 data being decoded as Windows-1252 (so the numeric character references were a red herring). Here’s a demonstration in Python:
>>> u = ''.join(map(unichr, [1505, 1500, 1511, 1493, 1501]))
>>> s = u.encode('utf-8')
>>> print s.decode('cp1255', 'replace')
׳¡׳�׳§׳•׳�
The solution to this problem depends on the environment in which the output is displayed. Merely outputting the bytes received and expecting them to be interpreted as characters leads to problems like this.
An HTML document typically contains a header tag like <meta charset=utf-8> to indicate to the browser what its encoding should be. A document served by a web server contains an HTTP header like Content-Type: text/html; charset=utf-8.
You should ask libcurl for the Content-Type HTTP header to know the encoding of the document, and then convert it to the system encoding using iconv. While in your case that would be codepage 1255, it depends on the user’s system and so you should look up the appropriate functions to detect that.
(Read Unicode and Character Sets and the character-encoding tag on this site for a wealth of further information.)

Iconv.conv in Rails application to convert from unicode to ASCII//translit

We wanted to convert a unicode string in Slovak language into plain ASCII (without accents/carons) That is to do: č->c š->s á->a é->e etc.
We tried:
cstr = Iconv.conv('us-ascii//translit', 'utf-8', a_unicode_string)
It was working on one system (Mac) and was not working on the other (Ubuntu) where it was giving '?' for accented characters after conversion.
Problem: iconv was using LANG/LC_ALL variables. I do not know why, when the encodings are known, but well... You had to set the locale variables to something.utf8, for example: sk_SK.utf8 or en_GB.utf8
Next step was to try to set ENV['LANG'] and ENV['LC_ALL'] in config/application.rb. This was ignored by Iconv in ruby.
Another try was to use global system setting in /etc/default/locale - this worked in command line, but not for Rails application. Reason: apache has its own environment. Therefore the final solution was to add LANG/LC_ALL variables into /etc/apache2/envvars:
export LC_ALL="en_GB.utf8"
export LANG="en_GB.utf8"
export LANGUAGE="en_GB.utf8"
Restarted apache and it worked.
This is more a little how-to than a question. However, if someone has better solution I would like to know about it.
You can try unaccent approach instead.

Resources