Read mixed language txt file in TCL - character-encoding

I need to read a text file that contains mixed languages (German & Arabic) in a TCL script. When I open the file with utf-8 encoding, only German letters are recognized correctly.
set My_Sheet [open $ref_book r]
fconfigure $My_Sheet -encoding utf-8
Does anyone konw what the correct encoding that I can use to open the file.
P.S: when I type
encoding names
I can see the following encodings are available:
cp860 cp861 cp862 cp863 tis-620 cp864 cp865 cp866 gb12345 gb2312-raw
cp949 cp950 cp869 dingbats ksc5601 macCentEuro cp874 macUkraine
jis0201 gb2312 euc-cn euc-jp macThai iso8859-10 jis0208 iso2022-jp macIceland
iso2022 iso8859-13 jis0212 iso8859-14 iso8859-15 cp737 iso8859-16 big5 euc-kr
macRomania macTurkish gb1988 iso2022-kr macGreek ascii cp437 macRoman
iso8859-1 iso8859-2 iso8859-3 macCroatian koi8-r iso8859-4 ebcdic iso8859-5
cp1250 macCyrillic iso8859-6 cp1251 macDingbats koi8-u iso8859-7 cp1252
iso8859-8 cp1253 iso8859-9 cp1254 cp1255 cp850 cp1256 cp932 identity cp1257 cp852
macJapan cp1258 shiftjis utf-8 cp855 cp936 symbol cp775 unicode cp857

Related

How to encode a STRING variable into a given code page

I've got a string variable containing a text that I need to encode and write to a file, in UTF-16LE code page.
Currently the following code generates a UTF-8 file and I don't see any option in the statement OPEN DATASET to generate the file in UTF-16LE.
REPORT zmyprogram.
DATA(filename) = `/tmp/myfile`.
OPEN DATASET filename IN TEXT MODE ENCODING DEFAULT FOR OUTPUT.
TRANSFER 'HELLO WORLD' TO filename.
CLOSE DATASET filename.
I guess one solution is to first encode the string in memory, then write the encoded bytes to the file.
Generally speaking, how to encode a string of characters into a given code page, in memory?
In the first part, I explain how to encode a string of characters into a given code page (all is done in memory), and in the second part, I explain specifically how to write files to the application server in a given code page.
General way (all in memory)
If a string of characters (type STRING) has to be encoded, the result has to be stored in a string of bytes, which corresponds to the built-in data type XSTRING.
There are several possibilities which depend on the ABAP version:
Since 7.53, use the class CL_ABAP_CONV_CODEPAGE:
DATA(xstring) = cl_abap_conv_codepage=>create_out( codepage = `UTF-16LE` )->convert( source = `ABCDE` ).
Since 7.02, use the class CL_ABAP_CODEPAGE:
DATA xstring TYPE xstring.
xstring = cl_abap_codepage=>convert_to( source = `ABCDE` codepage = `UTF-16LE` ).
Before 7.02, use the class CL_ABAP_CONV_OUT_CE (documentation provided with the class):
First, instantiate the conversion object, use a SAP code page number instead of the ISO name (list of values shown hereafter):
DATA: conv TYPE REF TO CL_ABAP_CONV_OUT_CE, xstring TYPE xstring.
conv = CL_ABAP_CONV_OUT_CE=>CREATE( encoding = '4103' ). "4103 = utf-16le
Then encode the string and retrieve the bytes encoded:
conv->RESET( ).
conv->WRITE( data = `ABCDE` ).
xstring = conv->GET_BUFFER( ).
Eventually, instead of using RESET, WRITE and GET_BUFFER, the method CONVERT was added in 6.40 and retroported :
conv->CONVERT( EXPORTING data = `ABCDE` IMPORTING buffer = xstring ).
With the class CL_ABAP_CONV_OUT_CE, you need to use the number of the SAP Code Page, not the ISO name. Here are the most common SAP code pages and their equivalent ISO names:
1100: ISO-8859-1
1101: US-ASCII
1160: Windows-1252 ("ANSI")
1401: ISO-8859-2
4102: UTF-16BE
4103: UTF-16LE
4104: UTF-32BE
4105: UTF-32LE
4110: UTF-8
Etc. (the possible values are defined in the table TCP00A, in lines with column CPATTRKIND = 'H').
 
Writing a file on the application server in a given code page
In ABAP, OPEN DATASET can directly specify the target code page, most code pages are supported including UTF-8, but not other UTF (code pages 41xx) which can be done only by the solution explained in 2.3 below (by first encoding in memory).
2.1) IN TEXT MODE ENCODING ...
Possible ENCODING values:
UTF-8: in this mode, it's possible to add the Byte Order Mark if needed, via the option WITH BYTE-ORDER MARK.
DEFAULT: will be UTF-8 in a SAP "Unicode" system (that you can check via the menu System > Status > Unicode System Yes/No), NON-UNICODE otherwise.
NON-UNICODE: will depend on the current ABAP linguistic environment; for language English, it's the character encoding iso-8859-1, for language Polish, it's the character encoding iso-8859-2, etc. (the equivalences are shown in table TCP0C.)
Example in ABAP version 7.52 to write to UTF-8 with the byte order mark:
REPORT zmyprogram.
DATA(filename) = `/tmp/dataset_utf_8`.
OPEN DATASET filename IN TEXT MODE ENCODING UTF-8 WITH BYTE-ORDER MARK FOR OUTPUT.
TRY.
TRANSFER `Witaj świecie` TO filename.
CATCH cx_sy_conversion_codepage INTO DATA(lx).
" Character not supported in language code page
ENDTRY.
CLOSE DATASET filename.
Example in ABAP version 7.52 to write to iso-8859-2 (Polish language here):
REPORT zmyprogram.
SET LOCALE LANGUAGE 'L'. " Polish
DATA(filename) = `/tmp/dataset_nonunicode_pl`.
OPEN DATASET filename IN TEXT MODE ENCODING NON-UNICODE FOR OUTPUT.
TRY.
TRANSFER `Witaj świecie` TO filename.
CATCH cx_sy_conversion_codepage INTO DATA(lx).
" Character not supported in language code page
ENDTRY.
CLOSE DATASET filename.
2.2) IN LEGACY TEXT MODE CODE PAGE ...
Use any code page number except code pages 41xx (i.e. UTF-8 and other UTF; see workaround in 2.3 below).
Example in ABAP version 7.52 to write to iso-8859-2 (code page 1401) :
REPORT zmyprogram.
DATA(filename) = `/tmp/dataset_iso_8859_2`.
OPEN DATASET filename IN LEGACY TEXT MODE CODE PAGE '1401' FOR OUTPUT. " iso-8859-2
TRY.
TRANSFER `Witaj świecie` TO filename.
CATCH cx_sy_conversion_codepage INTO DATA(lx).
" Character not supported in language code page
ENDTRY.
CLOSE DATASET filename.
2.3) UTF = general way + IN BINARY MODE
Example in ABAP version 7.52:
REPORT zmyprogram.
TRY.
DATA(xstring) = cl_abap_codepage=>convert_to( source = `Witaj świecie` codepage = `UTF-16LE` ).
CATCH cx_sy_conversion_codepage INTO DATA(lx).
" Character not supported in language code page
BREAK-POINT.
ENDTRY.
DATA(filename) = `/tmp/dataset_utf_16le`.
OPEN DATASET filename IN BINARY MODE FOR OUTPUT.
TRANSFER xstring TO filename.
CLOSE DATASET filename.

What encoding is this and how do I turn it into something I can see properly?

I'm writing a script that will operate on the subtitle files of a popular streaming service (Netfl*x).
The subtitle files have strange characters in them and I can't get them to render in a way that my text editors or web browser will display in a readable way. The xml encoding says UTF-8, but some characters are not readable.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<tt xmlns:tt="http://www.w3.org/ns/ttml" xmlns:ttm="http://www.w3.org/ns/ttml#metadata" xmlns:ttp="http://www.w3.org/ns/ttml#parameter" xmlns:tts="http://www.w3.org/ns/ttml#styling" ttp:tickRate="10000000" ttp:timeBase="media" xmlns="http://www.w3.org/ns/ttml">
<p>de 15 % la nuit dernière.</span></p>
<p>if youâve got things to doâ¦</span></p>
And in Vim:
This is what it looks like in the browser:
How can I convert this into something I can use?
I'll go out on a limb and say that file is UTF-8 encoded just fine, and you're merely looking at it using the wrong encoding. The character À encoded in UTF-8 is C3 80. C3 in ISO-8859-1 is Ã, which in your screenshot is followed by an 80. So looks like you're looking at a UTF-8 file using the (wrong) ISO-8859 encoding.
Use the correct encoding when opening the file.
My terminal is set to en_US.UTF-8, but was also rendering this supposedly UTF-8 encoded file incorrectly (sonné -> sonné). I was able to solve this by using iconv to encode the file in ISO8859-1.
iconv original.xml -t ISO8859-1 -o converted.xml
In the new file, the characters were properly rendered, although I don't quite understand why.

Can we change XML encoding from utf-8 to utf -16?

I have written a code for generating XML with UTF-8 encoding.I always validate the XML with XSD file. In the same code i need UTF-16 encoding. Because one of my XSD file is of UTF-16 encoding.
But in my existing code it is not accepted. it gives following error.
FAILED: Fatal error: Document labelled UTF-16 but has UTF-8 content at filepath/standard.xsd:1.
and at line 1 of XSD this tag is defined <?xml version="1.0" encoding="utf-16"?>
How can i validate it with utf-8 encoding?
Is there any way to change UTF-16 to UTF-8 encoding.
Thanks in advance.
You can change the encoding from utf16 to utf-8 with Iconv
Call iconv from Ruby 1.8.7 through system to convert a file from utf-16 to utf-8
When you write the new file you can replace the first line with a new header like
<?xml version="1.0" encoding="utf-8" ?>
Ruby - Open file, find and replace multiple lines
If you need it in the other way then change the endoding in the function.

BlackBerry - language support for Chinese

I have localised my app by adding the correct resource files for various European languages / dialects.
I have the required folder in my project: ./res/com/demo/localization
It contains the required files e.g. Demo.rrh, Demo.rrc, Demo_de.rrc etc.
I want to add support for 2 Chinese dialects, and I have the translations in an Excel file. On iPhone, they are referred to by the codes zh_TW & zh_CM. Following the pattern with German, I created 2 extra files called Demo_zh_TW.rrc & Demo_zh_CN.rrc.
I opened file Demo_zh_CN.rrc using Eclipse's text editor, and pasted in line of the Chinese translation using the normal resource file format:
START_LOCATION#0="开始位置";
When I tried to save the file, I got Eclipse's error about the Cp1252 character encoding:
Save could not be completed.
Reason:
Some characters cannot be mapped using "Cp1252" character encoding.
Either change the encoding or remove the characters which are not
supported by the "Cp1252" character encoding.
It seems the Eclipse editor will accept the Chinese characters, but the resource tool expects that these characters must be saved in the resource file as Java Unicode /u encoding.
How do I add language support for these 2 regions without manually copy n pasting in each string?
Is there maybe a tool that I can use to Java Unicode /u encode the strings from Excel so they can be saved in Code page 1252 Latin chars only?
I'm not aware of any readily available tools for working with BlackBerry's peculiar localization style.
Here's a snippet of Java-SE code I use to convert the UTF-8 strings I get for use with BlackBerry:
private static String unicodeEscape(String value, CharsetEncoder encoder) {
StringBuilder sb = new StringBuilder();
for(char c : value.toCharArray()) {
if(encoder.canEncode(c)) {
sb.append(c);
} else {
sb.append("\\u");
sb.append(hex4(c));
}
}
return sb.toString();
}
private static String hex4(char c) {
String ret = Integer.toHexString(c);
while(ret.length() < 4) {
ret = "0" + ret;
}
return ret;
}
Call unicodeEscape with the 8859-1 encoder with Charset.forName("ISO-8859-1").newEncoder()
I suggest you look at Blackberry Hindi and Gujarati text display
You need to use the resource editor to make these files with the right encoding. Eclipse will escape the characters automatically.
This is a problem with the encoding of your resource file. 1252 Code Page contains Latin characters only.
I have never worked with Eclipse, but there should be somewhere you specify the encoding of the file, you should set your default encoding for files to UTF-8 if possible. This will handle your chinese characters.
You could also use a good editor like Notepad++ or EMEditor to set the encoding of your file.
See here for how you can configure Eclipse to use UTF-8 by default.

How to change html encoded character to ascii character

I have a french character that is encoded as follows:
"Jos\xE9e"
I need to convert it to regular character because it produces this error on my server:
invalid byte sequence in UTF-8
What can I do to fix this error?
Rails 3 Ruby 1.9.2
That looks like "Josée" encoded in ISO 8859-1 (AKA Latin-1). You can use Iconv to convert it to UTF-8:
require 'iconv'
utf_string = Iconv.conv('UTF-8', 'ISO-8859-1', "Jos\xE9e")
Use a editor support utf-8, and add coding line at the top of all source files:
# coding: utf-8
If some input string is not utf-8, convert it to utf-8 first before processing:
input_str = "Jos\xE9e"
utf_input = input_str.force_encoding('iso-8859-1').encode('utf-8')
All above only work under ruby 1.9. For more information, you can check the book: Ruby Best Practices.
you should use utf8 in all your source code, how about save your file in utf-8 encoding

Resources