XML Parsing with parseFromString with portuguese characters - blackberry

I have a Blackberry app developed using PhoneGap. I am using suds client to call web service. There are some Portuguese character in the webservice XML. I am not able to parse to XMLDoc using the DOMParser.
I am using
xmlDoc = parser.parseFromString(_xml, "text/xml");
The encoding type is UTF-8. Without the Portuguese character, parsing is working perfectly.

"I am using is UTF-8 encoding type." - this can mean several things, so it is unclear what exactly you do in order to support UTF-8 end-to-end.
E.g. you should check:
your web service really sends data in UTF-8 (when it converts string chars into bytes to be sent into output stream it should use UTF-8)
the device code that reads data from web really uses UTF-8 to convert bytes to string _xml
P.S. I'm not familiar with phonegap API so this is just a general plan.

Related

How to encode a STRING variable into a given code page

I've got a string variable containing a text that I need to encode and write to a file, in UTF-16LE code page.
Currently the following code generates a UTF-8 file and I don't see any option in the statement OPEN DATASET to generate the file in UTF-16LE.
REPORT zmyprogram.
DATA(filename) = `/tmp/myfile`.
OPEN DATASET filename IN TEXT MODE ENCODING DEFAULT FOR OUTPUT.
TRANSFER 'HELLO WORLD' TO filename.
CLOSE DATASET filename.
I guess one solution is to first encode the string in memory, then write the encoded bytes to the file.
Generally speaking, how to encode a string of characters into a given code page, in memory?
In the first part, I explain how to encode a string of characters into a given code page (all is done in memory), and in the second part, I explain specifically how to write files to the application server in a given code page.
General way (all in memory)
If a string of characters (type STRING) has to be encoded, the result has to be stored in a string of bytes, which corresponds to the built-in data type XSTRING.
There are several possibilities which depend on the ABAP version:
Since 7.53, use the class CL_ABAP_CONV_CODEPAGE:
DATA(xstring) = cl_abap_conv_codepage=>create_out( codepage = `UTF-16LE` )->convert( source = `ABCDE` ).
Since 7.02, use the class CL_ABAP_CODEPAGE:
DATA xstring TYPE xstring.
xstring = cl_abap_codepage=>convert_to( source = `ABCDE` codepage = `UTF-16LE` ).
Before 7.02, use the class CL_ABAP_CONV_OUT_CE (documentation provided with the class):
First, instantiate the conversion object, use a SAP code page number instead of the ISO name (list of values shown hereafter):
DATA: conv TYPE REF TO CL_ABAP_CONV_OUT_CE, xstring TYPE xstring.
conv = CL_ABAP_CONV_OUT_CE=>CREATE( encoding = '4103' ). "4103 = utf-16le
Then encode the string and retrieve the bytes encoded:
conv->RESET( ).
conv->WRITE( data = `ABCDE` ).
xstring = conv->GET_BUFFER( ).
Eventually, instead of using RESET, WRITE and GET_BUFFER, the method CONVERT was added in 6.40 and retroported :
conv->CONVERT( EXPORTING data = `ABCDE` IMPORTING buffer = xstring ).
With the class CL_ABAP_CONV_OUT_CE, you need to use the number of the SAP Code Page, not the ISO name. Here are the most common SAP code pages and their equivalent ISO names:
1100: ISO-8859-1
1101: US-ASCII
1160: Windows-1252 ("ANSI")
1401: ISO-8859-2
4102: UTF-16BE
4103: UTF-16LE
4104: UTF-32BE
4105: UTF-32LE
4110: UTF-8
Etc. (the possible values are defined in the table TCP00A, in lines with column CPATTRKIND = 'H').
 
Writing a file on the application server in a given code page
In ABAP, OPEN DATASET can directly specify the target code page, most code pages are supported including UTF-8, but not other UTF (code pages 41xx) which can be done only by the solution explained in 2.3 below (by first encoding in memory).
2.1) IN TEXT MODE ENCODING ...
Possible ENCODING values:
UTF-8: in this mode, it's possible to add the Byte Order Mark if needed, via the option WITH BYTE-ORDER MARK.
DEFAULT: will be UTF-8 in a SAP "Unicode" system (that you can check via the menu System > Status > Unicode System Yes/No), NON-UNICODE otherwise.
NON-UNICODE: will depend on the current ABAP linguistic environment; for language English, it's the character encoding iso-8859-1, for language Polish, it's the character encoding iso-8859-2, etc. (the equivalences are shown in table TCP0C.)
Example in ABAP version 7.52 to write to UTF-8 with the byte order mark:
REPORT zmyprogram.
DATA(filename) = `/tmp/dataset_utf_8`.
OPEN DATASET filename IN TEXT MODE ENCODING UTF-8 WITH BYTE-ORDER MARK FOR OUTPUT.
TRY.
TRANSFER `Witaj świecie` TO filename.
CATCH cx_sy_conversion_codepage INTO DATA(lx).
" Character not supported in language code page
ENDTRY.
CLOSE DATASET filename.
Example in ABAP version 7.52 to write to iso-8859-2 (Polish language here):
REPORT zmyprogram.
SET LOCALE LANGUAGE 'L'. " Polish
DATA(filename) = `/tmp/dataset_nonunicode_pl`.
OPEN DATASET filename IN TEXT MODE ENCODING NON-UNICODE FOR OUTPUT.
TRY.
TRANSFER `Witaj świecie` TO filename.
CATCH cx_sy_conversion_codepage INTO DATA(lx).
" Character not supported in language code page
ENDTRY.
CLOSE DATASET filename.
2.2) IN LEGACY TEXT MODE CODE PAGE ...
Use any code page number except code pages 41xx (i.e. UTF-8 and other UTF; see workaround in 2.3 below).
Example in ABAP version 7.52 to write to iso-8859-2 (code page 1401) :
REPORT zmyprogram.
DATA(filename) = `/tmp/dataset_iso_8859_2`.
OPEN DATASET filename IN LEGACY TEXT MODE CODE PAGE '1401' FOR OUTPUT. " iso-8859-2
TRY.
TRANSFER `Witaj świecie` TO filename.
CATCH cx_sy_conversion_codepage INTO DATA(lx).
" Character not supported in language code page
ENDTRY.
CLOSE DATASET filename.
2.3) UTF = general way + IN BINARY MODE
Example in ABAP version 7.52:
REPORT zmyprogram.
TRY.
DATA(xstring) = cl_abap_codepage=>convert_to( source = `Witaj świecie` codepage = `UTF-16LE` ).
CATCH cx_sy_conversion_codepage INTO DATA(lx).
" Character not supported in language code page
BREAK-POINT.
ENDTRY.
DATA(filename) = `/tmp/dataset_utf_16le`.
OPEN DATASET filename IN BINARY MODE FOR OUTPUT.
TRANSFER xstring TO filename.
CLOSE DATASET filename.

Slack slash command results in different encoding coming from iOS app or MacOS App

We have a Slash command integration and discovered that the text passed to the slash command is encoded differently if it comes from the (iOS) mobile app compared to the Desktop app.
For the command "/whereis #xsd" on the MacOS Desktop app, the text element in the body is encoded as: text=%3C%23C02MKG1LH%7Cxsd%3E
For the command "/whereis #xsd" on the iOS app the text element in the body is encoded as: text=%26lt%3B%23C02MKG1LH%7Cxsd%26gt%3B
The iOS app is incorrect.
Did anyone else experience this? Any solutions?
(I have posted this question to Slack, they confirmed the behavior a while back but no solution from them so far).
This is not a bug. Both are valid HTML encodings. You can verify this by decoding them on this website.
The difference is that the string from IOS also includes an encoding of HTML special characters (like <) but the desktop string does not. To address this your app has to first do a URL decoding of the input string and then decode special HTML chars.
The results are:
Desktop: <#C02MKG1LH|xsd>
IOS: <#C02MKG1LH|xsd>
Here is a sample code that will decode both strings correctly in PHP:
<?php
function decodeInputString($input)
{
return htmlspecialchars_decode(urldecode($input));
}
$desktop = "%3C%23C02MKG1LH%7Cxsd%3E";
$ios = "%26lt%3B%23C02MKG1LH%7Cxsd%26gt%3B";
$desktop_plain = decodeInputString($desktop);
$ios_plain = decodeInputString($ios);
var_dump($desktop_plain);
var_dump($ios_plain);

NSString: dealing with UTF8-based API

Which characterset is the default characterset for NSString, when i get typed content from a UITextField?
I developed an app, which sends such NSStrings to a UTF8-based REST-API. At the backend, there is an utf8 based MySQL-Database and also utf8-based varchar-fields.
My POST-Request sends string data from the iOS App to the server. And with a GET-Request i receive those strings from the REST API.
Within the App, everything is printed fine. Special UTF-8-Characters like ÄÖÜ are showed correctly after sending them to the server and after receive them back.
But when i enter the mysql-console of the server of the REST API, and do a SELECT-Command at these data, there are broken characters visible.
What could be the root cause? In which characterset does Apple use a NSString?
It sounds like it is a server issue. Check that the version you are using supports UTF-8, older versions do not. See : How to support full Unicode in MySQL database
MySQL’s utf8 encoding is different from proper UTF-8 encoding. It doesn’t offer full Unicode support.
MySQL 5.5.3 (released in early 2010) introduced a new encoding called utf8mb4 which maps to proper UTF-8 and thus fully supports Unicode.
NSString has in internal representation that is essentially opaque.
The UITextField method text returns an NSString.
When you want data from a string use to send to a server use - (NSData *)dataUsingEncoding:(NSStringEncoding)encoding and specify the encoding such as NSUTF8StringEncoding.
NSData *textFieldUTF8Data = [textFieldInstance.text dataUsingEncoding: NSUTF8StringEncoding];
If, by "mysql console", you are referring to the DOS-like window in Windows, then you need:
The command "chcp" controls the "code page". chcp 65001 provides utf8, but it needs a special charset installed, too. some code pages
To set the font in the console window: Right-click on the title of the window → Properties → Font → pick Lucida Console
Also, tell the 'console' that your bytes are UTF8 by doing SET NAMES utf8mb4.

Indy IMAP4 does not display German symbols correctly

I am using TIdIMAP4 component to fill the string grid with the messages of my GMail mailbox.
var IMAPClient: TIdIMAP4;
Some messages have German umlauts. When I call IMAPClient.RetrieveAllHeaders(MyMsgList) the string grid is populated as expected (all umlauts are displayed) but there are no UIDs however (I guess that RetrieveAllHeaders just doesn't fetch UIDs).
When I call IMAPClient.UIDRetrieveAllEnvelopes(MyMsgList) all additional attributes of a Messages are there, but the headers are displayed in abracadabra (=?ISO-8859-1?Q?_Die_Br=FCcke_von_Arnheim?=) // Shall be 'Die Brücke von Arnheim'.
I've read many supportive posts but could not find the answer why IndyIMAP4 treats German symbols incorrectly.
Any ideas?
RetrieveAllHeaders() decodes the raw data it retrieves. UIDRetrieveAllEnvelopes() retrieves the raw data only, it does not decode. You can decode the raw headers manually by calling Indy's DecodeHeader() function in the IdCoderHeader unit.

HTML decoding in C/C++

I'm using libcurl for getting HTML pages.
I have some problems with Hebrew characters.
for example this: סלקום
gets gibberish.
How do I get Hebrew characters and not gibberish?
Do I need some HTML decoder?
Does libcurl support such operation?
Does libiconv support such operation?
I appreciate any help.
Thanks
Edit: Ok, so what you’re seeing is UTF-8 data being decoded as Windows-1252 (so the numeric character references were a red herring). Here’s a demonstration in Python:
>>> u = ''.join(map(unichr, [1505, 1500, 1511, 1493, 1501]))
>>> s = u.encode('utf-8')
>>> print s.decode('cp1255', 'replace')
׳¡׳�׳§׳•׳�
The solution to this problem depends on the environment in which the output is displayed. Merely outputting the bytes received and expecting them to be interpreted as characters leads to problems like this.
An HTML document typically contains a header tag like <meta charset=utf-8> to indicate to the browser what its encoding should be. A document served by a web server contains an HTTP header like Content-Type: text/html; charset=utf-8.
You should ask libcurl for the Content-Type HTTP header to know the encoding of the document, and then convert it to the system encoding using iconv. While in your case that would be codepage 1255, it depends on the user’s system and so you should look up the appropriate functions to detect that.
(Read Unicode and Character Sets and the character-encoding tag on this site for a wealth of further information.)

Resources