NSString: dealing with UTF8-based API - ios

Which characterset is the default characterset for NSString, when i get typed content from a UITextField?
I developed an app, which sends such NSStrings to a UTF8-based REST-API. At the backend, there is an utf8 based MySQL-Database and also utf8-based varchar-fields.
My POST-Request sends string data from the iOS App to the server. And with a GET-Request i receive those strings from the REST API.
Within the App, everything is printed fine. Special UTF-8-Characters like ÄÖÜ are showed correctly after sending them to the server and after receive them back.
But when i enter the mysql-console of the server of the REST API, and do a SELECT-Command at these data, there are broken characters visible.
What could be the root cause? In which characterset does Apple use a NSString?

It sounds like it is a server issue. Check that the version you are using supports UTF-8, older versions do not. See : How to support full Unicode in MySQL database
MySQL’s utf8 encoding is different from proper UTF-8 encoding. It doesn’t offer full Unicode support.
MySQL 5.5.3 (released in early 2010) introduced a new encoding called utf8mb4 which maps to proper UTF-8 and thus fully supports Unicode.
NSString has in internal representation that is essentially opaque.
The UITextField method text returns an NSString.
When you want data from a string use to send to a server use - (NSData *)dataUsingEncoding:(NSStringEncoding)encoding and specify the encoding such as NSUTF8StringEncoding.
NSData *textFieldUTF8Data = [textFieldInstance.text dataUsingEncoding: NSUTF8StringEncoding];

If, by "mysql console", you are referring to the DOS-like window in Windows, then you need:
The command "chcp" controls the "code page". chcp 65001 provides utf8, but it needs a special charset installed, too. some code pages
To set the font in the console window: Right-click on the title of the window → Properties → Font → pick Lucida Console
Also, tell the 'console' that your bytes are UTF8 by doing SET NAMES utf8mb4.

Related

How to encode a STRING variable into a given code page

I've got a string variable containing a text that I need to encode and write to a file, in UTF-16LE code page.
Currently the following code generates a UTF-8 file and I don't see any option in the statement OPEN DATASET to generate the file in UTF-16LE.
REPORT zmyprogram.
DATA(filename) = `/tmp/myfile`.
OPEN DATASET filename IN TEXT MODE ENCODING DEFAULT FOR OUTPUT.
TRANSFER 'HELLO WORLD' TO filename.
CLOSE DATASET filename.
I guess one solution is to first encode the string in memory, then write the encoded bytes to the file.
Generally speaking, how to encode a string of characters into a given code page, in memory?
In the first part, I explain how to encode a string of characters into a given code page (all is done in memory), and in the second part, I explain specifically how to write files to the application server in a given code page.
General way (all in memory)
If a string of characters (type STRING) has to be encoded, the result has to be stored in a string of bytes, which corresponds to the built-in data type XSTRING.
There are several possibilities which depend on the ABAP version:
Since 7.53, use the class CL_ABAP_CONV_CODEPAGE:
DATA(xstring) = cl_abap_conv_codepage=>create_out( codepage = `UTF-16LE` )->convert( source = `ABCDE` ).
Since 7.02, use the class CL_ABAP_CODEPAGE:
DATA xstring TYPE xstring.
xstring = cl_abap_codepage=>convert_to( source = `ABCDE` codepage = `UTF-16LE` ).
Before 7.02, use the class CL_ABAP_CONV_OUT_CE (documentation provided with the class):
First, instantiate the conversion object, use a SAP code page number instead of the ISO name (list of values shown hereafter):
DATA: conv TYPE REF TO CL_ABAP_CONV_OUT_CE, xstring TYPE xstring.
conv = CL_ABAP_CONV_OUT_CE=>CREATE( encoding = '4103' ). "4103 = utf-16le
Then encode the string and retrieve the bytes encoded:
conv->RESET( ).
conv->WRITE( data = `ABCDE` ).
xstring = conv->GET_BUFFER( ).
Eventually, instead of using RESET, WRITE and GET_BUFFER, the method CONVERT was added in 6.40 and retroported :
conv->CONVERT( EXPORTING data = `ABCDE` IMPORTING buffer = xstring ).
With the class CL_ABAP_CONV_OUT_CE, you need to use the number of the SAP Code Page, not the ISO name. Here are the most common SAP code pages and their equivalent ISO names:
1100: ISO-8859-1
1101: US-ASCII
1160: Windows-1252 ("ANSI")
1401: ISO-8859-2
4102: UTF-16BE
4103: UTF-16LE
4104: UTF-32BE
4105: UTF-32LE
4110: UTF-8
Etc. (the possible values are defined in the table TCP00A, in lines with column CPATTRKIND = 'H').
 
Writing a file on the application server in a given code page
In ABAP, OPEN DATASET can directly specify the target code page, most code pages are supported including UTF-8, but not other UTF (code pages 41xx) which can be done only by the solution explained in 2.3 below (by first encoding in memory).
2.1) IN TEXT MODE ENCODING ...
Possible ENCODING values:
UTF-8: in this mode, it's possible to add the Byte Order Mark if needed, via the option WITH BYTE-ORDER MARK.
DEFAULT: will be UTF-8 in a SAP "Unicode" system (that you can check via the menu System > Status > Unicode System Yes/No), NON-UNICODE otherwise.
NON-UNICODE: will depend on the current ABAP linguistic environment; for language English, it's the character encoding iso-8859-1, for language Polish, it's the character encoding iso-8859-2, etc. (the equivalences are shown in table TCP0C.)
Example in ABAP version 7.52 to write to UTF-8 with the byte order mark:
REPORT zmyprogram.
DATA(filename) = `/tmp/dataset_utf_8`.
OPEN DATASET filename IN TEXT MODE ENCODING UTF-8 WITH BYTE-ORDER MARK FOR OUTPUT.
TRY.
TRANSFER `Witaj świecie` TO filename.
CATCH cx_sy_conversion_codepage INTO DATA(lx).
" Character not supported in language code page
ENDTRY.
CLOSE DATASET filename.
Example in ABAP version 7.52 to write to iso-8859-2 (Polish language here):
REPORT zmyprogram.
SET LOCALE LANGUAGE 'L'. " Polish
DATA(filename) = `/tmp/dataset_nonunicode_pl`.
OPEN DATASET filename IN TEXT MODE ENCODING NON-UNICODE FOR OUTPUT.
TRY.
TRANSFER `Witaj świecie` TO filename.
CATCH cx_sy_conversion_codepage INTO DATA(lx).
" Character not supported in language code page
ENDTRY.
CLOSE DATASET filename.
2.2) IN LEGACY TEXT MODE CODE PAGE ...
Use any code page number except code pages 41xx (i.e. UTF-8 and other UTF; see workaround in 2.3 below).
Example in ABAP version 7.52 to write to iso-8859-2 (code page 1401) :
REPORT zmyprogram.
DATA(filename) = `/tmp/dataset_iso_8859_2`.
OPEN DATASET filename IN LEGACY TEXT MODE CODE PAGE '1401' FOR OUTPUT. " iso-8859-2
TRY.
TRANSFER `Witaj świecie` TO filename.
CATCH cx_sy_conversion_codepage INTO DATA(lx).
" Character not supported in language code page
ENDTRY.
CLOSE DATASET filename.
2.3) UTF = general way + IN BINARY MODE
Example in ABAP version 7.52:
REPORT zmyprogram.
TRY.
DATA(xstring) = cl_abap_codepage=>convert_to( source = `Witaj świecie` codepage = `UTF-16LE` ).
CATCH cx_sy_conversion_codepage INTO DATA(lx).
" Character not supported in language code page
BREAK-POINT.
ENDTRY.
DATA(filename) = `/tmp/dataset_utf_16le`.
OPEN DATASET filename IN BINARY MODE FOR OUTPUT.
TRANSFER xstring TO filename.
CLOSE DATASET filename.

Slack slash command results in different encoding coming from iOS app or MacOS App

We have a Slash command integration and discovered that the text passed to the slash command is encoded differently if it comes from the (iOS) mobile app compared to the Desktop app.
For the command "/whereis #xsd" on the MacOS Desktop app, the text element in the body is encoded as: text=%3C%23C02MKG1LH%7Cxsd%3E
For the command "/whereis #xsd" on the iOS app the text element in the body is encoded as: text=%26lt%3B%23C02MKG1LH%7Cxsd%26gt%3B
The iOS app is incorrect.
Did anyone else experience this? Any solutions?
(I have posted this question to Slack, they confirmed the behavior a while back but no solution from them so far).
This is not a bug. Both are valid HTML encodings. You can verify this by decoding them on this website.
The difference is that the string from IOS also includes an encoding of HTML special characters (like <) but the desktop string does not. To address this your app has to first do a URL decoding of the input string and then decode special HTML chars.
The results are:
Desktop: <#C02MKG1LH|xsd>
IOS: <#C02MKG1LH|xsd>
Here is a sample code that will decode both strings correctly in PHP:
<?php
function decodeInputString($input)
{
return htmlspecialchars_decode(urldecode($input));
}
$desktop = "%3C%23C02MKG1LH%7Cxsd%3E";
$ios = "%26lt%3B%23C02MKG1LH%7Cxsd%26gt%3B";
$desktop_plain = decodeInputString($desktop);
$ios_plain = decodeInputString($ios);
var_dump($desktop_plain);
var_dump($ios_plain);

Delphi TIdHTTP POST does not encode plus sign

I have a TIdHTTP component on a form, and I am sending an http POST request to a cloud-based server. Everything works brilliantly, except for 1 field: a text string with a plus sign, e.g. 'hello world+dog', is getting saved as 'hello world dog'.
Researching this problem, I realise that a '+' in a URL is regarded as a space, so one has to encode it. This is where I'm stumped; it looks like the rest of the POST request is encoded by the TIdHTTP component, except for the '+'.
Looking at the request through Fiddler, it's coming through as 'hello%20world+dog'. If I manually encode the '+' (hello world%2Bdog), the result is 'hello%20world%252Bdog'.
I really don't know what I'm doing here, so if someone could point me in the right direction it would be most appreciated.
Other information:
I am using Delphi 2010. The component doesn't have any special settings, I presume I might need to set something? The header content-type that comes through in Fiddler is 'application/x-www-form-urlencoded'.
Then, the Delphi code:
Request:='hello world+dog';
URL :='http://............./ExecuteWithErrors';
TSL:=TStringList.Create;
TSL.Add('query='+Request);
Try
begin
IdHTTP1.ConnectTimeout:=5000;
IdHTTP1.ReadTimeout :=5000;
Reply:=IdHTTP1.Post(URL,TSL);
You are using an outdated version of Indy and need to upgrade.
TIdHTTP's webform data encoder was changed several times in late 2010. Your version appears to predate all of those changes.
In your version, TIdHTTP uses TIdURI.ParamsEncode() internally to encode the form data, where a space character is encoded as %20 and a + character is left un-encoded, thus:
hello%20world+dog
In October 2010, the encoder was updated to encode a space character as & before calling TIdURI.ParamsEncode(), thus:
hello&world+dog
In early December 2010, the encoder was updated to encode a space character as + instead, thus:
hello+world+dog
In late December 2010, the encoder was completely re-written to follow W3C's HTML specifications for application/x-www-form-urlencoded. A space character is encoded as + and a + character is encoded as %2B, thus:
hello+world%2Bdog
In all cases, the above logic is applied only if the hoForceEncodeParams flag is enabled in the TIdHTTP.HTTPOptions property (which it is by default). If upgrading is not an option, you will have to disable the hoForceEncodeParams flag and manually encode the TStringList content yourself:
Request:='hello+world%2Bdog';

iconv C API: charset conversion from/to local encoding

I am using the iconv C API and I want iconv to detect the local encoding of the computer. Is that possible? Apparently it is because when I look in the source code, I find in the file iconv_open1.h that if the fromcode or tocode variables are empty strings ("") then the local encoding is used using the locale_charset() function call.
Someone also told me that in order to convert the locale encoding to unicode, all I needed was to use iconv_open ("UTF-8", "")
Unfortunately, I find no mention of this in the documentation.
And when I convert some iso-8859-1 text to the locale encoding (which is utf-8 on my machine), then during conversion I get errno=EILSEQ (illegal sequence). I checked and iconv_open returned no error.
If instead of the empty string in iconv_open I specify "utf-8", then I get no error. Obviously iconv failed to detect my current charset.
edit: I checked with a simple C program that puts(nl_langinfo(CODESET)) and I get ANSI_X3.4-1968 (which is ASCII). Apparently, I got a problem with charset detection.
edit: this should be related to Why is nl_langinfo(CODESET) different from locale charmap?
additional information: my program is written in Ada, and I bind at link-time to C functions. Apparently, the locale setting is not initialized the same way in the Ada runtime and C runtime.
I'll take the same answer as in Why is nl_langinfo(CODESET) different from locale charmap?
You need to first call
setlocale(LC_ALL, "");

XML Parsing with parseFromString with portuguese characters

I have a Blackberry app developed using PhoneGap. I am using suds client to call web service. There are some Portuguese character in the webservice XML. I am not able to parse to XMLDoc using the DOMParser.
I am using
xmlDoc = parser.parseFromString(_xml, "text/xml");
The encoding type is UTF-8. Without the Portuguese character, parsing is working perfectly.
"I am using is UTF-8 encoding type." - this can mean several things, so it is unclear what exactly you do in order to support UTF-8 end-to-end.
E.g. you should check:
your web service really sends data in UTF-8 (when it converts string chars into bytes to be sent into output stream it should use UTF-8)
the device code that reads data from web really uses UTF-8 to convert bytes to string _xml
P.S. I'm not familiar with phonegap API so this is just a general plan.

Resources