Delphi THttpClient: post's IHttpResponse giving encoded string seen in fiddler webview - delphi

I have a THTTPClient giving a strange response (apparently UTF-16 encoding?) when invoking lHttpResp.ContentAsString().
The string comes through as this:
㰀㼀砀洀氀 瘀攀爀猀椀漀渀㴀∀㄀⸀ ∀ 攀渀挀漀搀椀渀最㴀∀唀吀䘀ⴀ㄀㘀∀ 猀琀愀渀搀愀氀漀渀攀㴀∀礀攀猀∀㼀㸀਀㰀刀䔀匀唀䰀吀㸀਀    㰀倀䄀夀倀䄀䜀䔀唀刀䰀㸀栀琀琀瀀猀㨀⼀⼀攀攀⸀琀攀猀琀⸀瀀愀礀最愀琀攀眀愀礀⸀挀漀洀⼀䠀漀猀琀倀愀礀匀攀爀瘀椀挀攀⼀瘀㄀⼀栀漀猀琀瀀愀礀⼀瀀愀礀瀀愀最攀⼀㄀㘀㜀 㔀㘀㘀㈀㄀㔀㐀㌀㈀㐀欀䬀䬀儀㔀唀漀䈀椀吀氀㠀䔀爀䈀䴀 戀㰀⼀倀䄀夀倀䄀䜀䔀唀刀䰀㸀਀    㰀匀䔀匀匀䤀伀一吀伀䬀䔀一㸀㄀㘀㜀 㔀㘀㘀㈀㄀㔀㐀㌀㈀㐀欀䬀䬀儀㔀唀漀䈀椀吀氀㠀䔀爀䈀䴀 戀㰀⼀匀䔀匀匀䤀伀一吀伀䬀䔀一㸀਀㰀⼀刀䔀匀唀䰀吀㸀਀
Running Fiddler, I can see the response is fine when looking at the raw or text view, but matches the above encoding when looking at Webview. I'm probably missing something pretty obvious here, but I've tried converting with TEncoding to no avail, as per this thread:
Delphi - converting string back from UTF-8
Fiddler's text view gives a correct text:
<?xml version="1.0" encoding="UTF-16" standalone="yes"?>
<RESULT>
<PAYPAGEURL>https://url/</PAYPAGEURL>
<SESSIONTOKEN>1670565241202KKv4NPBBScANOL6rxbi</SESSIONTOKEN>
</RESULT>

A colleague helped with this and found it was TEncoding.BigEndianUnicode, it was one of the only ones I hadn't tried due to tunnel vision. Resolution seen below. Thanks for your input.
lHttpResp.ContentAsString(TEncoding.BigEndianUnicode); //to get the result in text for testing etc
XmlFile.LoadFromStream(lhttpResp.ContentStream); //to load to an xmlfile

Related

SAXParseException when using restassured

I am trying to verify a XML response with rest-assured like this:
.then().body("some.xml.path", is("abc"));
However, what I get is a SAXParseException:
DOCTYPE is disallowed when the feature "http://apache.org/xml/features/disallow-doctype-decl" set to true.]
Response starts like this:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE cXML SYSTEM "http://xml.cXML.org/schemas/cXML/1.2.021/cXML.dtd">
<cXML ...
Why am I getting this exception? What should I change?
I am using version 3.2.0 of rest-assured.
A similar question has been answered here. In short, the answer describes to use disableLoadingOfExternalDtd() to have RestAssured ignore the Document Type Definition in your XML.
Normally, the DTD would describe (using the external definition) the structural layout of the element defined as cXML.

What encoding is this and how do I turn it into something I can see properly?

I'm writing a script that will operate on the subtitle files of a popular streaming service (Netfl*x).
The subtitle files have strange characters in them and I can't get them to render in a way that my text editors or web browser will display in a readable way. The xml encoding says UTF-8, but some characters are not readable.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<tt xmlns:tt="http://www.w3.org/ns/ttml" xmlns:ttm="http://www.w3.org/ns/ttml#metadata" xmlns:ttp="http://www.w3.org/ns/ttml#parameter" xmlns:tts="http://www.w3.org/ns/ttml#styling" ttp:tickRate="10000000" ttp:timeBase="media" xmlns="http://www.w3.org/ns/ttml">
<p>de 15 % la nuit dernière.</span></p>
<p>if youâve got things to doâ¦</span></p>
And in Vim:
This is what it looks like in the browser:
How can I convert this into something I can use?
I'll go out on a limb and say that file is UTF-8 encoded just fine, and you're merely looking at it using the wrong encoding. The character À encoded in UTF-8 is C3 80. C3 in ISO-8859-1 is Ã, which in your screenshot is followed by an 80. So looks like you're looking at a UTF-8 file using the (wrong) ISO-8859 encoding.
Use the correct encoding when opening the file.
My terminal is set to en_US.UTF-8, but was also rendering this supposedly UTF-8 encoded file incorrectly (sonné -> sonné). I was able to solve this by using iconv to encode the file in ISO8859-1.
iconv original.xml -t ISO8859-1 -o converted.xml
In the new file, the characters were properly rendered, although I don't quite understand why.

Simple NSData's category to parse XML with cyrillic

I have to parse NSData with XML string, does somebody know simple category to do it? I have such for JSON, but I forced to use XML. I tried to use XMLReader, it's interface looks clean, but I found some issues:
Mysterious new line characters and spaces everywhere:
"comment_count" = {text = "\n \n 21";};
My cyrillic symbols looks so:
"description_text" = {text = "\n \U041f\U0438\U043a\U0430\U0431\U0443\U0448};
Example:
<?xml version="1.0" encoding="UTF-8" ?>
<news>
<xml_count>43</xml_count>
<hot_count>449</hot_count>
<item type="text">
<id>1469845</id>
<rating>147</rating>
<pluses>171</pluses>
<minuses>24</minuses>
<title>
<![CDATA[Обновление огромного архива Пикабу!]]>
</title>
<comment_count>26</comment_count>
<comment_link>http://pikabu.ru/story/obnovlenie_ogromnogo_arkhiva_pikabu_1469845</comment_link>
<author>icq677555</author>
<description_text>
<![CDATA[Пикабушники, я обновил свой огромный архив текстовых постов из горячего!]]>
</description_text>
</item>
</news>
I just realized whats' going on. Your data samples are obviously NSDictionary instances printed in the debugger. So the issues you found are:
As XML was originally designed as an annotated text format, the whitespace (spaces, newlines) handling doesn't perfectly fit for data only usage. You can either trim all resulting strings ([stringVar stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]]), adapt XMLReader to do it or use the XML parser at http://ios.biomsoft.com/2011/09/11/simple-xml-to-nsdictionary-converter/ (which does it by default).
The funny output you get for Cyrillic characters is the proper escaping for non-ASCII characters in the debugger output (which uses the old-style property list format). It's an artifact of the debugger output. Your variables contain the proper characters.
BTW: While JSON contains implicit type information (strings are always quoted, numbers are never quoted etc.), XML without a schema file does not. So all the parsed simple values will be strings even if they originally were numbers.
Update:
The XML parser you're using still contains the old whitespace handling code described in Pesky new lines and whitespace in XML reader class (though the comment tells otherwise). Apply the fix mentioned at the bottom of the answer, namely change the line:
[dictInProgress setObject:textInProgress forKey:kXMLReaderTextNodeKey];
to:
[dictInProgress setObject:[textInProgress stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]] forKey:kXMLReaderTextNodeKey];

What does "Error parsing XML: not well-formed" mean?

<?xml version="1.0" encoding="utf-8"?>
<LinearLayout xmlns:android="http://schemas.android.com/apk/res/android"
android:orientation=”vertical”
android:layout_width=”fill_parent”
android:layout_height=”fill_parent” >
I get these two errors
error: Error parsing XML: not well-formed (invalid token)
&
Open quote is expected for attribute "android:orientation" associated with an element type "LinearLayout".
Did you copy and paste that from word? Your quotes look a little funky. Sometimes word will use a different character than the expected " for double quotes. Make sure those are all consistent. Otherwise, the syntax is invalid.
Looks like you have "smart quotes" ( not simple " double quotes) around some attributes in your LinearLayout element.
There are many references that explain the differences between valid and well formed XML documents. A good starting point can be found here. There is also an online XML Validator that you can use to test XML documents.
The validator shows that you have two issues:
Some of your attribute values use an invalid quote character: ” vs. ", and
you need to close the LinearLayout tag with /> instead of just >.

HTML decoding in C/C++

I'm using libcurl for getting HTML pages.
I have some problems with Hebrew characters.
for example this: סלקום
gets gibberish.
How do I get Hebrew characters and not gibberish?
Do I need some HTML decoder?
Does libcurl support such operation?
Does libiconv support such operation?
I appreciate any help.
Thanks
Edit: Ok, so what you’re seeing is UTF-8 data being decoded as Windows-1252 (so the numeric character references were a red herring). Here’s a demonstration in Python:
>>> u = ''.join(map(unichr, [1505, 1500, 1511, 1493, 1501]))
>>> s = u.encode('utf-8')
>>> print s.decode('cp1255', 'replace')
׳¡׳�׳§׳•׳�
The solution to this problem depends on the environment in which the output is displayed. Merely outputting the bytes received and expecting them to be interpreted as characters leads to problems like this.
An HTML document typically contains a header tag like <meta charset=utf-8> to indicate to the browser what its encoding should be. A document served by a web server contains an HTTP header like Content-Type: text/html; charset=utf-8.
You should ask libcurl for the Content-Type HTTP header to know the encoding of the document, and then convert it to the system encoding using iconv. While in your case that would be codepage 1255, it depends on the user’s system and so you should look up the appropriate functions to detect that.
(Read Unicode and Character Sets and the character-encoding tag on this site for a wealth of further information.)

Resources