Simple NSData's category to parse XML with cyrillic - ios

I have to parse NSData with XML string, does somebody know simple category to do it? I have such for JSON, but I forced to use XML. I tried to use XMLReader, it's interface looks clean, but I found some issues:
Mysterious new line characters and spaces everywhere:
"comment_count" = {text = "\n \n 21";};
My cyrillic symbols looks so:
"description_text" = {text = "\n \U041f\U0438\U043a\U0430\U0431\U0443\U0448};
Example:
<?xml version="1.0" encoding="UTF-8" ?>
<news>
<xml_count>43</xml_count>
<hot_count>449</hot_count>
<item type="text">
<id>1469845</id>
<rating>147</rating>
<pluses>171</pluses>
<minuses>24</minuses>
<title>
<![CDATA[Обновление огромного архива Пикабу!]]>
</title>
<comment_count>26</comment_count>
<comment_link>http://pikabu.ru/story/obnovlenie_ogromnogo_arkhiva_pikabu_1469845</comment_link>
<author>icq677555</author>
<description_text>
<![CDATA[Пикабушники, я обновил свой огромный архив текстовых постов из горячего!]]>
</description_text>
</item>
</news>

I just realized whats' going on. Your data samples are obviously NSDictionary instances printed in the debugger. So the issues you found are:
As XML was originally designed as an annotated text format, the whitespace (spaces, newlines) handling doesn't perfectly fit for data only usage. You can either trim all resulting strings ([stringVar stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]]), adapt XMLReader to do it or use the XML parser at http://ios.biomsoft.com/2011/09/11/simple-xml-to-nsdictionary-converter/ (which does it by default).
The funny output you get for Cyrillic characters is the proper escaping for non-ASCII characters in the debugger output (which uses the old-style property list format). It's an artifact of the debugger output. Your variables contain the proper characters.
BTW: While JSON contains implicit type information (strings are always quoted, numbers are never quoted etc.), XML without a schema file does not. So all the parsed simple values will be strings even if they originally were numbers.
Update:
The XML parser you're using still contains the old whitespace handling code described in Pesky new lines and whitespace in XML reader class (though the comment tells otherwise). Apply the fix mentioned at the bottom of the answer, namely change the line:
[dictInProgress setObject:textInProgress forKey:kXMLReaderTextNodeKey];
to:
[dictInProgress setObject:[textInProgress stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]] forKey:kXMLReaderTextNodeKey];

Related

Using Umlaut or special characters in ibm-doors from batch

We have a link module that looks something like this:
const string lMod = "/project/_admin/somethingÜ" // Umlaut
We later use the linkMod like this to loop through the outlinks:
for a in obj->lMod do {}
But this only works when executing directly from DOORS and not from a batch script since it for some reason doesn't recognize the Umlaut causing the inside of the loop to never to be run; exchanging lMod with "*" works and also shows the objects linked to by the lMod.
We are already using UTF-8 encoding for the file:
pragma encoding, "UTF-8"
Any solutions are welcome.
Encode the file as UTF-8 in Notepad++ by going to Encoding > Convert to UTF-8. (Make sure it's not already set to UTF-8 before you do it).

SEC company filings: Is the <SEC-HEADER> tag valid SGML? If so, how to parse it?

I tried to parse SEC company filings from sec.gov. Starting from fb 10-Q index.htm let's look at a complete text submission filing like complete submission text filing. It has a structure like:
<SEC-DOCUMENT>
<SEC-HEADER>
<ACCEPTANCE-DATETIME>"some content" This tag is not closed.
"some lines resembling yaml markup"
These are indented lines with a
"key": "value" structure.
</SEC-HEADER>
<DOCUMENT>
.
.
some content
.
.
</DOCUMENT>
"several DOCUMENT tags" ...
</SEC-DOCUMENT>
I tried to figure out the structure of the <SEC-HEADER> tag and found some information under Public Dissemination
Service (PDS) Technical
Specification (pdf) and concluded that the content of the header should be SGML.
Nevertheless, I am clueless about the formatting, since there are no angle brackets, and the keys - value paires are separated by colons like key: value instead of <key>value</key>. In the pdf link I could not find anything about colons.
Question: Is the <SEC-HEADER> tag valid SGML? If it is, how to parse it?
I'd be glad at any help.
The short answer is no. The <SEC-HEADER> tag in the raw filing is not a valid SGML.
However, it is my understanding that this section in the raw filing is parsed automatically from the header file <accession_num>.hdr.sgml, which does follow SGML. This header file can be found in the same directory as the raw filing (i.e., the <accession_num>.txt file).
I use a REGEX of the form: ^<(.+?)>(.+?)$ (with re.MULTILINE option) to capture each (tag, value) tuple and get the results directly in a dict().
I believe the only tag in that file that has a closing tag is the </FILER> tag, where there could be multiple filers in each filing. You can first extract those using a REGEX of the form: <FILER>(.+?)</FILER> and then employ the same REGEX as above to get the inner tags for each filer.
Note that other than 'FILER', there could be other tags, representing different relations of the entities to the filing. Those are 'ISSUER', 'SUBJECT COMPANY', 'FILED BY', 'FILED FOR', 'SERIAL COMPANY', 'REPORTING OWNER'.

Encoding in POJO to/from XML conversion within Camel

We have been very successful to carry out POJO to/from XML conversion within Camel. The following code exemplifies a typical case how we use Camel. Our application listens to an Oracle AQ. The queue entry is an xml String. The xml is then converted to POJO class (MyClass), we then do some transformation on the MyClass with data from other source. After this transformation, POJO object is converted back to a string and sent to other system (here we save to a file)
<route id="testing">
<from uri="oracleaq:queue:FUSEQUEUE"/>
<convertBodyTo type="generated.MyClass"/>
<bean ref="mainReqprocessor" method="Modify"/>
<convertBodyTo type="java.lang.String"/>
<setHeader headerName="Exchange.FILE_NAME">
<simple>output.xml</simple>
</setHeader>
<to uri="file:C:\\Temp\\OUT"/>
</route>
Everything works fine until yesterday when we introduced html tags into one of the text field of the POJO class. We wrapped the text with CData "<![CDATA[" + str + "]]>". But, when the POJO is converted to string, the encoding still occurred on the starting and ending brackets of CGata section, such as the following. Because of this, the resulting xml string is not valid xml any more, and therefore can not be converted back to MyClass for other application. This is not the desired behavior. How can I avoid the encoding on CDATA starting and ending brackets?[Notes: the first < and the last > in the cdata are encoded.]
<TEXT>
&lt;![CDATA[&lt;html&gt;&lt;div&gt;&lt;pre&gt;COMPONENT PARTS.&lt;/br&gt;&lt;/div&gt;&lt;/pre&gt;&lt;/html&gt;]]&gt;
<\/TEXT>
Although you have a marshalling/unmarshalling problem, you don't mention how you convert the XML to POJO and back. This would be a very important information to help.
If you are using JAXB for the conversion, this Q/A could perhaps help you:
JAXB Marshalling Unmarshalling with CDATA

retrieve xml file using nsxmlparser in ios

i am getting problem while reading xml files through nsxmlparser in ios,
<PRODUCTS>
<PRODUCTSLIST>
<PRODUCTDETAILS>
<headertext> test header </headertext>
<description><b style="font-size: x-small;">product, advantages</b></description>
</PRODUCTDETAILS>
</PRODUCTSLIST>
</PRODUCTS>
while i read the file using nsxmlparser i am able to get value(test header) for headertext but the description attribute value contains html tags so i cant able to get the result (<b style="font-size: x-small;">product, advantages</b>)i am getting result as empty
How can i get the result as((<b style="font-size: x-small;">product, advantages</b>)) for description attribute?
Speaking from a developers perspective I would not recommend using NSXMLParser due to it's laborious way to parse XML Files. There is a great write up about choosing the right XML Parser.
I use KissXML quite often.
You can find a quit tutorial of using it here.
Hope this helps.
Your problem is that the "b" tag is considered part of the XML structure, try escaping the '<' and '>' characters of the 'b' tag:
#"<b style=\"font-size: x-small;>product, advantages</b>"
see here

What does "Error parsing XML: not well-formed" mean?

<?xml version="1.0" encoding="utf-8"?>
<LinearLayout xmlns:android="http://schemas.android.com/apk/res/android"
android:orientation=”vertical”
android:layout_width=”fill_parent”
android:layout_height=”fill_parent” >
I get these two errors
error: Error parsing XML: not well-formed (invalid token)
&
Open quote is expected for attribute "android:orientation" associated with an element type "LinearLayout".
Did you copy and paste that from word? Your quotes look a little funky. Sometimes word will use a different character than the expected " for double quotes. Make sure those are all consistent. Otherwise, the syntax is invalid.
Looks like you have "smart quotes" ( not simple " double quotes) around some attributes in your LinearLayout element.
There are many references that explain the differences between valid and well formed XML documents. A good starting point can be found here. There is also an online XML Validator that you can use to test XML documents.
The validator shows that you have two issues:
Some of your attribute values use an invalid quote character: ” vs. ", and
you need to close the LinearLayout tag with /> instead of just >.

Resources