retrieve xml file using nsxmlparser in ios - ios

i am getting problem while reading xml files through nsxmlparser in ios,
<PRODUCTS>
<PRODUCTSLIST>
<PRODUCTDETAILS>
<headertext> test header </headertext>
<description><b style="font-size: x-small;">product, advantages</b></description>
</PRODUCTDETAILS>
</PRODUCTSLIST>
</PRODUCTS>
while i read the file using nsxmlparser i am able to get value(test header) for headertext but the description attribute value contains html tags so i cant able to get the result (<b style="font-size: x-small;">product, advantages</b>)i am getting result as empty
How can i get the result as((<b style="font-size: x-small;">product, advantages</b>)) for description attribute?

Speaking from a developers perspective I would not recommend using NSXMLParser due to it's laborious way to parse XML Files. There is a great write up about choosing the right XML Parser.
I use KissXML quite often.
You can find a quit tutorial of using it here.
Hope this helps.

Your problem is that the "b" tag is considered part of the XML structure, try escaping the '<' and '>' characters of the 'b' tag:
#"<b style=\"font-size: x-small;>product, advantages</b>"
see here

Related

SEC company filings: Is the <SEC-HEADER> tag valid SGML? If so, how to parse it?

I tried to parse SEC company filings from sec.gov. Starting from fb 10-Q index.htm let's look at a complete text submission filing like complete submission text filing. It has a structure like:
<SEC-DOCUMENT>
<SEC-HEADER>
<ACCEPTANCE-DATETIME>"some content" This tag is not closed.
"some lines resembling yaml markup"
These are indented lines with a
"key": "value" structure.
</SEC-HEADER>
<DOCUMENT>
.
.
some content
.
.
</DOCUMENT>
"several DOCUMENT tags" ...
</SEC-DOCUMENT>
I tried to figure out the structure of the <SEC-HEADER> tag and found some information under Public Dissemination
Service (PDS) Technical
Specification (pdf) and concluded that the content of the header should be SGML.
Nevertheless, I am clueless about the formatting, since there are no angle brackets, and the keys - value paires are separated by colons like key: value instead of <key>value</key>. In the pdf link I could not find anything about colons.
Question: Is the <SEC-HEADER> tag valid SGML? If it is, how to parse it?
I'd be glad at any help.
The short answer is no. The <SEC-HEADER> tag in the raw filing is not a valid SGML.
However, it is my understanding that this section in the raw filing is parsed automatically from the header file <accession_num>.hdr.sgml, which does follow SGML. This header file can be found in the same directory as the raw filing (i.e., the <accession_num>.txt file).
I use a REGEX of the form: ^<(.+?)>(.+?)$ (with re.MULTILINE option) to capture each (tag, value) tuple and get the results directly in a dict().
I believe the only tag in that file that has a closing tag is the </FILER> tag, where there could be multiple filers in each filing. You can first extract those using a REGEX of the form: <FILER>(.+?)</FILER> and then employ the same REGEX as above to get the inner tags for each filer.
Note that other than 'FILER', there could be other tags, representing different relations of the entities to the filing. Those are 'ISSUER', 'SUBJECT COMPANY', 'FILED BY', 'FILED FOR', 'SERIAL COMPANY', 'REPORTING OWNER'.

Encoding in POJO to/from XML conversion within Camel

We have been very successful to carry out POJO to/from XML conversion within Camel. The following code exemplifies a typical case how we use Camel. Our application listens to an Oracle AQ. The queue entry is an xml String. The xml is then converted to POJO class (MyClass), we then do some transformation on the MyClass with data from other source. After this transformation, POJO object is converted back to a string and sent to other system (here we save to a file)
<route id="testing">
<from uri="oracleaq:queue:FUSEQUEUE"/>
<convertBodyTo type="generated.MyClass"/>
<bean ref="mainReqprocessor" method="Modify"/>
<convertBodyTo type="java.lang.String"/>
<setHeader headerName="Exchange.FILE_NAME">
<simple>output.xml</simple>
</setHeader>
<to uri="file:C:\\Temp\\OUT"/>
</route>
Everything works fine until yesterday when we introduced html tags into one of the text field of the POJO class. We wrapped the text with CData "<![CDATA[" + str + "]]>". But, when the POJO is converted to string, the encoding still occurred on the starting and ending brackets of CGata section, such as the following. Because of this, the resulting xml string is not valid xml any more, and therefore can not be converted back to MyClass for other application. This is not the desired behavior. How can I avoid the encoding on CDATA starting and ending brackets?[Notes: the first < and the last > in the cdata are encoded.]
<TEXT>
&lt;![CDATA[&lt;html&gt;&lt;div&gt;&lt;pre&gt;COMPONENT PARTS.&lt;/br&gt;&lt;/div&gt;&lt;/pre&gt;&lt;/html&gt;]]&gt;
<\/TEXT>
Although you have a marshalling/unmarshalling problem, you don't mention how you convert the XML to POJO and back. This would be a very important information to help.
If you are using JAXB for the conversion, this Q/A could perhaps help you:
JAXB Marshalling Unmarshalling with CDATA

google translate misses up the coding of my file

i am trying to use google translate for localization of an XML file, it has near 350K lines, but some of them contain coding for in-game font size and color, like so:
<replacement><p horizontalalignment="center"><br/><image enablescale="false" imagesetpath="00015590.InterD_Jeryoung_3"/><br/><image enablescale="true" imagesetpath="00015590.Tag_Dungeon_Six_Superior" scalerate="1.5"/><image enablescale="true" imagesetpath="00015590.Tag_Dungeon_Four_Superior" scalerate="1.5"/><br/><image enablescale="true" imagesetpath="00009499.Field_Boss" scalerate="1.4"/>Хмельной лик<br/><br/></p>Уничтожить зараженных насекомых<br/>возле мест обитания их королевы。<br/></replacement>
now for god knows what reason, google translate alters that code in the process of translation into some unacceptable coding, like so:
<replacement> <p horizontalalignment="center"> <br/> <image enablescale="false" imagesetpath="00015590.InterD_Jeryoung_3"/> <br/> <image enablescale = "true "imagesetpath =" 00015590.Tag_Dungeon_Six_Superior "scalerate =" 1.5 "/> <image enablescale="true" imagesetpath="00015590.Tag_Dungeon_Four_Superior" scalerate="1.5"/> <br/> <image enablescale = "true" imagesetpath = "00009499.Field_Boss" scalerate = "1.4" /> Intoxicated face <br/> <br/> </ p> Destroy infected insects <br/> habitats near their queen. <br/> </ replacement>
is there any way to avoid that, why is it happening exactly? anyhelp is appreciated on that matter,thanks
EDIT : i am also looking for a way to input my text and have it out in the same exact language with only the coding mishaps changing, so i can isolate those,build a comparison table and then use that to fix the errors after the actual translation is done, but i don't see a way for selecting the same language as input AND output in google translate, it always forces me choose a different one in input or output, kind of makes sense but if there is a way to do that, i might be able to work around it..
Do not feed Google translate with your Xml file, as far as I know it doesn't understand Xml.
Extract the text from the Xml file.
Feed the text to translate.
Transform the text back to Xml.
You could simply transform the Xml to a text document with a single line per Xml element so it would be easier to turn it back into Xml.
More detail
According to the Toolkit you can upload:
HTML (.HTML)
Microsoft Word (.DOC/.DOCX)
OpenDocument Text (.ODT)
Plain Text (.TXT)
Rich Text (.RTF)
Wikipedia URLs
And a couple of extras such as JSON. So no Xml.
The best way I see is to transform your Xml document into one of these types (I would probably use JSON) and transform it is such a way that it can easily be transformed back again by using either position (1 line in the text file is the first element in the Xml document) or by an id (add the Id or position of the element in the xml hierarchy to the JSON element)
My guess is that the toolkit recognizes the html tags in the xml and escapes them. So another option might be to un-escape the > to > and &lt to <

html2pdf and local (latvian) language characters

I am using Html2PDf to convert html to pdf.
But I am not able to achieve that it shows local (latvian) language letters. It shows ? instead.
I do understand that I should somehow add appropriate fonts, but I do not know where to get those fonts (which one support latvinan language) and how to add them into html2pdf.
Html2Pdf is based on tcpdf and currently there is font folder.
I think that is seems trivial question, but I was searching via google, but have not found answer that works for me.
require_once('inc/html2pdf/html2pdf.class.php');
$html2pdf = new HTML2PDF('P','A4','en');
//$html2pdf->pdf->setDefaultFont('times');
// HEADER
$pdf_output .='<page style="font-size: 11px; >';
$pdf_output .= '<img src="images/raka_pdf_logo.png" alt="logo"/><br><br><br><br>';
...
You may find the right font-family in html2pdf>tcpdf>fonts

Simple NSData's category to parse XML with cyrillic

I have to parse NSData with XML string, does somebody know simple category to do it? I have such for JSON, but I forced to use XML. I tried to use XMLReader, it's interface looks clean, but I found some issues:
Mysterious new line characters and spaces everywhere:
"comment_count" = {text = "\n \n 21";};
My cyrillic symbols looks so:
"description_text" = {text = "\n \U041f\U0438\U043a\U0430\U0431\U0443\U0448};
Example:
<?xml version="1.0" encoding="UTF-8" ?>
<news>
<xml_count>43</xml_count>
<hot_count>449</hot_count>
<item type="text">
<id>1469845</id>
<rating>147</rating>
<pluses>171</pluses>
<minuses>24</minuses>
<title>
<![CDATA[Обновление огромного архива Пикабу!]]>
</title>
<comment_count>26</comment_count>
<comment_link>http://pikabu.ru/story/obnovlenie_ogromnogo_arkhiva_pikabu_1469845</comment_link>
<author>icq677555</author>
<description_text>
<![CDATA[Пикабушники, я обновил свой огромный архив текстовых постов из горячего!]]>
</description_text>
</item>
</news>
I just realized whats' going on. Your data samples are obviously NSDictionary instances printed in the debugger. So the issues you found are:
As XML was originally designed as an annotated text format, the whitespace (spaces, newlines) handling doesn't perfectly fit for data only usage. You can either trim all resulting strings ([stringVar stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]]), adapt XMLReader to do it or use the XML parser at http://ios.biomsoft.com/2011/09/11/simple-xml-to-nsdictionary-converter/ (which does it by default).
The funny output you get for Cyrillic characters is the proper escaping for non-ASCII characters in the debugger output (which uses the old-style property list format). It's an artifact of the debugger output. Your variables contain the proper characters.
BTW: While JSON contains implicit type information (strings are always quoted, numbers are never quoted etc.), XML without a schema file does not. So all the parsed simple values will be strings even if they originally were numbers.
Update:
The XML parser you're using still contains the old whitespace handling code described in Pesky new lines and whitespace in XML reader class (though the comment tells otherwise). Apply the fix mentioned at the bottom of the answer, namely change the line:
[dictInProgress setObject:textInProgress forKey:kXMLReaderTextNodeKey];
to:
[dictInProgress setObject:[textInProgress stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]] forKey:kXMLReaderTextNodeKey];

Resources