I'm trying to detect ignorable whitespace in an xml document, but my delegate's parser:foundIgnorableWhitespace: method is never being invoked. I've tried with contrived examples that include ignorable whitespace as well, but it isn't working for those either.
This is a test document I've tried:
<?xml version="1.0" encoding="UTF-8" ?><!DOCTYPE library [ <!ELEMENT library (book+)> <!ELEMENT book (text)> <!ELEMENT text (#PCDATA)>]><library><book><text>lorem ipsum</text> </book></library>
The whitespace after the text element should be ignorable (in my understanding of what that means) because a book is not allowed to contain any #PCDATA, but it is being passed to foundCharacters rather than foundIgnorableWhitespace. Any ideas why?
Related
I have an XML document which prolog looks like this :
<?xml version="1.0" encoding="utf-8" standalone="no"?>
...
This XML document is valid against the external DTD with the exact same prolog :
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE root [
...
]>
When I transform using Saxon (latest release):
$:/opt/tomcat/webapps/ROOT/$ java net.sf.saxon.Transform -s:pandora.xml -xsl:pandora.xsl -o:pandora.html
Error on line 1 column 53 of pandora.dtd:
SXXP0003 Error reported by XML parser: No more pseudo attributes are allowed.: No more
pseudo attributes are allowed.
org.xml.sax.SAXParseException; systemId: file:/opt/tomcat/webapps/ROOT/fred/pandora/dtd/pandora.dtd; lineNumber: 1; columnNumber: 53; No more pseudo attributes are allowed.
I am newbie and my research about this has only led to listing the pseudo-attributes in the order they actually are. If anybody have a clue there.
Edit
I have made other transformations using the same process with other projects without any problem. The only difference is in this problematic application, I make use of another namespace exsl to use a function not provided with version 1.0 (node-set). Everything else is similar.
For an external subset of the DTD the specification defines the format in https://www.w3.org/TR/xml/#NT-extSubset as
extSubset ::= TextDecl? extSubsetDecl
extSubsetDecl ::= ( markupdecl | conditionalSect | DeclSep)*
, for the "Text Declaration" in https://www.w3.org/TR/xml/#sec-TextDecl as TextDecl ::= '<?xml' VersionInfo? EncodingDecl S? '?>' so a standalone "pseudo" attribute is indeed not allowed there.
So make sure that your external DTD file does not repeat <!DOCTYPE root, it is just meant to contain declaration of markup, e.g. elements, attributes.
The error message you get comes anyway just from the XML parser and is not transformation/XSLT related.
I have to parse NSData with XML string, does somebody know simple category to do it? I have such for JSON, but I forced to use XML. I tried to use XMLReader, it's interface looks clean, but I found some issues:
Mysterious new line characters and spaces everywhere:
"comment_count" = {text = "\n \n 21";};
My cyrillic symbols looks so:
"description_text" = {text = "\n \U041f\U0438\U043a\U0430\U0431\U0443\U0448};
Example:
<?xml version="1.0" encoding="UTF-8" ?>
<news>
<xml_count>43</xml_count>
<hot_count>449</hot_count>
<item type="text">
<id>1469845</id>
<rating>147</rating>
<pluses>171</pluses>
<minuses>24</minuses>
<title>
<![CDATA[Обновление огромного архива Пикабу!]]>
</title>
<comment_count>26</comment_count>
<comment_link>http://pikabu.ru/story/obnovlenie_ogromnogo_arkhiva_pikabu_1469845</comment_link>
<author>icq677555</author>
<description_text>
<![CDATA[Пикабушники, я обновил свой огромный архив текстовых постов из горячего!]]>
</description_text>
</item>
</news>
I just realized whats' going on. Your data samples are obviously NSDictionary instances printed in the debugger. So the issues you found are:
As XML was originally designed as an annotated text format, the whitespace (spaces, newlines) handling doesn't perfectly fit for data only usage. You can either trim all resulting strings ([stringVar stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]]), adapt XMLReader to do it or use the XML parser at http://ios.biomsoft.com/2011/09/11/simple-xml-to-nsdictionary-converter/ (which does it by default).
The funny output you get for Cyrillic characters is the proper escaping for non-ASCII characters in the debugger output (which uses the old-style property list format). It's an artifact of the debugger output. Your variables contain the proper characters.
BTW: While JSON contains implicit type information (strings are always quoted, numbers are never quoted etc.), XML without a schema file does not. So all the parsed simple values will be strings even if they originally were numbers.
Update:
The XML parser you're using still contains the old whitespace handling code described in Pesky new lines and whitespace in XML reader class (though the comment tells otherwise). Apply the fix mentioned at the bottom of the answer, namely change the line:
[dictInProgress setObject:textInProgress forKey:kXMLReaderTextNodeKey];
to:
[dictInProgress setObject:[textInProgress stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]] forKey:kXMLReaderTextNodeKey];
I need to validate XML using DTD stored in memory, i.e. something like the following:
static const char *dtd_str = "<!ELEMENT ...>";
xmlDtdPtr dtd;
dtd = xmlParseMemoryDtd(dtd_str);
XML_PARSE_DTDVALID parser option allows to validate DTD embedded into XML:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE some_tag[
<!ELEMENT some_tag ...>
...
]>
<some_tag>...</some_tag>
So a workaround is to modify in-memory XML. Things become more complicated with
a parser used in "push mode". In push mode we have to detect whether the XML
declaration (<?xml ...?>), or start of the root element, then put our inline
DTD between them.
Could you suggest better solution?
EDIT
A workaround is to validate parsed XML posteriori as Daniel(_DV) suggested below.
Example: main.c, response.xml.
But I was searching for way to "embed" a DTD and validate XML "on-the-fly" while libxml2 parses XML chunk-by-chunk.
The following aproach doesn't work for me:
xmlCtxtUseOptions(ctxt, XML_PARSE_NOENT | XML_PARSE_NOWARNING | XML_PARSE_DTDVALID);
ctxt->sax->internalSubset = ngx_http_file_chunks_sax_internal_subset;
ctxt->sax->externalSubset = NULL;
$ ./parsexml
validity error : Validation failed: no DTD found !
<response>
^
Document is not valid
xmlValidateDtd allows to do DTD validation a posteriori of an already parsed XML document
to make sure it validates against the DTD. This will not use the internal subset...
http://xmlsoft.org/html/libxml-valid.html#xmlValidateDtd
See xmllint.c code in libxml2 for a full example of how to use it,
Daniel
I just installed Biopython and wanted to try out its features and so I started to go through the tutorial.
However, when I reached the chapter about obtaining information from Entrez, I encountered a problem.
The example in the tutorial is simple:
from Bio import Entrez
Entrez.email = "A.N.Other#example.com"
handle = Entrez.einfo(db="pubmed")
record = Entrez.read(handle)
This works fine. But as soon as I want to parse a different database than pubmed I get following error:
Bio.Entrez.Parser.ValidationError: Failed to find tag 'Build' in the DTD. To skip all tags that are not represented in the DTD, please call Bio.Entrez.read or Bio.Entrez.parse with validate=False.
Trying the validate=False option also doesn't work, because this raises a Bio.Entrez.Parser.NotXMLError.
Can someone tell me what I did wrong and how can solve this issue?
In order to get round this problem simply alter your call to Entrez.read() to include a validate parameter, like so:
record = Entrez.read(handle,validate=False)
The other answer to this question is right. It's a falt in Biopython parser. Hopefully they'll update soon.
THIS IS NOT REALLY A VALID SOLUTION, BUT SHOWS WHAT THE PROBLEM IS. I think it's probably a biopython (Entrez.Parse) bug, so I'll get in contact with them and see what they think.
So a bit of hacking at Biopython shows the problem is because of a 'build' tag name.
If we do this manually, the first few lines of the pubmed XML request look like this
<eInfoResult>
<DbInfo>
<DbName>pubmed</DbName>
<MenuName>PubMed</MenuName>
<Description>PubMed bibliographic record</Description>
<Count>22224084</Count>
<LastUpdate>2012/10/30 03:30</LastUpdate>
....
But the protein request looks like this;
<eInfoResult>
<DbInfo>
<DbName>protein</DbName>
<MenuName>Protein</MenuName>
<Description>Protein sequence record</Description>
<Build>Build121030-0741m.1</Build> <-------- THIS IS BAD
<Count>59244879</Count>
<LastUpdate>2012/10/30 18:39</LastUpdate>
I had a look at how the Entrez.Parser works, and it basically doesn't recognize the build tag. Further rooting shows that the tags are defined in DTD files, and einfo DTD file, which on my system is here;
/usr/local/lib/python2.7/dist-packages/Bio/Entrez/DTDs
If we examine the relevant file eInfo_020511.dtd and add a build tag line (the line below with the arrow wasn't there before);
<!--
This is the Current DTD for Entrez eInfo
$Id: eInfo_020511.dtd,v 1.1 2008-05-13 11:17:44 mdehoon Exp $
-->
<!-- ================================================================= -->
<!ELEMENT DbName (#PCDATA)> <!-- \S+ -->
<!ELEMENT Name (#PCDATA)> <!-- .+ -->
<!ELEMENT FullName (#PCDATA)> <!-- .+ -->
<!ELEMENT Description (#PCDATA)> <!-- .+ -->
<!ELEMENT Build (#PCDATA)> <!-- .+ --> <------- I ADDED THIS LINE
<!ELEMENT TermCount (#PCDATA)> <!-- \d+ -->
<!ELEMENT Menu (#PCDATA)> <!-- .+ -->
It now works. The comments on this file suggest it hasn't been updated since 2008 (line below comes form the DTD header).
$Id: eInfo_020511.dtd,v 1.1 2008-05-13 11:17:44 mdehoon Exp $
My guess is that the build tag has been added since then but this file was never updated to reflect that.
<?xml version="1.0" encoding="utf-8"?>
<LinearLayout xmlns:android="http://schemas.android.com/apk/res/android"
android:orientation=”vertical”
android:layout_width=”fill_parent”
android:layout_height=”fill_parent” >
I get these two errors
error: Error parsing XML: not well-formed (invalid token)
&
Open quote is expected for attribute "android:orientation" associated with an element type "LinearLayout".
Did you copy and paste that from word? Your quotes look a little funky. Sometimes word will use a different character than the expected " for double quotes. Make sure those are all consistent. Otherwise, the syntax is invalid.
Looks like you have "smart quotes" ( not simple " double quotes) around some attributes in your LinearLayout element.
There are many references that explain the differences between valid and well formed XML documents. A good starting point can be found here. There is also an online XML Validator that you can use to test XML documents.
The validator shows that you have two issues:
Some of your attribute values use an invalid quote character: ” vs. ", and
you need to close the LinearLayout tag with /> instead of just >.