I am working on an iPhone project that needs to parse some xml. The xml may or may not include a default namespace. I need to know how to parse the xml in case it uses a default namespace. As I need to both read an write xml, I'm leaning towards using KissXML, but I'm open for suggestions.
This is my code:
NSString *content = [NSString stringWithContentsOfFile:[[NSBundle mainBundle]
pathForResource:#"bookstore" ofType:#"xml"] encoding:NSUTF8StringEncoding error:nil];
DDXMLDocument *theDocument = [[DDXMLDocument alloc] initWithXMLString:content options:0 error:nil];
NSArray *results = [theDocument nodesForXPath:#"//book" error:nil];
NSLog(#"%d", [results count]);
It works as expected on this xml:
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book category="COOKING">
<title lang="en">Everyday Italian</title>
</book>
<book category="CHILDREN">
<title lang="en">Harry Potter</title>
</book>
</bookstore>
But when the xml has a namespace, like this, it stops working:
<?xml version="1.0" encoding="UTF-8"?>
<bookstore xmlns="[INSERT RANDOM NAMESPACE]">
<book category="COOKING">
<title lang="en">Everyday Italian</title>
</book>
<book category="CHILDREN">
<title lang="en">Harry Potter</title>
</book>
</bookstore>
Of course, I could just preprocess the string and remove the xmlns, though that feels like a sort of ugly hack. What is the proper way to handle this?
The Clean Way: Querying for the Namespace
You can use two XPath queries, one to fetch the namespace, then register it; as second query use the one you already have including namespaces. I can only help you with the query, but it seems you're quite familiar with namespaces and how to register them in the KissXML framework:
namespace-uri(/*)
This expression fetches all child nodes starting at the document root, which is per XML definition a single root element, and returns it's namespace uri.
The Ugly Way: Only Testing for Local Name
It seems KissXML only supports XPath 1.0. With this less-capable language version, you need to use wildcard selectors at each axis step and compare the local name (without namespace prefix) inside the predicate:
//*[local-name(.) = 'book']
Starting from XPath 2.0, you could query using the namespace wildcard, which is much shorter:
//*:book
According to this comment KissXML implements "correct" behaviour while NSXML doesn't. Which doesn't exactly help. There is a proposed fix for this waiting to be merged. [edit] 11/2021 - still waiting to be merged!
Expanding on the accepted answer's first proposed solution the workaround I found was to rename the default namespace and then use that prefix in my XPath queries. Something like:
DDXMLNode *defaultNamespace = [document.rootElement namespaceForPrefix:#""];
defaultNamespace.name = #"default";
NSArray *xmlNodes = [[document rootElement] nodesForXPath:#"//default:foo/default:bar" error:nil];
This seems cleaner to me than textual processing of the file. You could of course check and handle namespace collisions but the above should work in most simple cases.
Related
I have an XML document which prolog looks like this :
<?xml version="1.0" encoding="utf-8" standalone="no"?>
...
This XML document is valid against the external DTD with the exact same prolog :
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE root [
...
]>
When I transform using Saxon (latest release):
$:/opt/tomcat/webapps/ROOT/$ java net.sf.saxon.Transform -s:pandora.xml -xsl:pandora.xsl -o:pandora.html
Error on line 1 column 53 of pandora.dtd:
SXXP0003 Error reported by XML parser: No more pseudo attributes are allowed.: No more
pseudo attributes are allowed.
org.xml.sax.SAXParseException; systemId: file:/opt/tomcat/webapps/ROOT/fred/pandora/dtd/pandora.dtd; lineNumber: 1; columnNumber: 53; No more pseudo attributes are allowed.
I am newbie and my research about this has only led to listing the pseudo-attributes in the order they actually are. If anybody have a clue there.
Edit
I have made other transformations using the same process with other projects without any problem. The only difference is in this problematic application, I make use of another namespace exsl to use a function not provided with version 1.0 (node-set). Everything else is similar.
For an external subset of the DTD the specification defines the format in https://www.w3.org/TR/xml/#NT-extSubset as
extSubset ::= TextDecl? extSubsetDecl
extSubsetDecl ::= ( markupdecl | conditionalSect | DeclSep)*
, for the "Text Declaration" in https://www.w3.org/TR/xml/#sec-TextDecl as TextDecl ::= '<?xml' VersionInfo? EncodingDecl S? '?>' so a standalone "pseudo" attribute is indeed not allowed there.
So make sure that your external DTD file does not repeat <!DOCTYPE root, it is just meant to contain declaration of markup, e.g. elements, attributes.
The error message you get comes anyway just from the XML parser and is not transformation/XSLT related.
Good afternoon,
I am working with Java Saxon 9.8.0.4. I would like to use EXPath File Module function "file:list" with its third "pattern" parameter. But I am in doubt, which style of pattern is supported.
I read both Saxon documentation and EXPath documentation. But I do not know, which patterns are supported in Saxon 9.8.0.4. It would be great to support regular expression, but I understand it is overkill for most users. I tried several blind tests, but just * and ? wildchars works for me as defined in EXPath documentation.
Yes, I can quite easily do regexp postprocessing in for-each, but to know more about list function could help.
Thank You in advance for Your help, Stepan
P.S: My use-case is to get all files without extension ("test" and not "test.txt") recursively from large and deep directory structure and process all of matching files with XSL-T 3.0. Most of such files have identical fileName and thus I can not do "copy to one folder" pre-processing for Saxon's -s:directory -o:directory one time invocation and invocation of Java (Saxon) for each file is of cource terrible time overhead. So I would like to read all matching files into sequence and process each item of such sequence using for-each (files are text ones and I read them using unparsed-text). And no, GAWK is not solution, as I have all transformation infrastructure from XML to SQL already in XSL-T, because 95 % of files are XMLs.
--ADDED code and explanation below:
Example of my test files.
XML file "a.xml":
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="a.xsl"?>
<root/>
XSL-T file "a.xsl":
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:saxon="http://saxon.sf.net/"
xmlns:expathFile="http://expath.org/ns/file"
exclude-result-prefixes="xs saxon"
version="3.0">
<xsl:output method="text" />
<xsl:template match="/root">
<xsl:variable name="list" select="expathFile:list('C:\temp\temp\test\', false(), '^.*$')"/>
<xsl:for-each select="$list">
<xsl:value-of select="."/>
</xsl:for-each>
</xsl:template>
My folder "C:\temp\temp\test\" contains 6 test files: "a.txt", "b.txt", "c.txt", "e", "f", "g".
But after testing of online Java RegExp tester on "http://www.regexplanet.com/advanced/java/index.html" I have found, that the problem is solely on my side, because Java regular expression behaves little different than PCRE (Perl), sed, gawk regular expressions. So it is my fault and I need to learn Java regular expression.
Saxon uses the same code for this pattern as for the filter in select="pattern" in collection URIs, which is described at http://www.saxonica.com/documentation/index.html#!sourcedocs/collections
Extracting the relevant details:
The pattern used in the select parameter can use glob-like syntax, for
example *.xml selects all files with extension "xml". More generally,
the pattern is converted to a regular expression by prepending "^",
appending "$", replacing "." by "\.", "*" by ".*", and "?" by ".?",
and it is then used to match the file names appearing in the directory
using the Java regular expression rules. So, for example, you can
write ?select=*.(xml|xhtml) to match files with either of these two
file extensions. Note however, that special characters used in the URL
(that is, characters such as backslash and curly braces that are not
allowed in the query part of a URI) must be escaped using the %HH
convention. For example, vertical bar needs to be written as %7C. This
escaping can be achieved using the encode-for-uri() function.
Note that Saxon's collection() function now also supports match=pattern in the URI, where the pattern is a standard XPath 3.1 regular expression.
I have a simple XML file like this
<?xml version="1.0"?>
<catalog>
<timezone>
<name>UTC-11:00</name>
<place>
<value>American Samoa</value>
<value>Samoa</value>
</place>
</timezone>
<timezone>
<name>UTC-10:00</name>
<place>
<value>Honolulu</value>
<value>Tahiti</value>
</place>
</timezone>
</catalog>
What is the canonical/standard way to parse this in iOS? I want to save "names" and "places" for each name.
NSXMLParser is already built in. You should look at this article for a more thorough comparison.
http://www.raywenderlich.com/553/xml-tutorial-for-ios-how-to-choose-the-best-xml-parser-for-your-iphone-project
It is NSXMLParser. Official documentation
I need to validate XML using DTD stored in memory, i.e. something like the following:
static const char *dtd_str = "<!ELEMENT ...>";
xmlDtdPtr dtd;
dtd = xmlParseMemoryDtd(dtd_str);
XML_PARSE_DTDVALID parser option allows to validate DTD embedded into XML:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE some_tag[
<!ELEMENT some_tag ...>
...
]>
<some_tag>...</some_tag>
So a workaround is to modify in-memory XML. Things become more complicated with
a parser used in "push mode". In push mode we have to detect whether the XML
declaration (<?xml ...?>), or start of the root element, then put our inline
DTD between them.
Could you suggest better solution?
EDIT
A workaround is to validate parsed XML posteriori as Daniel(_DV) suggested below.
Example: main.c, response.xml.
But I was searching for way to "embed" a DTD and validate XML "on-the-fly" while libxml2 parses XML chunk-by-chunk.
The following aproach doesn't work for me:
xmlCtxtUseOptions(ctxt, XML_PARSE_NOENT | XML_PARSE_NOWARNING | XML_PARSE_DTDVALID);
ctxt->sax->internalSubset = ngx_http_file_chunks_sax_internal_subset;
ctxt->sax->externalSubset = NULL;
$ ./parsexml
validity error : Validation failed: no DTD found !
<response>
^
Document is not valid
xmlValidateDtd allows to do DTD validation a posteriori of an already parsed XML document
to make sure it validates against the DTD. This will not use the internal subset...
http://xmlsoft.org/html/libxml-valid.html#xmlValidateDtd
See xmllint.c code in libxml2 for a full example of how to use it,
Daniel
I'm trying to use the XDOM functions of UniVerse to parse an XML file, but I can't get it to correctly parse XML that uses a default namespace. It can correctly handle XML without namespaces, or with named namespaces, but if there is a default namespace, all xPaths will fail to locate the nodes they are supposed to match.
To give a simple example, I'm, trying to parse this XML:
<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore xmlns="http://www.example.com">
<book category="COOKING">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="CHILDREN">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="WEB">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>
With this code:
PROGRAM XDOM.TEST
$INCLUDE SYSCOM XML.H
OPEN "XML" TO F.XML ELSE STOP "OPEN FAILED"
READ XML FROM F.XML, 'TEST.xml' ELSE STOP "READ FAILED"
EXIT.PROG = #FALSE
CONVERT #FM TO CHAR(10) IN XML
IF NOT(EXIT.PROG) AND XDOMOpen(XML, XML.FROM.STRING, XDOM) # XML.SUCCESS THEN GOSUB XML.ERR
IF NOT(EXIT.PROG) AND XDOMLocate(XDOM, '/bookstore/book[#category="CHILDREN"]', 'xmlns=http://www.example.com', XNODE) # XML.SUCCESS THEN GOSUB XML.ERR
IF NOT(EXIT.PROG) AND XDOMEvaluate(XNODE, './author', 'xmlns=http://www.example.com', AUTHOR) # XML.SUCCESS THEN GOSUB XML.ERR
IF NOT(EXIT.PROG) then PRINT AUTHOR
STOP
XML.ERR:
XML.CODE = ''
XML.ERR = ''
EXIT.PROG = #TRUE
IF XMLGetError(XML.CODE, XML.ERR) = XML.SUCCESS THEN
PRINT XML.CODE
PRINT XML.ERR
END
RETURN
END
When I run this code as is, I get the output:
10
The location path '/bookstore/book[#category="CHILDREN"]' was not found.
However, if I remove the "xmlns=http://www.example.com" namespace, it works fine.
After searching on my own, I found out that this is actually a bug UniVerse's XDOM parser itself. Someone has kindly documented a work-around here. You can "fool" the parser into working by giving a name to the default namespace. They also note that you can't use double-quotes in the namespace map either.
Personally, I prefer the simpler solution of just manually stripping out the offending namespace before parsing it. Adding this line to the above program fixes the issue, albeit in a very hacky way:
XML = CHANGE(XML, ' xmlns="http://www.example.com"', '')
This way you don't have to put unnecessary prefixes on all your xPath nodes. This may not always be an option though, depending on how you are getting the XDOM handle.