Xerces attempts to download Namespace URI - Why? Can I disable it? - xml-parsing

I have the following problem: I have an XML document with a number of namespaces - here is the opening tag:
<?xml version="1.0" encoding="UTF-8"?>
<REQ-IF
xmlns="http://www.omg.org/spec/ReqIF/20110401/reqif.xsd"
xmlns:doors="http://www.ibm.com/rdm/doors/REQIF/xmlns/1.0"
xmlns:reqif="http://www.omg.org/spec/ReqIF/20110401/reqif.xsd"
xmlns:reqif-common="http://www.prostep.org/reqif"
xmlns:reqif-xhtml="http://www.w3.org/1999/xhtml"
xmlns:rm="http://www.ibm.com/rm"
xmlns:rm-reqif="http://www.ibm.com/rm/reqif"
xmlns:xhtml="http://www.w3.org/1999/xhtml"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
As you can see, there are number of namespaces. I use Xerces as the parser. The problem is that the parser attempts to visit the URIs from the namespaces that it does not know about. This is bad, because it slows the parsing down. For instance, "http://www.prostep.org/reqif" resolves to a web page. The content is parsed just fine (of course, as the Namespace URI is just a name), it just takes a long time, because the Parser hangs for a long time when retrieving the URI.
So, two questions:
Why would Xerces attempt to treat the namespace URI like URI with "real" content?
How can I disable this?
For the record, The URI is neither the location for Schema or DTD. I still tried to disable loading external DTDs, which, of course, didn't do anything:
parser.setProperty("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
Any thoughts?

Related

White spaces are required between publicId and systemId, but XML looks OK

I just pulled out a piece of code which I wrote a few months ago. The code fetches an XML document from a web server and parses it using JAXB. The last time I tried it worked flawlessly; now I am getting an exception:
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 50; White spaces are required between publicId and systemId.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
Looking around, this suggests some issues with the XML header data, namely <!DOCTYPE ...>. The answer suggests that the statement is misleading: in the case described, systemId was missing altogether, despite the error just complaining about a missing whitespace in front of it.
However, if I get the XML document with a web browser, it doesn’t even contain the <!DOCTYPE ...> header.
Parsing an XML document I retrieved a few months back works without issues.
If I diff the document I retrieved today and the one from a few months back, both are exactly the same up to the start of the root element.
Capturing the HTTP traffic finally provided the answer (unencrypted connections come in handy at times): Apparently the service switched from HTTP to HTTPS in the last few months, with URLs remaining unchanged otherwise.
Requests to the old URL are answered with 301 Moved Permanently and the new URL.
When reading from a URL with java.net.URL.openStream(), redirects are not followed automatically. Thus, the data it returns is not valid XML, leading to the error message.
Lesson learned for today: White spaces are required between publicId and systemId is really just a cryptic way of saying: Something’s wrong with the XML data you supplied, but we didn’t bother to dig any deeper.

Web page without real files corresponding to URLs?

geniuses!
I need to make a demo page acting like DBpedia (http://dbpedia.org).
Two pages from different URLs,
http://dbpedia.org/page/Barack_Obama and
http://dbpedia.org/page/Lionel_Messi,
show different content.
I cannot really think DBpedia has million pages for all individual entities (E.g., Barack Obama and Lionel Messi).
How can I handle such URL request?
I know a bit about GET request but example URLs above do not seem like to use GET method.
Thank you in advance!
ps. Please teach me the process. Something like:
1. A user enters URL on a browser.
2. ...
When visiting http://dbpedia.org/page/Barack_Obama, your browser does send a GET request, e.g.:
GET /page/Barack_Obama HTTP/1.1
Host: dbpedia.org
The server (dbpedia.org) receives this GET request and then decides what to do. From the outside, you can’t know (for sure) how the server does something. The two common cases are:
Static web page: a file gets served that exists somewhere on the server. The URL path is often mapped to the server’s file system, but that’s not necessarily the case.
Dynamic web page: a file gets served that is generated on the fly. The content often comes from a database, but that’s not necessarily the case.
After trying some solutions, I'm now using Spring Web MVC framework.
Maybe Dynamic web page solution mentioned in unor's answer.
#Controller
public class SimpleDisplayController {
#RequestMapping("/page/{symbolicName:[!-z]+}")
public String displayEntity(HttpServletRequest hsr, Model model) {
String reqPath = (String) hsr.getAttribute(HandlerMapping.PATH_WITHIN_HANDLER_MAPPING_ATTRIBUTE);
String entityLb = reqPath.substring(reqPath.lastIndexOf("/"));
model.addAttribute("label", entityLb);
return "entity";
}
}
I could get request using regex as you can see at the 4th line: #RequestMapping("/page/{symbolicName:[!-z]+}").
The function above returns the string 'entity' which is the name of a HTML file serving as a template.
The following code is a body part of the example HTML template.
<body>
<p th:text="'About entity ' + ${label} + '...'" />
</body>
Since I add an attribute with the key 'label' in the controller above, the template can process ${label}.
In the example HTML template, th:text is a snytax of Thymeleaf (Java library to make an XML/XHTML/HTML5 template) which is supported by Spring.

W3C validator says 'feed does not validate' 'url must be a full URL'... whats wrong with it?

Validating my feed, it has an enclosure with a URL of
https://archive.org/download/NigelFarageAPersonalMessageToNorthernIrelandVoters./Nigel%20Farage,%20a%20personal%20message%20to%20Northern%20Ireland%20voters..mp3
I know it is a bit convoluted... but what is wrong with it? The stop in the directory name? the double dot in the file name? the comma? all of em?
I have looked at the RFC on URL's but cant make it out(!).
This feed does not validate.
line 441, column 2: url must be a full URL: https://archive.org/download/NigelFarageAPersonalMessageToNorthernIrelandVoters./Nigel%20Farage,%20a%20personal%20message%20to%20Northern%20Ireland%20voters..mp3 (4 occurrences) [help]
<enclosure type="audio/mpeg" url="https://archive.org/download/NigelFarage ...
^
** edit **
A useful (even if incorrect) answer was added (and removed...) showing the result from the w3c URL validator - https://validator.w3.org/checklink
This Link Checker looks for issues in links, anchors and referenced objects in a Web page, CSS style sheet, or recursively on a whole Web site. For best results, it is recommended to first ensure that the documents checked use Valid (X)HTML Markup and CSS. The Link Checker is part of the W3C's validators and Quality Web tools.
If you find this question, you may find the link checker a useful resource!
The problem seems to be that it’s a HTTPS URL instead of a HTTP URL.
The linked error documentation, foo attribute of bar must be a full URL, says:
If this is a link to a web page, you must include the "http://" at the beginning and immediately follow it with a valid domain name.
The RSS 2.0 spec says about <enclosure>:
The url must be an http url.
If you change https://archive.org/download/… to http://archive.org/download/…, it validates.
And if you don't have httpS then your SSL says your page isn't secure. #feedvalidator step up. There are a ton of feedback/complaints about this on the support forum here https://groups.google.com/forum/#!forum/feedvalidator-users
More specifically here: https://github.com/rubys/feedvalidator/issues/16

Got metadata in ODATA. what next?

I am trying to fetch data from a service that i donot know much about.
So i got its url like
http://ABC.com/ABC.svc
so i thouhgt to get metadata as
http://ABC.com/ABC.svc/$metadata
it gives me:
<EntityType Name="E1">
- <Key>
<PropertyRef Name="E1k1" />
</Key>
< Property Name="E2" Type="Edm.String" Nullable="true"
m:FC_TargetPath="SyndicationTitle" ..>
<ComplexType Name="OptionV1">
<Property Name="Value" Type="Edm.Int32" Nullable="true" />
... and a lot more.
How do i find out what should come next to ABC.svc/???
I want to write queries to access data. Can smebody point me to what should be my next steps?
and any learning resource on this query generation from metadata would be hlpful.
Thanks
There are two ways:
1) Using the service document. Navigate to the ABC.svc, that should return a service document, that is an ATOM Service payload which contains the names of the entity sets available from the service. For a sample of such you can go to http://services.odata.org/OData/OData.svc/. This should return a document with three collections (Entity sets). The href attribute is a relative URI to the entity set (relative to the xml:base which is usually the base of the service). So if for example your service has an entity set E1Set, then typically the address of it would be ABC.svc/E1Set.
2) Using the $metadata document and assuming the usual addressing scheme (note that this usually applies to the service but it doesn't have to). The $metadata document will define entity sets. Each of these is usually exposed by the service and typically follows the addressing scheme of ABC.svc/EntitySetName.
Once you navigate to the entity set, you should get back an ATOM feed with the entities in that set. The $metadata will help you recognize the shapes of the entities and the relationships.
Some services also have service operations or actions and so on. These are not exposed in the service document #1. Instead they are only visible in the $metadata as FunctionImport elements. They usually follow the addressing scheme of ABC.svc/FunctionImportName. But note that you might need to know something more about the service operation to be able to invoke it (what HTTP verb to use, what are the parameters, what it will do, and so on).
LinqPad provides a very simple means for getting started with OData services (assuming some familiarity with LINQ). If you will primarily be primarily consuming this application from .NET, I'd recommend starting with this application. You point it to the $metadata endpoint and it generates proxy classes which allow you to work with the OData service much like you would in a plain-old-.NET-app. On the Results Log tab, it will output the URL used to query the OData service, which you can then pick up and tweak in Fiddler. (For more about how to use OData + Fiddler, see this blog post.)
If you'll primarily be using the OData service from JavaScript, you might want to start by understanding the URI conventions better or by playing around with data.js.

url with multiple forward slashes, does it break anything?

http://example.com/something/somewhere//somehow/script.js
Does the double slash break anything on the server side? I have a script that parses URLs and i was wondering if it would break anything (or change the path) if i replaced multiple slashes with a single slash. Especially on the server side, some frameworks like CodeIgniter and Joomla use segmented url schemes and routing. I would just want to know if it breaks anything.
HTTP RFC 2396 defines path separator to be single slash.
However, unless you're using some kind of URL rewriting (in which case the rewriting rules may be affected by the number of slashes), the uri maps to a path on disk, but in (most?) modern operating systems (Linux/Unix, Windows), multiple path separators in a row do not have any special meaning, so /path/to/foo and /path//to////foo would eventually map to the same file.
An additional thing that might be affected is caching. Since both your browser and the server cache individual pages (according to their caching settings), requesting same file multiple times via slightly different URIs might affect the caching (depending on server and client implementation).
The correct answer to this question is it depends upon the implementation of the server!
Preface: Double-slash is syntactically valid according to RFC 2396, which defines URL path syntax. As amn explains, it therefore implies an empty URI segment. Note however that RFC 2396 only defines the syntax, not semantics of paths, including empty path segments, so it is up to your server to decide the semantics of the empty path.
You didn't mention the server software stack you're using, perhaps you're even rolling your own? So please use your imagination as to what the semantics could be!
Practically, I would like to point out some everyday semantic-related reasons which mean you should avoid double slashes even though they are syntactically valid:
Since empty being valid is somehow not expected by everyone, it can cause bugs. And even though your server technology of today might be compatible with it, either your server technology of tomorrow or the next version of your server technology of today might decide not to support it any more. Example: ASP.NET MVC Web API library throws an error when you try to specify a route template with a double slash.
Some servers might interpret // as indicating the root path. This can either be on-purpose, or a bug - and then likely it is a security bug, i.e. a directory traversal vulnerability.
Because it is sometimes a bug, and a security bug, some clever server stacks and firewalls will see the substring '//', deduce you are possibly making an attempt at exploiting such a bug, and therefore they will return 403 Forbidden or 400 Bad Request etc, and refuse to actually do any further processing of the URI.
URLs don't have to map to filesystem paths. So even if // in a filesystem path is equivalent to /, you can't guarantee the same is true for all URLs.
Consider the declaration of the relevant path-absolute non-terminal in "RFC3986: Uniform Resource Identifier (URI): Generic Syntax" (specified, as is typical, in ABNF syntax):
path-absolute = "/" [ segment-nz *( "/" segment ) ]
Then consider the segment declaration a few lines further down in the same document:
segment = *pchar
If you can read ABNF, the asterisk (*) specifies that the following element pchar may be repeated multiple times to make up a segment, including zero times. Learning this and re-reading the path-absolute declaration above, you can see that a potentially empty segment imples that the second "/" may repeat indefinitely, hence allowing valid combinations like ////// (arbitrary length of at least one /) as part of path-absolute (which itself is used in specifying the rule describing a URI).
As all URLs are URIs we can conclude that yes, URLs are allowed multiple consecutive forward slashes, per quoted RFC.
But it's not like everyone follows or implements URI parsers per specification, so I am fairly sure there are non-compliant URI/URL parsers and all kinds of software that stacks on top of these where such corner cases break larger systems.
One thing you may want to consider is that it might affect your page indexing in a search engine. According to this web page,
A URL with the same path repeated 3 times will not be indexed in Google
The example they use is:
example.com/path/path/path/
I haven't confirmed this would also be true if you used example.com///, but I would certainly want to find out if SEO optimization was critical for my website.
They mention that "This is because Google thinks it has hit a URL trap." If anyone else knows the answer for sure, please add a comment to this answer; otherwise, I thought it relevant to include this case for consideration.
Yes, it can most definitely break things.
The spec considers http://host/pages/foo.html and http://host/pages//foo.html to be different URIs, and servers are free to assign different meanings to them. However, most servers will treat paths /pages/foo.html and /pages//foo.html identically (because the underlying file system does too). But even when dealing with such servers, it's easily possible for extra slash to break things. Consider the situation where a relative URI is returned by the server.
http://host/pages/foo.html + ../images/foo.png = http://host/images/foo.png
http://host/pages//foo.html + ../images/foo.png = http://host/pages/images/foo.png
Let me explain what that means. Say your server returns an HTML document that contains the following:
<img src="../images/foo.png">
If your browser obtained that page using
http://host/pages/foo.html # Path has 2 segments: "pages" and "foo.html"
your browser will attempt to load
http://host/images/foo.png # ok
However, if your browser obtained that page using
http://host/pages//foo.html # Path has 3 segments: "pages", "" and "foo.html"
you'll probably get the same page (because the server probably doesn't distinguish /pages//foo.html from /pages/foo.html), but your browser will erroneously try to load
http://host/pages/images/foo.png # XXX
You may be surprised for example when building links for resources in your app.
<script src="mysite.com/resources/jquery//../angular/script.js"></script>
will not resolve to mysite.com/resources/angular/script.js but to mysite.com/resources/jquery/angular/script.js what you probably didn't want
Double slashes are evil, try to avoid them.
Your question is "does it break anything". In terms of the URL specification, extra slashes are allowed. Don't read the RFC, here is a quick experiment you can try to see if your browser silently mangles the URL:
echo '<?= $_SERVER['REQUEST_URI'];' > tmp.php
php -S localhost:4000 tmp.php
I tested macOS 10.14 (18A391) with Safari 12.0 (14606.1.36.1.9) and Chrome 69.0.3497.100 and both get the result:
/hello//world
This indicated that using an extra slash is visible to the web application.
Certain use cases will be broken when using a double slash. This includes URL redirects/routing that are expecting a single-slashed URL or other CGI applications that are analyzing the URI directly.
But for normal cases of serving static content, such as your example, this will still get the correct content. But the client will get a cache miss against the same content accessed with different slashes.

Resources