Is a (local) file path a URI? - url

On some input we allow the following paths:
C:\Folder
\\server\Folder
http://example.com/...
Can I mark them all as "URI"s?

C:/Folder and /server/Folder/ are file paths.
http://example.com/ is a URL, which is a URI sub-type, so you could mark it as a URI but not the other way around (like how squares are rectangles but not vice versa).
Of course, here you have posted a clear, simple example. When discussing the distinction between URI and URL, not only is the answer not clear cut, it is also disputed. I recommend taking a look at this thread and the answers posted in it for clarification. Generally though, it is mostly agreed that the main difference is that URLs provide an access method (such as http://).
So if we were to convert your first file path into a URL it would become the following (note the addition of the access method):
file:///c:/Folder/test.txt
If you modify all your file paths to include an access method like in my example, then it will be okay for you to mark them as URIs.

Strictly speaking no, unless you make sure it's an absolute path and add add "file://" to the beginning.
As per RFC 3986 Section 3:
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
hier-part = "//" authority path-abempty
/ path-absolute
/ path-rootless
/ path-empty
The scheme and the ":" are not in square brackets [], which means they are not optional.
However, the HTML standard calls these "path-relative-scheme-less-URL strings" and they're valid in the href attribute of an HTML element so maybe it's fine to call relative Unix paths "URLs" (not absolute Unix paths or Windows paths though).

Related

Changing "/" to "%2f" in URL doesn't work

I have an orchard site and have the following problem:
If I use the URL: http://asiahotelct.com/tours/ct---chau-%C4%91oc---ha-tien-3n2%C4%91, it's okay. But when I change url the / to %2f (like so: http://asiahotelct.com/tours%2fct---chau-%C4%91oc---ha-tien-3n2%C4%91), it no longer works.
Why can / not be replaced by %2f?
Any url is a kind of complete address to some resource(file) in network. But according to the rules of how it must be actually (to work as you expect), its expected that a few characters must have some specific meaning; just like in this case: "/" means a separator that separates the individual elements of your address(url).
But in case you need such specific characters to be a part of any such element of address(url), we must encode it. List of codes
URL encoding converts characters into a format that can be transmitted
over the Internet.
- w3Schools
So, "/" is actually a seperator, but "%2f" becomes an ordinary character that simply represents "/" character in element of your url.

How Can I Implement A Standard Set of Hyperlink Detection Rules in Delphi

I currently do automatic detection of hyperlinks within text in my program. I made it very simple and only look for http:// or www.
However, a user suggested to me that I extend it to other forms, e.g.: https:// or .com
Then I realized it might not stop there because there's ftp and mailto and file, all the other top level domains, and even email addresses and file paths.
What I think is best is to limit it to what is practical by following some often-used standard set of hyperlink detection rules that are currently in use. Maybe how Microsoft Word does it, or maybe how RichEdit does it or maybe you know of a better standard.
So my question is:
Is there a built in function that I can call from Delphi to do the detection, and if so, what would the call look like? (I plan in the future to go to FireMonkey, so I would prefer something that will work beyond Windows.)
If there isn't a function available, is there some place I can find a documented set of rules of what is detected in Word, in RichEdit, or any other set of rules of what should be detected? That would then allow me to write the detection code myself.
Try the PathIsURL function which is declarated in the ShLwApi unit.
Following regex taken from RegexBuddy's library might get you started (I can't make any claims about performance).
Regex
Match; JGsoft; case insensitive:
\b(https?|ftp|file)://[-A-Z0-9+&##/%?=~_|$!:,.;]*[A-Z0-9+&##/%=~_|$]
Explanation
URL: Find in full text
The final character class makes sure that if an URL is part of some text,
punctuation such as a comma or full stop after the URL is not interpreted as part
of the URL.
Matches (whole or partial)
http://regexbuddy.com
http://www.regexbuddy.com
http://www.regexbuddy.com/
http://www.regexbuddy.com/index.html
http://www.regexbuddy.com/index.html?source=library
You can download RegexBuddy at http://www.regexbuddy.com/download.html.
Does not match
regexbuddy.com
www.regexbuddy.com
"www.domain.com/quoted URL with spaces"
support#regexbuddy.com
For a set of rules you might look into RFC 3986
A Uniform Resource Identifier (URI) is a compact sequence of
characters that identifies an abstract or physical resource. This
specification defines the generic URI syntax and a process for
resolving URI references that might be in relative form, along with
guidelines and security considerations for the use of URIs on the
Internet
A regex that validates a URL as specified in RFC 3986 would be
^
(# Scheme
[a-z][a-z0-9+\-.]*:
(# Authority & path
//
([a-z0-9\-._~%!$&'()*+,;=]+#)? # User
([a-z0-9\-._~%]+ # Named host
|\[[a-f0-9:.]+\] # IPv6 host
|\[v[a-f0-9][a-z0-9\-._~%!$&'()*+,;=:]+\]) # IPvFuture host
(:[0-9]+)? # Port
(/[a-z0-9\-._~%!$&'()*+,;=:#]+)*/? # Path
|# Path without authority
(/?[a-z0-9\-._~%!$&'()*+,;=:#]+(/[a-z0-9\-._~%!$&'()*+,;=:#]+)*/?)?
)
|# Relative URL (no scheme or authority)
([a-z0-9\-._~%!$&'()*+,;=#]+(/[a-z0-9\-._~%!$&'()*+,;=:#]+)*/? # Relative path
|(/[a-z0-9\-._~%!$&'()*+,;=:#]+)+/?) # Absolute path
)
# Query
(\?[a-z0-9\-._~%!$&'()*+,;=:#/?]*)?
# Fragment
(\#[a-z0-9\-._~%!$&'()*+,;=:#/?]*)?
$
Regular Expressions may be the way to go here, to define the various patterns which you deem to be appropriate hyperlinks.

urn in url for RESTful service, building url path

we are working on creating a RESTFul service, and trying to decide on the URL path format.
we have urn for uniquely identify a resource throughout the organization, and we are building the Rest service to service that resource in the format the requester is looking for via http content negotiation.
my question is that how should we form the path of the url for the service, which one make more sense.
http://{domain}/{somethinghere}/{full urn string}
or
http://{domain}/{somethinghere}/{urn-part-1}/{urn-part-2}/{urn-part-3}
I have the same question too!... IMHO, I would use the full urn string,
http://{domain}/{somethinghere}/{full urn string}
It's elegant, semi-legal, and has a user-friendly feature of making it easier to copy-and-paste URN strings into your URL. Here's some of the homework I've done:
There is an old experimental RFC 2169 which suggests putting in the full urn string, and not %quoting the the colons (:). This is clean and elegant... And there are examples of colons in the wild e.g.,
http://en.wikipedia.org/wiki/Talk:Buckminster_Fuller
One of my fears (can anyone confirm or reject this?) is that some browsers, servers, frameworks, or tools may try to %quote or otherwise choke on a colon because of various assumptions that they may make about what a colon represents.
Neither RFC 1630 nor other RFCs make it clear whether a colon may be used in a path of the http scheme or not. There is a caveat however! The placement of a colon is important in determining whether or not a URL is absolute (and this is specified under the section "Partial (relative) form" in RFC 1630). If a colon appears before a slash (/), then the URL is absolute. (N.B. the colon is referred to as a "reserved" delimiter in the RFCs, but the intended reserved use of it is clear and does not rule out use in paths.)
I'd love to here more ideas about this... (and not just taking the easy cop-out of slash-encoding everything, as that is not as elegant).

Why we don't use such URL formats?

I am reworking on the URL formats of my project. The basic format of our search URLs is this:-
www.projectname/module/search/<search keyword>/<exam filter>/<subject filter>/... other params ...
On searching with no search keyword and exam filter, the URL will be :-
www.projectname/module/search///<subject filter>/... other params ...
My question is why don't we see such URLs with back to back slashes (3 slashes after www.projectname/module/search)? Please note that I am not using .htaccess rewrite rules in my project anymore. This URL works perfect functionally. So, should I use this format?
For more details on why we chose this format, please check my other question:-
Suggest best URL style
Web servers will typically remove multiple slashes before the application gets to see the request,for a mix of compatibility and security reasons. When serving plain files, it is usual to allow any number of slashes between path segments to behave as one slash.
Blank URL path segments are not invalid in URLs but they are typically avoided because relative URLs with blank segments may parse unexpectedly. For example in /module/search, a link to //subject/param is not relative to the file, but a link to the server subject with path /param.
Whether you can see the multiple-slash sequences from the original URL depends on your server and application framework. In CGI, for example (and other gateway standards based on it), the PATH_INFO variable that is typically used to implement routing will usually omit multiple slashes. But on Apache there is a non-standard environment variable REQUEST_URI which gives the original form of the request without having elided slashes or done any %-unescaping like PATH_INFO does. So if you want to allow empty path segments, you can, but it'll cut down on your deployment options.
There are other strings than the empty string that don't make good path segments either. Using an encoded / (%2F), \ (%5C) or null byte (%00) is blocked by default by many servers. So you can't put any old string in a segment; it'll have to be processed to remove some characters (often ‘slug’-ified to remove all but letters and numbers). Whilst you are doing this you may as well replace the empty string with _.
Probably because it's not clearly defined whether or not the extra / should be ignored or not.
For instance: http://news.bbc.co.uk/sport and http://news.bbc.co.uk//////////sport both display the same page in Firefox and Chrome. The server is treating the two urls as the same thing, whereas your server obviously does not.
I'm not sure whether this behaviour is defined somewhere or not, but it does seem to make sense (at least for the BBC website - if I type an extra /, it does what I meant it to do.)

Is the Scheme Optional in URIs?

I was recently asked to add some Woopra JavaScript to a website and noticed that the URL started with a double slash (i.e. omitted the scheme). I've never seen this before, so I went trying to find out more about it, but the only thing I could really find was an item on the Woopra FAQ:
The Woopra JavaScript in the Setup does not include http in the URL call for the script. This is correct. The JavaScript has been optimized to run very fast and efficiently on your site.
However, some validation and site testing/debugging services and tools do not recognize the code as correct. It is correct and valid. If the warnings annoy you, just add the http to the script’s URL. It will not impact the script.
(For clarification, the URL is "//static.woopra.com/js/woopra.v2.js"—the colon is omitted in addition to the "http".)
Is there any more information about this practice? If this is indeed valid, there must be a spec that talks about it, and I'd very much like to see it.
Thanks in advance for satisfying my curiousity!
This is a valid URL. It's called a "network-path reference" as defined in RFC 3986. When you don't specify a scheme/protocol, it will fall back to the current scheme. So if you are viewing a page via https:// all network path references will also use https.
For an example, here's a link to the RFC 3986 document again but with a network path reference. If you were viewing this page over https (although it looks like you can't use https with StackOverflow) the link will reflect your current URI scheme, unlike the first link.
See RFC 3986, section 3:
The generic URI syntax consists of a
hierarchical sequence of components
referred to as the scheme, authority,
path, query, and fragment.
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment
]
hier-part = "//" authority path-abempty
/ path-absolute
/ path-rootless
/ path-empty
The scheme and path components are
required, though the path may be
empty (no characters).

Resources