What are the legal and illegal characters in URL/Link? - url

What happens if there is a illegal character? Does the URL fix it self by encoding the illegal characters into something else?

As explained here
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]#!$&'()*+,;=.
Any other character needs to be encoded with the percent-encoding
(%hh). Each part of the URI has further restrictions about what
characters need to be represented by an percent-encoded word.

Allowed characters
RFC 3986 defines which characters are allowed in which URI components.
RFCs for specific URI schemes might further restrict this.
If you are interested in HTTP/HTTPS URIs: they are defined in RFC 7230. AFAIK they don’t have further restrictions regarding allowed characters, so you could stick to the definitions in RFC 3986.
What happens if illegal characters are used?
Depends on many factors … could be anything from "nothing happens" to "doesn’t work anymore".
Does the URL fix it self by encoding the illegal characters into something else?
A URI can’t fix itself, it’s just a string.
Clients working with this URI (browser, server, email client, etc.) may try to fix a URI (or work with invalid URIs) according to their own rules.
URI vs. link
Also note that there’s a difference between a URI and linking to (or storing etc.) this URI in a document.
The host language (e.g., HTML) might have rules what to encode. This does not change the URI, only the way the URI is stored/specified in this document.
For example, the valid URI http://example.com/a&b would have to be linked like this in HTML documents:
Link
But the URI is still http://example.com/a&b, not http://example.com/a&b.

Related

Language specific characters in URL

Colleagues from work have created API endpoint which uses language specific characters in url. This api url looks like
http://somedomain.com/someapi/somemethod/zażółć/gęślą/jaźń
Is this OK or is it a bad approach?
Technically, that's not a valid URL but web browsers and other clients finesse it. The script that characters are from is not an issue but structural characters like "/?#" could be. You'll have to consider what to do when they show up in data that you are "pasting" into your URLs.
An HTTP URL is:
an ASCII-encoded scheme (in this case the protocol "http")
a punycode-encoded, ASCII-encoded domain
a %-encoded, ASCII-encoded, server-defined sequence of octets for the path, optional query, and optional hash.
See RFC 3986
The assumption that everyone makes—quite reasonably because it is the predominant practice—is that the path, query, and hash are text. There is no text but encoded text. So, some character encoding is involved. Where %-encoding is needed outside of structural characters, browsers are going to assume UTF-8. If you don't want browsers to do the %-encoding, use valid URLs by doing it yourself with the character encoding that you are using.
As the world is standardizing on UTF-8 (where applicable), the HTML DOM has also with the encodeURIComponent function. Clients using JavaScript in a web browser are likely to use this function, either directly or through some library.
UTF-8 encoded, %-encoded (and, then on the wire, ASCII-encoded) version of your URL that my browser created:
http://somedomain.com/someapi/somemethod/za%C5%BC%C3%B3%C5%82%C4%87/g%C4%99%C5%9Bl%C4%85/ja%C5%BA%C5%84
(You can see this yourself using your browser's dev tools [F12 key, network tab] or a packet sniffer [e.g., Wireshark or Fiddler]. What you gave as a URL is never seen on the wire.)
Your server application probably understands that just fine. In any case, it is your server's rules that the client complies with. If your API uses UTF-8 encoded, %-encoded URLs then just document that. (But phrase it in a way that doesn't confuse people who do that already without knowing.)

Should I url encode a query string parameter that's a URL?

Just say I have the following url that has a query string parameter that's an url:
http://www.someSite.com?next=http://www.anotherSite.com?test=1&test=2
Should I url encode the next parameter? If I do, who's responsible for decoding it - the web browser, or my web app?
The reason I ask is I see lots of big sites that do things like the following
http://www.someSite.com?next=http://www.anotherSite.com/another/url
In the above, they don't bother encoding the next parameter because I'm guessing, they know it doesn't have any query string parameters itself. Is this ok to do if my next url doesn't include any query string parameters as well?
RFC 2396 sec. 2.2 says that you should URL-encode those symbols anywhere where they're not used for their explicit meanings; i.e. you should always form targetUrl + '?next=' + urlencode(nextURL).
The web browser does not 'decode' those parameters at all; the browser doesn't know anything about the parameters but just passes along the string. A query string of the form http://www.example.com/path/to/query?param1=value&param2=value2 is GET-requested by the browser as:
GET /path/to/query?param1=value&param2=value2 HTTP/1.1
Host: www.example.com
(other headers follow)
On the backend, you'll need to parse the results. I think PHP's $_REQUEST array will have already done this for you; in other languages you'll want to split over the first ? character, then split over the & characters, then split over the first = character, then urldecode both the name and the value.
According to RFC 3986:
The query component is indicated by the first question mark ("?")
character and terminated by a number sign ("#") character or by the
end of the URI.
So the following URI is valid:
http://www.example.com?next=http://www.example.com
The following excerpt from the RFC makes this clear:
... as query components are often used to carry identifying
information in the form of "key=value" pairs and one frequently used
value is a reference to another URI, it is sometimes better for
usability to avoid percent-encoding those characters.
It is worth noting that RFC 3986 makes RFC 2396 obsolete.

Can I use an at (#) sign in the path part of an url

I know it can't be part of the authority section as usernames with an # are used there, but can I use it in the path section.
The reason I want to use it is as part of an url for a users resources. eg
www.example.com/user#domain.com/someresource
The # symbol is a reserved character in RFC 3986 so it is not allowed in your URL. It would be converted to %40 when URL encoding is used.
In your case, best practice for RESTful and nice URLS, it should be
www.example.com/domain.com/user/someresource

urn in url for RESTful service, building url path

we are working on creating a RESTFul service, and trying to decide on the URL path format.
we have urn for uniquely identify a resource throughout the organization, and we are building the Rest service to service that resource in the format the requester is looking for via http content negotiation.
my question is that how should we form the path of the url for the service, which one make more sense.
http://{domain}/{somethinghere}/{full urn string}
or
http://{domain}/{somethinghere}/{urn-part-1}/{urn-part-2}/{urn-part-3}
I have the same question too!... IMHO, I would use the full urn string,
http://{domain}/{somethinghere}/{full urn string}
It's elegant, semi-legal, and has a user-friendly feature of making it easier to copy-and-paste URN strings into your URL. Here's some of the homework I've done:
There is an old experimental RFC 2169 which suggests putting in the full urn string, and not %quoting the the colons (:). This is clean and elegant... And there are examples of colons in the wild e.g.,
http://en.wikipedia.org/wiki/Talk:Buckminster_Fuller
One of my fears (can anyone confirm or reject this?) is that some browsers, servers, frameworks, or tools may try to %quote or otherwise choke on a colon because of various assumptions that they may make about what a colon represents.
Neither RFC 1630 nor other RFCs make it clear whether a colon may be used in a path of the http scheme or not. There is a caveat however! The placement of a colon is important in determining whether or not a URL is absolute (and this is specified under the section "Partial (relative) form" in RFC 1630). If a colon appears before a slash (/), then the URL is absolute. (N.B. the colon is referred to as a "reserved" delimiter in the RFCs, but the intended reserved use of it is clear and does not rule out use in paths.)
I'd love to here more ideas about this... (and not just taking the easy cop-out of slash-encoding everything, as that is not as elegant).

Why are URLs in the form of "http://www.mongodb.org/display/DOCS/mongo+-+The+Interactive+Shell"

What is the mongo+-+The+Interactive+Shell part for and why is it that way? It seems like it is urlencoded from "mongo - The Interactive Shell"
for the same reason the url to this qustion includes why-are-urls-in-the-form-of-http-www-mongodb-org-display-docs-mongo-theinte. unencoded spaces aren't valid, and encoded ones (%20) are hard to read, so a more readable alternative is used.
The W3C reserved the plus sign as a shorthand for the space character. You'll also find the same document codified as RFC 1630.

Resources