I am trying to write a function to encode URIs in order to make them compliant with rfc 3986.
I.e. checking that every character other than alphanum; /?:#&=+$-_.!~*'()|\^[]``# gets replaced by %[hex octet]
I want to be sure that if the function gets called with an already encoded URI, the code won't ruin it.
So far all I am doing is looking for a '%' sign followed by 2 octect characters. Any other reserved character I find I replace.
Is there any other check I should be doing?
Don't mind security issues; they are being handled somewhere else.
I think that properly-encoded URIs should always pass through cleanly the second time.
The reason being that you have to correctly parse a URI no matter what, because it's entirely legal to have characters such as / # . : ? & = in a URI, provided they appear in the right places.
So you only encode a character if it is not legal in that part of the URI. With that assertion, you then create an encoded string that IS legal at every position, so when you parse it, there is nothing left to encode.
Bear in mind that if someone throws a URI at you to be encoded and it happens to be ambiguous (ie it contains special characters that alter the URI syntax), they cannot expect a correct result.
To answer your question more directly, I would say yes: in light of all the above, you only need to have special treatment for the % escape sequences.
Um, how do you know that an already encoded URI should not be encoded once again? Maybe the URI contains, I don't know, example how to encode URIs, and if will not get encoded a second time, then the decoding will break it?
That said, you can check whether only allowed characters plus % are present, and whether every % is followed by a hex number. If yes, there is a good chance (but no guarantee) that the encoding has already been done.
Related
I need a character to separate two or more URIs in one string. Later I will the split the string to get each URI separately.
The problem is I'm not sure what character to pick here. Is there a good character to choose here that definitely can't be part of a URI itself? Or is ultimately pretty much all characters allowed in a URI?
I know certain characters are illegal in certain parts of the URI, but I'm talking about a URI as a whole, like this:
scheme://username:password#domain.tld/path/to/file.ext?key=value#blah
I'm thinking maybe space, although technically I suppose that could be part of the password, or would it be escaped as %20 in that case?
Any of the control characters should be good for this, such as TAB, FF and so on.
RFC3986 (a) controls the URI specification and Appendix A of that RFC states that the characters are limited to:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
0123456789-._~:/?#[]#!$&'()*+,;=
(and the % encoding character, of course, for all other characters not listed above).
So, basically, any other character should be okay as a delimiter.
(a) This has actually been augmented by RFC6874 which has to do with changes to the IPv6 part of the URI, adding a zone identifier. Since the zone ID consists of % and "unreserved" characters already included above, it doesn't change the set of characters allowed.
I have read the question Why do you need to encode URLs
but I still confused:
Why the W3C just allow more character could exist in URL?So it could avoid encoding?
Why there is exist decode
The URL representation of characters may differ from the characters you have in your code. In other words, there is a specific grammar that defines how URLs are assembled. Special characters that are used in forming a URL need to be encoded so that they do not cause unexpected results.
Now to answer your questions more specifically:
They may already allow some of the characters you are thinking of, but these characters (&, ?, for example) are given special meaning to function in a certain way. Therefore, they cannot be used in a different context. From the link to the question you posted, it also looks like in the example of the space character, it is not supported because of the problems it would introduce in its use.
Decode is useful for decoding the URL to get the string representation of the URL before it was encoded for manipulation/other functions in the application.
Is anyone aware of any problems with using commas in SEO friendly URL's? I'm working with some software that uses a lot of commas in it's SEO friendly URL's; but I am 100% certain I have seen some instances where some programs/platforms don't recognize the URL correctly & cut the "linking" of the URL off after the first comma.
I just tested this out with thunderbird, gmail, hotmail & on a SMF forum with no problems; however I know I have seen the issue before.
So my question is, is there anything in particular that would cause some platforms to stop linking URL's with a comma? Such as a certain character after the comma?
There will be countless implementations that will cut the automatical linking at that point. As with many other characters, too. But that’s not a problem because of using these characters, but because of a wrong/incomplete implementation.
See for example this very site, Stack Overflow. It will cut off the link at the * when manually entering/pasting this URL (see bug; in case it gets fixed, here’s a screenshot of it):
http://wayback.archive.org/web/*/http://www.example.com/
But when using the hyperlink syntax, it works fine:
http://wayback.archive.org/web/*/http://www.example.com/
The * character is allowed in an HTTP URL path, so the link detection should have recognized the first URL instead of breaking it at the occurence of *.
Regarding the comma:
The comma is a reserved character and its meaning is relevant for the URL path (bold emphasis mine):
Aside from dot-segments in hierarchical paths, a path segment is
considered opaque by the generic syntax. URI producing applications
often use the reserved characters allowed in a segment to delimit
scheme-specific or dereference-handler-specific subcomponents. For
example, the semicolon (";") and equals ("=") reserved characters are
often used to delimit parameters and parameter values applicable to
that segment. The comma (",") reserved character is often used for
similar purposes. For example, one URI producer might use a segment
such as "name;v=1.1" to indicate a reference to version 1.1 of
"name", whereas another might use a segment such as "name,1.1" to
indicate the same.
So, if you don’t intend to use the comma for the function it has as reserved character, you may want to percent-encode it with %2C. Users copying such an URL from their browser’s address bar would paste it in the encoded form, so it should work almost everywhere.
However, especially because it’s a reserved character, the unencoded form should work, too.
I came across the following URL today:
http://www.sfgate.com/cgi-bin/blogs/inmarin/detail??blogid=122&entry_id=64497
Notice the doubled question mark at the beginning of the query string:
??blogid=122&entry_id=64497
My browser didn't seem to have any trouble with it, and running a quick bookmarklet:
javascript:alert(document.location.search);
just gave me the query string shown above.
Is this a valid URL? The reason I'm being so pedantic (assuming that I am) is because I need to parse URLs like this for query parameters, and supporting doubled question marks would require some changes to my code. Obviously if they're in the wild, I'll need to support them; I'm mainly curious if it's my fault for not adhering to URL standards exactly, or if it's in fact a non-standard URL.
Yes, it is valid. Only the first ? in a URL has significance, any after it are treated as literal question marks:
The query component is indicated by
the first question mark ("?")
character and terminated by a number
sign ("#") character or by the end of
the URI.
...
The characters slash ("/") and
question mark ("?") may represent data
within the query component. Beware
that some older, erroneous
implementations may not handle such
data correctly when it is used as the
base URI for relative references
(Section 5.1), apparently because they
fail to distinguish query data from
path data when looking for
hierarchical separators. However, as
query components are often used to
carry identifying information in the
form of "key=value" pairs and one
frequently used value is a reference
to another URI, it is sometimes better
for usability to avoid
percent-encoding those characters.
https://www.rfc-editor.org/rfc/rfc3986#section-3.4
As a tangentially related answer, foo?spam=1?&eggs=3 gives the parameter spam the value 1?
Why do you need to encode urls? Is there a good reason why you have to change every space in the GET data to %20?
Because some characters have special meanings.
For instance, in a query string, the ampersand (&) is used as a separator between key-value pairs. If you were to put an ampersand into one of those values, it would look like the separator between the end of a value and the beginning of the next key. So for special characters like this, we use percent encoding so that we can be sure that the data is unambiguously encoded.
From RFC 2936, section 2.4.3:
The space character is excluded
because significant spaces may
disappear and insignificant spaces may
be introduced when URI are transcribed
or typeset or subjected to the
treatment of word- processing
programs. Whitespace is also used to
delimit URI in many contexts.
originally older browsers could get confused by the spaces (not really an issue anymore).
now, if someone copies the url to send as a link - the space can break the hyperlink - ie
Hey! Check out this derping cat playing a piano!
http://www.mysite.com/?video=funny cat plays piano.
See how the link breaks?
Now look at this:
http://www.mysite.com/?video=funny%20cat%20plays%20piano.
Let's break down your question.
Why do you need to encode URL?
A URL is composed of only a limited number of characters and those are digits(0-9), letters(A-Z, a-z), and a few special characters("-", ".", "_", "~").
So does it mean that we cannot use any other character?
The answer to this question is "YES". But wait a minute, there is a hack and the hack is URL Encoding or Perchantage Encoding. So if you want to transmit any character which is not a member of the above mentioned (digits, letters, and special chars), then we need to encode them. And that is why we need to encode "space" as "%20".
OK? Is this enough for URL encoding? No this is not enough, there's a lot about URL encoding but here, I'm not gonna make it a pretty big, boring technical answer. But If you want to know more, then you can read it from here: https://www.urlencoder.io/learn/ (Credit goes to this writer)
Well, you do so because every different browsers knows how the string that makes up the URL is encoded. converting the space to %20, etc makes that URL/URI portable. It could be latin-1 it could be unicode. It needs normalized to something that is understood universally. Take a look at rfc3986 https://www.rfc-editor.org/rfc/rfc3986#section-2.1