I'm curious as to the best way to use the relatively new pushState feature for URL's.
From what I understand, a hex "#" symbol is typically used:
http://www.somewebsite.com/page.html#someoperation
However, in browsers such as Safari, two "#" symbols cannot be used. This is an issue if you wish to store some data in the URL.
http://www.somewebsite.com/page.html#someoperation#somedata=data
...Because it converts the second hex to a "%23".
I also understand that certain characters are "reserved" although I am unsure what this really means, and the "#" is one of them.
The # delimits a fragment identifier which is why browsers may refuse to accept that character twice. Read RFC1738 and its successor RFC3986 for the full list of unsafe and reserved characters, there's quite a lot of them.
Related
(Context: I'm writing an HTML sanitiser, and want to normalize URLs as a defence-in-depth measure, making it impossible to use abnormally escaped URLs to bypass downstream blacklists (I'm not relying on blacklists myself) or mislead users.)
When given a URL, in what contexts can a character be changed to its percent-encoded version, or vice versa, without changing the meaning of the URL?
What I've been able to conclude so far:
In the path portion of a URL, / is not equivalent to its escaped form %2F
The separator ? between the path and query string is not equivalent to its escaped form %3F (presumably the same rule also applies to the fragment separator #)
For the special cases of . and .. within a hierarchical path, . is equivalent to %2E according to the specification
Some characters, such as ^, are illegal in URLs, and thus must only appear in encoded form – the decoded form is not equivalent because it can't be used at all
I don't have a second-hand source for this, but all the software I've tested agrees that percent-encoded domain names are equivalent to the corresponding decoded versions (e.g. ex%61mple.com is equivalent to example.com in the host part of a URL) – this makes sense because %, /, and illegal-in-URL characters are all illegal in domain names anyway, so escaping could not possibly be of use
% cannot be equivalent to its encoded form %25, otherwise there would be no way to escape the escape character
application/x-www-form-urlencoded is a commonly (although not universally) used format for URL query strings, and in that format, =, +, & are not equivalent to %3D, %2B, %26 respectively; thus these equivalences cannot hold in URL query strings
However, I'm finding it unclear what the correct action to take with real-world URLs is in other cases, especially as real-life URL parsing libraries tend not to match the specification exactly. In particular:
Should I be percent-decoding characters in the path portion of a URL that are URL-safe (other than %/?#) but have been unexpectedly encoded anyway? The most common software behaviour that I've seen for URLs like http://example.com/ind%65x.html is to treat them as distinct URLs from http://example.com/index.html (e.g. they appear differently in logs and don't compare as equal), but to actually handle the two "distinct" URLs the same way. I don't know whether this is an implementation detail, or whether it's some sort of compatibility workaround.
Should I be decoding any characters in query strings? If so, which?
Should I be decoding any characters in fragments? If so, which?
There seem to be competing standards on this subject, and real-world application behaviour might not match any of them, so I'm interested in knowing how far I can go with URL normalization without breaking real-world use cases. (It would also be helpful to know in which situations escaped characters might be technically different in meaning from the non-escaped versions, but in which escaping them would have no legitimate uses – a sanitiser could have an option to reject URLs that escaped these characters as being likely to be malicious.)
I hope this may provide some insight to your question:
We should only encode the individual components of the ur (example query parameters and fragments), excluding the domain name, that may contain unsafe symbols. Please note, the different components have different rules of what characters need to be encoded and which ones do not. Please read here [https://datatracker.ietf.org/doc/html/rfc3986].
In general, you may follow below:
These unreserved Characters Need not be encoded: ALPHA (uppercase and lowercase) / Decimal Digits / "-" / "." / "_" / "~"
The space character is converted into a plus sign "+" and should not trigger encoding.
All other characters (unsafe, reserved characters if not used for their reserved purposes) should be encoded. Below is a list of such characters (it may include a few more):
! * ' ( ) ; : # & = + $ , / ? # [ ] % { } | \ ^
Is anyone aware of any problems with using commas in SEO friendly URL's? I'm working with some software that uses a lot of commas in it's SEO friendly URL's; but I am 100% certain I have seen some instances where some programs/platforms don't recognize the URL correctly & cut the "linking" of the URL off after the first comma.
I just tested this out with thunderbird, gmail, hotmail & on a SMF forum with no problems; however I know I have seen the issue before.
So my question is, is there anything in particular that would cause some platforms to stop linking URL's with a comma? Such as a certain character after the comma?
There will be countless implementations that will cut the automatical linking at that point. As with many other characters, too. But that’s not a problem because of using these characters, but because of a wrong/incomplete implementation.
See for example this very site, Stack Overflow. It will cut off the link at the * when manually entering/pasting this URL (see bug; in case it gets fixed, here’s a screenshot of it):
http://wayback.archive.org/web/*/http://www.example.com/
But when using the hyperlink syntax, it works fine:
http://wayback.archive.org/web/*/http://www.example.com/
The * character is allowed in an HTTP URL path, so the link detection should have recognized the first URL instead of breaking it at the occurence of *.
Regarding the comma:
The comma is a reserved character and its meaning is relevant for the URL path (bold emphasis mine):
Aside from dot-segments in hierarchical paths, a path segment is
considered opaque by the generic syntax. URI producing applications
often use the reserved characters allowed in a segment to delimit
scheme-specific or dereference-handler-specific subcomponents. For
example, the semicolon (";") and equals ("=") reserved characters are
often used to delimit parameters and parameter values applicable to
that segment. The comma (",") reserved character is often used for
similar purposes. For example, one URI producer might use a segment
such as "name;v=1.1" to indicate a reference to version 1.1 of
"name", whereas another might use a segment such as "name,1.1" to
indicate the same.
So, if you don’t intend to use the comma for the function it has as reserved character, you may want to percent-encode it with %2C. Users copying such an URL from their browser’s address bar would paste it in the encoded form, so it should work almost everywhere.
However, especially because it’s a reserved character, the unencoded form should work, too.
I have a long URL with several values.
Example 1:
http://www.domain.com/list?seach_type[]=0&search_period[]=1&search_min=3000&search_max=21000&search_area=6855%3B7470%3B7700%3B7730%3B7741%3B7742%3B7752%3B7755%3B7760%3B7770%3B7800%3B7840%3B7850%3B7860%3B7870%3B7884%3B7900%3B7950%3B7960%3B7970%3B7980%3B7990%3B8620%3B8643%3B8800%3B8830%3B8831%3B8832%3B8840%3B8850%3B8860%3B8881%3B9620%3B9631%3B9632
My variable search area contains only 4 number digits (example 4000, 5000), but can contain a lot of them. Right now I seperate these in the URL by using ; as separator symbol. Though as seen in Example 1, the ; is converted into %3B. This makes me believe that this is a bad symbol to use.
What is the best URL separator?
Moontear, I think you have misread the linked document. That limitation only applies to the "scheme" portion of the URL. In the case of WWW URLs, that is the "http".
The next section of the document goes on to say:
Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.
I'd personally use comma (,). However, plus (+) and dash (-) are both reasonable options as well.
BTW, that document also mentions that semi-colon (;) is reserved in some schemes.
Well, according to RFC1738, valid URLs may only contain the letters a-z, the plus sign (+), period and hyphen (-).
Generally I would go with a plus to separate your search areas. So your URL would become http://www.domain.com/list?seach_type=0&search_period=1&search_min=3000&search_max=21000&search_area=6855+7470+7700+...
--EDIT--
As GinoA pointed out I misread the document. Hence "$-_.+!*'()," are valid characters too. I'd still go with the + sign though.
If there are only numbers to separate, you have a large choice of separators. You can choose any letter for example.
Probably a space can be a good choice. It will be transformed into + character in the URL, so will be more readable than a letter.
Example: search_area=4000+5000+6000
I'm very late to the party, but a valid query string can repeat variables so instead of...
http://x.y.z/list?type=0&period=1&min=3000&max=21000&area=6855+7470+7700
...you could also use...
http://x.y.z/list?type=0&period=1&min=3000&max=21000&area=6855&area=7470&area=7700
"+" is to be interpreted as a space " " when the content-type is application/x-www-form-urlencoded (standard for HTML forms). This may be handled by your server software.
I prefer "!". It doesn't get URL encoded (at least not in Chrome) and it reserves "+" for use as a real space character in the typical case.
I came across an approach to encode just the following 4 characters in the POST parameter's value: # ; & +. What problems can it cause, if any?
Personally I dislike such hacks. The reason why I'm asking about this one is that I have an argument with its inventor.
Update. To clarify, this question is about encoding parameters in the POST body and not about escaping POST parameters on the server side, e. g. before feeding them into shell, database, HTML page or whatever.
From rfc1738 (if you're using application/x-www-form-urlencoded encoding to transfer data):
Unsafe:
Characters can be unsafe for a number of reasons. The space character is unsafe because significant spaces may disappear and insignificant spaces may be introduced when URLs are transcribed or typeset or subjected to the treatment of word-processing programs. The characters "<" and ">" are unsafe because they are used as the delimiters around URLs in free text; the quote mark (""") is used to delimit URLs in some systems. The character "#" is unsafe and should always be encoded because it is used in World Wide Web and in other systems to delimit a URL from a fragment/anchor identifier that might follow it. The character "%" is unsafe because it is used for encodings of other characters. Other characters are unsafe because gateways and other transport agents are known to sometimes modify such characters. These characters are "{", "}", "|", "\", "^", "~", "[", "]", and "`".
All unsafe characters must always be encoded within a URL. For example, the character "#" must be encoded within URLs even in systems that do not normally deal with fragment or anchor identifiers, so that if the URL is copied into another system that does use them, it will not be necessary to change the URL encoding.
Escaping metacharacters is usually (always?) done to prevent injection attacks. Different systems have different metacharacters, so each needs its own way of preventing injections. Different systems have different ways of escaping characters. Some systems don't need to escape characters, since they have different channels for control and data (e.g. prepared statements). Additionally, the filtering is usually best performed when the data is introduced to a system.
The biggest problem is that escaping only those four characters won't provide complete protection. SQL, HTML and shell injection attacks are still possible after filtering the four characters you mention.
Consider this: $sql ='DELETE * fromarticlesWHEREid='.$_POST['id'].';
And you enter in the form: 1' OR '10
It then Becomes this : $sql ='DELETE * fromarticlesWHEREid='1' OR '10';
Why do you need to encode urls? Is there a good reason why you have to change every space in the GET data to %20?
Because some characters have special meanings.
For instance, in a query string, the ampersand (&) is used as a separator between key-value pairs. If you were to put an ampersand into one of those values, it would look like the separator between the end of a value and the beginning of the next key. So for special characters like this, we use percent encoding so that we can be sure that the data is unambiguously encoded.
From RFC 2936, section 2.4.3:
The space character is excluded
because significant spaces may
disappear and insignificant spaces may
be introduced when URI are transcribed
or typeset or subjected to the
treatment of word- processing
programs. Whitespace is also used to
delimit URI in many contexts.
originally older browsers could get confused by the spaces (not really an issue anymore).
now, if someone copies the url to send as a link - the space can break the hyperlink - ie
Hey! Check out this derping cat playing a piano!
http://www.mysite.com/?video=funny cat plays piano.
See how the link breaks?
Now look at this:
http://www.mysite.com/?video=funny%20cat%20plays%20piano.
Let's break down your question.
Why do you need to encode URL?
A URL is composed of only a limited number of characters and those are digits(0-9), letters(A-Z, a-z), and a few special characters("-", ".", "_", "~").
So does it mean that we cannot use any other character?
The answer to this question is "YES". But wait a minute, there is a hack and the hack is URL Encoding or Perchantage Encoding. So if you want to transmit any character which is not a member of the above mentioned (digits, letters, and special chars), then we need to encode them. And that is why we need to encode "space" as "%20".
OK? Is this enough for URL encoding? No this is not enough, there's a lot about URL encoding but here, I'm not gonna make it a pretty big, boring technical answer. But If you want to know more, then you can read it from here: https://www.urlencoder.io/learn/ (Credit goes to this writer)
Well, you do so because every different browsers knows how the string that makes up the URL is encoded. converting the space to %20, etc makes that URL/URI portable. It could be latin-1 it could be unicode. It needs normalized to something that is understood universally. Take a look at rfc3986 https://www.rfc-editor.org/rfc/rfc3986#section-2.1