I came across an approach to encode just the following 4 characters in the POST parameter's value: # ; & +. What problems can it cause, if any?
Personally I dislike such hacks. The reason why I'm asking about this one is that I have an argument with its inventor.
Update. To clarify, this question is about encoding parameters in the POST body and not about escaping POST parameters on the server side, e. g. before feeding them into shell, database, HTML page or whatever.
From rfc1738 (if you're using application/x-www-form-urlencoded encoding to transfer data):
Unsafe:
Characters can be unsafe for a number of reasons. The space character is unsafe because significant spaces may disappear and insignificant spaces may be introduced when URLs are transcribed or typeset or subjected to the treatment of word-processing programs. The characters "<" and ">" are unsafe because they are used as the delimiters around URLs in free text; the quote mark (""") is used to delimit URLs in some systems. The character "#" is unsafe and should always be encoded because it is used in World Wide Web and in other systems to delimit a URL from a fragment/anchor identifier that might follow it. The character "%" is unsafe because it is used for encodings of other characters. Other characters are unsafe because gateways and other transport agents are known to sometimes modify such characters. These characters are "{", "}", "|", "\", "^", "~", "[", "]", and "`".
All unsafe characters must always be encoded within a URL. For example, the character "#" must be encoded within URLs even in systems that do not normally deal with fragment or anchor identifiers, so that if the URL is copied into another system that does use them, it will not be necessary to change the URL encoding.
Escaping metacharacters is usually (always?) done to prevent injection attacks. Different systems have different metacharacters, so each needs its own way of preventing injections. Different systems have different ways of escaping characters. Some systems don't need to escape characters, since they have different channels for control and data (e.g. prepared statements). Additionally, the filtering is usually best performed when the data is introduced to a system.
The biggest problem is that escaping only those four characters won't provide complete protection. SQL, HTML and shell injection attacks are still possible after filtering the four characters you mention.
Consider this: $sql ='DELETE * fromarticlesWHEREid='.$_POST['id'].';
And you enter in the form: 1' OR '10
It then Becomes this : $sql ='DELETE * fromarticlesWHEREid='1' OR '10';
Related
(Context: I'm writing an HTML sanitiser, and want to normalize URLs as a defence-in-depth measure, making it impossible to use abnormally escaped URLs to bypass downstream blacklists (I'm not relying on blacklists myself) or mislead users.)
When given a URL, in what contexts can a character be changed to its percent-encoded version, or vice versa, without changing the meaning of the URL?
What I've been able to conclude so far:
In the path portion of a URL, / is not equivalent to its escaped form %2F
The separator ? between the path and query string is not equivalent to its escaped form %3F (presumably the same rule also applies to the fragment separator #)
For the special cases of . and .. within a hierarchical path, . is equivalent to %2E according to the specification
Some characters, such as ^, are illegal in URLs, and thus must only appear in encoded form – the decoded form is not equivalent because it can't be used at all
I don't have a second-hand source for this, but all the software I've tested agrees that percent-encoded domain names are equivalent to the corresponding decoded versions (e.g. ex%61mple.com is equivalent to example.com in the host part of a URL) – this makes sense because %, /, and illegal-in-URL characters are all illegal in domain names anyway, so escaping could not possibly be of use
% cannot be equivalent to its encoded form %25, otherwise there would be no way to escape the escape character
application/x-www-form-urlencoded is a commonly (although not universally) used format for URL query strings, and in that format, =, +, & are not equivalent to %3D, %2B, %26 respectively; thus these equivalences cannot hold in URL query strings
However, I'm finding it unclear what the correct action to take with real-world URLs is in other cases, especially as real-life URL parsing libraries tend not to match the specification exactly. In particular:
Should I be percent-decoding characters in the path portion of a URL that are URL-safe (other than %/?#) but have been unexpectedly encoded anyway? The most common software behaviour that I've seen for URLs like http://example.com/ind%65x.html is to treat them as distinct URLs from http://example.com/index.html (e.g. they appear differently in logs and don't compare as equal), but to actually handle the two "distinct" URLs the same way. I don't know whether this is an implementation detail, or whether it's some sort of compatibility workaround.
Should I be decoding any characters in query strings? If so, which?
Should I be decoding any characters in fragments? If so, which?
There seem to be competing standards on this subject, and real-world application behaviour might not match any of them, so I'm interested in knowing how far I can go with URL normalization without breaking real-world use cases. (It would also be helpful to know in which situations escaped characters might be technically different in meaning from the non-escaped versions, but in which escaping them would have no legitimate uses – a sanitiser could have an option to reject URLs that escaped these characters as being likely to be malicious.)
I hope this may provide some insight to your question:
We should only encode the individual components of the ur (example query parameters and fragments), excluding the domain name, that may contain unsafe symbols. Please note, the different components have different rules of what characters need to be encoded and which ones do not. Please read here [https://datatracker.ietf.org/doc/html/rfc3986].
In general, you may follow below:
These unreserved Characters Need not be encoded: ALPHA (uppercase and lowercase) / Decimal Digits / "-" / "." / "_" / "~"
The space character is converted into a plus sign "+" and should not trigger encoding.
All other characters (unsafe, reserved characters if not used for their reserved purposes) should be encoded. Below is a list of such characters (it may include a few more):
! * ' ( ) ; : # & = + $ , / ? # [ ] % { } | \ ^
I need a character to separate two or more URIs in one string. Later I will the split the string to get each URI separately.
The problem is I'm not sure what character to pick here. Is there a good character to choose here that definitely can't be part of a URI itself? Or is ultimately pretty much all characters allowed in a URI?
I know certain characters are illegal in certain parts of the URI, but I'm talking about a URI as a whole, like this:
scheme://username:password#domain.tld/path/to/file.ext?key=value#blah
I'm thinking maybe space, although technically I suppose that could be part of the password, or would it be escaped as %20 in that case?
Any of the control characters should be good for this, such as TAB, FF and so on.
RFC3986 (a) controls the URI specification and Appendix A of that RFC states that the characters are limited to:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
0123456789-._~:/?#[]#!$&'()*+,;=
(and the % encoding character, of course, for all other characters not listed above).
So, basically, any other character should be okay as a delimiter.
(a) This has actually been augmented by RFC6874 which has to do with changes to the IPv6 part of the URI, adding a zone identifier. Since the zone ID consists of % and "unreserved" characters already included above, it doesn't change the set of characters allowed.
I'm curious as to the best way to use the relatively new pushState feature for URL's.
From what I understand, a hex "#" symbol is typically used:
http://www.somewebsite.com/page.html#someoperation
However, in browsers such as Safari, two "#" symbols cannot be used. This is an issue if you wish to store some data in the URL.
http://www.somewebsite.com/page.html#someoperation#somedata=data
...Because it converts the second hex to a "%23".
I also understand that certain characters are "reserved" although I am unsure what this really means, and the "#" is one of them.
The # delimits a fragment identifier which is why browsers may refuse to accept that character twice. Read RFC1738 and its successor RFC3986 for the full list of unsafe and reserved characters, there's quite a lot of them.
Clearly % needs to be encoded. The wikipedia article on the standard says:
Because the percent ("%") character serves as the indicator for
percent-encoded octets, it must be percent-encoded as "%25" for that
octet to be used as data within a URI.
Why isn't it also listed as a reserved character? Clearly it is reserved to signify something special in the context of a URI...
The "reserved" characters are intended to be available as delimiters between different parts of a URI. The percent-sign isn't used for that — can't be used for that — because of its use in percent encoding.
It may help clarify matters to point out that there's a separate list of "unreserved" characters, and the percent-sign is not one of those, either:
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
(from http://www.ietf.org/rfc/rfc3986.txt, bottom of page 12). In other words, in the context of URIs, "reserved" has a more specific meaning than one might expect. :-)
The reserved characters are ones that have some special meaning in a URI and therefore need to be escaped in some way if they are used for something other than their special purpose.
The percent character does not have a special meaning in a URI -- which is what makes it a good choice for an escape/encoding character.
The fact that it is being used to do encoding is the only reason that percent itself needs to be escaped, by percent-encoding it.
This is similar to character escaping, where backslash \ has to itself be escaped \\ only because it was the character chosen to do the initial escaping as in \t or \n
The percent sign is already reserved through its involvement in the grammar rule pct-encoded. Also, this paragraph seems enlightening on the subject:
A URI is composed from a limited set of characters consisting of digits, letters, and a few graphic symbols. A reserved subset of those characters may be used to delimit syntax components within a URI while the remaining characters, including both the unreserved set and those reserved characters not acting as delimiters, define each component's identifying data.
This suggests that the percent symbol itself is indeed reserved for percent encoding (as it does not delimit syntax components within a URI). Your original interpretation is correct, I think it's just a matter of semantics.
In designing of a (mini)language:
When there are certain characters that should be escaped to lose special meanings (like quotes in some programming languages), what should be done, especially from a security perspective, when characters that are not escapable (e.g. normal characters which never have special meaning) are escaped? Should an error be "error"ed, or should the character be discarded, or should it be in the output the same as if it was not escaped?
Example:
In a simple language where strings are delimited by double-quotes("), and any quotes in a given string are escaped with a back-slash(\): for input "We \said, \"We want Moshiach Now\"" -- what would should be done with the letter s in said which is escaped?
I prefer the lexer to whine when this occurs. A lexer/parser should be tight about syntax; one can always loosen it up later. If you are sloppy, you'll find you can't retract a decision you didn't think you made.
Assume that you initially decide to treat " backslash not-an-escape " as that pair of characters, and the "T" is
not-an-escape today. Sometime later you decide to extend the language, and want "\T" to mean something special, and you change your language.
You'll find an angry mob of programmers storming your design castle,
because for them, "\T" means "\" "T" (or "T" depending on your default decision),
and you just broke their code. You hang your head in shame, retract the decision,
and then realize... oops, there are no more available escape characters!
This lesson goes for any piece of syntax that isn't well defined in your language. If it isn't explicitly legal, it should be implicitly illegal and your compiler should check it. Or you'll never be able to extend your successful language.
If your language isn't going to be successful, you may not care as much.
Well, one way to solve the problem is for the backslash to just mean backslash when it precedes a non-escapable character. That's what Python does:
>>> print "a\tb"
a b
>>> print "a\tb\Rc"
a b\Rc
Obviously, most systems take the escape character to mean "take the next character verbatim", so escaping a "non-escapable" character is usually harmless. The problem later happens when you get to comparisons and such, where the literal text does not represent the actual value (that's where you see a lot of issues securitywise, especially with things like URLs).
So on the one hand, you can only accept a limited number of escaped characters. In that sense, you have an "escape sequence", rather than an escaped character (the \x is the entire sequence rather than a \ followed by an x). That's like the most safe mechanism, and it's not really burdensome to write.
The other option is to ensure that you you "canonicalizing" everything you compare, through some ruleset. This typically means removing all of the escape sequences properly up front, before comparison and comparing only the final values rather than the literals.
Most systems interpret the slash as Will Hartung says, except for alphanumerics which are variously used as aliases for control codes, character classes, word boundaries, the start of hex sequences, case region markers, hex or octal digits, etc. \s in particular often means white-space in perl5 style regexs. JavaScript, which interprets it as 's' in one context and as whitespace in another suffers from subtle bugs because of this choice. Consider /foo\sbar/ vs new RegExp('foo\sbar').