I have some data, where citation signs (i.e. ") are encoded as ". I want to reverse this encoding. What's the encoding scheme that does this? It doesn't seem to be the classic HTML URL encoding (https://www.w3schools.com/tags/ref_urlencode.asp).
Related
I have noticed that Google does not encode all special characters in the query part of the URL . For example:
Placing this string in Google's search: !##$%^&*()
Yields this URL: https://www.google.com/#q=!%40%23%24%25^%26*()
Notice that the !, ^, *, ( , and ) are not encoded.
Some of the characters such as : or < are considered unsafe or reserved, yet Google doesn't encode them.
Can someone explain why Google does this, and if they have a reference document as to exactly what characters get encoded and which don't?
Thanks for any help!
As documented here:
Some characters are not safe to use in a URL without first being
encoded. Because a Google search request is made by using an HTTP URL,
the search request must follow URL conventions, including character
encoding, where necessary.
The HTTP URL syntax defines that only alphanumeric characters, the
special characters $-_.+!*'(), and the reserved characters ;/?:#=& can
be used as values within an HTTP URL request. Since reserved
characters are used by the search engine to decode the URL, and some
special characters are used to request search features, then all
non-alphanumeric characters used as a value to an input parameter must
be URL-encoded.
To URL-encode a string:
Replace space characters with a "+" character Replace each
non-alphanumeric character by its hexadecimal ASCII value, in the
format of a "%" character followed by two hexadecimal digits. (Such an
ASCII value may be referred to as an escape code.)
Some input parameters require that the values passed to Google search are double-URL-encoded. This requirement means that you must apply the URL encoding to the string twice in succession to generate the final value.
I'm importing an RSS feed from Tumblr into a Kynetx app. It appears that the RSS feed has some encoding issues, as apostrophes appear like this:
The feed (which you can find here) claims to be encoded in UTF-8.
Is there a way to specify the encoding or else replace those characters with regular apostrophes?
While not optimal, you could try to catch these encodings and replace them with the UTF-8 standard:
newstring = oldstring.replace(re/’/\'/);
This appears to be a case of a service that specifies UTF-8, but does't explicitly enforce it. I uploaded an image of the RSS feed that you provided. For comparison, I cut and pasted the text into a notepad document and then typed in the same text from my keyboard.
I don't know if you can tell from the image, but the apostrophe that is mangled is different from the apostrophe that is generated by my UTF-8 browser.
I suspect that this post was submitted via a Windows client. If you look at your encoding options, you will see an option for Western (Windows-1252).
Windows-1252 is a legacy encoding from windows that resembles ISO 8859-1, but substitutes some of their own characters for control characters in the ANSI standard and changes the location in the codepage of others.
A couple of quotes from the wikipedia page that I cite above:
It is very common to mislabel Windows-1252 text data with the charset label ISO-8859-1. Many web browsers and e-mail clients treat the MIME charset ISO-8859-1 as Windows-1252 characters in order to accommodate such mislabeling
Many Microsoft programs, such as Word will automatically substitute Windows-1252 characters when standard ASCII characters are entered, such as for "smart quotes" (e.g. substituting ’ for the apostrophe in a contraction) or substituting © for the three characters '(c)'.
KRL supports all of the language charsets supported by UTF-8, so it supports multi-byte international characters natively; however, that comes at the expense of being able to fudge encodings that is possible when you only have ISO-8859-1 or Windows-1252 to choose from.
I came across an approach to encode just the following 4 characters in the POST parameter's value: # ; & +. What problems can it cause, if any?
Personally I dislike such hacks. The reason why I'm asking about this one is that I have an argument with its inventor.
Update. To clarify, this question is about encoding parameters in the POST body and not about escaping POST parameters on the server side, e. g. before feeding them into shell, database, HTML page or whatever.
From rfc1738 (if you're using application/x-www-form-urlencoded encoding to transfer data):
Unsafe:
Characters can be unsafe for a number of reasons. The space character is unsafe because significant spaces may disappear and insignificant spaces may be introduced when URLs are transcribed or typeset or subjected to the treatment of word-processing programs. The characters "<" and ">" are unsafe because they are used as the delimiters around URLs in free text; the quote mark (""") is used to delimit URLs in some systems. The character "#" is unsafe and should always be encoded because it is used in World Wide Web and in other systems to delimit a URL from a fragment/anchor identifier that might follow it. The character "%" is unsafe because it is used for encodings of other characters. Other characters are unsafe because gateways and other transport agents are known to sometimes modify such characters. These characters are "{", "}", "|", "\", "^", "~", "[", "]", and "`".
All unsafe characters must always be encoded within a URL. For example, the character "#" must be encoded within URLs even in systems that do not normally deal with fragment or anchor identifiers, so that if the URL is copied into another system that does use them, it will not be necessary to change the URL encoding.
Escaping metacharacters is usually (always?) done to prevent injection attacks. Different systems have different metacharacters, so each needs its own way of preventing injections. Different systems have different ways of escaping characters. Some systems don't need to escape characters, since they have different channels for control and data (e.g. prepared statements). Additionally, the filtering is usually best performed when the data is introduced to a system.
The biggest problem is that escaping only those four characters won't provide complete protection. SQL, HTML and shell injection attacks are still possible after filtering the four characters you mention.
Consider this: $sql ='DELETE * fromarticlesWHEREid='.$_POST['id'].';
And you enter in the form: 1' OR '10
It then Becomes this : $sql ='DELETE * fromarticlesWHEREid='1' OR '10';
I have an action like this:
<%=Html.ActionLink("My_link", "About", "Home", new RouteValueDictionary {
{ "id", "Österreich" } }, null)%>
This produces the following link: http://localhost:1855/Home/About/%C3%96sterreich
I want a link which looks like this - localhost:1855/Home/About/Österreich
I have tried.
Server.HtmlDecode("Österreich")
HttpUtility.UrlDecode("Österreich")
Neither seems to be helping. What else can I try to get my desired result?
I think this is an issue with your browser (IE).
Your code is correct as it is, no explicit UrlEncoding needed.
<%=Html.ActionLink("My_link", "About", "Home", new RouteValueDictionary {
{ "id", "Österreich" } }, null)%>
There is nothing wrong with ASP.NET MVC. See unicode urls on the web, e.g. http://he.wikipedia.org/wiki/%D7%9B%D7%A9%D7%A8%D7%95%D7%AA in IE and in a browser that handles unicode in URLs correctly.
E.g. chrome displays unicode URLs without any problem. IE does not decode "special" unicode characters in address bar.
This is only a cosmetic issue.
According to RFC 1738 Uniform Resource Locators (URL), only US-ASCII is supported, all other characters must be encoded.
2.2. URL Character Encoding Issues
URLs are sequences of characters, i.e., letters, digits, and special
characters. A URLs may be represented
in a variety of ways: e.g., ink on
paper, or a sequence of octets in a
coded character set. The
interpretation of a URL depends only
on the identity of the characters
used.
In most URL schemes, the sequences of characters in different parts of a
URL are used to represent sequences of
octets used in Internet protocols. For
example, in the ftp scheme, the host
name, directory name and file names
are such sequences of octets,
represented by parts of the URL.
Within those parts, an octet may be
represented by the chararacter which
has that octet as its code within the
US-ASCII [20] coded character set.
In addition, octets may be encoded by a character triplet consisting of
the character "%" followed by the two
hexadecimal digits (from
"0123456789ABCDEF") which forming the
hexadecimal value of the octet. (The
characters "abcdef" may also be used
in hexadecimal encodings.)
Octets must be encoded if they have no corresponding graphic
character within the US-ASCII coded
character set, if the use of the
corresponding character is unsafe, or
if the corresponding character is
reserved for some other interpretation
within the particular URL scheme.
No corresponding graphic US-ASCII:
URLs are written only with the graphic printable characters of the
US-ASCII coded character set. The
octets 80-FF hexadecimal are not used
in US-ASCII, and the octets 00-1F and
7F hexadecimal represent control
characters; these must be encoded.
I think your desire for a non urlencode url is valid, but I don't think the tools actually make it easy to do this.
Would putting the generated link inside an <a>, with the link text being the non-encoded string, not be good enough? It would still look bad in the browser's URL field, but your UI would be a little prettier.
Also, in Firefox anyway, the URL shown in my status bar when I mouse over your "ugly" link shows the unencoded version, so it would probably look fine there as well.
What’s the difference between an URL Encode and a HTML Encode?
HTML Encoding escapes special characters in strings used in HTML documents to prevent confusion with HTML elements like changing
"<hello>world</hello>"
to
"<hello>world</hello>"
URL Encoding does a similar thing for string values in a URL like changing
"hello+world = hello world"
to
"hello%2Bworld+%3D+hello+world"
urlEncode replaces special characters with characters that can be understood by web browsers/web servers for the purpose of addressing... hence URL. For instance, spaces are replaced with %20, ' = %27 etc...
See these references:
http://www.blooberry.com/indexdot/html/topics/urlencoding.htm
http://www.degraeve.com/reference/urlencoding.php
HtmlEncode replaces special characters with character strings that are recognised by the HTML engine itself to render the content of the page - things like & become & or < = <, > = > this prevents the HTML engine from interpreting these characters as parts of the HTML markup and therefore render them as if they were strings.
See this reference:
http://msdn.microsoft.com/en-us/library/ms525347.aspx
Both HTML and URL's are essentially very constrained languages. As a language they add meaning to specific keywords or operators. For both of these languages though, keywords are almost always single characters. For example
HTML: > and <
URL: / and :
In the use of each language though it is possible to use these constructs in a manner that does not ensure the meaning of the language. For instance this post contains a > character. I do not want it to be interpreted as HTML, just text.
This is where Encode and Decode methods come into play. These methods will respectively take a string and convert any of the characters that would otherwise be treated as keywords into an escaped form which will not be interpreted as part of the language.
For instance: Passing > into HtmlEncode will return >
HTMLEncode and URLEncode deal with invalid characters in HTML and URLs, or more accurately, characters that need to be specially written to be interpreted correctly. For example, in HTML the < and > characters are used to indicate tags. Thus, if you wanted to write a math formula, something like 1+1 < 2+2, the '<' would normally be interpreted as the beginning of a tag. HTMLEncoding turns this character into "<" which is the encoded representation of the less-than sign. URLEncoding does the same, but for URLs, for which the special characters are different, although there is some overlap.
I don't know what language you are working in, but the PHP manual for example provides good explanations.
URLEncode
Returns a string in which all
non-alphanumeric characters except -_.
have been replaced with a percent (%)
sign followed by two hex digits and
spaces encoded as plus (+) signs. It
is encoded the same way that the
posted data from a WWW form is
encoded, that is the same way as in
application/x-www-form-urlencoded
media type. This differs from the »
RFC 1738 encoding (see rawurlencode())
in that for historical reasons, spaces
are encoded as plus (+) signs.
Read on