There are multiple classes and functions in different Programming Languages for encoding and decoding strings to be URL friendly. For example
in java
URLEncoder.encode(String, String)
in PHP
urlencode ( string $str )
And ...
My question is, If I UrlEncode a String in java, can I expect the other different UrlDecoders in other Languages decode to the same original sting?
I'm creating a Service that needs to encode some Base64 value in query string and I have no idea who are serving to.
Please consider the only option I have here seems to be the query string. I can't use xml or json or HTTP headers Since I need this to be in a url to be redirected.
I looked around and there were some questions exactly like this but non of them had a proper answer.
I appreciate so much for any acknowledge or any solutions.
EDIT:
For example in PHP Manual there is this description:
Returns a string in which all non-alphanumeric characters except -_. have been replaced with a percent (%) sign followed by two hex digits and spaces encoded as plus (+) signs. It is encoded the same way that the posted data from a WWW form is encoded, that is the same way as in application/x-www-form-urlencoded media type. This differs from the » RFC 3986 encoding (see rawurlencode()) in that for historical reasons, spaces are encoded as plus (+) signs.
That sounds it does not follow the RFC
It sounds url encoders can use various algorithms in different Programming Languages.
But one should look for the encoding schema for every function. For example one of them could be
application/x-www-form-urlencoded
looking into JAVA Url Encoder:
Translates a string into application/x-www-form-urlencoded format using a specific encoding scheme. This method uses the supplied encoding scheme to obtain the bytes for unsafe characters.
Also looking into PHP's
that is the same way as in application/x-www-form-urlencoded media type
So if you are looking for a Cross Platform Url Encoding you should tell your users what is the format of your encoder.
This way, they can found the appropriate Decoder or otherwise they can implement their own.
After some investigation, sounds application/x-www-form-urlencoded is the most popular among others.
Related
I'm analyzing character set used in MIME to combine multiple character set.
For that wrote as sample email as:
This is sample test email 精巣日本 dsdsadsadsads
which is automatically gets convert into:
This is sample test email 精巣日本 dsdsadsadsads
I want to know, which character set encoding is used to encode theses character?
Is this possible to use that character set encoding in C?
Email client: Postfix webmail
The purpose of MIME is to allow for support of arbitrary content types and encodings. As long as the content is adequately tagged in the MIME headers, you can use any encoding you see fit. There is no single right encoding for your use case; though in this day and age, the simplest solution by far is to use Unicode for everything.
In MIME terms, you'd use something like Content-Type: text/plain; charset="utf-8" and then correspondingly encode the body text. If you need the email to be 7-bit safe, you might use a quoted-printable or base64 content-trasfer encoding on top, but any modern MIME library should take care of this detail for you.
The HTML entities you observed in your experiment are not suitable for plain-text emails, though they are a viable alternative for pure-HTML email. (If your webmail client used them in plaintext emails, it is buggy; it will only work if the sender and recipient both have the same bug.)
Traditionally, Japanese email messages would use one of the legacy Japanese encodings, like Shift_JIS or ISO-2022-JP. These have reasonable support for English, but generalize poorly to properly multilingual text (though ISO-2022 does somehow support it). With Unicode, by contrast, mixing Japanese with e.g. Farsi, Uzbek, and Turkish is straightforward and undramatic.
Using UTF-8 from C is easy and basically transparent. See e.g. http://utf8everywhere.org/ for some starting points.
When someone types an url in a browser to access a page, which charset is used for that URL? Is there a standard? Can I consider that UTF-8 is used everywhere? Which characters are accepted?
URLs may contain only a subset of ASCII, all URLs are valid ASCII.
International domain names must be Punycode encoded. Non-ASCII characters in the path or query parts must be encoded, with Percent-encoding being the generally agreed-upon standard.
Percent-encoding only takes the raw bytes and encodes each byte as %xx. There's no generally followed standard on what encoding should be used to determine a byte representation. As such, it's basically impossible to assume any particular character set being used in the percent-encoded representation. If you're creating those links, then you're in full control over the used charset before percent-encoding; if you're not, you're mostly out of luck. Though you will most likely encounter UTF-8, this is not guaranteed.
I have read the question Why do you need to encode URLs
but I still confused:
Why the W3C just allow more character could exist in URL?So it could avoid encoding?
Why there is exist decode
The URL representation of characters may differ from the characters you have in your code. In other words, there is a specific grammar that defines how URLs are assembled. Special characters that are used in forming a URL need to be encoded so that they do not cause unexpected results.
Now to answer your questions more specifically:
They may already allow some of the characters you are thinking of, but these characters (&, ?, for example) are given special meaning to function in a certain way. Therefore, they cannot be used in a different context. From the link to the question you posted, it also looks like in the example of the space character, it is not supported because of the problems it would introduce in its use.
Decode is useful for decoding the URL to get the string representation of the URL before it was encoded for manipulation/other functions in the application.
What are the valid characters that can be used in a URL query variable?
I'm asking because I would like to create GUIDs of minimal string length by using the largest character set so long as they can be passed as a URL query variable (www.StackOverflow.com?query=guiddaf09834fasnv)
Edit
If you want to encode a UUID/GUID or any other information represented in a byte array into a url-friendly string, you can use this method in the Apache Commons Code library:
Base64.encodeBase64URLSafeString(byte[])
When in doubt, just go to the RFC.
Note: A query variable is not dealt with any differently then the rest of the URL.
From the section "2.2. URL Character Encoding Issues"
... only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes may be used unencoded within a URL.
I have lots of UTF-8 content that I want inserted into the URL for SEO purposes. For example, post tags that I want to include in th URI (site.com/tags/id/TAG-NAME). However, only ASCII characters are allowed by the standards.
Characters that are allowed in a URI
but do not have a reserved purpose are
called unreserved. These include
uppercase and lowercase letters,
decimal digits, hyphen, period,
underscore, and tilde.
The solution seems to be to:
Convert the character string into a
sequence of bytes using the UTF-8
encoding
Convert each byte that is
not an ASCII letter or digit to %HH,
where HH is the hexadecimal value of
the byte
However, that converts the legible (and SEO valuable) words into mumbo-jumbo. So I'm wondering if google is still smart enough to handle searches in URL's that contain encoded data - or if I should attempt to convert those non-english characters into there semi-ASCII counterparts (which might help with latin based languages)?
Firstly, search engines really don't care about the URLs. They help visitors: visitors link to sites, and search engines care about that. URLs are easy to spam, if they cared there would be incentive to spam. No major search engines wants that. The allinurl: is merely a feature of google to help advanced users, not something that gets factored into organic rankings. Any benefits you get from using a more natural URL will probably come as a fringe benefit of the PR from an inferior search engine indexing your site -- and there is some evidence this can be negative with the advent of negative PR too.
From Google Webmaster Central
Does that mean I should avoid
rewriting dynamic URLs at all?
That's
our recommendation, unless your
rewrites are limited to removing
unnecessary parameters, or you are
very diligent in removing all
parameters that could cause problems.
If you transform your dynamic URL to
make it look static you should be
aware that we might not be able to
interpret the information correctly in
all cases. If you want to serve a
static equivalent of your site, you
might want to consider transforming
the underlying content by serving a
replacement which is truly static. One
example would be to generate files for
all the paths and make them accessible
somewhere on your site. However, if
you're using URL rewriting (rather
than making a copy of the content) to
produce static-looking URLs from a
dynamic site, you could be doing harm
rather than good. Feel free to serve
us your standard dynamic URL and we
will automatically find the parameters
which are unnecessary.
I personally don't believe it matters all that much short of getting a little more click through and helping users out. So far as Unicode, you don't understand how this works: the request goes to the hex-encoded unicode destination, but the rendering engine must know how to handle this if it wishes to decode them back to something visually appealing. Google will render (aka decode) unicode (encoded) URL's properly.
Some browsers make this slightly more complex by always encoding the hostname portion, because of phishing attacks using ideographs that look the same.
I wanted to show you an example of this, here is request to http://hy.wikipedia.org/wiki/Գլխավոր_Էջ issued by wget:
Hypertext Transfer Protocol
GET /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB HTTP/1.0\r\n
[Expert Info (Chat/Sequence): GET /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB HTTP/1.0\r\n]
[Message: GET /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB HTTP/1.0\r\n]
[Severity level: Chat]
[Group: Sequence]
Request Method: GET
Request URI: /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB
Request Version: HTTP/1.0
User-Agent: Wget/1.11.4\r\n
Accept: */*\r\n
Host: hy.wikipedia.org\r\n
Connection: Keep-Alive\r\n
\r\n
As you can see, wget like every other browser will just url-encode the destination for you, and the continue the request to the url-encoded destination. The url-decoded domain only exists as a visual convenience.
Do you know what language everything will be in? Is it all latin based?
If so, then I would suggest building a sort of lookup table that will convert UTF-8 to ASCII when possible(and non-colliding) Something like that would convert Ź into Z and such, and when there is a collision or the character doesn't exist in your lookup table, then it just uses %HH.