Keep spaces in URLs without encoding them - url

As Stack Overflow seems to be unable to create links from URLs that have spaces in them, copy and paste this URL into your browser.
http://grooveshark.com/#!/search/song?q=we will rock you
It does not redirect you to ...song?q=we%20will%20rock%20you or anything like that.
The spaces just simply stay there. When I first saw this, it looked so foreign to me. How is this achieved?

I believe they use javascript to set the contents of the url bar. You can use something like Live HTTP Headers to confirm that the browser definitely sends a request with %20 encoded spaces.

It’s a browser setting. The browser decodes the URL, to make it more readable for humans.
If you copy the URL from the browser’s address bar and paste it into a text document, you’ll see that the space characters are percent-encoded.
See How can I see how the browser percent-encoded my URL? (which is not visible on address bar)

Related

Web address changes during cut & paste

I've just encountered something that I don't quite understand. I received a document (administrative memo from my employer) containing a web address. The address is not a clickable hyperlink, it is just text.
What is interesting is that when the address is copy & pasted into a web browser address bar, it causes the web browser to attempt to contact a different web address than the cut & pasted text contains. The address text initially appears to be pasted correctly into the address bar, until I hit enter -- then instantly the text changes to something else.
Please note that this is not a matter of simple web site redirection. I know this because if I manually type in the same address (instead of copy & pasting it from the original document), the "correct" address is loaded. It is only following the copy/paste/load process that text appears to be magically changing.
I have also noticed that if I copy & paste the address first into a Notepad text file, save the text file, close, re-open, and then copy/paste to the web browser, the "correct" site then loads. Of note, when I save, Notepad warns that there are characters in Unicode format which will be lost. So I assume that there is some hidden unicode text that is being stripped out when I save as plain text.
But, in Notepad if I enable the "Show Unicode Control Characters" option, I see nothing. So what could be going on here?
To get really specific, the domain transforms like this: http://www.aaaaaaaaaa-usa.com/bbbbb/ddddddtools.html ==> www.xn--aaaaaaaaaausa-km6g.com. (The browser of course reports that it cannot find the IP address of the server)
For compatibility, domain names should be ASCII text, so there is a standard (IDN) to convert other characters to ASCII, using the two letter prefixes followed by two dashes --.
Additional, there were some phishing attack, using letter on other alphabets, that looked like latin letters, so deceiving users. So some browsers choose to display the ascii name instead of the intended name. (It changes from browser to browser, and usually only on selected similar characters).

HTTrack gives 404 on unicode urls with german special characters

I've realized that HTTrack can't download files if urls have special characters in them, like german ß - it returns a 404 response.
Errors look like on screenshot:
Is there any setting in HTTrack to make it able to deal with such characters?
ps: I found a similar thread, but without an answer:
Httrack faulty when encountering japanese encoded URLS
HTTrack seems to be able to get files errorfree from urls with special characters, only if you don't run a "real" domain crawl, but:
firstly create an url list,
save it as iso-8859-1,
than let HTTrack crawl this list
If HTTrack will explore urls by its own, it will run into 404 errors on urls with special characters - at least i wasn't able to get them errorfree. Maybe somebody will provide a magic setting ;)

Browser Support for UTF8 Encoded Characters in URL's

If I navigate to the following URL with a special UTF8 encoded character I get different results in web browsers:
http://example.com/lörickè
Firefox 37 - Shows the correct URL as above.
Chrome 42 - Shows the correct URL as above.
Edge - Shows the correct URL as above.
IE 11 - Shows percent encoded URL http://example.com/l%c3%b6rick%c3%a8/
Where can I find a list of browsers and versions that support this feature and are there any announcements of whether the new Microsoft Edge browser supports this.
This StackOverflow post highlights the above issue for those interested.
What is shown in browser address bars is not necessarily what is used internally.
If you enter http://example.com/lörickè in Firefox, it gets shown like that, but it actually gets percent-encoded and becomes http://example.com/l%C3%B6rick%C3%A8. This is for usability reasons (or, if IRIs are not supported, like in HTTP/1.1, for transforming an IRI into a URI), so users don’t necessarily have to enter the correct URL (with percent-encoding), and don’t get confused by seeing these cryptic parts.
You can easily check what really gets used by copy-pasting the URL from the address bar into a text document.
So the three browsers from your example probably use the same URI (i.e., percent-encoded), but two browsers decided to display the un-encoded variant instead.

Is '|' a recommended separator for semantic URLs?

After researching Google and SO, there seems to be conflicting opinions on this.
We have run-in to a problem with Google Chrome substituting | separator as %7C, whereas Firefox and Safari do not.
Here's an example:
http://www.example.com/page1|sub-page2|sub-page-3
Are there any strict rules to follow when choosing a separator character for semantic URLs and are there any strong arguments against (or workarounds when) using |?
| is not a valid character in a URL. Modern browsers will silently encode it to %7C when sending, and may or may not display this change in the address bar. Similarly, servers will silently decode the character for you.
This would have been a problem in last millennium, where browsers would crash just because you didn't specify http://, but today you can just use whatever you want and the browser will take care of it. However, automatic parsers such as http://example.com/test|fish Markdown may not agree to it being a valid URL. In this case, it looks like it does, but try that on my forums and it will complain at you.
Internet explorer/chrome use url encoding when displaying the url in the address bar after a page request has been made, %7C is the safe way of displaying a pipe ('|')
so its not a problem that chrome is doing this.
as a cheeky fix to make all browsers behave the same way, why not use %7C as your separator from the get-go, instead of a pipe, and then all browsers should interpret this as a pipe for you behind the scenes, but display it as &7C in the address bar.

Internet Explorer does not display Chinese characters from the URL

I am working on a requirement to display (make readable) characters from the URL.
When I use Google Chrome, it displays the parameters in Chinese - even though they are encoded to UTF-8.
When I use Mozilla Firefox, it displays the parameters in Chinese - even though they are encoded to UTF-8.
When I use Internet Explorer, it displays the parameters encoded in UTF-8.
N.B. The URL is encoded to UTF-8; I know that because when I copy the URL from the three of them and paste it to Notepad++ the three of them display the following:
/%E6%89%93%E5%BC%80%E7%9B%AE%E5%BD%95/%E7%9B%B8%E6%9C%BA/%E6%95%B0%E7%A0%81%E7%9B%B8%E6%9C%BA/%E5%B0%8F%E5%9E%8B%E6%95%B0%E7%A0%81%E7%9B%B8%E6%9C%BA/PowerShot-A480/p/1934793
Could it be that Mozilla Firefox and Google Chrome guys have this improvement that can make an encoded String readable and perhaps the IE guys do not support that? Or, is there any way to activate that with IE?
By the way... Going to View >> Encoding >> Unicode (UTF-8) takes care of the text inside of the page but does not make any difference for the text in the URL.
Any help will be greatly appreciated!
I've written a blog post about Internet Explorer not displaying the decoded version of non-ASCII characters and using IRIs to solve the problem.
As of today, we have the following situation:
HTML5 supports IRIs, i.e. URIs with Unicode character support
HTTP does not support IRIs, but all major browsers take care of converting IRIs to valid (encoded) URIs to retrieve the specified resource (page).
IE supports IRIs in the href attribute of anchor tags and properly displays them in its address bar just like when you enter your URL by hand (keyboard ;-)).
If you choose to percent-encode your IRI thus making it a URI, IE will not decode that URI back into an IRI.
So you could try the following:
Save your HTML files using UTF-8. This allows you to insert any Unicode character into it.
Do not percent-encode your URLs inside your HTML pages' links. Just use links like this: 亦思巴奚兵乱
A great article on the topic can also be found at the W3C: An Introduction to Multilingual Web Addresses.

Resources