How to transform encoded URL to readable texts? - url

It's about Bangla Unicode texts, but can be a problem for any language other than Latin glyphs.
I'm a host of a Bangla blog with all its texts and categories in Bangla (I prefer not to say Bengali as because the name of the language is Bangla rather than Bengali).
So the category in Bangla "বাংলা" saying a URL like:
http://www.example.com/category/বাংলা
But whenever I copied the URL from address bar and put 'em into a chat panel or somewhere else, it changed with some strange characters, for example:
http://www.example.com/category/%E0%A6%B8%E0%A7%8D%E0%A6%A8%E0*
* it's just an example, not the exact gibberish for the word "বাংলা")
So, in many cases I got some encoded URLs like above, from where I found no trace which Unicode text they are saying. Recently I'm getting some 404 error logged by one of my plugin. From there I found a URI like:
/category/%E0%A6%B8%E0%A7%8D%E0%A6%A8%E0%A6%BE%E0%A7%9F%E0%A7%81%E0%A6%AC%E0%A6%BF%E0%A6%A6%E0%A7%8D%E0%A6%AF%E0
I used the Jetpack's Omnisearch to find out any match, but the result is empty. I can't even trace which category that is— creating such a 404.
So here comes the question:
How can I transform the encoded URL to readable glyphs?

http://www.example.com/category/বাংলা
isn't a URL; URLs can only contain ASCII characters. This is an IRI.
http://www.example.com/category/%E0%A6%AC%E0%A6%BE%E0%A6%82%E0%A6%B2%E0%A6%BE
is the URI representation of that IRI. They are otherwise equivalent. A browser may display the ‘pretty’ IRI version in the user interface, but put the URI version on the clipboard so that you can paste it into other tools that don't support IRI.
The 404 address you pasted translates to:
/category/স্নায়ুবিদ্য�
where the last character is a � because it is an invalid, truncated UTF-8 sequence. (This is probably why the request failed.) Someone may have mis-pasted a partial URI here.

If you're using javascript you can do:
decodeURIComponent(url);
This will make sure the original language is preserved.

Related

Ruby on Rails - Arabic text sometimes being encoded, and others not - why and how to fix?

I want to save a wikipedia link that has Arabic characters in it.
In my console I can do
> title = "The Broken Wings / الأجنحة المتكسرة"
=> "The Broken Wings / الأجنحة المتكسرة
And it returns properly with English and Arabic. But if I try to save a link, it encodes the Arabic characters.
When I try to enter this link : https://ar.wikipedia.org/wiki/الأجنحة_المتكسرة
it changes to https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D8%A3%D8%AC%D9%86%D8%AD%D8%A9_%D8%A7%D9%84%D9%85%D8%AA%D9%83%D8%B3%D8%B1%D8%A9
How can I save the link as-is?
I assume by 'console' you mean the Rails console. Ruby (since 2.0) uses UTF-8 as its default character encoding. That means it can natively handle Universal Character Set (aka Unicode) characters. So, when you work with Arabic strings in the Rails console there is no transformation occurring.
However, for URIs (Uniform Resource Identifier -- simplistically the 'path' part of a URL), the standard (RFC3986) says that only US-ASCII and some special characters are allowed. You can use other characters to specify a location (what is called an Internationalized Resource Identifier), but only some systems natively understand IRIs. Otherwise, they get translated to a byte encoding called 'Percent-Encoding', which is what you see in the wikipedia URL.
This introduction does a more complete job of explaining how multi-lingual web addressing works and how translation between IRIs and URIs works with percent encoding.

Apostrophe (valid char) is percent-encoded - but only sometimes

Try to use Google to find Wikipedia article about De Morgan's laws.
Click the link, and see the URL. At least in Chrome, it will be
https://en.wikipedia.org/wiki/De_Morgan%27s_laws
' is percent-encoded as %27, despite it is a valid URL character (and even more, if you manually change it in address bar from %27 to ', it will work). Why?
While aposthrope may be valid char, URL-encoded version is also equally valid!
Not sure if there is a hard reason, so this is kinda "soft" answer: Aposthrope (and/or double quote) needs to be escaped somehow if URL is ever put into for example JSON or XML. URL encoding them as part of sanitizing URLs solves this one way, and protects against poor JSON/XML handling and programmer errors. It's just pragmatic.
Decoding these certain valid chars in HTTP responses' headers etc (so browser shows them "right") should be possible and maybe nice, but extra work and code. Note that there are also chars where decoding would not be ok, so this would have to be selective! So at least in this case it just wasn't done I guess. So if a char gets URL-encoded at any step of the whole page loading operation chain, they stay that way.

lua reading chinese character

I have the following xml that I would like to read:
chinese xml - https://news.google.com/news/popular?ned=cn&topic=po&output=rss
korean xml - http://www.voanews.com/templates/Articles.rss?sectionPath=/korean/news
Currently, I try to use a luaxml to parse in the xml which contain the chinese character. However, when I print out using the console, the result is that the chinese character cannot be printed correctly and show as a garbage character.
I would like to ask if there is anyway to parse a chinese or korean character into lua table?
I don't think Lua is the issue here. The raw data the remote site sends is encoded using UTF-8, and Lua does no special interpretation of that—which means it should be preserved perfectly if you just (1) read from the remote site, and (2) save the read data to a file. The data in the file will contain CJK characters encoded in UTF-8, just like the remote site sent back.
If you're getting funny results like you mention, the fault probably lies either with the library you're using to read from the remote site, or perhaps simply with the way your console displays the results when you output to it.
I managed to convert the "中美" into chinese character.
I would need to do one additional step which has to convert all the the series of string by using this method from this link, http://forum.luahub.com/index.php?topic=3617.msg8595#msg8595 before saving into xml format.
string.gsub(l,"&#([0-9]+);", function(c) return string.char(tonumber(c)) end)
I would like to ask for LuaXML, I have come across this method xml.registerCode(decoded,encoded)
Under that method, it says that
registers a custom code for the conversion between non-standard characters and XML character entities
What do they mean by non-standard characters and how do I use it?

ASP.NET MVC Colon in URL

I've seen that IIS has a problem with letting colons into URLs. I also saw the suggestions others offered here.
With the site I'm working on, I want to be able to pass titles of movies, books, etc., into my URL, colon included, like this:
mysite.com/Movie/Bob:The Return
This would be consumed by my MovieController, for example, as a string and used further down the line.
I realize that a colon is not ideal. Does anyone have any other suggestions? As poor as it currently is, I'm doing a find-and-replace from all colons (:) to another character, then a backwards replace when I want to consume it on the Controller end.
I resolved this issue by adding this to my web.config:
<httpRuntime requestPathInvalidCharacters=""/>
This must be within the system.web section.
The default is:
<httpRuntime requestPathInvalidCharacters="<,>,*,%,&,:,\,?"/>
So to only make an exception for the colon it would become
<httpRuntime requestPathInvalidCharacters="<,>,*,%,&,\,?"/>
Read more at: http://msdn.microsoft.com/en-us/library/system.web.configuration.httpruntimesection.requestpathinvalidcharacters.aspx
For what I understand the colon character is acceptable as an unencoded character in an URL. I don't know why they added it to the default of the requestPathInvalidCharacters.
Consider URL encoding and decoding your movie titles.
You'd end up with foo.com/bar/Bob%58The%20Return
As an alternative, consider leveraging an HTML helper to remove URL unfriendly characters in URLs (method is URLFriendly()). The SEO benefits between a colon and a placeholder (e.g. a dash) would likely be negligable.
One of the biggest worries with your approach is that the movie name isn't always going to be unique (e.g. "The Italian Job"). Also what about other ilegal characters (e.g. brackets etc).
It might be a good idea to use an id number in the url to locate the movie in your database. You could still include a url friendly copy of movie name in your url, but you wouldn't need to worry about getting back to the original title with all the illegal characters in it.
A good example is the url to this page. You can see that removing the title of the page still works:
ASP.NET MVC Colon in URL
ASP.NET MVC Colon in URL
Colon is a reserved and invalid character in an URI according to the RFC 3986. So don't do something that violates the specification. You need to either URL encode it or use another character. And here's a nice blog post you might take a look at.
The simplest way is to use System.Web.HttpUtility.UrlEncode() when building the url
and System.Web.HttpUtility.UrlDecode when interpreting the results coming back. You would also have problems with the space character if you don't encode the value first.

Encoding of XHTML and & (ampersand)

My website is XHTML Transitional compliant except for one thing: the & (ampersand) in the URL are written as it is, instead of &
That is, all the URLs in my pages are usually like this:
Foo
But XHTML validator generates this error:
cannot generate system identifier for general entity "y"
... and it wants the URL to be written like this:
Foo
The problem is that Internet Explorer and Firefox don't handle the URL correctly and ignore the y parameter. How can I make this link work and validate correctly?
It seems to me that it is impossible to write XHTML pages if the browsers don't work with strict encoded XHTML URLs.
Do you want to see in action? See the difference between these two links (copy and paste them as they are):
http://stackoverflow.com/search?q=ff&sort=newest
and
http://stackoverflow.com/search?q=ff&sort=newest
I have just tried this. What you attempted to do is correct. In HTML if you are writing a link the & characters should be encoded as & You would only encode the & as %26 if you wanted a parameter value to contain an ampersand. I just wrote a simple HTML page that contained a link: Click me
and it worked fine: default2.aspx received the parameters intended and the source passed validation.
The encoding of & as & is required in HTML, not in the link. When the browser sees the & in the HTML source for a link it will interpret it as an ampersand and the link target will be as intended. If you paste a URL into your browser address bar it does not expect it to be HTML and does not try to interpret any HTML encoding that it may contain. This is why your example links that you suggest we should copy/paste into a browser don't work and why we wouldn't expect them to work.
If you post a bit more of your actual code we might be able to see what you have done wrong, but you appear to be heading the right direction by using & in your anchor tags.
It was my fault: the hyperlink control already encoded &, so my URL http://foo?x=1&y=2 was encoded to http://foo?x=1&amp;y=2
Normally the &amp inside the URL is correctly handled by browsers, as you stated.
You could use & instead of & in your URL within your page.
That should allow it to be validated as strict XHTML...
Foo
Note, if used by an ASP.NET Request.QueryString function, the query string doesn't use XML encoding; it uses URL encoding:
/mypath/mypage?b=%26stuff
So you need to provide a function translating '&' into %26.
Note: in that case, Server.URLEncode(”neetu & geetu”), which would produce neetu+%26+geetu, is not what you want, since you need to translate & into %26, not just '&'. You must add a replace() call applied to URLEncode result, in order to replace '%26amp;' by '%26'.
To be even more thorough: use &, a numeric character reference.
Because & is a character entity reference:
Character entity references are defined in the markup language
definition. This means, for example, that for HTML only a specific
range of characters (defined by the HTML specification) can be
represented as character entity references (and that includes only a
small subset of the Unicode range).
That's coming from the wise people at W3C (read this for more).
Of course, this is not a very big deal, but the suggestion of W3C is that the numeric one will be valid and useable everywhere and always, while the named one is 'fine' for HTML but nothing more.
The problem is worse than you think - try it in Safari. &amp; gets converted to &#38; and the hash ends the URL.
The correct answer is to not output XHTML - there's no reason that justifies spending more time on development and alienating Mac users.

Resources