Is there any way to avoid showing "xn--" for IDN domains? - url

If I use a domain such as www.äöü.com, is there any way to avoid it being displayed as www.xn--4ca0bs.com in users’ browsers?
Domains such as www.xn--4ca0bs.com cause a lot of confusion with average internet users, I guess.

This is entirely up to the browser. In fact, IDNs are pretty much a browser-only technology. Domain names cannot contain non-ASCII characters, so the actual domain name is always the Punycode encoded xn--... form. It's up to the browser to prettify this, but many choose to not do so to avoid domain name spoofing using lookalike Unicode characters.

From a security perspective, Unicode domains can be problematic because many Unicode characters are difficult to distinguish from common ASCII characters (or indeed other Unicode characters).
It is possible to register domains such as "xn–pple-43d.com", which is equivalent to "аpple.com". It may not be obvious at first glance, but "аpple.com" uses the Cyrillic "а" (U+0430) rather than the ASCII "a" (U+0061). This is known as a homograph attack.
Fortunately modern browsers have mechanisms in place to limit IDN homograph attacks. The page IDN Policy on chrome highlights the conditions under which an IDN is displayed in its native Unicode form. Generally speaking, the Unicode form will be hidden if a domain label contains characters from multiple different languages. The "аpple.com" domain as described above will appear in its Punycode form as "xn–pple-43d.com" to limit confusion with the real "apple.com".
For more information see this blog post by Xudong Zheng.

Internet Explorer 8.0 on Windows 7 displays your UTF-8 domain just fine.
Google Chrome 19 on the other hand doesn't.
Read more here: An Introduction to Multilingual Web Addresses #phishing.
Different browsers to things differently, possibly because some use the system codepage/locale/encoding/wtvr. And others use their own settings, or a list of allowed characters.
Read that article carefully, it explains how each browser works when making a decision.
If you are targeting a specific language, you can get away with it and make it work.

Related

Is it a good idea to generate short URLs with non-ASCII characters?

Given I want to generate short urls and I want to keep it as short as possible, is it a good idea to add non-ASCII characters in the path?
URL could look like https://example.com/hžйÄю
Possible pros:
Many possible combinations
Addresses will be short even if there are thousands of them
Possible cons:
No user will ever be able to write it without installing/enabling a variety of keyboard layouts, let alone remember it
Some Operating systems may not be able to interpret the characters
Is it even worth considering doing this?

URL Structure: Lower case VS Upper case

Just trigger in my mind when I was going through some websites were they having upper case and lower case combination in url something like http://www.domain.com/Home/Article
Now as I know we should always use lowercase in url but have not idea about technical reason. I would like to learn from you expert to clear this concept why to use lowercase in url. What are the advantages and disadvantages for upper case url.
The domain part is not case sensitive. GoOgLe.CoM works. You can add uppercase as you like, but normally there's not a reason to do so and, as stated in the comments below, may hurt your SEO ranking.
The path part is or is not case sensitive, depending on the server environment and server. Typically Windows machines are case insensitive, while Linux machines are case sensitive. This means that you should stick to lowercase or you risk introducing a bug that's really hard to hunt down (mismatched case that doesn't matter on the dev server).
The query string part is available to the server as it is. You can readily use mixed-case as you like, or discard the case (toLowerCase(...)). This also means that using a base64-encoded keys will work. You can't expect the users to type that correctly, though.
The hash part (called "fragment identifier") is only available to the client code, not to the server. Javascript may distinguish between the cases as it likes, and so does the browser. url#a will scroll to the element with the ID a, but url#A won't.
I'm going to have to disagree with all established wisdom on this, so I'll probably get downvoted, but:
If you redirect all mixed case urls to your properly cased url, it solves all the problems mentioned. Therefore it seems this argument is coming from tradition and preference. The point of a URL is to have a user-friendly representation of a page, and if your url is friendlier with upper case, why not use it? Compare:
moviesforyoutowatch.com/batman-vii-the-dark-knight-whatevers
MoviesForYouToWatch.com/Batman-VII-The-Dark-Knight-Whatevers
I find the mixed case version superior for the purpose. If there's a technical reason that can't be solved with a lower-case compare and redirect, please share it.
I know you asked for technical reasons but it's also worth considering this from a UX perspective.
Say you have a URL with upper case characters and, for arguments sake, this has been distributed on printed media. When a user comes to enter that URL into their browser they may well be compelled to match that case (or be forced to match the specified case if your web server is case sensitive) ultimately you are giving them more work to do as they have to consider case as well. After all, they don't know if your server is case sensitive or not and they may have experienced 404s from case sensitive web servers in the past.
If your server is case sensitive and you are using mixed case URLs you are giving more scope for the user to mistype the URL. Furthermore, say you have the URL www.example.com/Contact. It's easy to confuse an upper and lower case "c" (especially if it is copied in hand writing) if the user overlooks this and uses the wrong case they may never reach your content.
With all this in mind consider www.example.com/News/Articles/FreeIceCreamForAll. On keyboard that's not too difficult but consider this on a mobile device, it would be very fiddly to input.
The reverse is also true should a user want to write down a URL from the address bar. They may feel they need to match the case, ultimately giving them more work to do and increasing the likelyhood of errors.
To conclude; keep URLs lower case.
REGARDING SECURITY ASPECTS OF THIS ISSUE:
There is actually a good security reason to use a mix of uppercase and lowercase.
It has the effect of confusing and blocking attackers !
In human conversation humans get easily confused with uppercase and lowercase use.
Humans can't "speak" the word of the "identifiers or passwords or url's" with clarity if they contain uppercase and lowercase.
This helps with security on data or passwords on site sub-parts that are provided as part of a locked-in or secure sub-part of an "automated access" part of sites or their data.
It's similar to NOT USING JSON.
JSON is "human-readable text" and so JSON is simply giving all the attackers (Including Governments, Google .. who steal your ideas and data) ... almost everything they need to know about the data ... it's much more secure to confuse them by using private bespoke very-fast "binary protocols" - that use your own "unknowable data structures" ... but just watch out, because it is actually possible to confuse yourself or your own development team.
All your security layers and protocols have to be "well managed" to avoid confusion.
There is therefore an extra level of site and data security from human attackers (and some robots) to be had by simply using totally unconventional systems (i.e. why on earth would anybody want to use a "standard security protocol" when by some simple heavyweight prior computing they can all be easily broken).
Just "salt and hash" everything - plus also add some extra extra bespoke security of your own - it's just commonsense !
Conclusion: All the above answers are very clear and correct - but you can also happily leverage that very same knowledge to confuse potential attackers.

Dealing with invalid characters from web scraping

I've written a web scraper to extract a large amount of information from a website using Nokigiri and Mechanize, which outputs a database seed file. Unfortunately, I've discovered there's a lot of invalid characters in the text on the source website, things like keppnisæfind, Scémario and Klätiring, which is preventing the seed file from running. The seed file is too large to go through with search and replace, so how can I go about dealing with this issue?
I think those are html characters, all you need do is to write functions that will clean the characters. This depends on the programming platform
Those are almost certainly UTF-8 characters; the words should look like keppnisæfind, Scémario and Klätiring. The web sites in question might be sending UTF-8 but not declaring that as their encoding, in which case you will have to force Mechanize to use UTF-8 for sites with no declared encoding. However, that might complicate matters if you encounter other web sites without a declared encoding and they send something besides UTF-8.

UTF-8 uses and alternatives

Under what circumstances would you recommend using UTF-8? Is there an alternative to it that will serve the same purpose?
UTF-8 is being used for i18n?
Since you tagged this with web design, I assume you need to optimize the code size to be as small as possible to transfer files quickly.
The alternatives to UTF-8 would be the other Unicode encodings, since there is no alternative to using Unicode (for regular computer systems at least).
If you look at how UTF-8 is specified, you'll see that all code points up to U+007F will require one octet, and code points up to U+07FF will require two octets, up to U+FFFF three and four octets for code points up to U+10FFFF.
For UTF-16, you will need two octets up to U+FFFF (mostly), and four octets for values up to U+10FFFF.
For UTF-32, you need four octets for all unicode points.
In other words, scripts that lie under U+07FF will have some size benefit from using UTF-8 compared to UTF-16, while scripts above that will have some size penalty.
However, since the domain is web design, it might be worth noting that all control characters lie in the one-octet range of UTF-8, which makes this less true for texts with lots of, say, HTML markup and Javascript, compared to the amount of actual "text".
Scripts under U+07FF include Latin (except some extensions such as tone marks), Greek, Cyrillic, Hebrew and probably some more. Wikipedia has pretty good coverage on Unicode issues, and on the Unicode Consortium you can get even more details.
Since you are asking for recommendations, I recommend you to use it at any circumstances. All the time, i.e. for HTML files and textual resources. For English-only application it doesn't change a thing, but when you need to actually localize it, having UTF-8 in the first place would be a benefit (you won't need to re-visit your code and change it; one source of defects less).
As for other Unicode family encodings (like especially UTF-16), I would not recommend to use them for web application. Although bandwidth consumption might be actually higher for i.e. Chinese characters (at least three bytes all the time), you'll avoid problems with transmission and browser interpretation (yeah, I know that in theory it should all work the same, unfortunately in practice it tends to break).
Use UTF-8 all the way. No excuses.
use utf-8 for latin languages. utf-16 for every other language.

What is the best way to set a collation in Informix IDS?

I'm administrating a informix IDS DBMS in Argentina. We speak spanish, and the traditional ASCII caracter Set of Informix doesn't fit our needs.
I've been fooling around, and make it work with the DB_LOCALE variable. But i've seen some other call CLIENT_LOCALE and SERVER_LOCALE. Should i use them? Is it enough with DB_LOCALE
Thanks.
You mainly need to set CLIENT_LOCALE and DB_LOCALE - to es_es.8859-1 or something similar (maybe es_ar.8859-1, but you would probably need to get the ILS International Language Supplement to get that, assuming it is available at all).
The server locale controls the language used when the server reports errors. Some of the messages in the server log files would be given in Spanish rather than English.
The DB_LOCALE controls how the data is sorted in the database in indexes. It is most critical when the database is created; if it is not set, the database will be assumed to be in US English (American). You should normally set DB_LOCALE when accessing the database too, though it isn't quite as critical. The CLIENT_LOCALE should be set too. Usually, these values are the same. Sometimes, though, you have a Windows client running using a Microsoft code page for Spanish (CP 1252, I think) and a Unix server using 8859-1 or perhaps 8859-15. In those cases, the GLS (Global Language Support) library will automatically take care of codeset conversion for you.

Resources