Under what circumstances would you recommend using UTF-8? Is there an alternative to it that will serve the same purpose?
UTF-8 is being used for i18n?
Since you tagged this with web design, I assume you need to optimize the code size to be as small as possible to transfer files quickly.
The alternatives to UTF-8 would be the other Unicode encodings, since there is no alternative to using Unicode (for regular computer systems at least).
If you look at how UTF-8 is specified, you'll see that all code points up to U+007F will require one octet, and code points up to U+07FF will require two octets, up to U+FFFF three and four octets for code points up to U+10FFFF.
For UTF-16, you will need two octets up to U+FFFF (mostly), and four octets for values up to U+10FFFF.
For UTF-32, you need four octets for all unicode points.
In other words, scripts that lie under U+07FF will have some size benefit from using UTF-8 compared to UTF-16, while scripts above that will have some size penalty.
However, since the domain is web design, it might be worth noting that all control characters lie in the one-octet range of UTF-8, which makes this less true for texts with lots of, say, HTML markup and Javascript, compared to the amount of actual "text".
Scripts under U+07FF include Latin (except some extensions such as tone marks), Greek, Cyrillic, Hebrew and probably some more. Wikipedia has pretty good coverage on Unicode issues, and on the Unicode Consortium you can get even more details.
Since you are asking for recommendations, I recommend you to use it at any circumstances. All the time, i.e. for HTML files and textual resources. For English-only application it doesn't change a thing, but when you need to actually localize it, having UTF-8 in the first place would be a benefit (you won't need to re-visit your code and change it; one source of defects less).
As for other Unicode family encodings (like especially UTF-16), I would not recommend to use them for web application. Although bandwidth consumption might be actually higher for i.e. Chinese characters (at least three bytes all the time), you'll avoid problems with transmission and browser interpretation (yeah, I know that in theory it should all work the same, unfortunately in practice it tends to break).
Use UTF-8 all the way. No excuses.
use utf-8 for latin languages. utf-16 for every other language.
Related
Given I want to generate short urls and I want to keep it as short as possible, is it a good idea to add non-ASCII characters in the path?
URL could look like https://example.com/hžйÄю
Possible pros:
Many possible combinations
Addresses will be short even if there are thousands of them
Possible cons:
No user will ever be able to write it without installing/enabling a variety of keyboard layouts, let alone remember it
Some Operating systems may not be able to interpret the characters
Is it even worth considering doing this?
I've written a web scraper to extract a large amount of information from a website using Nokigiri and Mechanize, which outputs a database seed file. Unfortunately, I've discovered there's a lot of invalid characters in the text on the source website, things like keppnisæfind, Scémario and Klätiring, which is preventing the seed file from running. The seed file is too large to go through with search and replace, so how can I go about dealing with this issue?
I think those are html characters, all you need do is to write functions that will clean the characters. This depends on the programming platform
Those are almost certainly UTF-8 characters; the words should look like keppnisæfind, Scémario and Klätiring. The web sites in question might be sending UTF-8 but not declaring that as their encoding, in which case you will have to force Mechanize to use UTF-8 for sites with no declared encoding. However, that might complicate matters if you encounter other web sites without a declared encoding and they send something besides UTF-8.
If I use a domain such as www.äöü.com, is there any way to avoid it being displayed as www.xn--4ca0bs.com in users’ browsers?
Domains such as www.xn--4ca0bs.com cause a lot of confusion with average internet users, I guess.
This is entirely up to the browser. In fact, IDNs are pretty much a browser-only technology. Domain names cannot contain non-ASCII characters, so the actual domain name is always the Punycode encoded xn--... form. It's up to the browser to prettify this, but many choose to not do so to avoid domain name spoofing using lookalike Unicode characters.
From a security perspective, Unicode domains can be problematic because many Unicode characters are difficult to distinguish from common ASCII characters (or indeed other Unicode characters).
It is possible to register domains such as "xn–pple-43d.com", which is equivalent to "аpple.com". It may not be obvious at first glance, but "аpple.com" uses the Cyrillic "а" (U+0430) rather than the ASCII "a" (U+0061). This is known as a homograph attack.
Fortunately modern browsers have mechanisms in place to limit IDN homograph attacks. The page IDN Policy on chrome highlights the conditions under which an IDN is displayed in its native Unicode form. Generally speaking, the Unicode form will be hidden if a domain label contains characters from multiple different languages. The "аpple.com" domain as described above will appear in its Punycode form as "xn–pple-43d.com" to limit confusion with the real "apple.com".
For more information see this blog post by Xudong Zheng.
Internet Explorer 8.0 on Windows 7 displays your UTF-8 domain just fine.
Google Chrome 19 on the other hand doesn't.
Read more here: An Introduction to Multilingual Web Addresses #phishing.
Different browsers to things differently, possibly because some use the system codepage/locale/encoding/wtvr. And others use their own settings, or a list of allowed characters.
Read that article carefully, it explains how each browser works when making a decision.
If you are targeting a specific language, you can get away with it and make it work.
The Erlang external term format has changed at least once (but this change appears to predate the history stored in the Erlang/OTP github repository); clearly, it could change in the future.
However, as a practical matter, is it generally considered safe to assume that this format is stable now? By "stable," I mean specifically that, for any term T, term_to_binary will return the same binary in any current or future version of Erlang (not merely whether it will return a binary that binary_to_term will convert back to a term identical to T). I'm interested in this property because I'd like to store hashes of arbitrary Erlang terms on disk and I want identical terms to have the same hash value now and in the future.
If it isn't safe to assume that the term format is stable, what are people using for efficient and stable term serialization?
it's been stated that erlang will provide compatibility for at least 2 major releases. that would mean that BEAM files, the distribution protocol, external term format, etc from R14 will at least work up to R16.
"We have as a strategy to at least support backwards compatibility 2 major releases back in time."
"In general, we only break backward compatibility in major releases
and only for a very good reason and usually after first deprecating
the feature one or two releases beforehand."
erlang:phash2 is guaranteed to be a stable hash of an Erlang term.
I don't think OTP makes the guarantee made that term_to_binary(T) in vX =:= term_to_binary(T) in vY. Lots of things could change if they introduce new term codes for optimized representations of things. Or if we need to add unicode strings to the ETF or something. Or in the vanishingly unlikely future in which we introduce a new fundamental datatype. For an example of change that has happened in external representation only (stored terms compare equal, but are not byte equal) see float_ext vs. new_float_ext.
In practical terms, if you stick to atoms, lists, tuples, integers, floats and binaries, then you're probably safe with term_to_binary for quite some time. If the time comes that their ETF representation changes, then you can always write your own version of term_to_binary that doesn't change with the ETF.
For data serialization, I usually choose between Google Protocol Buffers and JSON. Both of them are very stable. For working with these formats from Erlang I use Piqi, Erlson and mochijson2.
Big advantage of Protobuf and JSON is that they can be used from other programming languages by design, whereas Erlang external term format is more or less specific to Erlang.
Note that JSON string representation is implementation-dependent (escaped characters, floating point precision, whitespace, etc.) and for this reason it may not be suitable for your use-case.
Protobuf is less straightforward to work with compared to schemaless formats but it is a very well-designed and powerful tool.
Here are a couple of other schemaless binary serialization formats to consider. I don't know how stable they are. It may turn out that Erlang external term format is more stable.
https://github.com/uwiger/sext
https://github.com/TonyGen/bson-erlang
I'm working on internationalization of applications. I wonder whether it is better to keep in YAML files unformatted text versions of the static example, all begin with a lowercase letter, and small caps are created each time the view (method capitalize). The advantage of the method is that when creating subsequent files translator does not need to pay attention to the size of the characters and the downside may be the time overhead associated with multiple calling the helper in the view.
Different languages have different capitalization rules so it might not be a good idea. For example I should capitalize 'i' when talking about myself in english.