Is there a standard for localized locale codes? - localization

I am currently working on a project that would benefit from localized locale codes. For example, RFC 5646 and the parent-standard BCP 47 define locale codes for various locales, such as en-GB for British English and zh-Hans-SG for Singaporean Chinese using simplified Chinese characters. Unfortunately, these codes use only a small subset of the latin alphabet.
I am looking for a similar standard or commonly used system that defines a set of language codes in the respective writing system of each language (somewhat akin to an autoglossonym).
EDIT: I am strictly seeking localized locale codes since in the problem's context (URI i18n/l10n), it would be unreasonable to use an autoglossonym or other verbose equivalent.

Locale codes as specified by RFC 5656 and BCP 47 are meant to be machine parseable. Thus, en-GB is "English (Great Britain)" and zh-Hans-SG is "Chinese (Singapore, Simplified Chinese Script)".
They are designed so that web pages, e-books and other documents can specify the language and script they are written in in a standard way.
Thus, each language, script and country is given a unique code from the respective standards and collated in the IANA Language Subtag Registry (http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry).
For a localized version of this, you are better off mapping the codes to a localized name (e.g. localizing the Description field of the subtag registry database, or using a project like iso-codes) and formatting that in a presentable way, keeping the locale code as an internal representation.

Related

Handling Declension in iOS

I need to handle Declensions. The app is in different language(Czech), where the words changes for singular or plural. and based on genders as well.
Example in English
1 item, 2 items, 5 items, ...
In target language => Czech Language
1 položka, 2 položky, 5 položek, ...
I have found few repositories that I am currently going through.
https://github.com/adamelliot/Inflections
https://github.com/mattt/InflectorKit
On android, there is a way to do it via xml. Is there any recommended way to handle this on iOS ? I don't want to use if's or switches.
Thank you for and suggestions.
Matti
In iOS (and other Apple platforms), plural declensions and other localized strings that change along with numeric values run through the same API as other localized strings. So you just call NSLocalizedString in your code (no extra ifs or switches), but you provide more supporting localized data — in addition to the Localizable.strings file that contains the main (number-independent) localized strings, you add a stringsdict file for the declensions.
Apple's docs run through this step-by-step: see Handling Noun Plurals and Units of Measurement in their Internationalization and Localization Guide. The example there is Russian, which IIUC has similar plural rules to Czech... but the stringsdict format supports the full set of Unicode CLDR Plural Rules if you need to handle more.

What are most actual locale & language ISO standards nowdays?

Can't get what ISO standards are actual nowadays for locales and languages. Where i can find such info?
I want to standardize languages and locales in my project.
I found the 639 standard for languages, but it is said that it is obsolete. Should I use 639-1 then? Should I use ISO 3166-2 for country codes then, to make locales by myself?
ISO 639 is the way to go for language codes: it is a 'superset' of standards for language codes with different lengths. Two-letter language codes, for example, can be found in ISO 639-1. You will have to decide if ISO 639-1 covers all languages that you need or not: if not, switch to ISO 639-2 or 639-3 (although even ISO 639-3 does not cover all languages in the world yet).
The 'obsolete' info that you saw most likely refers to the list that is kept on the w3c website: language codes do change from time to time.
See also this question for a more detailed answer - although I do not agree with all details given there.
To build your locale just combine language and country code (from ISO 3166), as you suggested. Two letter codes are more common than three letter codes: use them if they support all locales that you need.
The most authoritative guidance on this topic is IETF BCP 47, which recommends that you use the shortest code that is available for a language. So if both a two-letter code (ISO 639-1) and a three-letter code (ISO 639-2) are available, IETF BCP 47 recommends that you use the two-letter code.

What products support 3-digit region subtags, e.g., es-419 for Latin-American Spanish?

What products support 3-digit region subtags, e.g., es-419 for Latin-American Spanish?
Are web browsers, translation tools and translators familiar with these numeric codes in addition to the more common "es" or "es-ES"?
I've already visited the following pages:
W3C Choosing a Language Tag
W3C Language tags in HTML and XML
RFC 5646 Tags for Identifying Languages
Microsoft National Language Support (NLS) API Reference
I doubt that many products like that exist. It seems that some main stream programming languages (I have tested C# and Java) does not support these tags, therefore it would be quite hard to develop programs that does so.
BTW. NLS API Reference that you have provided, does not contain region tag for any of the LCID definition. And if you think of it for the moment, knowing how Locale Identifier is built, there is no way to support it now, actually. Implementation change would be required (they should use some reserved bits, I suppose).
I don't think we will see support for region tags in foreseeable future.
Edit
I saw that Microsoft assigned LCID of value -1 and -2 to "European Union 1" and "European Union 2" respectively. However I don't think it is related.

If you have an application localized in pt-br and pt-pt, what language you should choose if the system is reporting only "pt" code?

If you have an application localized in pt-br and pt-pt, what language you should choose if the system is reporting only pt code (generic Portuguese)?
This question is independent of the nature of the application, desktop, mobile or browser based. Let's assume you are not able to get region information from another source and you have to choose one language as the default one.
The question does apply as well for more case including:
pt-pt and pt-br
en-us and en-gb
fr-fr and fr-CA
zh-cn, zh-tw, .... - in fact in this case I know that zh can be used as predominant language for Simplified Chinese where full code is zh-hans. For Traditional Chinese, with codes like zh-tw, zh-hant-tw, zh-hk, zh-mo the proper code (canonical) should be zh-hant.
Q1: How to I determine the predominant languages for a specified meta-language?
I need a solution that will include at least Portuguese, English and French.
Q2: If the system reported Simplified Chinese (PRC) (zh-cn) as preferred language of the user and I have translation only for English and Traditional Chinese (en,zh-tw) what should I choose from the two options: en or zh-tw?
In general you should separate the "guess the missing parameters" problem from the "matching a list of locales I want vs. a list of locales I have" problem. They are different.
Guessing the missing parts
These are all tricky areas, and even (potentially) politically charged.
But with very few exceptions the rule is to select the "original country" of the language.
The exceptions are mostly based on population.
So fr-FR for fr, es-ES, etc.
Some exceptions: pt-BR instead of pt-PT, en-US instead of en-GB.
It is also commonly accepted (and required by the Chinese standards) that zh maps to zh-CN.
You might also have to look at the country to determine the script, or the other way around.
For instance az => az-AZ but az-Arab => az-Arab-IR, and az_IR => az_Arab_IR
Matching 'want' vs. 'have'
This involves matching a list of want vs. a list of have languages.
Dealing with lists makes it harder. And the result should also be sorted in a smart way, if possible. (for instance if want = [ fr ro ] and have = [ en fr_CA fr_FR ro_RO ] then you probably want [ fr_FR fr_CA ro_RO ] as result.
There should be no match between language with different scripts. So zh-TW should not fallback to zh-CN, and mn-Mong should not fallback to mn-Cyrl.
Tricky areas: sr-Cyrl should not fallback to sr-Latn in theory, but it might be understood by users. ro-Cyrl might fallback to ro-Latn, but not the other way around.
Some references
RFC 4647 deals with language fallback (but is not very useful in this case, because it follows the "cut from the right" rule).
ICU 4.2 and newer (draft in 4.0, I think) has uloc_addLikelySubtags (and uloc_minimizeSubtags) in uloc.h. That implements http://www.unicode.org/reports/tr35/#Likely_Subtags
Also in ICU uloc.h there are uloc_acceptLanguageFromHTTP and uloc_acceptLanguage that deal with want vs have. But kind of useless as they are, because they take a UEnumeration* as input, and there is no public API to build a UEnumeration.
There is some work on language matching going beyond the simple RFC 4647. See http://cldr.unicode.org/development/design-proposals/languagedistance
Locale matching in ActionScript at http://code.google.com/p/as3localelib/
The APIs in the new Flash Player 10.1 flash.globalization namespace do both tag guessing and language matching (http://help.adobe.com/en_US/FlashPlatform/beta/reference/actionscript/3/flash/globalization/package-detail.html). It works on TR-35 and can look beyond the # and consider the operation. For instance, if have = [ ja ja#collation=radical ja#calendar=japanese ] and want = [ ja#calendar=japanese;collation=radical ] then the best match depends on the operation you want. For date formatting ja#calendar=japanese is the better match, but for collation you want ja#collation=radical
Do you expect to have more users in Portugal or in Brazil? Pick accordingly.
For your general solution, you find out by reading up on Ethnologue.

Can someone explain ja_JP.UTF8?

I know utf8,but what's the difference between *.utf8?
From the answer to my post
Locale = ja_JP
Encoding = UTF-8
Before Unicode, handling non-english characters was done using tricks like Code Pages (like this) and special character sets (like this: Shift_JIS). UTF-8 contains a much larger range of characters with a completely different mapping system (i.e. the way each character is addressed by number).
When setting ja_JP.UTF8 as the locale, the "UTF8" part signifies the encoding for the special characters needed. For example, when you output a currency amount in the Japanese locale, you will need the ¥ character. The encoding information defines which character set to use to display the ¥.
I'm assuming there could exist a ja_JP.Shift_JIS locale. One difference to the UTF8 one - among others - would be that the ¥ sign is displayed in a way that works in this specific encoding.
Why ja_JP?
The two codes ja_JP signify language (I think based on this ISO norm) and country (based on this one). This is important if a language is spoken in more than one country. In the german speaking area, for example, the Swiss format numbers differently than the germans: 1'000'000 vs. 1.000.000. The country code serves to define these distinctions within the same language.
In which context? ja_JP tells us that the string is in the Japanese language. That does not have anything to do with the character encoding, but is probably used - depending on context - for sorting, keyboard input and language on displayed text in the program.
At a guess, I'd say each utf8 file with that naming convention contains a language definition for translating your site.
It's a locale name. The basic format is language_COUNTRY. ja = Japanese language, JP = Japan.
In addition to a date format, currency symbol, etc., each locale is associated with a character encoding. This is a historical legacy from the days when every language had its own encoding. Now, UTF-8 provides a common encoding for every locale.
The reason .UTF8 is part of the locale name is to distinguish it from older locales with a different encoding. For example, there's an ja_JP.EUC-JP locale available on my system. And for Germany, there's the choice of de_DE (obsolete pre-Euro locale with ISO-8859-1 encoding), de_DE#euro (ISO-8859-15 encoding, to provide the € sign), and de_DE.UTF-8.

Resources