ICU - How to specify custom formats inside pattern strings? - localization

Does ICU or any of its implementations allow specifying custom formats inside pattern strings? Here is what I am looking to do:
English:
"{0} likes {1}."
Polish:
"{0} lubi {1, accusative}."
Russian:
"{0, dative} нравится {1}."
Here accusative and dative are language-specific formatters that return the inflected value of their arguments.
I am aware of the existence of setFormat() but that doesn't work for me because grammatical cases are language-specific and need to be set by translators rather than programmers.
Do you know any other localization frameworks that do support this?

Related

Does the Feel language builtin string function 'replace' affect the first match or all occurrances of the search pattern?

The Decision Model and Notation Feel Language has many builtin functions.
For strings, one function is replace. It accepts a search string, a regex pattern, a replacement string, and optional flags.
Does replace act only on the first regex match or does it replace all matches? The DMN version 1.3 specification, page 138, does not seem to address this.
In your question, it replaces all matches.
Some other valid examples:
replace("banana","a","o") = "bonono"
taken as one of the agreed behaviour test cases, from the DMN TCK project.
I agree in the DMN Specification document from OMG, it could list some more down-to-Earth examples :)

Explicit Plural string using iOS Stringsdict files

I am getting started learning iOS Stringsdict files and found some existing code on a project which used the following syntax:
<key>zero</key>
<string>You no message.</string>
As per the CLDR, zero is an invalid plural in English and we expect to use explicit plural rules (=0 when using ICU MessageFormat)
I tried to find how to use explicit plural rules in iOS Stringsdict files and could not find any way to achieve this. Can someone confirm if this is supported or not?
Example of solutions (I cannot test them but maybe someone can?)
<key>0</key>
<string>You no message.</string>
Or
<key>=0</key>
<string>You no message.</string>
Extra reference on explicit plural rules part of the CLDR implementation of ICU MessageFormat:
https://formatjs.io/guides/message-syntax/#plural-format
=value
This is used to match a specific value regardless of the plural categories of the current locale.
If you are interested in the zero rule only, it is handled in .stringsdict file for any language.
Source: Foundation Release Notes for OS X v10.9
If "zero" is present, the value is used for mapping the argument value zero regardless of what CLDR rule specifies for the numeric value.
Otherwise, these are the only rules handled (depends on language): zero, one, two, few, many, others
Short Answer
.stringsdict files have no way to support explicit plural rules (other than a custom Apple implementation of zero which is detailed below)
Detailed Answer
Normal CLDR implementation:
All rules that are not in the CLDR for a given language will be ignored
If using the rule zero, it will use the CLDR values (most languages have 0 as value for zero). This also includes languages like Latvian who have 20, 30, etc. values mapped to zero and also contradicts Apple's own documentation (this behavior was verified):
If "zero" is present, the value is used for mapping the argument value
zero regardless of what CLDR rule specifies for the numeric value.
Source: Foundation Release Notes for OS X v10.9
Custom (Apple) CLDR implementation:
All languages can use the zero category from the CLDR even if the rule is not defined for this language (reference here)
Presumably, they implemented this to facilitate negative forms of sentences which is a common use case (this can even be found in their examples). For example instead of writing:
You have 0 emails.
You can write:
You have no emails.
This is a very common use case but is typically not covered using CLDR categories, it is used by using explicit values. For example, in ICU MessageFormat you can use =0 and not zero for negative forms.
While this seems convenient, it creates a big problem, what if you want to use negative forms for Latvian using the zero category? You simply can't - basically Apple broke linguistic rules by overwriting the CLDR.
Complimentary details:
There are only two languages in the CLDR where zero does not equal 0:
Latvian: 1.3 million speakers worldwide
Prussian: dead language since the 18th century
Neither iOS nor macOS is available in the Latvian languages but they support locale settings (keyboard and date formats)
This means that there are probably few applications that will support Latvian, unless they have a manual way to change the language inside the application itself (this is a less common scenario for iOS which typically honor the device's settings)
Conclusion
Tip #1: If you need to use Latvian, you should probably avoid using zero for negative forms, and use code instead, with strings outside of the stringsdict file
Tip #2: Make sure that your translation process supports this behavior correctly!

Is there a standard for localized locale codes?

I am currently working on a project that would benefit from localized locale codes. For example, RFC 5646 and the parent-standard BCP 47 define locale codes for various locales, such as en-GB for British English and zh-Hans-SG for Singaporean Chinese using simplified Chinese characters. Unfortunately, these codes use only a small subset of the latin alphabet.
I am looking for a similar standard or commonly used system that defines a set of language codes in the respective writing system of each language (somewhat akin to an autoglossonym).
EDIT: I am strictly seeking localized locale codes since in the problem's context (URI i18n/l10n), it would be unreasonable to use an autoglossonym or other verbose equivalent.
Locale codes as specified by RFC 5656 and BCP 47 are meant to be machine parseable. Thus, en-GB is "English (Great Britain)" and zh-Hans-SG is "Chinese (Singapore, Simplified Chinese Script)".
They are designed so that web pages, e-books and other documents can specify the language and script they are written in in a standard way.
Thus, each language, script and country is given a unique code from the respective standards and collated in the IANA Language Subtag Registry (http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry).
For a localized version of this, you are better off mapping the codes to a localized name (e.g. localizing the Description field of the subtag registry database, or using a project like iso-codes) and formatting that in a presentable way, keeping the locale code as an internal representation.

StrLComp vs AnsiStrLComp when called with Unicode strings

I'm having a bit of confusion regarding the "Ansi" vs "regular" rtl string functions when called with Unicode strings. I understand that under older versions of Delphi (when Ansistring was the default) that the "Ansi" versions handled multibyte characters. Does this mean anything when dealing with Unicode strings? Assuming that I need to handle Korean characters and also that my code does not have to be compatible with older Delphi versions, which rtl functions should be used?
The 'Ansi' prefix of the string compare functions really never signified anything other than that the locale was taken into account when comparing strings instead of doing "just" a simple binary comparison. In the Unicode world this is still the case. The Ansi* family of functions also take (Unicode) strings as their parameters and take the locale into account when doing the comparison.
From the AnsiCompareStr doc (D2009):
Most locales consider lowercase characters to be less than the
corresponding uppercase characters. This is in contrast to ASCII
order, in which lowercase characters are greater than uppercase
characters. Thus, setting S1 to 'a' and S2 to 'A' causees
AnsiCompareStr to return a value less than zero, while CompareStr,
with the same arguments, returns a value greater than zero.
What the effect of "taking the locale into account" may be differs per locale. It may have to do with accented characters or not. In Unicode versions it may actually take into account how the characters are composed. For example an accented e (é) may be encoded exactly like that but may also be encoded as two separate items: the accent and the e.
Both the Ansi* and the "normal" string compare functions are included in the SysUtils unit. They all take strings as their parameters and in Unicode Delphi that does indeed mean UnicodeStrings.
If you need to work with AnsiStrings then you need to use the AnsiStrings unit. It has the same set of string compare functions, but in this unit they all take AnsiStrings as their parameters.
Now, if you don't need compatability with older versions: use the standard functions from SysUtils. Use the normale ones if byte comparison is enough. Use the Ansi ones if you need to take locale considerations into account.
Not sure what exactly you want to do, but...
if you want to compare two strings by your current user locale rules, use the AnsiStrLComp for case sensitive comparision or AnsiStrLIComp for case insensitive comparision. Internally these functions uses the CompareString function with the LOCALE_USER_DEFAULT locale set
if you want to compare two strings by using the Delphi internal comparing mechanism, use the StrLComp function for case sensitive comparision or StrLIComp for case insensitive compare
So if you'll compare the two same strings with AnsiStrLComp or AnsiStrLIComp on machines with different user locale settings, you may get different results, but on the other hand you can get natural sorting for the user's language settings to your application.
The StrLComp and StrLIComp will work on all machines the same way, locale independently.
The simple answer is that when it comes to Delphi string routines you should use the ANSI...() functions for Unicode strings.
However, if you are comparing strings (among other things) then you may also need to consider normalising those strings first, depending on the nature and needs (and the source of the strings) in your application, to deal with Unicode Equivalence.

If you have an application localized in pt-br and pt-pt, what language you should choose if the system is reporting only "pt" code?

If you have an application localized in pt-br and pt-pt, what language you should choose if the system is reporting only pt code (generic Portuguese)?
This question is independent of the nature of the application, desktop, mobile or browser based. Let's assume you are not able to get region information from another source and you have to choose one language as the default one.
The question does apply as well for more case including:
pt-pt and pt-br
en-us and en-gb
fr-fr and fr-CA
zh-cn, zh-tw, .... - in fact in this case I know that zh can be used as predominant language for Simplified Chinese where full code is zh-hans. For Traditional Chinese, with codes like zh-tw, zh-hant-tw, zh-hk, zh-mo the proper code (canonical) should be zh-hant.
Q1: How to I determine the predominant languages for a specified meta-language?
I need a solution that will include at least Portuguese, English and French.
Q2: If the system reported Simplified Chinese (PRC) (zh-cn) as preferred language of the user and I have translation only for English and Traditional Chinese (en,zh-tw) what should I choose from the two options: en or zh-tw?
In general you should separate the "guess the missing parameters" problem from the "matching a list of locales I want vs. a list of locales I have" problem. They are different.
Guessing the missing parts
These are all tricky areas, and even (potentially) politically charged.
But with very few exceptions the rule is to select the "original country" of the language.
The exceptions are mostly based on population.
So fr-FR for fr, es-ES, etc.
Some exceptions: pt-BR instead of pt-PT, en-US instead of en-GB.
It is also commonly accepted (and required by the Chinese standards) that zh maps to zh-CN.
You might also have to look at the country to determine the script, or the other way around.
For instance az => az-AZ but az-Arab => az-Arab-IR, and az_IR => az_Arab_IR
Matching 'want' vs. 'have'
This involves matching a list of want vs. a list of have languages.
Dealing with lists makes it harder. And the result should also be sorted in a smart way, if possible. (for instance if want = [ fr ro ] and have = [ en fr_CA fr_FR ro_RO ] then you probably want [ fr_FR fr_CA ro_RO ] as result.
There should be no match between language with different scripts. So zh-TW should not fallback to zh-CN, and mn-Mong should not fallback to mn-Cyrl.
Tricky areas: sr-Cyrl should not fallback to sr-Latn in theory, but it might be understood by users. ro-Cyrl might fallback to ro-Latn, but not the other way around.
Some references
RFC 4647 deals with language fallback (but is not very useful in this case, because it follows the "cut from the right" rule).
ICU 4.2 and newer (draft in 4.0, I think) has uloc_addLikelySubtags (and uloc_minimizeSubtags) in uloc.h. That implements http://www.unicode.org/reports/tr35/#Likely_Subtags
Also in ICU uloc.h there are uloc_acceptLanguageFromHTTP and uloc_acceptLanguage that deal with want vs have. But kind of useless as they are, because they take a UEnumeration* as input, and there is no public API to build a UEnumeration.
There is some work on language matching going beyond the simple RFC 4647. See http://cldr.unicode.org/development/design-proposals/languagedistance
Locale matching in ActionScript at http://code.google.com/p/as3localelib/
The APIs in the new Flash Player 10.1 flash.globalization namespace do both tag guessing and language matching (http://help.adobe.com/en_US/FlashPlatform/beta/reference/actionscript/3/flash/globalization/package-detail.html). It works on TR-35 and can look beyond the # and consider the operation. For instance, if have = [ ja ja#collation=radical ja#calendar=japanese ] and want = [ ja#calendar=japanese;collation=radical ] then the best match depends on the operation you want. For date formatting ja#calendar=japanese is the better match, but for collation you want ja#collation=radical
Do you expect to have more users in Portugal or in Brazil? Pick accordingly.
For your general solution, you find out by reading up on Ethnologue.

Resources