I am getting started learning iOS Stringsdict files and found some existing code on a project which used the following syntax:
<key>zero</key>
<string>You no message.</string>
As per the CLDR, zero is an invalid plural in English and we expect to use explicit plural rules (=0 when using ICU MessageFormat)
I tried to find how to use explicit plural rules in iOS Stringsdict files and could not find any way to achieve this. Can someone confirm if this is supported or not?
Example of solutions (I cannot test them but maybe someone can?)
<key>0</key>
<string>You no message.</string>
Or
<key>=0</key>
<string>You no message.</string>
Extra reference on explicit plural rules part of the CLDR implementation of ICU MessageFormat:
https://formatjs.io/guides/message-syntax/#plural-format
=value
This is used to match a specific value regardless of the plural categories of the current locale.
If you are interested in the zero rule only, it is handled in .stringsdict file for any language.
Source: Foundation Release Notes for OS X v10.9
If "zero" is present, the value is used for mapping the argument value zero regardless of what CLDR rule specifies for the numeric value.
Otherwise, these are the only rules handled (depends on language): zero, one, two, few, many, others
Short Answer
.stringsdict files have no way to support explicit plural rules (other than a custom Apple implementation of zero which is detailed below)
Detailed Answer
Normal CLDR implementation:
All rules that are not in the CLDR for a given language will be ignored
If using the rule zero, it will use the CLDR values (most languages have 0 as value for zero). This also includes languages like Latvian who have 20, 30, etc. values mapped to zero and also contradicts Apple's own documentation (this behavior was verified):
If "zero" is present, the value is used for mapping the argument value
zero regardless of what CLDR rule specifies for the numeric value.
Source: Foundation Release Notes for OS X v10.9
Custom (Apple) CLDR implementation:
All languages can use the zero category from the CLDR even if the rule is not defined for this language (reference here)
Presumably, they implemented this to facilitate negative forms of sentences which is a common use case (this can even be found in their examples). For example instead of writing:
You have 0 emails.
You can write:
You have no emails.
This is a very common use case but is typically not covered using CLDR categories, it is used by using explicit values. For example, in ICU MessageFormat you can use =0 and not zero for negative forms.
While this seems convenient, it creates a big problem, what if you want to use negative forms for Latvian using the zero category? You simply can't - basically Apple broke linguistic rules by overwriting the CLDR.
Complimentary details:
There are only two languages in the CLDR where zero does not equal 0:
Latvian: 1.3 million speakers worldwide
Prussian: dead language since the 18th century
Neither iOS nor macOS is available in the Latvian languages but they support locale settings (keyboard and date formats)
This means that there are probably few applications that will support Latvian, unless they have a manual way to change the language inside the application itself (this is a less common scenario for iOS which typically honor the device's settings)
Conclusion
Tip #1: If you need to use Latvian, you should probably avoid using zero for negative forms, and use code instead, with strings outside of the stringsdict file
Tip #2: Make sure that your translation process supports this behavior correctly!
Related
The function below results in "10.000". Where I live this means "ten thousand".
format!("{:.3}", 10.0);
I would like the output to be "10,000".
There is no support for internationalization (i18n) or localization (l10n) baked in the Rust standard library.
There are several reasons, in no particular order:
a locale-dependent output should be a conscious choice, not a default,
i18n and l10n are much more complicated than just formatting numbers,
the Rust std aims at being small.
The format! machinery is going to be used to write JSON or XML files. You really do NOT want to end up with a differently formatted file depending on the locale of the machine that encoded it. It's a recipe for disaster.
The detection of locale at run-time is also optimization unfriendly. Suddenly you cannot pre-compute things at compile-time (even partially), you cannot even know which size of buffer to allocate at compile-time.
And this ties in with a dubious usefulness. Dates and numbers are arguably important, however this American vs English formatting war is ultimately a drop in the ocean. A French grammar schooler will certainly appreciate that the number is formatted in the typical French format... but it will be of no avail to her if the surrounding text is in English (we French are notoriously bad at teaching/learning foreign languages). Locale should influence language selection, sorting order, etc... merely changing the format of numbers is pointless, everything should switch with it, and this requires much more serious support (check gettext for a C library that provides a good base).
Basing the detection of the locale on the host locale, and it being global to the whole process, is also a very dubious architectural choice in this age of multi-threaded web servers. Imagine if Facebook was served in Swedish in Europe just because its datacenter is running there.
Finally, all this language/date/... support requires a humongous amount of data. ICU has several dozens (or is it hundreds?) of MBs of such data embedded inside it. This would make the size of the std explode, and make it completely unsuitable for embedded development; which probably do not care about this anyway.
Of course, you could cut down on this significantly if you only chose to support a handful of languages... which is yet another argument for putting this outside the standard library.
Since the standard library doesn't have this functionality (localization of number format), you can just replace the dot with a comma:
fn main() {
println!("{}", format!("{:.3}", 10.0).replacen(".", ",", 1));
}
There are other ways of doing this, but this is probably the most straightforward solution.
This is not the role of the macro format!. This option should be handle by Rust. Unfortunately, my search lead me to the conclusion that Rust don't handle locale (yet ?).
There is a library rust-locale, but they are still in alpha.
I need to handle Declensions. The app is in different language(Czech), where the words changes for singular or plural. and based on genders as well.
Example in English
1 item, 2 items, 5 items, ...
In target language => Czech Language
1 položka, 2 položky, 5 položek, ...
I have found few repositories that I am currently going through.
https://github.com/adamelliot/Inflections
https://github.com/mattt/InflectorKit
On android, there is a way to do it via xml. Is there any recommended way to handle this on iOS ? I don't want to use if's or switches.
Thank you for and suggestions.
Matti
In iOS (and other Apple platforms), plural declensions and other localized strings that change along with numeric values run through the same API as other localized strings. So you just call NSLocalizedString in your code (no extra ifs or switches), but you provide more supporting localized data — in addition to the Localizable.strings file that contains the main (number-independent) localized strings, you add a stringsdict file for the declensions.
Apple's docs run through this step-by-step: see Handling Noun Plurals and Units of Measurement in their Internationalization and Localization Guide. The example there is Russian, which IIUC has similar plural rules to Czech... but the stringsdict format supports the full set of Unicode CLDR Plural Rules if you need to handle more.
Lets say I have a field which accepts A-Z,a-z,0-9 . If I'm trying to communicate to someone, via documenation or api creation "what" my code can accept, i HAVE to say:
A-Z,a-z,0-9
Now that in my mind this is restrictive and error prone.
Compare that to what i'm proposing.
Suppose A-Z,a-z,0-9 was allocated the "code" ANSI456
When I'm communicating that to someone, I can say that my code accepts ANSI456. If someone else was developing a check, there is no confusion on what my code can or cannot accept.
To those who will suggest just specifying character ranges, please note that what i'm envisioning will handle scenarios where even this is defined as a valid "code"
0-9, +, -, *, /
In fact, if its done properly, we can have a site generate automatic code in various languages to accomodate the different "codes".
Okay - i KNOW there are ~ infinite values, eg:
a-z
is different from
a-l,n-z
And these would have two different codes in this "system".
I'm not proposing a HUMAN moderated system - it can be completely automatic BUT systematic way of generating these "codes"
There already is such a standard, although it doesn't have the word "standard" in its name. It is called Perl 5 compatible regular expressions, and it is used in Perl 5, Java, JavaScript, libpcre and many other contexts.
I have what I imagine will be a fairly involved technical challenge: I want to be able to reliably alpha-rename identifiers in multiple languages (as many as possible). This will require special consideration for each language, and I'm asking for advice for how to minimize the amount of work I need to do by sharing code. Something like a unified parsing or abstract syntax framework that already has support for many languages would be great.
For example, here is some python code:
def foo(x):
def bar(y):
return x+y
return bar
An alpha renaming of x to y changes the x to a y and preserves semantics. So it would become:
def foo(y):
def bar(y1):
return y+y1
return bar
See how we needed to rename y to y1 in order to keep from breaking the code? That is why this is a hard problem. It seems like the program would have to have a pretty good knowledge of what constitutes a scope, rather than just doing, say, a string search and replace.
I would also like to preserve as much of the formatting as possible: comments, spacing, indentation. But that is not 100% necessary, it would just be nice.
Any tips?
To do this safely, you need to be able to to determine
all the identifiers (and those things that are not, e.g., the middle of a comment) in your code
the scopes of validity for each identifer
the ability to substitute a new identifier for an old one in the text
the ability to determine if renaming an identifier causes another name to be shadowed
To determine identifiers accurately, you need a least a langauge-accurate lexer. Identifiers in PHP look different than the do in COBOL.
To determine scopes of validity, you have to be determine program structure in practice, since most "scopes" are defined by such structure. This means you need a langauge-accurate parser; scopes in PHP are different than scopes in COBOL.
To determine which names are valid in which scopes, you need to know the language scoping rules. Your language may insist that the identifier X will refer to different Xes depending on the context in which X is found (consider object constructors named X with different arguments). Now you need to be able to traverse the scope structures according to the naming rules. Single inheritance, multiple inheritance, overloading, default types all will pretty much require you to build a model of the scopes for the programs, insert the identifiers and corresponding types into each scope, and then climb from the point of encounter of an identifier in the program text through the various scopes according to the language semantics. You will need symbol tables, inheritance linkages, ASTs, and the ability to navigage all of these. These structures are different from PHP and COBOL, but they share lots of common ideas so you likely need a library with the common concept support.
To rename an identifier, you have to modify the text. In a million lines of code, you need to point carefully. Modifying an AST node is one way to point carefully. Actually, you need to modify all the identifiers that correspond to the one being renamed; you have to climb over the tree to find them all, or record in the AST where all the references exist so they can be found easily. After modifyingy the tree you have to regenerate the source text after modifying the AST. That's a lot of machinery; see my SO answer on how to prettyprint ASTs preseriving all of the stuff you reasonably suggest should be preserved.
(Your other choice is to keep track in the AST of where the text for the string is,
and the read/patch/write the file.)
Before you update the file, you need to check that you haven't shadowed something. Consider this code:
{ local x;
x=1;
{local y;
y=2;
{local z;
z=y
print(x);
}
}
}
We agree this code prints "1". Now we decide to rename y to x.
We've broken the scoping, and now the print statement which referred
conceptually to the outer x refers to an x captured by the renamed y. The code now prints "2", so our rename broke it. This means that one must check all the other identifiers in scopes in which the renamed variable might be found, to see if the new name "captures" some name we weren't expecting. (This would be legal if the print statement printed z).
This is a lot of machinery.
Yes, there is a framework that has almost all of this as well as a number of robust language front ends. See our DMS Software Reengineering Toolkit. It has parsers producing ASTs, prettyprinters to produce text back from ASTs, generic symbol table management machinery (including support for multiple inheritance), AST visiting/modification machinery. Ithas prettyprinting machinery to turn ASTs back into text. It has front ends for C, C++, COBOL and Java that implement name and type resolution (e.g. instanting symbol table scopes and identifier to symbol table entry mappings); it has front ends for many other langauges that don't have scoping implemented yet.
We've just finished an exercise in implementing "rename" for Java. (All the above issues of course appeared). We about about to start one for C++.
You could try to create Xtext based implementations for the involved languages. The Xtext framework provides reliable infrastructure for cross language rename refactoring. However, you'll have to provide a grammar a at least a "good enough" scope resolution for each language.
Languages mostly guarantee tokens will be unique, whatever the context. A naive first approach (and this will break many, many pieces of code) would be:
cp file file.orig
sed -i 's/\b(newTokenName)\b/TEMPTOKEN/g' file
sed -i 's/\b(oldTokenName)\b/newTokenName/g' file
With GNU sed, this will break on PHP. Rewriting \b to a general token match, like ([^a-zA-Z~$-_][^a-zA-Z0-9~$-_]) would work on most C, Java, PHP, and Python, but not Perl (need to add # and % to the token characters. Beyond that, it would require a plugin architecture that works for any language you wanted to add. At some point, there will be two languages whose variable and function naming rules will be incompatible, and at that point, you'll need to do more and more in the plugin.
What products support 3-digit region subtags, e.g., es-419 for Latin-American Spanish?
Are web browsers, translation tools and translators familiar with these numeric codes in addition to the more common "es" or "es-ES"?
I've already visited the following pages:
W3C Choosing a Language Tag
W3C Language tags in HTML and XML
RFC 5646 Tags for Identifying Languages
Microsoft National Language Support (NLS) API Reference
I doubt that many products like that exist. It seems that some main stream programming languages (I have tested C# and Java) does not support these tags, therefore it would be quite hard to develop programs that does so.
BTW. NLS API Reference that you have provided, does not contain region tag for any of the LCID definition. And if you think of it for the moment, knowing how Locale Identifier is built, there is no way to support it now, actually. Implementation change would be required (they should use some reserved bits, I suppose).
I don't think we will see support for region tags in foreseeable future.
Edit
I saw that Microsoft assigned LCID of value -1 and -2 to "European Union 1" and "European Union 2" respectively. However I don't think it is related.