Korean translation in Globalization Pipeline - translation

What's the relative quality level of Korean translation in IBM Globalization Pipeline?
"Invalid usage" is translated to "효력이 없는 사용", but when I send this:
Invalid usage (extra arguments), try {name1}{name2}
I get:
. {name1} {name2}
For all the other languages (es, fr, de, it, ja, pt-BR, zh-Hans, zh-Hant), the results look reasonable.

In general, when variable parameters are involved, output could be very poor. At this moment, the service depends on generic machine translation engines running on backend, translation results are not really optimized for incomplete sentences.

Related

Create translation from translation, not from source

There are two files, a German template file (de.pot), generated from source, and an English translation in en.po.
de.pot
en.po
Now someone non-German wants to translate the application. But from the surface, it seems Poedit only allows creating translations from source (which is German). What's the workflow to create an en → fr translation, for example, for this setup?
Translating from translation (or worse, of a translation of a translation of a…) is a bad idea. It leads to serious accuracy and understandability issues.
If you don't have other options (using English for the source being the canonical one), you can use the poswap tool to do it.

How can I change the formatter's decimal separator in Rust?

The function below results in "10.000". Where I live this means "ten thousand".
format!("{:.3}", 10.0);
I would like the output to be "10,000".
There is no support for internationalization (i18n) or localization (l10n) baked in the Rust standard library.
There are several reasons, in no particular order:
a locale-dependent output should be a conscious choice, not a default,
i18n and l10n are much more complicated than just formatting numbers,
the Rust std aims at being small.
The format! machinery is going to be used to write JSON or XML files. You really do NOT want to end up with a differently formatted file depending on the locale of the machine that encoded it. It's a recipe for disaster.
The detection of locale at run-time is also optimization unfriendly. Suddenly you cannot pre-compute things at compile-time (even partially), you cannot even know which size of buffer to allocate at compile-time.
And this ties in with a dubious usefulness. Dates and numbers are arguably important, however this American vs English formatting war is ultimately a drop in the ocean. A French grammar schooler will certainly appreciate that the number is formatted in the typical French format... but it will be of no avail to her if the surrounding text is in English (we French are notoriously bad at teaching/learning foreign languages). Locale should influence language selection, sorting order, etc... merely changing the format of numbers is pointless, everything should switch with it, and this requires much more serious support (check gettext for a C library that provides a good base).
Basing the detection of the locale on the host locale, and it being global to the whole process, is also a very dubious architectural choice in this age of multi-threaded web servers. Imagine if Facebook was served in Swedish in Europe just because its datacenter is running there.
Finally, all this language/date/... support requires a humongous amount of data. ICU has several dozens (or is it hundreds?) of MBs of such data embedded inside it. This would make the size of the std explode, and make it completely unsuitable for embedded development; which probably do not care about this anyway.
Of course, you could cut down on this significantly if you only chose to support a handful of languages... which is yet another argument for putting this outside the standard library.
Since the standard library doesn't have this functionality (localization of number format), you can just replace the dot with a comma:
fn main() {
println!("{}", format!("{:.3}", 10.0).replacen(".", ",", 1));
}
There are other ways of doing this, but this is probably the most straightforward solution.
This is not the role of the macro format!. This option should be handle by Rust. Unfortunately, my search lead me to the conclusion that Rust don't handle locale (yet ?).
There is a library rust-locale, but they are still in alpha.

Translation API with candidates

I am looking for a translation API that outputs all the candidates and not just single "best" candidate.
All statistical machine translation systems at the last stage score the list of translation candidates and choice the best candidate. I wonder if there is a system like Google translate or Microsoft translate that returns the list of all possible candidates so that I will be able to score them by myself.
Thanks.
I think WordNet is good for this:
https://wordnet.princeton.edu/
Originally wordnet is english ontology describing english word in english, showing synonims, definition etc. but there are a lot of other language wordnets projects as well as multilingual wordnets. Below interesting links:
http://globalwordnet.org/wordnets-in-the-world/
http://www.certifiedchinesetranslation.com/openaccess/WordNet/
There is a big dictionary project leveraging from wordnets too:
http://babelnet.org/about

Why do we use the term syntax in computer languages and not the term grammar instead

I am confused between the word syntax and grammar. Is there a reason that for computer languages we always use the word syntax to describe the word order and not the word grammar?
The term "syntax" and "grammar" both comes from the field of linguistics. In linguistics, syntax refers to the rules by which sentences are constructed. Grammar refers to how the rules of the language relate to one another.
Grammar actually covers syntax, morphology and phonology. Morphology are the rules of how words can be modified to add meaning or context. Phonology are the rules of how words should sound like (which in turn govern how spelling works in that language).
So, how did concepts form linguistics got adopted by programmers?
If you look at really old papers and publications related to computing, for example Turing's seminal work on computability (Turing machines) or even older, Babbage's publications describing his Analytical Engine and Ada Lovelace's publications on programming, you'll find that they don't refer to computer programs as languages. Instead, they were just referred to as instructions or, if you want to get fancy, algorithms.
It was partly, perhaps mostly, the work of Noam Chomsky that related languages to programming.
Looking for a new way to study languages and how to extract meaning from sentences Chomsky created the concept of the Chomsky hierarchy. His idea was to start with the simplest system that could process a string of "stuff" (sounds,letters,words): a Turing machine and categorize the instructions for a Turing machine as type-0 grammar. Then he went on to define grammar types 1, 2 and 3 (type 3 being the grammar of human languages such as English or Swahili) hoping that as we understand how complexity gets introduced we will end up with a parser for human languages.
Most programming languages are type 2. Indeed we have discovered parsers for types 0, 1 and 2 in the form of language interperters and CPU designs.
Inheriting Chomsky's work, we have defined "syntax" in computing to mean how symbols are arranged to implement a language feature and "grammar" to mean the collection of syntax rules.
Because a language has only "one" syntax (the set of strings it will accept), and probably very many grammars even if we exclude trivial variants.
This may be clearer if you think about the phrase, "the language syntax allows stuff". This phrase is independent of any grammars that might be used to describe the syntax.

If you have an application localized in pt-br and pt-pt, what language you should choose if the system is reporting only "pt" code?

If you have an application localized in pt-br and pt-pt, what language you should choose if the system is reporting only pt code (generic Portuguese)?
This question is independent of the nature of the application, desktop, mobile or browser based. Let's assume you are not able to get region information from another source and you have to choose one language as the default one.
The question does apply as well for more case including:
pt-pt and pt-br
en-us and en-gb
fr-fr and fr-CA
zh-cn, zh-tw, .... - in fact in this case I know that zh can be used as predominant language for Simplified Chinese where full code is zh-hans. For Traditional Chinese, with codes like zh-tw, zh-hant-tw, zh-hk, zh-mo the proper code (canonical) should be zh-hant.
Q1: How to I determine the predominant languages for a specified meta-language?
I need a solution that will include at least Portuguese, English and French.
Q2: If the system reported Simplified Chinese (PRC) (zh-cn) as preferred language of the user and I have translation only for English and Traditional Chinese (en,zh-tw) what should I choose from the two options: en or zh-tw?
In general you should separate the "guess the missing parameters" problem from the "matching a list of locales I want vs. a list of locales I have" problem. They are different.
Guessing the missing parts
These are all tricky areas, and even (potentially) politically charged.
But with very few exceptions the rule is to select the "original country" of the language.
The exceptions are mostly based on population.
So fr-FR for fr, es-ES, etc.
Some exceptions: pt-BR instead of pt-PT, en-US instead of en-GB.
It is also commonly accepted (and required by the Chinese standards) that zh maps to zh-CN.
You might also have to look at the country to determine the script, or the other way around.
For instance az => az-AZ but az-Arab => az-Arab-IR, and az_IR => az_Arab_IR
Matching 'want' vs. 'have'
This involves matching a list of want vs. a list of have languages.
Dealing with lists makes it harder. And the result should also be sorted in a smart way, if possible. (for instance if want = [ fr ro ] and have = [ en fr_CA fr_FR ro_RO ] then you probably want [ fr_FR fr_CA ro_RO ] as result.
There should be no match between language with different scripts. So zh-TW should not fallback to zh-CN, and mn-Mong should not fallback to mn-Cyrl.
Tricky areas: sr-Cyrl should not fallback to sr-Latn in theory, but it might be understood by users. ro-Cyrl might fallback to ro-Latn, but not the other way around.
Some references
RFC 4647 deals with language fallback (but is not very useful in this case, because it follows the "cut from the right" rule).
ICU 4.2 and newer (draft in 4.0, I think) has uloc_addLikelySubtags (and uloc_minimizeSubtags) in uloc.h. That implements http://www.unicode.org/reports/tr35/#Likely_Subtags
Also in ICU uloc.h there are uloc_acceptLanguageFromHTTP and uloc_acceptLanguage that deal with want vs have. But kind of useless as they are, because they take a UEnumeration* as input, and there is no public API to build a UEnumeration.
There is some work on language matching going beyond the simple RFC 4647. See http://cldr.unicode.org/development/design-proposals/languagedistance
Locale matching in ActionScript at http://code.google.com/p/as3localelib/
The APIs in the new Flash Player 10.1 flash.globalization namespace do both tag guessing and language matching (http://help.adobe.com/en_US/FlashPlatform/beta/reference/actionscript/3/flash/globalization/package-detail.html). It works on TR-35 and can look beyond the # and consider the operation. For instance, if have = [ ja ja#collation=radical ja#calendar=japanese ] and want = [ ja#calendar=japanese;collation=radical ] then the best match depends on the operation you want. For date formatting ja#calendar=japanese is the better match, but for collation you want ja#collation=radical
Do you expect to have more users in Portugal or in Brazil? Pick accordingly.
For your general solution, you find out by reading up on Ethnologue.

Resources