I am developing a project which supports multiple languages. One of the functions we have is to support replacement parameters.
Here is simplified example of what i mean:
A string "{CUSTNAME} has 10 customers" is defined somewhere. It includes one parameter {CUSTNAME}, which will be defined within the hierarchy where this string will be used. When the item with this string is opened up, the {CUSTNAME} resolves to its defined value.
Since in some languages, a single word or a phrase can actually change the previous or the following character(s) in the sentence, how do I implemented the replacement field functionality in that situation?
You'll need to do a few things.
(1). Set up some functions that return different translations based on the quantity and the rules of that language.
Aside from your customer name replacement the part that says 10 customers will also need some replacement and will need to be built with a function call that looks more like:
ngettext( 'customer', 'customers', 10 )
This is along the lines of how Gettext works.
(2). Set up your translation source strings such that they're aware of pluralization rules.
You haven't said what technology you're working with, but Gettext has this built in and many languages including PHP can interact with your system Gettext.
(3). Organize your text replacement into two stages. Possibly using sprintf instead of your token replacement, but that part is up to you.
Because you're using stored translations plus your own customer name replacement I'd do as follows:
Set up translation strings with your full template in each language, perhaps like this in a Gettext PO file:
# ....
msgid "%1$s has one customer"
msgid_plural "%1$s has %2$u customers"
msgstr[0] "%1$s a un client"
msgstr[1] "%2$u clients pour %1$s"
You would then fetch the required template based on quantity and perform your replacement afterwards. for example in PHP:
$n = 10;
$name = "Pierre";
$template = ngettext( '%1$s has one customer', '%1$s has %2$u customers', $n );
$rendered = sprintf( $template, $name, $n );
There are lots of gotchas here, and not all language pack formats support plurals. If you can't use Gettext in your system then have a look at Loco as a way to manage the rules of plurals and export to a file format you can work with.
Related
I am trying to add Turkish support in my product. Turkish is agglutinative language. Which means that it tends to express concepts in complex words consisting of many elements, rather than by inflection or by using isolated elements.
Currently we have created keys for i18next like following:
tr/resourceExample.json
{
"comment":"Yorum",
"comment_plural":"Yorumlar",
"select_label":"{{label}} seç"
}
Whenever we want to add a sentence like "Select comments" we use
t("resourceExample:select_label",{label:t("resourceExample:comment_plural")})
Now this works properly for languages like English or Spanish. But for Turkish, the suffix of comment changes if the word is used with verb.
For example, our currently key structure will give output for Turkish following:
Yorumlar seç
But the actual expected result for Turkish is:
Yorumları seç
The reason behind keeping this structure is that we didn't want to create new keys for select_label because Select something is used in many places where something can be replaced by many different words.
So, my question is that is there any functionality in i18next which can help in this situation?
If i got you right, you can add custom format function.
i18next.services.formatter.add('objectify', (value, lng, options) => {
if(lng=='tr'){
//add suffix or any decorations here
value=value+"ı";
}
return value
})
Read more at i18next Docs
In Java there is a class called BreakItterator which allows me to pass a paragraph of text in any language (the language it is written in is known) and it will split the text into separate sentences. The magic is that it can take as an argument the locale of the langue the text is written in and it will split the text according to that languages rules (if you look into it it is actually a very complex issue even in English - it is certainly not a case of 'split by full-stops/periods').
Does anybody know how I would do this in elixir? I can't find anything in a Google search.
I am almost at the point of deploying a very thin public API that does only this basic task that I can call into from elixir - but this is really not desirable.
Any help would be really appreciated.
i18n library should be usable for this. Just going from the examples provided, since I have no experience using it, something like the following should work (:en is the locale code):
str = :i18n_string.from("some string")
iter = :i18n_iterator.open(:en, :sentence)
sentences = :i18n_string.split(iter, str)
There's also Cldr, which implements a lot of locale-dependent Unicode algorithms directly in Elixir, but it doesn't seem to include iteration in particular at the moment (you may want to raise an issue there).
I want to make some words and phrases in different languages from Google Translator without translating it's actual meaning.Is it possible to convert the text to other languages rather than translating it.
Example:
i want plain conversion like cambridge - كامبردج, कैंब्रिज ,cambridge ,剑桥,Кембридж
i donot want translation like university - جامعة ,विश्वविद्यालय,universitet,大学,
Университет
Yes. This is called "transliteration". There are multiple ways to do it programmatically depending on which programming language you are using. Here, for demonstration, I'm using ICU4J library in Groovy:
// https://mvnrepository.com/artifact/com.ibm.icu/icu4j
#Grapes(
#Grab(group='com.ibm.icu', module='icu4j', version='59.1')
)
import com.ibm.icu.text.Transliterator;
String sourceString = "cambridge";
List<String> transformSchemes = ["Latin-Arabic", "Latin-Cyrillic", "Latin-Devanagari", "Latin-Hiragana"]
for (t in transformSchemes) {
println "${t}: " + Transliterator.getInstance(t).transform(sourceString);
}
Which returns:
Latin-Arabic: كَمبرِدگِ
Latin-Cyrillic: цамбридге
Latin-Devanagari: चंब्रिद्गॆ
Latin-Hiragana: かんぶりでげ
Obviously, since these are rule-based transformations from one language to another, they tend to be imperfect.
Therefore, if you are looking for names of places (since you mentioned "Cambridge" as an example), you'll have better luck using a database of names of places; ICU has some names of cities and many names of countries. You could also use Wikidata API to retrieve such information; here is a sample call: https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q350
I am looking at gettext and .po files for creating a multilingual application. My understanding is that in the .po file msgid is the source and msgstr is the translation. Accordingly I see 2 ways of defining msgid:
Using full text (e.g. "My name is %s.\n") with the following advantages:
when calling gettext you can clearly see what is about to be
translated
it's easier to translate .po files because they
contain the actual content to be translated
Using a key (e.g. my-name %s) with the following advantages:
when the source text is long (e.g. paragraph about company), gettext calls are more concise which makes your views cleaner
easier to maintain several .po files and views, because the key is less likely to change (e.g. key of company-description far less likely to change than the actual company description)
Hence my question:
Is there a way of working with gettext and .po files that allows combining the advantages of both methods, that is:
-usage of a keys for gettext calls
-ability for the translator to see the full text that needs to be translated?
gettext was designed to translate English text to other languages, and this is the way you should use it. Do not use it with keys. If you want keys, use some other technique such as an associative array.
I have managed two large open-source projects (50 languages, 5000 translations), one using the key approach and one using the gettext approach - and I would never use the key approach again.
The cons include propagating changes in English text to the other langauges. If you change
msg_no_food = "We had no food left, so we had to eat the cats"
to
msg_no_food = "We had no food left, so we had to eat the cat's"
The new text has a completely different meaning - so how do you ensure that other translations are invalidated and updated?
You mentioned having long text that makes your scripts hard to read. The solution to this might be to put these in a separate script. For example, put this in the main code
print help_message('help_no_food')
and have a script that just provides help messages:
switch ($help_msg) {
...
case 'help_no_food': return gettext("We had no food left, so we had to eat the cat's");
...
}
Another problem for gettext is when you have a full page to translate. Perhaps a brochure page on a website that contains lots of embedded images. If you allow lots of space for languages with long text (e.g. German), you will have lots of whitespace on languages with short text (e.g. Chinese). As a result, you might have different images/layout for each language.
Since these tend to be few in number, it is often easier to implement these outside gettext completely. e.g.
brochure-view.en.php
brochure-view.de.php
brochure-view.zh.php
I just answered a similar (much older) question here.
Short version:
The PO file format is very simple, so it is possible to generate PO/MO files from another workflow that allows the flexibility you're asking for. (your devs want identifiers, your translators want words)
You could roll this solution yourself, or use a cloud-based app like Loco to manage your translations and export a Gettext file with identifiers when your devs need them.
I have a database of words (including nouns and verbs). Now I would like to generate all the different (inflected) forms of those nouns and verbs. What would be the best strategy to do this?
As Latin is a highly inflected language, there is:
a) the declension of nouns
b) the conjugation of verbs
See this translated page for an example of a verb's conjugation ("mandare"): conjugation
I don't want to type in all those forms for all the words manually.
How can I generate them automatically? What is the best approach?
a list of complex rules how to inflect all the words
Bayesian methods
...
There's a program called "William Whitaker's Words". It creates inflections for Latin words as well, so it's exactly doing what I want to do.
Wikipedia says that the program works like this:
Words uses a set of rules based on natural pre-, in-, and suffixation, declension, and conjugation to determine the possibility of an entry. As a consequence of this approach of analysing the structure of words, there is no guarantee that these words were ever used in Latin literature or speech, even if the program finds a possible meaning to a given word.
The program's source is also available here. But I don't really understand how this is to work. Can you help me? Maybe this would be the solution to my question ...
You could do something similar to hunspell dictionary format (see http://www.manpagez.com/man/4/hunspell/)
You define 2 tables. One contains roots of the words (the part that never change), and the other contains modifications for a given class. For a given class, for each declension (or conjugation), it tells what characters to add at the end (or the beginning) of the root. It even can specify to replace a given number of characters. Now, to get a word at a specific declension, you take the root, apply the transformation from the class it belongs, and voilà!
For example, for mandare, the root would be mand, and the class would contains suffixes like o, as, ate, amous, atis... for active indicative present.
I'll use as example the nouns, but it applies also to verbs.
First, I would create two classes: Regular and Irregular. For the Regular nouns, I would make three classes for the three declensions, and make them all implement a Declensable (or however the word is in English :) interface (FirstDeclension extends Regular implements Declensable). The interface would define two static enums (NOMINATIVE, VOCATIVE, etc, and SINGULAR, PLURAL).
All would have a string for the root and a static hashmap of suffixes. The method FirstDeclension#get (case, number) would then append the right suffix based on the hashmap.
The Irregular class should have to define a local hashmap for each word and then implement the same Declensable interface.
Does it make any sense?
Addendum: To clarify, the constructor of class Regular would be
public Regular (String stem) {
this.stem = stem
}
Perhaps, you could follow the line of AOT in your implementation. (It's under LGPL.)
http://prometheus.altlinux.org/en/Sisyphus/srpms/aot
http://seman.sourceforge.net/
http://aot.ru/
There's no Latin morphology in AOT, rather only Russian, German, English, where Russian is of course an example of an inflectional morphology as complex as Latin, so AOT should be ready as a framework for implementing it.
Still, I believe one has to have an elaborate precise formal system for the morphology already clearly defined before one goes on to programming. As for Russian, I guess, most of the working morphological computer systems are based on the serious analysis of Russian morphology done by Andrey Zalizniak and in the Grammatical Dictionary of Russian and related works.