Transform Text into Different Languages - localization

I want to make some words and phrases in different languages from Google Translator without translating it's actual meaning.Is it possible to convert the text to other languages rather than translating it.
Example:
i want plain conversion like cambridge - كامبردج, कैंब्रिज ,cambridge ,剑桥,Кембридж
i donot want translation like university - جامعة ,विश्वविद्यालय,universitet,大学,
Университет

Yes. This is called "transliteration". There are multiple ways to do it programmatically depending on which programming language you are using. Here, for demonstration, I'm using ICU4J library in Groovy:
// https://mvnrepository.com/artifact/com.ibm.icu/icu4j
#Grapes(
#Grab(group='com.ibm.icu', module='icu4j', version='59.1')
)
import com.ibm.icu.text.Transliterator;
String sourceString = "cambridge";
List<String> transformSchemes = ["Latin-Arabic", "Latin-Cyrillic", "Latin-Devanagari", "Latin-Hiragana"]
for (t in transformSchemes) {
println "${t}: " + Transliterator.getInstance(t).transform(sourceString);
}
Which returns:
Latin-Arabic: كَمبرِدگِ
Latin-Cyrillic: цамбридге
Latin-Devanagari: चंब्रिद्गॆ
Latin-Hiragana: かんぶりでげ
Obviously, since these are rule-based transformations from one language to another, they tend to be imperfect.
Therefore, if you are looking for names of places (since you mentioned "Cambridge" as an example), you'll have better luck using a database of names of places; ICU has some names of cities and many names of countries. You could also use Wikidata API to retrieve such information; here is a sample call: https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q350

Related

How to handle different suffix in i18next for agglutinative languages (eg. Turkish, Japanese, etc.)

I am trying to add Turkish support in my product. Turkish is agglutinative language. Which means that it tends to express concepts in complex words consisting of many elements, rather than by inflection or by using isolated elements.
Currently we have created keys for i18next like following:
tr/resourceExample.json
{
"comment":"Yorum",
"comment_plural":"Yorumlar",
"select_label":"{{label}} seç"
}
Whenever we want to add a sentence like "Select comments" we use
t("resourceExample:select_label",{label:t("resourceExample:comment_plural")})
Now this works properly for languages like English or Spanish. But for Turkish, the suffix of comment changes if the word is used with verb.
For example, our currently key structure will give output for Turkish following:
Yorumlar seç
But the actual expected result for Turkish is:
Yorumları seç
The reason behind keeping this structure is that we didn't want to create new keys for select_label because Select something is used in many places where something can be replaced by many different words.
So, my question is that is there any functionality in i18next which can help in this situation?
If i got you right, you can add custom format function.
i18next.services.formatter.add('objectify', (value, lng, options) => {
if(lng=='tr'){
//add suffix or any decorations here
value=value+"ı";
}
return value
})
Read more at i18next Docs

Elixir/Erlang - Split paragraph into sentences based on the language

In Java there is a class called BreakItterator which allows me to pass a paragraph of text in any language (the language it is written in is known) and it will split the text into separate sentences. The magic is that it can take as an argument the locale of the langue the text is written in and it will split the text according to that languages rules (if you look into it it is actually a very complex issue even in English - it is certainly not a case of 'split by full-stops/periods').
Does anybody know how I would do this in elixir? I can't find anything in a Google search.
I am almost at the point of deploying a very thin public API that does only this basic task that I can call into from elixir - but this is really not desirable.
Any help would be really appreciated.
i18n library should be usable for this. Just going from the examples provided, since I have no experience using it, something like the following should work (:en is the locale code):
str = :i18n_string.from("some string")
iter = :i18n_iterator.open(:en, :sentence)
sentences = :i18n_string.split(iter, str)
There's also Cldr, which implements a lot of locale-dependent Unicode algorithms directly in Elixir, but it doesn't seem to include iteration in particular at the moment (you may want to raise an issue there).

How can I translate my webpage to another language (Can't use google translator or any else)

Basically i want a translator that translates my webpage to a specific language (word meanings are project specific). Now for that language words and corresponding word meaning should be made manually. i mean , something like a dictionary should be there. Because the words / texts that are need to be converted have specific meanings based on my project. So what is the best method / concept / approach to do this ?

Replacement fields inside a multi-language application

I am developing a project which supports multiple languages. One of the functions we have is to support replacement parameters.
Here is simplified example of what i mean:
A string "{CUSTNAME} has 10 customers" is defined somewhere. It includes one parameter {CUSTNAME}, which will be defined within the hierarchy where this string will be used. When the item with this string is opened up, the {CUSTNAME} resolves to its defined value.
Since in some languages, a single word or a phrase can actually change the previous or the following character(s) in the sentence, how do I implemented the replacement field functionality in that situation?
You'll need to do a few things.
(1). Set up some functions that return different translations based on the quantity and the rules of that language.
Aside from your customer name replacement the part that says 10 customers will also need some replacement and will need to be built with a function call that looks more like:
ngettext( 'customer', 'customers', 10 )
This is along the lines of how Gettext works.
(2). Set up your translation source strings such that they're aware of pluralization rules.
You haven't said what technology you're working with, but Gettext has this built in and many languages including PHP can interact with your system Gettext.
(3). Organize your text replacement into two stages. Possibly using sprintf instead of your token replacement, but that part is up to you.
Because you're using stored translations plus your own customer name replacement I'd do as follows:
Set up translation strings with your full template in each language, perhaps like this in a Gettext PO file:
# ....
msgid "%1$s has one customer"
msgid_plural "%1$s has %2$u customers"
msgstr[0] "%1$s a un client"
msgstr[1] "%2$u clients pour %1$s"
You would then fetch the required template based on quantity and perform your replacement afterwards. for example in PHP:
$n = 10;
$name = "Pierre";
$template = ngettext( '%1$s has one customer', '%1$s has %2$u customers', $n );
$rendered = sprintf( $template, $name, $n );
There are lots of gotchas here, and not all language pack formats support plurals. If you can't use Gettext in your system then have a look at Loco as a way to manage the rules of plurals and export to a file format you can work with.

Latin inflection:

I have a database of words (including nouns and verbs). Now I would like to generate all the different (inflected) forms of those nouns and verbs. What would be the best strategy to do this?
As Latin is a highly inflected language, there is:
a) the declension of nouns
b) the conjugation of verbs
See this translated page for an example of a verb's conjugation ("mandare"): conjugation
I don't want to type in all those forms for all the words manually.
How can I generate them automatically? What is the best approach?
a list of complex rules how to inflect all the words
Bayesian methods
...
There's a program called "William Whitaker's Words". It creates inflections for Latin words as well, so it's exactly doing what I want to do.
Wikipedia says that the program works like this:
Words uses a set of rules based on natural pre-, in-, and suffixation, declension, and conjugation to determine the possibility of an entry. As a consequence of this approach of analysing the structure of words, there is no guarantee that these words were ever used in Latin literature or speech, even if the program finds a possible meaning to a given word.
The program's source is also available here. But I don't really understand how this is to work. Can you help me? Maybe this would be the solution to my question ...
You could do something similar to hunspell dictionary format (see http://www.manpagez.com/man/4/hunspell/)
You define 2 tables. One contains roots of the words (the part that never change), and the other contains modifications for a given class. For a given class, for each declension (or conjugation), it tells what characters to add at the end (or the beginning) of the root. It even can specify to replace a given number of characters. Now, to get a word at a specific declension, you take the root, apply the transformation from the class it belongs, and voilà!
For example, for mandare, the root would be mand, and the class would contains suffixes like o, as, ate, amous, atis... for active indicative present.
I'll use as example the nouns, but it applies also to verbs.
First, I would create two classes: Regular and Irregular. For the Regular nouns, I would make three classes for the three declensions, and make them all implement a Declensable (or however the word is in English :) interface (FirstDeclension extends Regular implements Declensable). The interface would define two static enums (NOMINATIVE, VOCATIVE, etc, and SINGULAR, PLURAL).
All would have a string for the root and a static hashmap of suffixes. The method FirstDeclension#get (case, number) would then append the right suffix based on the hashmap.
The Irregular class should have to define a local hashmap for each word and then implement the same Declensable interface.
Does it make any sense?
Addendum: To clarify, the constructor of class Regular would be
public Regular (String stem) {
this.stem = stem
}
Perhaps, you could follow the line of AOT in your implementation. (It's under LGPL.)
http://prometheus.altlinux.org/en/Sisyphus/srpms/aot
http://seman.sourceforge.net/
http://aot.ru/
There's no Latin morphology in AOT, rather only Russian, German, English, where Russian is of course an example of an inflectional morphology as complex as Latin, so AOT should be ready as a framework for implementing it.
Still, I believe one has to have an elaborate precise formal system for the morphology already clearly defined before one goes on to programming. As for Russian, I guess, most of the working morphological computer systems are based on the serious analysis of Russian morphology done by Andrey Zalizniak and in the Grammatical Dictionary of Russian and related works.

Resources