Tool in the gettext suite to unify source strings with fuzzy match?

Tool in the gettext suite to unify source strings with fuzzy match? - localization

Is there any way to leverage the tools in the gettext suite to do something like fuzzy match the source strings within one PO file to find strings which are almost identical? This would seem like a useful quality check to improve the sources. Example:
#: my_file
msgid "Sorry, something went wrong"
msgstr ""
#: some_other_file
msgid "Sorry, something went wrong."
msgstr ""
#: yet_another_file
msgid "Sorry, something is wrong"
msgstr ""
These strings are virtually identical and the source code could possibly be changed to use the same message in each instance. This would reduce the l10n work and make the UI more coherent. It would seem to me that the fuzzy match algorithm in msgmerge should already be pretty well suited to identify these instances. Yet I could not find an obvious way to do this.

You don't want to do any kind of folding without human supervision.
Most translation tools have that feature, but a human should validate such folding.
You can't even do it for perfectly identical strings because of the context.
Why:
buttons ("commands") often get translated differently than labels and titles ("descriptions)")
Example: "Print" is translated to French as "Imprimer" (buttons) or "Impression" (titles)
gender, number, case, will change the translation.
Example: translating a "New" button into Spanish can give you "Nuovo" (masculine, singular), "Nuevos" (masculine, plural), "Nueva" (feminine, singular) "Nuevas" (feminine, plural)
the same word can be translated differently if it has a different meaning.
Example: "Scan" will have different translations if it is about scanning the disk (for a virus) or scanning a piece of paper.
So, you don't want to "magically merge strings" to save a few cents, if the price is lower translations quality.

Related

Add forbidden words to TexStudio / Latex

I have some words in my language (German) that seem to be valid according to TexStudios spellchecker.
However they must not be used for my thesis (and globally for me at least).
Is it possible to add words to a list, that trigger a (optimally huge) sign "DO NOT USE THIS!" or even prevent compilation in Latex when such words are used?
I'm looking for something like a negative dictionary.
I've seen files like "badwords" or "stopwords" but don't know when/how they are used. I can freely use them although "check for bad words" is on.

In case anyone else has the problem: Badword files are named after the main language. For me it happened that I have "de_DE_frami" as the dictionary set. Hence it did not use the "de_DE.badwords".
For a good highlighting: One can change the appearance in the options dialog (syntaxhighlighting->badwords) and make it e.g. background red, size 200%
I'd still would like to have a "bad" words and a "impossible" words distinction as you can sometimes not avoid "bad" words or they are not bad in all contexts.

Max size for PO file strings

I know that PO / MO files are meant to be used for small strings like button names, labels, etc. Not long text like an About page, etc.
But lately I am encountering a lot of situations that are in the middle. For example, a two sentence call to action. Or a short paragraph.
Is there best practice or "rule of thumb" for when a string is too long to put in a PO file?
update
For "long" text I use partials and include the correct language version. My question is WHEN is it optimal to use one vs the other. I've heard that PO files are "inefficient" for "long" pieces of text. But what does that mean and when is it too "long"? Or is this not a concern?

Use one entry for a self-contained chunk of text; e.g. a sentence as you say.
Two sentences that belong together and don't make sense without each other should be one entry. Why? Because otherwise the translator wouldn't have the context necessary to translate it well. Same goes for a short paragraph, e.g. explaining a setting: if it's inseparable in the code, it should be one entry.
If you encounter a situation where you have lots of long texts regularly (e.g. entire pages or paragraphs of pages), that's usually a sign that you are using an ill-fitting tool. Some people do it, using Gettext for entire articles, but you're better off having separate documents in such cases. But that doesn't seem to be the case here.

Combining keys and full text when working with gettext and .po files

I am looking at gettext and .po files for creating a multilingual application. My understanding is that in the .po file msgid is the source and msgstr is the translation. Accordingly I see 2 ways of defining msgid:
Using full text (e.g. "My name is %s.\n") with the following advantages:
when calling gettext you can clearly see what is about to be
translated
it's easier to translate .po files because they
contain the actual content to be translated
Using a key (e.g. my-name %s) with the following advantages:
when the source text is long (e.g. paragraph about company), gettext calls are more concise which makes your views cleaner
easier to maintain several .po files and views, because the key is less likely to change (e.g. key of company-description far less likely to change than the actual company description)
Hence my question:
Is there a way of working with gettext and .po files that allows combining the advantages of both methods, that is:
-usage of a keys for gettext calls
-ability for the translator to see the full text that needs to be translated?

gettext was designed to translate English text to other languages, and this is the way you should use it. Do not use it with keys. If you want keys, use some other technique such as an associative array.
I have managed two large open-source projects (50 languages, 5000 translations), one using the key approach and one using the gettext approach - and I would never use the key approach again.
The cons include propagating changes in English text to the other langauges. If you change
msg_no_food = "We had no food left, so we had to eat the cats"
to
msg_no_food = "We had no food left, so we had to eat the cat's"
The new text has a completely different meaning - so how do you ensure that other translations are invalidated and updated?
You mentioned having long text that makes your scripts hard to read. The solution to this might be to put these in a separate script. For example, put this in the main code
print help_message('help_no_food')
and have a script that just provides help messages:
switch ($help_msg) {
...
case 'help_no_food': return gettext("We had no food left, so we had to eat the cat's");
...
}
Another problem for gettext is when you have a full page to translate. Perhaps a brochure page on a website that contains lots of embedded images. If you allow lots of space for languages with long text (e.g. German), you will have lots of whitespace on languages with short text (e.g. Chinese). As a result, you might have different images/layout for each language.
Since these tend to be few in number, it is often easier to implement these outside gettext completely. e.g.
brochure-view.en.php
brochure-view.de.php
brochure-view.zh.php

I just answered a similar (much older) question here.
Short version:
The PO file format is very simple, so it is possible to generate PO/MO files from another workflow that allows the flexibility you're asking for. (your devs want identifiers, your translators want words)
You could roll this solution yourself, or use a cloud-based app like Loco to manage your translations and export a Gettext file with identifiers when your devs need them.

How do I design a heuristic for matching translated sentences?

Summary
I am trying to design a heuristic for matching up sentences in a translation (from the original language to the translated language) and would like guidance and tips. Perhaps there is a heuristic that already does something similar? So given two text files, I would like to be able to match up the sentences (so I can pick out a sentence and say this is the translation of that sentence).
Details
The input text would be translated novels. So I do not expect the translations to be literal, although, using something like google translate might be a good way to test the accuracy of the heuristic.
To help me, I have a library that will gloss the contents of the translated text and give me the definitions of the words in the sentence. Other things I know:
Chapters and order are preserved; I know that the first sentence in chapter three will match with the first sentence in chapter three of the translation (Note, this is not strictly true; the first sentence might match up with the first two sentences, or even the second sentence)
I can calculate the overall size (characters, sentences, paragraphs); which could give me an idea of the average difference in sentence size (for example, the translation might be 30% longer).
Looking at the some books I have, the translated version has about 30% more sentences than the original text.
Implementation
(if it matters)
I am planning to do this in Java - but I am not that fussed - any language will do.
I am not greatly concerned about speed.
I guess to to be sure of the matches, some user feedback might be required. Like saying "Yes, this sentence definitely matches with that sentence." This would give the heuristic some more ground to stand on. This would mean that the user would need a little proficiency in the languages.
Background
(for those interested)
The reason I want to make this is that I want it to assist with my foreign language study. I am studying Japanese and find it hard to find "good" material (where "good" is defined by what I like). There are already tools to do something similar with subtitles from videos (an easier task - using the timing information of the video). But nothing, as far as I know, for texts.

There are tools called "sentence aligners" used in NLP research that does exactly what you want.
I advise hunalign:
http://mokk.bme.hu/resources/hunalign/
and MS sentence aligner:
http://research.microsoft.com/en-us/downloads/aafd5dcf-4dcc-49b2-8a22-f7055113e656/
Both are quite OK, but remember that nothing is perfect. Sentences that are too hard to be aligned will be dropped and some sentences may be wrongly aligned.

Localization ground rules

I've just submitted my first localized app to the iPhone app store the other day. I decided to do it to learn about application localization, and because my app was simple enough to stumble through localizing with my mediocre french. I know I didn't do everything "right", but I learned a lot from doing it once. I'd like to keep doing this for all my future apps.
For one thing, I learned to code with localization in mind, but don't start localizing until your app is ready to be released. I spent way too much time doing small tweaks in 2 UI files.
What are your favourite localization basics, cardinal rules, and best practices?
I'm thinking mostly for small hobby developers like myself, although stuff from the big leagues would be interesting as well.

The biggest one for me is don't concatenate strings:
Bad:
"You have " + messageCount + " messages";
Good:
"You have {0} messages"
Word order varies from language to language, and so you can't assume where in a sentence your dynamic data might occur.

In your UI, allow for about 30-50% expansion of translations from English. A method I learned early in my career was to produce a 'pig latin' localized version of the UI.
If your user interface is still legible in Pig Latin, it will probably be legible in real languages.
Ifway ouryay userway interfaceway isway illstay egiblelay inway Igpay
Atinlay, itway illway obablypray ebay egiblelay inway ealray
anguageslay.

Use Unicode for all strings - UTF-16 or UTF-8. If reading/writing to any program/format that doesn't assume that by default, make sure you specify UTF-16 or UTF-8 explicitly.
As Mike Sickler said, don't concatenate strings. Better yet, don't have sentences with inserts, since you don't know how the insert affects the rest of the sentence - different languages have different rules regarding plural / etc.
Bad: "You have " + messageCount + " messages"
Better: "You have {0} messages" (but what if {0} == 1? Do you write message(s)? What about Hebrew, where "one" comes after the noun, but other numbers before?)
Best: "Messages: {0}"
As rhsatrhs said, allow 30-50% expansion. In my (big league) company, we usually assume that German is the longest, although I found out that sometimes Russian got over 100% larger. I suspect it's sometimes translators who don't know the exact term, so they write a longer description using close term (Example: Symbol ==> source code reference marker).

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart